## Introduction

This exercise is to calculate permutation importance with a sample of data from the [Taxi Fare Prediction](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) competition.

RandomForestRegression is used to model building here

In [1]:
import pandas as pd

# Set Jupyter notebook display options
pd.options.display.max_rows = 8
pd.options.display.max_columns = 8

# Check version number
pd.__version__

'0.24.2'

In [2]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

import joblib

import eli5
from eli5.sklearn import PermutationImportance

In [3]:
training_dataset_file = 'nyc_taxi_fare_train.csv'
training_dataset_size = 100000

test_dataset_file = 'nyc_taxi_fare_test.csv'

best_model_file = 'nyc_taxi_fare.joblib'

**Trim down original training dataset**

There are over 55+ millions of rows in the training dataset which can cause out of memory issue when running on a standalone machine.

Many of rows have extreme outlier coordinates or negative fares, let remove those to bring down to 34+ million ones, then save the first 100K to be our training dataset

**This code cell was commented since it only runs once**

```python
# read original train dataset
df = pd.read_csv('train.csv', header='infer')
print(df.info())  # RangeIndex: 55423856 entries, 0 to 55423855

# filter out extreme outlier coordinates or nagative fares
filter_query = ('pickup_latitude > 40.7 and pickup_latitude < 40.8 and ' + 
                'dropoff_latitude > 40.7 and dropoff_latitude < 40.8 and ' + 
                'pickup_longitude > -74 and pickup_longitude < -73.9 and ' +
                'dropoff_longitude > -74 and dropoff_longitude < -73.9 and ' +
                'fare_amount > 0')

df = df.query(filter_query)
print(df.info())  # Int64Index: 34730008 entries, 2 to 55423855

# reset index from 0
df = df.reset_index().drop(['index'], axis=1)

# pick the first 100K rows
df = df.iloc[:training_dataset_size]

# save 100K to new CSV file
df.to_csv(training_dataset_file, index=False, encoding='utf-8')
```

**Reload the trimmed version of our training dataset***

In [4]:
df = pd.read_csv(training_dataset_file, header='infer', nrows=training_dataset_size)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
key                  100000 non-null object
fare_amount          100000 non-null float64
pickup_datetime      100000 non-null object
pickup_longitude     100000 non-null float64
pickup_latitude      100000 non-null float64
dropoff_longitude    100000 non-null float64
dropoff_latitude     100000 non-null float64
passenger_count      100000 non-null int64
dtypes: float64(5), int64(1), object(2)
memory usage: 6.1+ MB


In [6]:
df.head()

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
1,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
2,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
3,2012-11-20 20:35:00.0000001,7.5,2012-11-20 20:35:00 UTC,-73.980002,40.751662,-73.973802,40.764842,1
4,2012-01-04 17:22:00.00000081,16.5,2012-01-04 17:22:00 UTC,-73.9513,40.774138,-73.990095,40.751048,1


**Split the dataset into for training and valuation datasets**

In [7]:
features = ['pickup_longitude', 
            'pickup_latitude',
            'dropoff_longitude',
            'dropoff_latitude',
            'passenger_count']

label = 'fare_amount'

X = df[features]
y = df[label]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

In [8]:
train_X.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,75000.0,75000.0,75000.0,75000.0,75000.0
mean,-73.976803,40.756925,-73.975369,40.757279,1.669893
std,0.014609,0.018178,0.015892,0.018752,1.298424
min,-73.999999,40.700013,-73.999998,40.700002,0.0
25%,-73.988046,40.744931,-73.98717,40.745643,1.0
50%,-73.979553,40.758017,-73.978534,40.758292,1.0
75%,-73.967776,40.769501,-73.966212,40.770221,2.0
max,-73.900062,40.799994,-73.900062,40.799999,6.0


In [9]:
train_y.describe()

count    75000.000000
mean         8.455531
std          4.496008
min          0.010000
25%          5.500000
50%          7.500000
75%         10.100000
max        165.000000
Name: fare_amount, dtype: float64

In [10]:
val_X.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,25000.0,25000.0,25000.0,25000.0,25000.0
mean,-73.976865,40.756752,-73.975471,40.757254,1.6754
std,0.014609,0.018101,0.015977,0.018831,1.310429
min,-73.999995,40.70015,-73.999999,40.700003,0.0
25%,-73.987946,40.744686,-73.987383,40.745411,1.0
50%,-73.97969,40.757756,-73.978623,40.758153,1.0
75%,-73.96804,40.769106,-73.966641,40.770274,2.0
max,-73.900267,40.79998,-73.900232,40.799984,6.0


In [11]:
val_y.describe()

count    25000.000000
mean         8.466946
std          4.777574
min          2.500000
25%          5.500000
50%          7.500000
75%         10.100000
max        255.000000
Name: fare_amount, dtype: float64

**Train the model**

In [12]:
rfr_model = RandomForestRegressor(n_estimators=30, random_state=1).fit(train_X, train_y)

**Calculate Permutation Importance of the newly created model upon the valuation dataset**

In [13]:
perm = PermutationImportance(rfr_model, random_state=1).fit(val_X, val_y)

eli5.show_weights(perm, feature_names=features)

Weight,Feature
1.0619  ± 0.0121,dropoff_latitude
0.8498  ± 0.0179,pickup_latitude
0.5829  ± 0.0054,dropoff_longitude
0.5490  ± 0.0126,pickup_longitude
0.0001  ± 0.0011,passenger_count


The model uses the following features
- pickup_longitude
- pickup_latitude
- dropoff_longitude
- dropoff_latitude
- passenger_count

The first 4 are linked to traveling distance, they should contribute much more in the fare while number of passengers contribute less. The assumption still hold, but why latitude are more important than longtitude?

Some hypotheses are:
1. Travel might tend to have greater latitude distances than longitude distances. If the longitudes values were generally closer together, shuffling them wouldn't matter as much.
2. Different parts of the city might have different pricing rules (e.g. price per mile), and pricing rules could vary more by latitude than longitude.
3. Tolls might be greater on roads going North<->South (changing latitude) than on roads going East <-> West (changing longitude).  Thus latitude would have a larger effect on the prediction because it captures the amount of the tolls.

Without detailed knowledge of New York City, it's difficult to rule out most hypotheses about why latitude features matter more than longitude.

A good next step is to disentangle the effect of being in certain parts of the city from the effect of total distance traveled.  

**Two new features for longitudinal and latitudinal distance will be created and used to build a new model:**

In [14]:
df['abs_lon_change'] = abs(df['dropoff_longitude'] - df['pickup_longitude'])
df['abs_lat_change'] = abs(df['dropoff_latitude'] - df['pickup_latitude'])

df.head()

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,...,dropoff_latitude,passenger_count,abs_lon_change,abs_lat_change
0,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,...,40.750562,2,0.008504,0.010708
1,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,...,40.758092,1,0.004437,0.024949
2,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,...,40.783762,1,0.01144,0.015754
3,2012-11-20 20:35:00.0000001,7.5,2012-11-20 20:35:00 UTC,-73.980002,...,40.764842,1,0.0062,0.01318
4,2012-01-04 17:22:00.00000081,16.5,2012-01-04 17:22:00 UTC,-73.9513,...,40.751048,1,0.038795,0.02309


In [15]:
# add 2 newly created features and remove passenger_count since it is not matter much
features2  = ['pickup_longitude',
              'pickup_latitude',
              'dropoff_longitude',
              'dropoff_latitude',
              'abs_lat_change', 
              'abs_lon_change']

X = df[features2]
train_X2, val_X2, train_y2, val_y2 = train_test_split(X, y, random_state=1)

In [16]:
rfr_model2 = RandomForestRegressor(n_estimators=30, random_state=1).fit(train_X2, train_y2)

In [17]:
perm2 = PermutationImportance(rfr_model2, random_state=1).fit(val_X2, val_y2)

eli5.show_weights(perm2, feature_names=features2)

Weight,Feature
0.4898  ± 0.0223,abs_lat_change
0.4234  ± 0.0235,abs_lon_change
0.0918  ± 0.0099,dropoff_latitude
0.0781  ± 0.0067,pickup_latitude
0.0773  ± 0.0058,dropoff_longitude
0.0593  ± 0.0076,pickup_longitude


Distance traveled (`abs_lat_change` and `abs_lon_change`) seems far more important than any location effects. 

But the location still affects model predictions, and dropoff location (`dropoff_latitude`) now matters slightly more than pickup location (`pickup_latitude`). What are hypotheses for why this might be?

Observing that the values for `abs_lon_change` and `abs_lat_change` are pretty small (all values are between -0.1 and 0.1), whereas other variables have larger values. This could explain why those coordinates had larger permutation importance values in this case  

**Consider an alternative where new features are created 1000x larger and used for the 3rd model. Would this change the outputted permutaiton importance values?**

In [18]:
df['abs_lon_change1000'] = df['abs_lon_change'] * 1000
df['abs_lat_change1000'] = df['abs_lat_change'] * 1000

df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,...,abs_lon_change,abs_lat_change,abs_lon_change1000,abs_lat_change1000
count,100000.0,100000.0,100000.0,100000.0,...,100000.0,100000.0,100000.0,100000.0
mean,8.458385,-73.976819,40.756882,-73.975394,...,0.013064,0.014863,13.063561,14.862951
std,4.568005,0.014609,0.018159,0.015913,...,0.011668,0.012129,11.668227,12.129215
min,0.01,-73.999999,40.700013,-73.999999,...,0.0,0.0,0.0,0.0
25%,5.5,-73.988025,40.744874,-73.987225,...,0.004996,0.006058,4.996,6.058
50%,7.5,-73.979579,40.757958,-73.978558,...,0.01005,0.011715,10.0495,11.715
75%,10.1,-73.967823,40.76938,-73.966315,...,0.017698,0.020533,17.69825,20.53325
max,255.0,-73.900062,40.799994,-73.900062,...,0.094957,0.094655,94.957,94.655


In [19]:
# add 2 newly created features and remove abs_lat_change, abs_lon_change
features3  = ['pickup_longitude',
              'pickup_latitude',
              'dropoff_longitude',
              'dropoff_latitude',
              'abs_lat_change1000', 
              'abs_lon_change1000']

X = df[features3]
train_X3, val_X3, train_y3, val_y3 = train_test_split(X, y, random_state=1)

rfr_model3 = RandomForestRegressor(n_estimators=30, random_state=1).fit(train_X3, train_y3)

perm3 = PermutationImportance(rfr_model3, random_state=1).fit(val_X3, val_y3)

eli5.show_weights(perm3, feature_names=features3)

Weight,Feature
0.4898  ± 0.0245,abs_lat_change1000
0.4242  ± 0.0210,abs_lon_change1000
0.0912  ± 0.0110,dropoff_latitude
0.0803  ± 0.0066,dropoff_longitude
0.0784  ± 0.0080,pickup_latitude
0.0601  ± 0.0079,pickup_longitude


The scale of features does not affect Permutation Importance per se. The only reason that rescaling a feature would affect Permutation Importance is indirectly, if rescaling helped or hurt the ability of the particular learning method we're using to make use of that feature. That won't happen with tree based models, like the Random Forest used here. If Ridge Regression is used, scaling features might be affected. With that said, the absolute change features are have high importance because they capture total distance traveled, which is the primary determinant of taxi fares. It is not an artifact of the feature magnitude.

The feature importance for latitudinal distance is greater than the importance of longitudinal distance. From this, can we conclude whether travelling a fixed latitudinal distance tends to be more expensive than traveling the same longitudinal distance?

Possible reasons latitude feature are more important than longitude features:

1. Latitudinal distances in the dataset tend to be larger
2. It is more expensive to travel a fixed latitudinal distance
3. Understanding NYC geographic might help since taxi drivers do not like to go uptown in rush hours since they would make less money

If abs_lon_change values were very small, longitues could be less important to the model even if the cost per mile of travel in that direction were high.

**Save Model**

We decided that the rfr_model2 is good enough, let save it for future use

In [20]:
with open(best_model_file, 'wb') as model_file:
  joblib.dump(rfr_model2, model_file)

**Read test dataset and use it for real prediction**

Some time in the future, we need to predict taxi fares in real time, let load our saved model

In [21]:
model = joblib.load(best_model_file)

**Load test dataset**

In [22]:
df = pd.read_csv(test_dataset_file, header='infer')

In [23]:
df['abs_lon_change'] = abs(df['dropoff_longitude'] - df['pickup_longitude'])
df['abs_lat_change'] = abs(df['dropoff_latitude'] - df['pickup_latitude'])

test_X = df[features2]
test_X.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,abs_lat_change,abs_lon_change
count,9914.0,9914.0,9914.0,9914.0,9914.0,9914.0
mean,-73.974722,40.751041,-73.973657,40.751743,0.022133,0.023348
std,0.042774,0.033541,0.039072,0.035435,0.025589,0.036719
min,-74.252193,40.573143,-74.263242,40.568973,0.0,0.0
25%,-73.992501,40.736125,-73.991247,40.735254,0.007279,0.006354
50%,-73.982326,40.753051,-73.980015,40.754065,0.014715,0.013123
75%,-73.968013,40.767113,-73.964059,40.768757,0.028261,0.024557
max,-72.986532,41.709555,-72.990963,41.696683,0.633213,0.849168


In [24]:
test_y = model.predict(test_X)

In [25]:
df['fare_amount'] = test_y
df.drop(['abs_lon_change', 'abs_lat_change'], axis=1)

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,fare_amount
0,2015-01-27 13:08:24.0000002,2015-01-27 13:08:24 UTC,-73.973320,40.763805,-73.981430,40.743835,1,9.050000
1,2015-01-27 13:08:24.0000003,2015-01-27 13:08:24 UTC,-73.986862,40.719383,-73.998886,40.739201,1,9.200000
2,2011-10-08 11:53:44.0000002,2011-10-08 11:53:44 UTC,-73.982524,40.751260,-73.979654,40.746139,1,4.346667
3,2012-12-01 21:12:12.0000002,2012-12-01 21:12:12 UTC,-73.981160,40.767807,-73.990448,40.751635,1,9.050000
...,...,...,...,...,...,...,...,...
9910,2015-01-12 17:05:51.0000001,2015-01-12 17:05:51 UTC,-73.945511,40.803600,-73.960213,40.776371,6,10.513333
9911,2015-04-19 20:44:15.0000001,2015-04-19 20:44:15 UTC,-73.991600,40.726608,-73.789742,40.647011,6,29.313333
9912,2015-01-31 01:05:19.0000005,2015-01-31 01:05:19 UTC,-73.985573,40.735432,-73.939178,40.801731,6,15.543333
9913,2015-01-18 14:06:23.0000006,2015-01-18 14:06:23 UTC,-73.988022,40.754070,-74.000282,40.759220,6,6.203333


In [26]:
df.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,abs_lon_change,abs_lat_change,fare_amount
count,9914.0,9914.0,9914.0,9914.0,9914.0,9914.0,9914.0,9914.0
mean,-73.974722,40.751041,-73.973657,40.751743,1.671273,0.023348,0.022133,11.084158
std,0.042774,0.033541,0.039072,0.035435,1.278747,0.036719,0.025589,6.715807
min,-74.252193,40.573143,-74.263242,40.568973,1.0,0.0,0.0,3.24
25%,-73.992501,40.736125,-73.991247,40.735254,1.0,0.006354,0.007279,6.593333
50%,-73.982326,40.753051,-73.980015,40.754065,1.0,0.013123,0.014715,8.96
75%,-73.968013,40.767113,-73.964059,40.768757,2.0,0.024557,0.028261,13.3125
max,-72.986532,41.709555,-72.990963,41.696683,6.0,0.849168,0.633213,105.8
