<center><a href="https://www.featuretools.com/"><img src="img/featuretools-logo.png" width="400" height="200" /></a></center>

<h2> An Advanced Featuretools Approach </h2>
<p>The following tutorial illustrates a advanced featuretools model for the NYC Taxi Trip Duration competition on Kaggle. This notebook reads the dataset, adds additional datasets, creates new columns, and uses deep feature synthesis to generate features, and trains a regressor.</p>

<h2>Step 1: Download raw data </h2>
<p>The first step is to download the raw data from the <a href="https://www.kaggle.com/c/nyc-taxi-trip-duration/data">Kaggle website</a>.</p>
<li>train.csv</li>
<li>test.csv</li>

<p>The second dataset is found here, <a href="https://www.kaggle.com/oscarleo/new-york-city-taxi-with-osrm">Kaggle Notebook Output</a>.</p>
<li>fastest_routes_train_part_1.csv</li>
<li>fastest_routes_train_part_2.csv</li>
<li>fastest_routes_test.csv</li>

In [None]:
import featuretools as ft
import pandas as pd
import numpy as np
import utils
ft.__version__

<h2>Step 2: Prepare the Data </h2>

In [3]:
TRAIN_DIR = "data/train.csv"
TEST_DIR = "data/test.csv"

# We can specify the number of rows to speed up this notebook
data_train, data_test = utils.read_data(TRAIN_DIR, TEST_DIR, nrows=None)

data_train.head(5)

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,1,-73.982155,40.767937,-73.96463,40.765602,False,455
1,id2377394,1,2016-06-12 00:43:35,1,-73.980415,40.738564,-73.999481,40.731152,False,663
2,id3858529,2,2016-01-19 11:35:24,1,-73.979027,40.763939,-74.005333,40.710087,False,2124
3,id3504673,2,2016-04-06 19:32:31,1,-74.01004,40.719971,-74.012268,40.706718,False,429
4,id2181028,2,2016-03-26 13:30:55,1,-73.973053,40.793209,-73.972923,40.78252,False,435


<p>We can also add a dataset which contains the shortest path between two coordinates, depending on the trip id.</p>

In [4]:
# https://www.kaggle.com/oscarleo/new-york-city-taxi-with-osrm
fastest_routes_part1 = pd.read_csv('data/fastest_routes_train_part_1.csv', 
                                   usecols=['id', 'number_of_steps'])
fastest_routes_part2 = pd.read_csv('data/fastest_routes_train_part_2.csv', 
                                   usecols=['id', 'number_of_steps'])
fastest_routes_train = pd.concat((fastest_routes_part1, fastest_routes_part2))

fatest_routes_test = pd.read_csv('data/fastest_routes_test.csv',
                                 usecols=['id', 'number_of_steps'])

data_train = data_train.merge(fastest_routes_train, how='left', on='id')
data_test = data_test.merge(fatest_routes_test, how='left', on='id')

<p>Lets create another column to define test and train datasets.</p>

In [5]:
data_train['test_data'] = False
data_test['test_data'] = True

<p>We can now combine the data.</p>

In [6]:
data = pd.concat([data_train, data_test])

<h2>Step 3: Latitude and Longitude Information Extraction </h2>
<p>In this step we will use the latitude and longitude to create some additional data points.</p>

<p>First, we can determine a pickup and dropoff cluster depending on latitude and longitude.</p>

In [7]:
from sklearn.cluster import MiniBatchKMeans
coords = np.vstack((data_train[['pickup_latitude', 'pickup_longitude']].values,
                    data_train[['dropoff_latitude', 'dropoff_longitude']].values,
                    data_test[['pickup_latitude', 'pickup_longitude']].values,
                    data_test[['dropoff_latitude', 'dropoff_longitude']].values))

kmeans = MiniBatchKMeans(n_clusters=100, batch_size=10000).fit(coords)

data['pickup_cluster'] = kmeans.predict(data[['pickup_latitude', 'pickup_longitude']])
data['dropoff_cluster'] = kmeans.predict(data[['dropoff_latitude', 'dropoff_longitude']])

<p>Distance can also be determined by the longitude and latitude. We can determine simple straight distance by using the haversine formula. Another distance that can be determined by these points is the city block distance, also known as manhattan distance. </p>

In [9]:
# https://en.wikipedia.org/wiki/Haversine_formula

def haversine_distance(lat1, lon1, lat2, lon2):
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    km = 6367 * 2 * np.arcsin(np.sqrt(a))
    return km

data['distance_euclidean'] = haversine_distance(data['pickup_latitude'], data['pickup_longitude'], 
                                                data['dropoff_latitude'], data['dropoff_longitude'])

#https://en.wikipedia.org/wiki/Taxicab_geometry

def cityblock_distance(lat1, lon1, lat2, lon2):
    lon_dis = haversine_distance(lat1, lon1, lat1, lon2)
    lat_dist = haversine_distance(lat1, lon1, lat2, lon1)
    return lon_dis + lat_dist

data['distance_cityblock'] = cityblock_distance(data['pickup_latitude'], data['pickup_longitude'], 
                                                data['dropoff_latitude'], data['dropoff_longitude'])

<p>The bearing distance can also be determined.</p>

In [10]:
# http://www.movable-type.co.uk/scripts/latlong.html
def bearing_distance(lat1, lon1, lat2, lon2):
    delta_lon = np.radians(lon2 - lon1)
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    x = np.cos(lat2) * np.sin(delta_lon)
    y = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(delta_lon)
    return np.degrees(np.arctan2(x, y))
data['bearing_distance'] = bearing_distance(data['pickup_latitude'], data['pickup_longitude'], 
                                            data['dropoff_latitude'], data['dropoff_longitude'])

<p>Finally, we can determine the center point between the pickup and dropoff coordinates.</p>

In [11]:
data['center_latitude'] = (data['pickup_latitude'] + data['dropoff_latitude']) / 2
data['center_longtitude'] = (data['pickup_longitude'] + data['dropoff_longitude']) / 2

<h2>Step 4: Create new features using DFS </h2>
<p>Lets use feature tools to see what features DFS comes up with.</p>

In [12]:
from featuretools import variable_types as vtypes

trip_variable_types = {
    'number_of_steps': vtypes.Ordinal,
    'passenger_count': vtypes.Ordinal, 
    'vendor_id': vtypes.Categorical,
    'pickup_cluster': vtypes.Categorical,
    'dropoff_cluster': vtypes.Categorical,
}

<p>We can also normalize many more entities than before.</p>

In [13]:
es = ft.EntitySet("taxi")

es.entity_from_dataframe(entity_id="trips",
                         dataframe=data,
                         index="id",
                         time_index='pickup_datetime',
                         variable_types=trip_variable_types)

es.normalize_entity(base_entity_id="trips",
                    new_entity_id="vendors",
                    index="vendor_id")

es.normalize_entity(base_entity_id="trips",
                    new_entity_id="passenger_cnt",
                    index="passenger_count")

es.normalize_entity(base_entity_id="trips",
                    new_entity_id="steps",
                    index="number_of_steps")

es.normalize_entity(base_entity_id="trips",
                    new_entity_id="pickup_loc",
                    index="pickup_cluster")

es.normalize_entity(base_entity_id="trips",
                    new_entity_id="dropoff_loc",
                    index="dropoff_cluster")


Entityset: taxi
  Entities:
    dropoff_loc (shape = [100, 2])
    pickup_loc (shape = [100, 2])
    passenger_cnt (shape = [8, 2])
    vendors (shape = [2, 2])
    steps (shape = [43, 2])
    ...And 1 more
  Relationships:
    trips.vendor_id -> vendors.vendor_id
    trips.passenger_count -> passenger_cnt.passenger_count
    trips.number_of_steps -> steps.number_of_steps
    trips.pickup_cluster -> pickup_loc.pickup_cluster
    trips.dropoff_cluster -> dropoff_loc.dropoff_cluster

<p>We can specify the time for each instance of the target_entity to calculate features. The timestamp represents the last time data can be used for calculating features by DFS. This is specified using a dataframe of cutoff times. Below we can see that the cutoff time for each trip is the pickup time.</p>

In [14]:
cutoff_time = (es['trips'].df[['id', 'pickup_datetime']])

<p>Given this dataset, we would have about 2 million unique cutoff times. This is a good use case to use the approximate features parameter of DFS. In a large dataset, direct features that are aggregations on the prediction entity may not change much from cutoff time to cutoff time. Calculating the aggregation features at specific times every hour and using it for all cutoff times within the hour would save time and perhaps not lose much information. The approximate parameter in DFS lets you specify a window size to use when approximating these direct aggregation features.</p>

<b>Note, we can use an already calculated feature_matrix by doing the following:</b>
<p>You must copy and run the code.</p>
```python
feature_matrix = pd.read_csv('https://s3.amazonaws.com/featuretools-static/nyc_taxi/fm_advanced.csv', 
                             index_col='id')
features = feature_matrix.columns.values
```

In [15]:
from featuretools.primitives import (Sum, Mean, Median, Std, Count, Min, Max, NUnique, Skew,
                                     Day, Hour, Minute, Month, Weekday, Week, Weekend)

# this allows us to create features that are conditioned on a second value before we calculate.
es.add_interesting_values()

agg_primitives = [Sum, Mean, Median, Std, Count, Min, Max, NUnique, Skew]
trans_primitives = [Day, Hour, Minute, Month, Weekday, Week, Weekend]

# calculate feature_matrix using deep feature synthesis
feature_matrix, features = ft.dfs(entityset=es,
                                  target_entity="trips",
                                  trans_primitives=trans_primitives,
                                  agg_primitives=agg_primitives,
                                  drop_contains=['trips.test_data'],
                                  verbose=True,
                                  cutoff_time=cutoff_time,
                                  approximate='36d',
                                  max_depth=3)

Building features: 911it [00:00, 5780.16it/s]
Progress: 100%|██████████| 6/6 [13:42<00:00, 150.16s/cutoff time]


<p>Lets look at the many features created by simply using DFS. </p>

In [16]:
features[:30]

[<Feature: dropoff_longitude>,
 <Feature: trip_duration>,
 <Feature: pickup_latitude>,
 <Feature: pickup_longitude>,
 <Feature: distance_cityblock>,
 <Feature: pickup_cluster>,
 <Feature: bearing_distance>,
 <Feature: number_of_steps>,
 <Feature: center_latitude>,
 <Feature: passenger_count>,
 <Feature: distance_euclidean>,
 <Feature: dropoff_latitude>,
 <Feature: store_and_fwd_flag>,
 <Feature: center_longtitude>,
 <Feature: test_data>,
 <Feature: vendor_id>,
 <Feature: dropoff_cluster>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: dropoff_loc.MIN(trips.bearing_distance)>,
 <Feature: passenger_cnt.MAX(trips.center_latitude)>,
 <Feature: dropoff_loc.STD(trips.distance_cityblock)>,
 <Feature: vendors.SUM(trips.dropoff_longitude)>,
 <Feature: passenger_cnt.MEAN(trips.ce

In [17]:
print len(features)

459


<h2>Step 5: Build the Model </h2>

<p>As before, we need to retrieve our labels for the train dataset, so we should merge our current feature matrix with the original dataset. </p>

<p>We also get the log of the trip duration so that a more linear relationship can be found.</p>

In [18]:
# separates the whole feature matrix into train data feature matrix, train data labels, and test data feature matrix 
X_train, labels, X_test = utils.get_train_test_fm(feature_matrix)
labels = np.log(labels.values + 1)

In [19]:
model = utils.train_xgb(X_train, labels)

[0]	train-rmse:4.99587	valid-rmse:4.99586
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 60 rounds.
[10]	train-rmse:0.906881	valid-rmse:0.908897
[20]	train-rmse:0.382196	valid-rmse:0.388451
[30]	train-rmse:0.340506	valid-rmse:0.349907
[40]	train-rmse:0.33136	valid-rmse:0.343459
[50]	train-rmse:0.323989	valid-rmse:0.338486
[60]	train-rmse:0.31904	valid-rmse:0.33593
[70]	train-rmse:0.315621	valid-rmse:0.334252
[80]	train-rmse:0.312133	valid-rmse:0.332654
[90]	train-rmse:0.309233	valid-rmse:0.331902
[100]	train-rmse:0.307548	valid-rmse:0.331158
[110]	train-rmse:0.305488	valid-rmse:0.330103
[120]	train-rmse:0.303794	valid-rmse:0.329419
[130]	train-rmse:0.302013	valid-rmse:0.328778
[140]	train-rmse:0.300956	valid-rmse:0.328439
[150]	train-rmse:0.299635	valid-rmse:0.328131
[160]	train-rmse:0.298779	valid-rmse:0.327923
[170]	train-rmse:0.297169	valid-rmse:0.327512
[180]	train-rmse:0.295742	valid-rmse:0.3273

<h2>Step 6: Make a Submission </h2>

In [20]:
submission = utils.predict_xgb(model, X_test)
submission.head(5)

Unnamed: 0_level_0,trip_duration
id,Unnamed: 1_level_1
id0000002,985.992432
id0000199,2252.50708
id0000446,680.643005
id0000587,1250.717896
id0000604,226.703491


In [21]:
submission.to_csv('trip_duration_ft_advanced.csv', index=True, index_label='id')

<dl>
<dt>This solution:</dt>
<dd>&nbsp; &nbsp; Received a score of 0.38561 on the Kaggle competition.</dd>
<dd>&nbsp; &nbsp; Placed 171 out of 1047.</dd>
<dd>&nbsp; &nbsp; Beat 84% of competitors on the Kaggle competition.</dd>
<dd>&nbsp; &nbsp; Scored 41% better than the baseline solution</dd>
<dd>&nbsp; &nbsp; Had a modeling RMSLE of 0.32641</dd>
</dl>

September 7, 2017.

<h2>Additional Analysis</h2>
<p>Lets look at how important each feature was for the model.</p>

In [22]:
feature_names = X_train.columns.values
ft_importances = utils.feature_importances(model, feature_names)
ft_importances[:50]

Unnamed: 0,feature_name,importance
76,distance_euclidean,2264.0
70,distance_cityblock,2043.0
54,bearing_distance,1878.0
154,HOUR(pickup_datetime),1876.0
158,dropoff_latitude,1672.0
68,pickup_latitude,1665.0
156,center_longtitude,1590.0
67,dropoff_longitude,1571.0
51,pickup_longitude,1384.0
56,center_latitude,1273.0
