<center><a href="https://www.featuretools.com/"><img src="img/featuretools-logo.png" width="400" height="200" /></a></center>

<h2> A Featuretools Baseline </h2>
<p>The following tutorial illustrates a featuretools baseline model for the NYC Taxi Trip Duration competition on Kaggle. This notebook follows the structure of the previous worksheet, but uses deep feature synthesis to create the model.</p>

<h2>Step 1: Download raw data </h2>
<p>As always, if you have not yet downloaded the data it can be found at the <a href="https://www.kaggle.com/c/nyc-taxi-trip-duration/data">Kaggle website</a>. After installing featuretools following <a href = "https://docs.featuretools.com/">the instructions in the documentation</a> you can run the following.
</p>


In [1]:
import pandas as pd
import numpy as np
import featuretools as ft
import taxi_utils
ft.__version__



'0.1.16'

In [2]:
TRAIN_DIR = "data/train.csv"
TEST_DIR = "data/test.csv"

data_train, data_test = taxi_utils.read_data(TRAIN_DIR, TEST_DIR)

data_train.head(5)

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,1,-73.982155,40.767937,-73.96463,40.765602,False,455
1,id2377394,1,2016-06-12 00:43:35,1,-73.980415,40.738564,-73.999481,40.731152,False,663
2,id3858529,2,2016-01-19 11:35:24,1,-73.979027,40.763939,-74.005333,40.710087,False,2124
3,id3504673,2,2016-04-06 19:32:31,1,-74.01004,40.719971,-74.012268,40.706718,False,429
4,id2181028,2,2016-03-26 13:30:55,1,-73.973053,40.793209,-73.972923,40.78252,False,435


<h2>Step 2: Prepare the Data </h2>
<p>Let's create another column to define test and train datasets.</p>

In [3]:
data_train['test_data'] = False
data_test['test_data'] = True

<p>We can now combine the data. </p>

In [4]:
data = pd.concat([data_train, data_test])

<h2>Step 3: Create baseline features using DFS </h2>
<p>Instead of manually creating features, such as month of <b>pickup_datetime</b>, we can let featuretools come up with them. </p>

<p>Within featuretools there is a standard format for representing data that is used to set up predictions and build features. A <b>EntitySet</b> stores information about entities (database table), variables (columns in database tables), relationships, and the data itself. </p>

<p> First, we create the EntitySet.</p>

In [5]:
es = ft.EntitySet("taxi")

<p>We can then use the `entity_from_dataframe` method to add an Entity called <i>trips</i>. We want to track the `id`, the `time_index` and specify other types of variables we care about in this entity. </p>

<p>As a note: Featuretools will try to interpret the types of variables. We can override this interpretation by specifying the types. In this case, I wanted <b>passenger_count</b> to be a type of Ordinal, and <b>vendor_id</b> to be of type Categorical.</p>

In [6]:
from featuretools import variable_types as vtypes

trip_variable_types = {
    'passenger_count': vtypes.Ordinal, 
    'vendor_id': vtypes.Categorical,
}

es.entity_from_dataframe(entity_id="trips",
                         dataframe=data,
                         index="id",
                         time_index='pickup_datetime',
                         variable_types=trip_variable_types)

Entityset: taxi
  Entities:
    trips (shape = [2050266, 11])
  Relationships:
    No relationships

In [7]:
es['trips'].df

Unnamed: 0_level_0,dropoff_latitude,dropoff_longitude,id,passenger_count,pickup_datetime,pickup_latitude,pickup_longitude,store_and_fwd_flag,test_data,trip_duration,vendor_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
id0190469,40.829182,-73.938828,id0190469,5,2016-01-01 00:00:17,40.719158,-73.981743,False,False,849.0,2
id0621643,40.769379,-73.969330,id0621643,2,2016-01-01 00:00:22,40.716881,-73.981850,False,True,,2
id1384355,40.891788,-73.854263,id1384355,1,2016-01-01 00:00:28,40.733562,-73.976501,False,True,,1
id1665586,40.717491,-73.958038,id1665586,1,2016-01-01 00:00:53,40.747166,-73.985085,False,False,1294.0,1
id1210365,40.815170,-73.947479,id1210365,5,2016-01-01 00:01:01,40.801041,-73.965279,False,False,408.0,2
id3888279,40.750340,-73.991341,id3888279,1,2016-01-01 00:01:14,40.751331,-73.982292,False,False,280.0,1
id0924227,40.742989,-73.989357,id0924227,1,2016-01-01 00:01:20,40.759800,-73.970108,False,False,736.0,1
id2568735,40.748665,-73.876602,id2568735,2,2016-01-01 00:01:24,40.759865,-73.972267,False,True,,1
id2294362,40.847771,-73.936493,id2294362,1,2016-01-01 00:01:33,40.773891,-73.984993,False,False,712.0,2
id1078247,40.761734,-73.974854,id1078247,1,2016-01-01 00:01:37,40.764072,-73.973335,False,False,114.0,2


<p>We can also normalize some of the columns to create new entities. So a <i>vendors</i> entity is created based on the unique values in the <i>vendor_id</i> column in <i>trips</i>.</p>

In [8]:
es.normalize_entity(base_entity_id="trips",
                    new_entity_id="vendors",
                    index="vendor_id")

es.normalize_entity(base_entity_id="trips",
                    new_entity_id="passenger_cnt",
                    index="passenger_count")

Entityset: taxi
  Entities:
    vendors (shape = [2, 2])
    trips (shape = [2050266, 11])
    passenger_cnt (shape = [8, 2])
  Relationships:
    trips.vendor_id -> vendors.vendor_id
    trips.passenger_count -> passenger_cnt.passenger_count

<p>We can specify the time for each instance of the target_entity to calculate features. The timestamp represents the last time data can be used for calculating features by DFS. This is specified using a dataframe of cutoff times. Below we can see that the cutoff time for each trip is the pickup time.</p>

In [9]:
cutoff_time = es['trips'].df[['id', 'pickup_datetime']]

<p>Given this dataset, we would have about 2 million unique cutoff times. This is a good use case to use the approximate features parameter of DFS. In a large dataset, direct features that are aggregations on the prediction entity may not change much from cutoff time to cutoff time. Calculating the aggregation features at specific times every hour and using it for all cutoff times within the hour would save time and perhaps not lose much information. The approximate parameter in DFS lets you specify a window size to use when approximating these direct aggregation features.</p>

<p>We now create features using DFS.</p>

<b>Note, we can use an already calculated feature_matrix by doing the following:</b>
<p>You must copy and run the code.</p>
```python
feature_matrix = pd.read_csv('https://s3.amazonaws.com/featuretools-static/nyc_taxi/fm_simple.csv', 
                             index_col='id')
features = feature_matrix.columns.values
```

In [10]:
from featuretools.primitives import (Day, Hour, Minute, Month, Weekday, Week, Weekend)

es.add_interesting_values()

trans_primitives = [Minute, Hour, Day, Week, Month, Weekday, Weekend]

feature_matrix, features = ft.dfs(entityset=es,
                                  target_entity="trips",
                                  trans_primitives=trans_primitives,
                                  drop_contains=['trips.test_data'],
                                  verbose=True,
                                  cutoff_time=cutoff_time,
                                  approximate='36d')

Building features: 188it [00:00, 9917.23it/s]
Progress: 100%|██████████| 6/6 [03:55<00:00, 39.29s/cutoff time]


<p>Here are the features created. Notice how some of the features match the manually created features in the previous notebook.</p>

In [11]:
print len(features)

96


In [12]:
features[:25]

[<Feature: store_and_fwd_flag>,
 <Feature: dropoff_longitude>,
 <Feature: test_data>,
 <Feature: pickup_longitude>,
 <Feature: trip_duration>,
 <Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: pickup_latitude>,
 <Feature: dropoff_latitude>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: passenger_cnt.MAX(trips.pickup_longitude)>,
 <Feature: passenger_cnt.STD(trips.pickup_longitude)>,
 <Feature: passenger_cnt.SUM(trips.pickup_longitude)>,
 <Feature: vendors.SUM(trips.dropoff_longitude)>,
 <Feature: passenger_cnt.WEEKDAY(first_trips_time)>,
 <Feature: vendors.MAX(trips.pickup_longitude)>,
 <Feature: passenger_cnt.MAX(trips.dropoff_latitude)>,
 <Feature: passenger_cnt.MODE(trips.vendor_id)>,
 <Feature: vendors.WEEKDAY(first_trips_time)>]

<h2>Step 3: Build the Model </h2>

<p>We need to retrieve our labels for the train dataset, so we should merge our current feature matrix with the original dataset. </p>
<p>We also get the log of the trip duration so that a more linear relationship can be found.</p>

In [13]:
# separates the whole feature matrix into train data feature matrix, train data labels, and test data feature matrix 
X_train, labels, X_test = taxi_utils.get_train_test_fm(feature_matrix)
labels = np.log(labels.values + 1)

In [14]:
model = taxi_utils.train_xgb(X_train, labels)

[0]	train-rmse:5.00403	valid-rmse:5.00377
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 60 rounds.
[10]	train-rmse:1.01317	valid-rmse:1.01429
[20]	train-rmse:0.595111	valid-rmse:0.598508
[30]	train-rmse:0.550728	valid-rmse:0.555082
[40]	train-rmse:0.519177	valid-rmse:0.524392
[50]	train-rmse:0.489201	valid-rmse:0.495522
[60]	train-rmse:0.47082	valid-rmse:0.478204
[70]	train-rmse:0.462427	valid-rmse:0.47082
[80]	train-rmse:0.453642	valid-rmse:0.462814
[90]	train-rmse:0.446342	valid-rmse:0.456236
[100]	train-rmse:0.444788	valid-rmse:0.455247
[110]	train-rmse:0.44277	valid-rmse:0.453643
[120]	train-rmse:0.435461	valid-rmse:0.44695
[130]	train-rmse:0.415849	valid-rmse:0.428293
[140]	train-rmse:0.412735	valid-rmse:0.425813
[150]	train-rmse:0.407197	valid-rmse:0.420897
[160]	train-rmse:0.406527	valid-rmse:0.420566
[170]	train-rmse:0.401259	valid-rmse:0.415712
[180]	train-rmse:0.397528	valid-rmse:0.412441


<h2>Step 4: Make a Submission </h2>

In [15]:
submission = taxi_utils.predict_xgb(model, X_test)
submission.head(5)

Unnamed: 0_level_0,trip_duration
id,Unnamed: 1_level_1
id0000002,1140.518677
id0000199,2268.834717
id0000446,790.1474
id0000587,1056.925415
id0000604,203.023926


In [16]:
submission.to_csv('trip_duration_ft_simple.csv', index=True, index_label='id')

<dt>This solution:</dt>
<dd>&nbsp; &nbsp; Received a score of 0.45288 on the Kaggle competition.</dd>
<dd>&nbsp; &nbsp; Placed 685 out of 1257.</dd>
<dd>&nbsp; &nbsp; Beat 45% of competitors on the Kaggle competition.</dd>
<dd>&nbsp; &nbsp; Scored 4% better than the baseline solution</dd>
<dd>&nbsp; &nbsp; Had a modeling RMSLE of 0.40196</dd>

December 27, 2017.

<h2>Additional Analysis</h2>
<p>Let's look at how important each feature was for the model.</p>

In [17]:
feature_names = X_train.columns.values
ft_importances = taxi_utils.feature_importances(model, feature_names)
ft_importances

Unnamed: 0,feature_name,importance
6,dropoff_latitude,7946.0
5,pickup_latitude,7674.0
2,pickup_longitude,7277.0
1,dropoff_longitude,7231.0
9,HOUR(pickup_datetime),3199.0
8,MINUTE(pickup_datetime),2735.0
72,DAY(pickup_datetime),2212.0
70,WEEK(pickup_datetime),1838.0
71,WEEKDAY(pickup_datetime),1446.0
77,passenger_cnt.STD(trips.pickup_longitude),457.0
