<center><a href="https://www.featuretools.com/"><img src="img/featuretools-logo.png" width="400" height="200" /></a></center>

## New York City Taxi Ride Duration Prediction </h2>

In this case study, we will build a predictive model to predict taxi ride ``duration``. We will do the following steps:

* First load the data 
* Define the outcome variable- the variable we are trying to predict. 
* Build features using featuretools package - that implements Deep Feature Synthesis. We will start with simple features and incrementally improve the feature definitions and examine the accuracy of the system. 

In [59]:
import pandas as pd
import numpy as np
import featuretools as ft
import utils
from utils import load_nyc_taxi_data, compute_features
from sklearn.metrics import mean_squared_error
from math import sqrt
ft.__version__
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Step 1: Load the raw data  </h2>
<p>If you have not yet downloaded the data it can be found at the <a href="">from S3</a>. The data is already split in test and train. Both are csv files. 
</p>

In [44]:
trips, passenger_cnt, vendors = load_nyc_taxi_data()
trips.head(10)

Unnamed: 0,id,dropoff_latitude,dropoff_longitude,passenger_count,payment_type,pickup_datetime,pickup_latitude,pickup_longitude,store_and_fwd_flag,test_data,trip_distance,trip_duration,vendor_id
0,338479,40.716202,-74.014748,6,2,2016-01-01 00:00:45,40.726471,-73.994202,False,False,1.85,512.0,2
1,209890,40.712761,-74.00975,1,2,2016-01-01 00:00:52,40.744091,-73.99575,False,False,2.5,626.0,2
2,1515010,40.749889,-73.987579,1,1,2016-01-01 00:01:32,40.719131,-74.007088,False,True,2.7,,2
3,32054,40.749062,-73.869568,1,2,2016-01-01 00:02:07,40.760712,-73.96846,False,False,5.69,1387.0,2
4,982081,40.743797,-73.985741,5,1,2016-01-01 00:02:41,40.742138,-74.004005,False,False,1.34,852.0,2
5,1547231,40.7192,-74.005127,1,1,2016-01-01 00:03:35,40.742577,-73.980392,False,True,2.7,,1
6,1872196,40.703861,-73.930763,2,1,2016-01-01 00:04:18,40.718288,-74.000641,False,True,4.9,,1
7,159096,40.768898,-73.985909,1,1,2016-01-01 00:04:39,40.735119,-74.006042,False,False,3.0,1033.0,1
8,335672,40.649349,-74.02034,5,1,2016-01-01 00:04:58,40.710529,-73.984673,False,False,8.79,1495.0,2
9,1108122,40.754318,-73.977547,2,1,2016-01-01 00:05:04,40.782616,-73.980721,False,True,2.76,,2


### Step 2: Prepare the Data 
Lets create entities and relationships. The three entities in this data are 
* trips 
* vendors (these are the cab)
* passenger_cnt (a simple entity that has the unique number of passenger counts 1-8)

This data has the following relationships
* Vendors --> trips (the same vendor can have multiple trips - vendors is the ``parent_entity`` and trips it the child entity
* passenger_cnt --> trips (the same passenger_cnt can appear in multiple trips. passenger_cnt is the ``parent_entity`` and trips is the child entity. 


In [45]:
entities = {
        "trips": (trips, "id", 'pickup_datetime' ),
        "vendors": (vendors, "vendor_id"),
        "passenger_cnt": (passenger_cnt,"passenger_count")
        }

relationships = [("vendors", "vendor_id","trips", "vendor_id"), 
                ("passenger_cnt", "passenger_count","trips", "passenger_count")]

<h2>Step 3: Create baseline features using DFS </h2>
<p>Instead of manually creating features, such as month of <b>pickup_datetime</b>, we can let featuretools come up with them. </p>

<p>Within featuretools there is a standard format for representing data that is used to set up predictions and build features.</p>


<p>As a note: Featuretools will try to interpret the types of variables. We can override this interpretation by specifying the types. In this case, I wanted <b>passenger_count</b> to be a type of Ordinal, and <b>vendor_id</b> to be of type Categorical.</p>

<p>We can specify the time for each instance of the target_entity to calculate features. The timestamp represents the last time data can be used for calculating features by DFS. This is specified using a dataframe of cutoff times. Below we can see that the cutoff time for each trip is the pickup time.</p>

In [46]:
cutoff_time = (trips[['id', 'pickup_datetime']])

<p>Given this dataset, we would have about 2 million unique cutoff times. This is a good use case to use the approximate features parameter of DFS. In a large dataset, direct features that are aggregations on the prediction entity may not change much from cutoff time to cutoff time. Calculating the aggregation features at specific times every hour and using it for all cutoff times within the hour would save time and perhaps not lose much information. The approximate parameter in DFS lets you specify a window size to use when approximating these direct aggregation features.</p>

In [47]:
from featuretools.primitives import (Day, Hour, Minute, Month, Weekday, Week, Weekend)


trans_primitives = [Minute, Hour, Day, Week, Month, Weekday, Weekend]

features = ft.dfs(entities=entities,
                   relationships=relationships,
                   target_entity="trips",
                   trans_primitives=trans_primitives,
                   agg_primitives=[],
                   drop_contains=['trips.test_data'],
                   features_only=True)

<p>Here are the features created. Notice how some of the features match the manually created features in the previous notebook.</p>

In [48]:
print len(features)

30


In [49]:
features[:25]

[<Feature: passenger_count>,
 <Feature: dropoff_longitude>,
 <Feature: payment_type>,
 <Feature: store_and_fwd_flag>,
 <Feature: vendor_id>,
 <Feature: test_data>,
 <Feature: pickup_latitude>,
 <Feature: pickup_longitude>,
 <Feature: trip_duration>,
 <Feature: trip_distance>,
 <Feature: dropoff_latitude>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: passenger_cnt.WEEK(first_trips_time)>,
 <Feature: vendors.DAY(first_trips_time)>,
 <Feature: passenger_cnt.WEEKDAY(first_trips_time)>,
 <Feature: vendors.WEEKDAY(first_trips_time)>,
 <Feature: vendors.MONTH(first_trips_time)>,
 <Feature: passenger_cnt.DAY(first_trips_time)>,
 <Feature: passenger_cnt.MINUTE(first_trips_time)>]

In [50]:
feature_matrix = compute_features(features,cutoff_time)

<h2>Step 3: Build the Model </h2>

<p>We need to retrieve our labels for the train dataset, so we should merge our current feature matrix with the original dataset. </p>
<p>We also get the log of the trip duration so that a more linear relationship can be found.</p>

In [55]:
# separates the whole feature matrix into train data feature matrix, train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix)
y_train = np.log(y_train.values + 1)

In [56]:
model = utils.train_xgb(X_train, labels)

[0]	train-rmse:5.00153	valid-rmse:5.00566
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 60 rounds.
[10]	train-rmse:0.974724	valid-rmse:0.978011
[20]	train-rmse:0.444732	valid-rmse:0.448364
[30]	train-rmse:0.372997	valid-rmse:0.377877
[40]	train-rmse:0.343329	valid-rmse:0.349475
[50]	train-rmse:0.335098	valid-rmse:0.342742
[60]	train-rmse:0.330965	valid-rmse:0.339851
[70]	train-rmse:0.323858	valid-rmse:0.334437
[80]	train-rmse:0.319718	valid-rmse:0.331676
[90]	train-rmse:0.316818	valid-rmse:0.329861
[100]	train-rmse:0.313605	valid-rmse:0.327871
[110]	train-rmse:0.311416	valid-rmse:0.326687
[120]	train-rmse:0.308945	valid-rmse:0.325237
[130]	train-rmse:0.307372	valid-rmse:0.324473
[140]	train-rmse:0.30443	valid-rmse:0.322506
[150]	train-rmse:0.303034	valid-rmse:0.321743
[160]	train-rmse:0.301195	valid-rmse:0.3207
[170]	train-rmse:0.299955	valid-rmse:0.320197
[180]	train-rmse:0.29939	valid-rmse:0.31993

<h2>Step 4: Evalute on test data  </h2>


In [57]:
y_pred = utils.predict_xgb(model, X_test)
y_pred.head(5)

Unnamed: 0_level_0,trip_duration
id,Unnamed: 1_level_1
1003399,1314.03894
1003423,371.636261
1003452,812.39563
1003467,1336.536865
1003491,1074.334717


In [62]:
y_test

id
1003399   NaN
1003423   NaN
1003452   NaN
1003467   NaN
1003491   NaN
1003528   NaN
1003609   NaN
1003800   NaN
1003829   NaN
1003846   NaN
1003934   NaN
1003935   NaN
1003964   NaN
1004047   NaN
1004069   NaN
1004090   NaN
1004105   NaN
1004113   NaN
1004141   NaN
1004205   NaN
1004210   NaN
1004232   NaN
1004290   NaN
1004307   NaN
1004340   NaN
1004343   NaN
1004352   NaN
1004358   NaN
1004377   NaN
1004383   NaN
           ..
2003025   NaN
2003041   NaN
2003047   NaN
2003052   NaN
2003058   NaN
2003066   NaN
2003068   NaN
2003071   NaN
2003098   NaN
2003104   NaN
2003109   NaN
2003120   NaN
2003144   NaN
2003171   NaN
2003174   NaN
2003182   NaN
2003184   NaN
2003194   NaN
2003210   NaN
2003227   NaN
2003228   NaN
2003250   NaN
2003287   NaN
2003289   NaN
2003301   NaN
2003304   NaN
2003305   NaN
2003344   NaN
2003348   NaN
2003388   NaN
Name: trip_duration, Length: 1000001, dtype: float64

In [61]:
mean_squared_error(y_test, y_pred['trip_duration'])

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

<h2>Additional Analysis</h2>
<p>Let's look at how important each feature was for the model.</p>

In [18]:
feature_names = X_train.columns.values
ft_importances = utils.feature_importances(model, feature_names)
ft_importances[:20]

Unnamed: 0,feature_name,importance
92,dropoff_latitude,3598.0
86,dropoff_longitude,3524.0
90,pickup_latitude,3423.0
87,pickup_longitude,2752.0
91,trip_distance,2301.0
75,HOUR(pickup_datetime),1684.0
83,id.1,1370.0
80,WEEK(pickup_datetime),1126.0
76,DAY(pickup_datetime),970.0
78,MINUTE(pickup_datetime),959.0
