<center><a href="https://www.featuretools.com/"><img src="img/featuretools-logo.png" width="400" height="200" /></a></center>

<h2> New York City Taxi Ride Duration Prediction </h2>

In this case study, we will build a predictive model to predict taxi ride ``duration``. We will do the following steps:
* First install the dependencies
* Next load the data 
* Define the outcome variable- the variable we are trying to predict. 
* Build features using featuretools package - that implements Deep Feature Synthesis. We will start with simple features and incrementally improve the feature definitions and examine the accuracy of the system. 

<h2>Install Dependencies</h2>
<p>If you have not done so already, download this repository <a href="https://github.com/Featuretools/DSx/archive/master.zip">from git</a>. Once you have downloaded this archive, unzip it and cd into the directory from the command line. Next run the command ``./osx.sh`` if you are on a mac or ``./linux.sh`` if you are on linux. This should install all of the dependencies.</p>
<p> If you are on a windows machine, open the requirements.txt folder and make sure to install each of the dependencies listed (featuretools, jupyter, pandas, sklearn, xgboost, numpy) </p>
<p> Once you have installed all of the dependencies, open this notebook. On Mac and Linux, navigate to the directory that you downloaded from git and run ``jupyter notebook`` to be taken to this notebook in your default web browser. When you open the NewYorkCity_taxi_case_study.ipynb file in the web browser, you can step through the code by clicking the ``Run`` button at the top of the page. If you have any questions for how to use Jupyter, refer to google or the discussion forum.</p>

<h2>Running the Code</h2>

In [1]:
import pandas as pd
import numpy as np
import featuretools as ft
import utils
from utils import load_nyc_taxi_data, compute_features
from sklearn.metrics import mean_squared_error
from math import sqrt
from featuretools.primitives import (Day, Hour, Minute, Month, Weekday, Week, Weekend, Sum, Mean, Median, Std)
ft.__version__
%load_ext autoreload
%autoreload 2

### Step 1: Load the raw data  </h2>
<p>If you have not yet downloaded the data it can be downloaded <a href="https://s3.amazonaws.com/mit-dsx-data/nyc-taxi-data.zip">from S3</a>. Once you have downloaded the archive, unzip it and place trips.csv, passenger_cnt.csv, and vendors.csv in the nyc-taxi-data folder. 
</p>

In [2]:
trips, passenger_cnt, vendors = load_nyc_taxi_data()
trips.head(10)

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,trip_duration
0,0,2,2016-01-01 00:00:19,2016-01-01 00:06:31,3,1.32,-73.961258,40.7962,False,-73.95005,40.787312,2,372.0
1,1,2,2016-01-01 00:01:45,2016-01-01 00:27:38,1,13.7,-73.956169,40.707756,False,-73.939949,40.839558,1,1553.0
2,2,1,2016-01-01 00:01:47,2016-01-01 00:21:51,2,5.3,-73.993103,40.752632,False,-73.953903,40.81654,2,1204.0
3,3,2,2016-01-01 00:01:48,2016-01-01 00:16:06,1,7.19,-73.983009,40.731419,False,-73.930969,40.80846,2,858.0
4,4,1,2016-01-01 00:02:49,2016-01-01 00:20:45,2,2.9,-74.004631,40.747234,False,-73.976395,40.777237,1,1076.0
5,5,2,2016-01-01 00:03:21,2016-01-01 00:12:18,1,2.76,-73.956947,40.76638,False,-73.943008,40.796822,1,537.0
6,6,1,2016-01-01 00:04:20,2016-01-01 00:13:16,4,1.0,-73.98912,40.738045,False,-73.991638,40.748993,1,536.0
7,7,1,2016-01-01 00:05:06,2016-01-01 00:32:46,1,10.6,-73.972755,40.764198,False,-73.834953,40.692356,1,1660.0
8,8,2,2016-01-01 00:05:06,2016-01-01 00:12:27,3,2.32,-73.962997,40.765808,False,-73.967758,40.79039,2,441.0
9,9,2,2016-01-01 00:05:15,2016-01-01 00:08:27,1,0.73,-73.973824,40.792049,False,-73.977913,40.78376,2,192.0


## Step 2: Prepare the Data 
Lets create entities and relationships. The three entities in this data are 
* trips 
* vendors (these are the cab)
* passenger_cnt (a simple entity that has the unique number of passenger counts 1-8)

This data has the following relationships
* Vendors --> trips (the same vendor can have multiple trips - vendors is the ``parent_entity`` and trips it the child entity
* passenger_cnt --> trips (the same passenger_cnt can appear in multiple trips. passenger_cnt is the ``parent_entity`` and trips is the child entity. 


In [3]:
entities = {
        "trips": (trips, "id", 'pickup_datetime' ),
        "vendors": (vendors, "vendor_id"),
        "passenger_cnt": (passenger_cnt,"passenger_count")
        }

relationships = [("vendors", "vendor_id","trips", "vendor_id"), 
                ("passenger_cnt", "passenger_count","trips", "passenger_count")]

<p>We can specify the time for each instance of the target_entity to calculate features. The timestamp represents the last time data can be used for calculating features by DFS. This is specified using a dataframe of cutoff times. Below we can see that the cutoff time for each trip is the pickup time.</p>

In [4]:
cutoff_time = (trips[['id', 'pickup_datetime']])
print cutoff_time.head(10)

   id     pickup_datetime
0   0 2016-01-01 00:00:19
1   1 2016-01-01 00:01:45
2   2 2016-01-01 00:01:47
3   3 2016-01-01 00:01:48
4   4 2016-01-01 00:02:49
5   5 2016-01-01 00:03:21
6   6 2016-01-01 00:04:20
7   7 2016-01-01 00:05:06
8   8 2016-01-01 00:05:06
9   9 2016-01-01 00:05:15


<h2>Step 3: Create baseline features using DFS </h2>
<p>Instead of manually creating features, such as month of <b>pickup_datetime</b>, we can let featuretools come up with them. </p>

<p>Within featuretools there is a standard format for representing data that is used to set up predictions and build features.</p>


<p>As a note: Featuretools will try to interpret the types of variables. We can override this interpretation by specifying the types. In this case, I wanted <b>passenger_count</b> to be a type of Ordinal, and <b>vendor_id</b> to be of type Categorical. This was override occured while loading in the csv files.</p>

### Create transform features using transform primitives

As we described in the video, features fall into two major categories, ``transform`` and ``aggregate``. In Deep feature synthesis algorithm, we can create transform features by specifying ``transform`` primitives. Below we specify a ``transform`` primitive called ``weekend`` and here is what it does:

* It can be applied to any ``datetime`` column in the data. 
* For each entry in the column, it assess if it is a ``weekend`` and returns a boolean. 

In this specific data, there are two ``datetime`` columns ``pickup_datetime`` and ``dropoff_datetime``. The tool automatically creates features using the primitive and these two columns as shown below. 

In [6]:

trans_primitives = [Weekend]

features = ft.dfs(entities=entities,
                   relationships=relationships,
                   target_entity="trips",
                   trans_primitives=trans_primitives,
                   agg_primitives=[],
                   features_only=True)

<p>Here are the features created.</p>

In [9]:
print len(features)
features

12


[<Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: payment_type>,
 <Feature: dropoff_longitude>,
 <Feature: pickup_latitude>,
 <Feature: trip_duration>,
 <Feature: store_and_fwd_flag>,
 <Feature: trip_distance>,
 <Feature: dropoff_latitude>,
 <Feature: pickup_longitude>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>]

Now let's compute the features. 

In [10]:
feature_matrix = compute_features(features,cutoff_time)

<h2>Step 4: Build the Model </h2>

To build a model,
* we first seperate the data into a porition of ``training`` (75% in this case) and a portion for ``testing`` 
* We also get the log of the trip duration so that a more linear relationship can be found.
* We use ``XGBOOST`` to train a model. 

In [11]:
# separates the whole feature matrix into train data feature matrix, train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)
y_train = np.log(y_train.values + 1)

In [12]:
model = utils.train_xgb(X_train, y_train)

[0]	train-rmse:4.98698	valid-rmse:4.98587
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 50 rounds.
[10]	train-rmse:0.973206	valid-rmse:0.972554
[20]	train-rmse:0.436417	valid-rmse:0.436489
[30]	train-rmse:0.380745	valid-rmse:0.382061
[40]	train-rmse:0.37503	valid-rmse:0.377282
[50]	train-rmse:0.367368	valid-rmse:0.370566
[60]	train-rmse:0.362789	valid-rmse:0.366918
[70]	train-rmse:0.358907	valid-rmse:0.364013
[80]	train-rmse:0.357262	valid-rmse:0.362921
[90]	train-rmse:0.354699	valid-rmse:0.361165
[100]	train-rmse:0.353081	valid-rmse:0.360219
[110]	train-rmse:0.351461	valid-rmse:0.359141
[120]	train-rmse:0.35009	valid-rmse:0.358254
[130]	train-rmse:0.34822	valid-rmse:0.357092
[140]	train-rmse:0.346831	valid-rmse:0.35624
[150]	train-rmse:0.346074	valid-rmse:0.355775
[160]	train-rmse:0.345375	valid-rmse:0.3554
[170]	train-rmse:0.34477	valid-rmse:0.355074
[180]	train-rmse:0.343869	valid-rmse:0.35461
[1

<h2>Step 5: Adding more Transform Primitives</h2>

* Adding ``Minute`` ``Hour`` ``Week`` ``Month`` ``Weekday`` primitives
* All these transform primitives apply to ``datetime`` column

In [13]:
trans_primitives = [Minute, Hour, Day, Week, Month, Weekday, Weekend]

features = ft.dfs(entities=entities,
                   relationships=relationships,
                   target_entity="trips",
                   trans_primitives=trans_primitives,
                   agg_primitives=[],
                   drop_contains=['trips.test_data'],
                   features_only=True)

In [14]:
print len(features)
features

36


[<Feature: passenger_count>,
 <Feature: dropoff_longitude>,
 <Feature: payment_type>,
 <Feature: store_and_fwd_flag>,
 <Feature: vendor_id>,
 <Feature: pickup_latitude>,
 <Feature: pickup_longitude>,
 <Feature: trip_duration>,
 <Feature: trip_distance>,
 <Feature: dropoff_latitude>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: WEEK(dropoff_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: WEEKDAY(dropoff_datetime)>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: DAY(dropoff_datetime)>,
 <Feature: MONTH(dropoff_datetime)>,
 <Feature: HOUR(dropoff_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: MINUTE(dropoff_datetime)>,
 <Feature: passenger_cnt.WEEK(first_trips_time)>,
 <Feature: vendors.DAY(first_trips_time)>,
 <Feature: passenger_cnt.WEEKDAY(first_trips_time)>,
 <Feature: vendors.WEEKDAY(first_trips_time)>,
 <Fe

In [15]:
feature_matrix = compute_features(features,cutoff_time)

<h2>Step 5: Build the new model</h2>

In [16]:
# separates the whole feature matrix into train data feature matrix, train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)
y_train = np.log(y_train.values + 1)

In [17]:
model = utils.train_xgb(X_train, y_train)

[0]	train-rmse:4.99672	valid-rmse:4.99546
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 50 rounds.
[10]	train-rmse:0.926123	valid-rmse:0.925607
[20]	train-rmse:0.398269	valid-rmse:0.398866
[30]	train-rmse:0.336353	valid-rmse:0.338614
[40]	train-rmse:0.319268	valid-rmse:0.322974
[50]	train-rmse:0.29361	valid-rmse:0.299003
[60]	train-rmse:0.281483	valid-rmse:0.288059
[70]	train-rmse:0.257367	valid-rmse:0.265217
[80]	train-rmse:0.242748	valid-rmse:0.251557
[90]	train-rmse:0.236299	valid-rmse:0.246339
[100]	train-rmse:0.221303	valid-rmse:0.232359
[110]	train-rmse:0.21447	valid-rmse:0.226336
[120]	train-rmse:0.205326	valid-rmse:0.217802
[130]	train-rmse:0.203326	valid-rmse:0.21675
[140]	train-rmse:0.195485	valid-rmse:0.209856
[150]	train-rmse:0.194128	valid-rmse:0.209188
[160]	train-rmse:0.187765	valid-rmse:0.203539
[170]	train-rmse:0.178377	valid-rmse:0.19481
[180]	train-rmse:0.175451	valid-rmse:0.19234

<h2>Step 6: Add Aggregation Primitives</h2>

In [18]:
trans_primitives = [Minute, Hour, Day, Week, Month, Weekday, Weekend]
aggregation_primitives = [Sum, Mean, Median, Std]

features = ft.dfs(entities=entities,
                   relationships=relationships,
                   target_entity="trips",
                   trans_primitives=trans_primitives,
                   agg_primitives=aggregation_primitives,
                   drop_contains=['trips.test_data'],
                   features_only=True)

In [19]:
print len(features)
features

92


[<Feature: payment_type>,
 <Feature: store_and_fwd_flag>,
 <Feature: dropoff_longitude>,
 <Feature: pickup_longitude>,
 <Feature: trip_duration>,
 <Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: pickup_latitude>,
 <Feature: trip_distance>,
 <Feature: dropoff_latitude>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: HOUR(dropoff_datetime)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: WEEKDAY(dropoff_datetime)>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: WEEK(dropoff_datetime)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: MONTH(dropoff_datetime)>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: DAY(dropoff_datetime)>,
 <Feature: MINUTE(dropoff_datetime)>,
 <Feature: passenger_cnt.STD(trips.pickup_longitude)>,
 <Feature: passenger_cnt.SUM(trips.pickup_longitude)>,
 <Feature: vendors.SUM(trips.dropoff_longitude)>,
 <Feature: passenger_cnt.WEEKDAY(firs

In [20]:
feature_matrix = compute_features(features,cutoff_time)

<h2>Step 6: Build the new model</h2>

In [21]:
# separates the whole feature matrix into train data feature matrix, train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)
y_train = np.log(y_train.values + 1)

In [23]:
model = utils.train_xgb(X_train, y_train)

KeyboardInterrupt: 

<h2>Step 7: Evalute on test data  </h2>


In [None]:
y_pred = utils.predict_xgb(model, X_test)
y_pred.head(5)

In [None]:
mean_squared_error(y_test, y_pred['trip_duration'])**0.5

<h2>Additional Analysis</h2>
<p>Let's look at how important each feature was for the model.</p>

In [None]:
feature_names = X_train.columns.values
ft_importances = utils.feature_importances(model, feature_names)
ft_importances[:20]