# New York City Taxi Ride Duration Prediction

In this case study, we will build a predictive model to predict the duration of taxi ride. We will do the following steps:
  * First install the dependencies
  * Next load the data as pandas dataframe
  * Define the outcome variable- the variable we are trying to predict.
  * Build features using featuretools package - that implements Deep Feature Synthesis. We will start with simple features and incrementally improve the feature definitions and examine the accuracy of the system.


Allocate atleast 2-3 hours to go through this case study end-to-end

# Install Dependencies 
<p>If you have not done so already, download this repository <a href="https://github.com/Featuretools/DSx/archive/master.zip">from git</a>. Once you have downloaded this archive, unzip it and cd into the directory from the command line. Next run the command ``./install_osx.sh`` if you are on a mac or ``./install_linux.sh`` if you are on linux. This should install all of the dependencies.</p>
<p> If you are on a windows machine, open the requirements.txt folder and make sure to install each of the dependencies listed (featuretools, jupyter, pandas, sklearn, numpy) </p>
<p> Once you have installed all of the dependencies, open this notebook. On Mac and Linux, navigate to the directory that you downloaded from git and run ``jupyter notebook`` to be taken to this notebook in your default web browser. When you open the NewYorkCity_taxi_case_study.ipynb file in the web browser, you can step through the code by clicking the ``Run`` button at the top of the page. If you have any questions for how to use <a href="http://jupyter.org/">Jupyter</a>, refer to google or the discussion forum.</p>

# Running the Code

In [1]:
import pandas as pd
import numpy as np
import featuretools as ft
import utils
from utils import load_nyc_taxi_data, compute_features, preview
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
from math import sqrt
from featuretools.primitives import (Day, Hour, Minute, Month, Weekday, 
                                     Week, Weekend, Sum, Mean, Median, Std)
ft.__version__
%load_ext autoreload
%autoreload 2

# Step 1: Download and load the raw data as pandas dataframes
<p>If you have not yet downloaded the data it can be downloaded <a href="https://s3.amazonaws.com/mit-dsx-data/nyc-taxi-data.zip">from S3</a>. Once you have downloaded the archive, unzip it and place the nyc-taxi-data folder in the same directory as this script. 
</p>

In [2]:
trips, passenger_cnt, vendors = load_nyc_taxi_data()
preview(trips,10)

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,trip_duration
0,0,2,2016-01-01 00:00:19,2016-01-01 00:06:31,3,1.32,-73.961258,40.7962,False,-73.95005,40.787312,2,372.0
679995,679995,1,2016-04-30 12:57:36,2016-04-30 13:04:36,1,1.1,-73.979973,40.770679,False,-73.969696,40.785587,1,420.0
679996,679996,2,2016-04-30 12:57:40,2016-04-30 13:06:01,5,1.22,-73.940399,40.79388,False,-73.952667,40.804859,1,501.0
679997,679997,2,2016-04-30 12:57:45,2016-04-30 13:07:11,1,1.58,-73.924728,40.744068,False,-73.953087,40.74929,1,566.0
679998,679998,2,2016-04-30 12:57:49,2016-04-30 13:15:28,1,4.76,-73.985863,40.746799,False,-74.005951,40.711269,2,1059.0
679999,679999,2,2016-04-30 12:58:04,2016-04-30 13:08:30,1,1.79,-73.959747,40.773682,False,-73.981071,40.778381,2,626.0
680000,680000,2,2016-04-30 12:58:33,2016-04-30 13:08:51,1,1.32,-73.9813,40.752972,False,-73.973923,40.764381,1,618.0
680001,680001,2,2016-04-30 12:58:39,2016-04-30 13:16:19,2,1.99,-73.987549,40.756226,False,-73.998032,40.765732,2,1060.0
680002,680002,2,2016-04-30 12:58:47,2016-04-30 13:13:47,2,4.0,-73.951172,40.77422,False,-73.909988,40.801823,1,900.0
680003,680003,1,2016-04-30 12:58:56,2016-04-30 13:24:28,1,6.1,-74.008163,40.70364,False,-73.984138,40.75898,2,1532.0


The ``trips`` table has the following fields
* ``id`` which uniquely identifies the trip
* ``vendor_id`` is the taxi cab company - in our case study we have data from three different cab companies
* ``pickup_datetime`` the time stamp for pickup
* ``dropoff_datetime`` the time stamp for drop-off
* ``passenger_count`` the number of passengers for the trip
* ``trip_distance`` total distance of the trip in miles 
* ``pickup_longitude`` the longitude for pickup
* ``pickup_latitude`` the latitude for pickup
* ``dropoff_longitude``the longitude of dropoff 
* ``dropoff_latitude`` the latitude of dropoff
* ``payment_type`` A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided
* ``trip_duration`` this is the duration we would like to predict using other fields 

# Step 2: Prepare the Data
Lets create entities and relationships. The three entities in this data are 
* trips 
* vendors (these are the cab companies)
* passenger_cnt (a simple entity that has the unique number of passenger counts 1-8)

This data has the following relationships
* Vendors --> trips (the same vendor can have multiple trips - vendors is the ``parent_entity`` and trips it the child entity
* passenger_cnt --> trips (the same passenger_cnt can appear in multiple trips. passenger_cnt is the ``parent_entity`` and trips is the child entity. 

In <a <href="https://www.featuretools.com/"><featuretools (automated feature engineering software package)/></a>, we specify the list of entities and relationships as follows: 

In [3]:
entities = {
        "trips": (trips, "id", 'pickup_datetime' ),
        "vendors": (vendors, "vendor_id"),
        "passenger_cnt": (passenger_cnt,"passenger_count")
        }

relationships = [("vendors", "vendor_id","trips", "vendor_id"), 
                ("passenger_cnt", "passenger_count","trips", "passenger_count")]

<p>We specify the time for each instance of the target_entity, in this case ``trips`` to calculate features. The timestamp represents the last time data can be used for calculating features by DFS. This is specified using a dataframe of cutoff time. This cutoff time for each trip is the pickup time.</p>

In [4]:
cutoff_time = (trips[['id', 'pickup_datetime']])
preview(cutoff_time,10)

Unnamed: 0,id,pickup_datetime
0,0,2016-01-01 00:00:19
679995,679995,2016-04-30 12:57:36
679996,679996,2016-04-30 12:57:40
679997,679997,2016-04-30 12:57:45
679998,679998,2016-04-30 12:57:49
679999,679999,2016-04-30 12:58:04
680000,680000,2016-04-30 12:58:33
680001,680001,2016-04-30 12:58:39
680002,680002,2016-04-30 12:58:47
680003,680003,2016-04-30 12:58:56


# Step 3: Create baseline features using DFS 
<p>Instead of manually creating features, such as month of <b>pickup_datetime</b>, we can let featuretools come up with them. </p> 

Featuretools does this by 
* interpret the types of variables - categorical, numeric and others. We can override this interpretation by specifying the types. In this case study, we wanted <b>passenger_count</b> to be a type of Ordinal, and <b>vendor_id</b> to be of type Categorical. This override occured while loading in the csv files.</p>
* then based on the primitives we specify, it matches up the columns to which those primitives can be applied. 

# Create transform features using transform primitives

As we described in the video, features fall into two major categories, ``transform`` and ``aggregate``. In featureools, we can create transform features by specifying ``transform`` primitives. Below we specify a ``transform`` primitive called ``weekend`` and here is what it does:

* It can be applied to any ``datetime`` column in the data. 
* For each entry in the column, it assess if it is a ``weekend`` and returns a boolean. 

In this specific data, there are two ``datetime`` columns ``pickup_datetime`` and ``dropoff_datetime``. The tool automatically creates features using the primitive and these two columns as shown below. 

In [5]:
trans_primitives = [Weekend]

features = ft.dfs(entities=entities,
                   relationships=relationships,
                   target_entity="trips",
                   trans_primitives=trans_primitives,
                   agg_primitives=[],
                   features_only=True)

<p>Here are the features created.</p>

In [6]:
print len(features)
features

12


[<Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: payment_type>,
 <Feature: dropoff_longitude>,
 <Feature: pickup_latitude>,
 <Feature: trip_duration>,
 <Feature: store_and_fwd_flag>,
 <Feature: trip_distance>,
 <Feature: dropoff_latitude>,
 <Feature: pickup_longitude>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>]

Now let's compute the features. 

In [7]:
feature_matrix = compute_features(features,cutoff_time)

# Step 4: Build the Model

To build a model,
* we first seperate the data into a porition for ``training`` (75% in this case) and a portion for ``testing`` 
* We also get the log of the trip duration so that a more linear relationship can be found.
* We use ``GradientBoostingRegressor`` to train a model. 

In [8]:
# separates the whole feature matrix into train data feature matrix, 
# train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)

In [9]:
model = GradientBoostingRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.67914562001890844

# Step 5: Adding more Transform Primitives

* Adding ``Minute`` ``Hour`` ``Week`` ``Month`` ``Weekday`` primitives
* All these transform primitives apply to ``datetime`` column

In [10]:
trans_primitives = [Minute, Hour, Day, Week, Month, Weekday, Weekend]

features = ft.dfs(entities=entities,
                   relationships=relationships,
                   target_entity="trips",
                   trans_primitives=trans_primitives,
                   agg_primitives=[],
                   features_only=True)

In [11]:
print len(features)
features

36


[<Feature: passenger_count>,
 <Feature: dropoff_longitude>,
 <Feature: payment_type>,
 <Feature: store_and_fwd_flag>,
 <Feature: vendor_id>,
 <Feature: pickup_latitude>,
 <Feature: pickup_longitude>,
 <Feature: trip_duration>,
 <Feature: trip_distance>,
 <Feature: dropoff_latitude>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: WEEK(dropoff_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: WEEKDAY(dropoff_datetime)>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: DAY(dropoff_datetime)>,
 <Feature: MONTH(dropoff_datetime)>,
 <Feature: HOUR(dropoff_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: MINUTE(dropoff_datetime)>,
 <Feature: passenger_cnt.WEEK(first_trips_time)>,
 <Feature: vendors.DAY(first_trips_time)>,
 <Feature: passenger_cnt.WEEKDAY(first_trips_time)>,
 <Feature: vendors.WEEKDAY(first_trips_time)>,
 <Fe

Now let's compute the features. 

In [12]:
feature_matrix = compute_features(features,cutoff_time)

In [13]:
preview(feature_matrix,10)

Unnamed: 0_level_0,passenger_count,dropoff_longitude,payment_type,store_and_fwd_flag,vendor_id,pickup_latitude,pickup_longitude,trip_duration,trip_distance,dropoff_latitude,...,passenger_cnt.WEEKDAY(first_trips_time),vendors.WEEKDAY(first_trips_time),vendors.MONTH(first_trips_time),passenger_cnt.DAY(first_trips_time),passenger_cnt.MINUTE(first_trips_time),passenger_cnt.HOUR(first_trips_time),vendors.HOUR(first_trips_time),passenger_cnt.MONTH(first_trips_time),vendors.MINUTE(first_trips_time),vendors.WEEK(first_trips_time)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3,-73.95005,2,False,2,40.7962,-73.961258,372.0,1.32,40.787312,...,4,4,1,1,0,0,0,1,1,53
679995,1,-73.969696,1,False,1,40.770679,-73.979973,420.0,1.1,40.785587,...,4,4,1,1,45,1,0,1,1,53
679996,5,-73.952667,1,False,2,40.79388,-73.940399,501.0,1.22,40.804859,...,4,4,1,1,7,0,0,1,1,53
679997,1,-73.953087,1,False,2,40.744068,-73.924728,566.0,1.58,40.74929,...,4,4,1,1,45,1,0,1,1,53
679998,1,-74.005951,2,False,2,40.746799,-73.985863,1059.0,4.76,40.711269,...,4,4,1,1,45,1,0,1,1,53
679999,1,-73.981071,2,False,2,40.773682,-73.959747,626.0,1.79,40.778381,...,4,4,1,1,45,1,0,1,1,53
680000,1,-73.973923,1,False,2,40.752972,-73.9813,618.0,1.32,40.764381,...,4,4,1,1,45,1,0,1,1,53
680001,2,-73.998032,2,False,2,40.756226,-73.987549,1060.0,1.99,40.765732,...,4,4,1,1,47,1,0,1,1,53
680002,2,-73.909988,1,False,2,40.77422,-73.951172,900.0,4.0,40.801823,...,4,4,1,1,47,1,0,1,1,53
680003,1,-73.984138,2,False,1,40.70364,-74.008163,1532.0,6.1,40.75898,...,4,4,1,1,45,1,0,1,1,53


# Step 6: Build the new model

In [14]:
# separates the whole feature matrix into train data feature matrix,
# train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)

In [15]:
model = GradientBoostingRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.74751581723434957

# Step 7: Add Aggregation Primitives

Now let's add aggregation primitives. These primitives will generate features for the parent entities in this case both ``vendors`` and ``passenger_cnt`` and then add them to the trips entity (which is the entity for which we are trying to make prediction.

In [16]:
trans_primitives = [Minute, Hour, Day, Week, Month, Weekday, Weekend]
aggregation_primitives = [Sum, Mean, Median, Std]

features = ft.dfs(entities=entities,
                   relationships=relationships,
                   target_entity="trips",
                   trans_primitives=trans_primitives,
                   agg_primitives=aggregation_primitives,
                   features_only=True)

In [17]:
print len(features)
features

92


[<Feature: payment_type>,
 <Feature: store_and_fwd_flag>,
 <Feature: dropoff_longitude>,
 <Feature: pickup_longitude>,
 <Feature: trip_duration>,
 <Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: pickup_latitude>,
 <Feature: trip_distance>,
 <Feature: dropoff_latitude>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: HOUR(dropoff_datetime)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: WEEKDAY(dropoff_datetime)>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: IS_WEEKEND(dropoff_datetime)>,
 <Feature: WEEK(dropoff_datetime)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: MONTH(dropoff_datetime)>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: DAY(dropoff_datetime)>,
 <Feature: MINUTE(dropoff_datetime)>,
 <Feature: passenger_cnt.STD(trips.pickup_longitude)>,
 <Feature: passenger_cnt.SUM(trips.pickup_longitude)>,
 <Feature: vendors.SUM(trips.dropoff_longitude)>,
 <Feature: passenger_cnt.WEEKDAY(firs

In [18]:
feature_matrix = compute_features(features,cutoff_time)

In [19]:
preview(feature_matrix,10)

Unnamed: 0_level_0,payment_type,store_and_fwd_flag,dropoff_longitude,pickup_longitude,trip_duration,vendor_id,passenger_count,pickup_latitude,trip_distance,dropoff_latitude,...,passenger_cnt.MEAN(trips.pickup_longitude),vendors.MEDIAN(trips.dropoff_latitude),passenger_cnt.SUM(trips.trip_distance),passenger_cnt.MEDIAN(trips.payment_type),passenger_cnt.STD(trips.dropoff_latitude),passenger_cnt.MEDIAN(trips.trip_duration),vendors.MEDIAN(trips.payment_type),passenger_cnt.MEDIAN(trips.dropoff_latitude),vendors.MEAN(trips.pickup_longitude),vendors.MEAN(trips.dropoff_longitude)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
510001,2,False,-73.998131,-73.982216,674.0,1,1,40.763084,1.5,40.765652,...,-73.974532,40.75441,5549035.11,1.0,0.029019,622.0,1.0,40.754593,-73.975087,-73.974294
679994,1,False,-74.000252,-74.009727,612.0,2,1,40.713009,1.08,40.726639,...,-73.974591,40.754723,5948678.22,1.0,0.029112,631.0,1.0,40.754616,-73.973981,-73.974037
679995,1,False,-73.969696,-73.979973,420.0,1,1,40.770679,1.1,40.785587,...,-73.974591,40.754452,5948678.22,1.0,0.029112,631.0,1.0,40.754616,-73.975138,-73.974177
679996,1,False,-73.952667,-73.940399,501.0,2,5,40.79388,1.22,40.804859,...,-73.97367,40.754723,102501.77,1.0,0.029189,651.0,1.0,40.7547,-73.973981,-73.974037
679997,1,False,-73.953087,-73.924728,566.0,2,1,40.744068,1.58,40.74929,...,-73.974591,40.754723,5948678.22,1.0,0.029112,631.0,1.0,40.754616,-73.973981,-73.974037
679998,2,False,-74.005951,-73.985863,1059.0,2,1,40.746799,4.76,40.711269,...,-73.974591,40.754723,5948678.22,1.0,0.029112,631.0,1.0,40.754616,-73.973981,-73.974037
679999,2,False,-73.981071,-73.959747,626.0,2,1,40.773682,1.79,40.778381,...,-73.974591,40.754723,5948678.22,1.0,0.029112,631.0,1.0,40.754616,-73.973981,-73.974037
680000,1,False,-73.973923,-73.9813,618.0,2,1,40.752972,1.32,40.764381,...,-73.974591,40.754723,5948678.22,1.0,0.029112,631.0,1.0,40.754616,-73.973981,-73.974037
680001,2,False,-73.998032,-73.987549,1060.0,2,2,40.756226,1.99,40.765732,...,-73.974236,40.754723,277797.69,1.0,0.029832,666.0,1.0,40.754372,-73.973981,-73.974037
680002,1,False,-73.909988,-73.951172,900.0,2,2,40.77422,4.0,40.801823,...,-73.974236,40.754723,277797.69,1.0,0.029832,666.0,1.0,40.754372,-73.973981,-73.974037


# Step 8: Build the new model

In [20]:
# separates the whole feature matrix into train data feature matrix,
# train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)

In [21]:
model = GradientBoostingRegressor()
model.fit(X_train,y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False)

# Step 9: Evalute on test data

In [23]:
y_pred = model.predict(X_test)
y_pred[5:]

array([  783.56907268,   458.17670278,   659.62407679, ...,  1165.6161969 ,
        1909.1846298 ,   837.368847  ])

In [24]:
model.score(X_test,y_test)

0.74813440151888877

# Additional Analysis
<p>Let's look at how important each feature was for the model.</p>