# New York City Taxi Ride Duration Prediction

In this case study, we will build a predictive model to predict the duration of taxi ride. We will do the following steps:
  * First install the dependencies
  * Next load the data as pandas dataframe
  * Define the outcome variable - the variable we are trying to predict.
  * Build features using the [featuretools](featuretools.com) package that implements Deep Feature Synthesis. We will start with simple features and incrementally improve the feature definitions and examine the accuracy of the system.
  


Allocate at least 2-3 hours to go through this case study end-to-end

# Install Dependencies 
<p>If you have not done so already, download this repository <a href="https://github.com/Featuretools/DSx/archive/master.zip">from git</a>. Once you have downloaded this archive, unzip it and cd into the directory from the command line. Next run the command ``./install_osx.sh`` if you are on a mac or ``./install_linux.sh`` if you are on linux. This should install all of the dependencies.</p>
<p> If you are on a windows machine, open the requirements.txt folder and make sure to install each of the dependencies listed (featuretools, jupyter, pandas, sklearn, numpy) </p>
<p> Once you have installed all of the dependencies, open this notebook. On Mac and Linux, navigate to the directory that you downloaded from git and run ``jupyter notebook`` to be taken to this notebook in your default web browser. When you open the NewYorkCity_taxi_case_study.ipynb file in the web browser, you can step through the code by clicking the ``Run`` button at the top of the page. If you have any questions for how to use <a href="http://jupyter.org/">Jupyter</a>, refer to google or the discussion forum.</p>

# Running the Code

In [2]:
import featuretools as ft
import utils
from utils import load_nyc_taxi_data, compute_features, preview, feature_importances
from sklearn.ensemble import GradientBoostingRegressor
from featuretools.primitives import (Day, Hour, Minute, Month, Weekday, 
                                     Week, Weekend, Sum, Mean, Median, Std, Count)
ft.__version__
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Step 1: Download and load the raw data as pandas dataframes
<p>If you have not yet downloaded the data it can be downloaded <a href="https://s3.amazonaws.com/mit-dsx-data/nyc-taxi-data.zip">from S3</a>. Once you have downloaded the archive, unzip it and place the nyc-taxi-data folder in the same directory as this script. 
</p>

In [None]:
trips, passenger_cnt, vendors = load_nyc_taxi_data()
preview(trips,10)

Initialization complete
Initialization complete
Initialization complete
Initialization complete
start iteration
done sorting
start iteration
done sorting
start iteration
done sorting
start iteration
done sorting
end inner loop
end inner loop
end inner loop
end inner loop
Iteration 0, inertia 67.3424827888
start iteration
done sorting
Iteration 0, inertia 64.0671364769
start iteration
Iteration 0, inertia 66.174031678
done sorting
start iteration
done sorting
Iteration 0, inertia 64.6004357218
start iteration
done sorting
end inner loop
end inner loop
end inner loop
end inner loop
Iteration 1, inertia 63.1321597048
start iteration
done sorting
Iteration 1, inertia 62.2435588287
start iteration
done sorting
Iteration 1, inertia 60.3161685561
start iteration
done sorting
end inner loop
Iteration 1, inertia 60.7431164512
start iteration
done sorting
end inner loop
end inner loop
end inner loop
Iteration 2, inertia 61.4483708957
start iteration
done sorting
Iteration 2, inertia 58.999435921

The ``trips`` table has the following fields
* ``id`` which uniquely identifies the trip
* ``vendor_id`` is the taxi cab company - in our case study we have data from three different cab companies
* ``pickup_datetime`` the time stamp for pickup
* ``dropoff_datetime`` the time stamp for drop-off
* ``passenger_count`` the number of passengers for the trip
* ``trip_distance`` total distance of the trip in miles 
* ``pickup_longitude`` the longitude for pickup
* ``pickup_latitude`` the latitude for pickup
* ``dropoff_longitude``the longitude of dropoff 
* ``dropoff_latitude`` the latitude of dropoff
* ``payment_type`` A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided
* ``trip_duration`` this is the duration we would like to predict using other fields 

# Step 2: Prepare the Data
Lets create entities and relationships. The three entities in this data are 
* trips 
* vendors (these are the cab companies)
* passenger_cnt (a simple entity that has the unique number of passenger counts 1-8)

This data has the following relationships
* Vendors --> trips (the same vendor can have multiple trips - vendors is the ``parent_entity`` and trips it the child entity
* passenger_cnt --> trips (the same passenger_cnt can appear in multiple trips. passenger_cnt is the ``parent_entity`` and trips is the child entity. 

In <a <href="https://www.featuretools.com/"><featuretools (automated feature engineering software package)/></a>, we specify the list of entities and relationships as follows: 

In [None]:
entities = {
        "trips": (trips, "id", 'pickup_datetime' ),
        "vendors": (vendors, "vendor_id"),
        "passenger_cnt": (passenger_cnt,"passenger_count")
        }

relationships = [("vendors", "vendor_id","trips", "vendor_id"), 
                ("passenger_cnt", "passenger_count","trips", "passenger_count")]

<p>We specify the time for each instance of the target_entity, in this case ``trips`` to calculate features. The timestamp represents the last time data can be used for calculating features by DFS. This is specified using a dataframe of cutoff time. This cutoff time for each trip is the pickup time. We want to have a minimum amount of data, so we only use trips after January 12th, 2016</p>

In [None]:
cutoff_time = trips[['id', 'pickup_datetime']]
cutoff_time = cutoff_time[cutoff_time['pickup_datetime'] > "2016-01-12"]
preview(cutoff_time, 10)

# Step 3: Create baseline features using DFS 
<p>Instead of manually creating features, such as "<b>month of pickup_datetime</b>", we can let featuretools come up with them. </p> 

Featuretools does this by 
* interpret the types of variables - categorical, numeric and others. We can override this interpretation by specifying the types. In this case study, we wanted <b>passenger_count</b> to be a type of Ordinal, and <b>vendor_id</b> to be of type Categorical. This override occured while loading in the csv files.</p>
* then based on the primitives we specify, it matches up the columns to which those primitives can be applied. 

# Create transform features using transform primitives

As we described in the video, features fall into two major categories, ``transform`` and ``aggregate``. In featureools, we can create transform features by specifying ``transform`` primitives. Below we specify a ``transform`` primitive called ``weekend`` and here is what it does:

* It can be applied to any ``datetime`` column in the data. 
* For each entry in the column, it assess if it is a ``weekend`` and returns a boolean. 

In this specific data, there are two ``datetime`` columns ``pickup_datetime`` and ``dropoff_datetime``. The tool automatically creates features using the primitive and these two columns as shown below. 

In [None]:
trans_primitives = [Weekend]

features = ft.dfs(entities=entities,
                  relationships=relationships,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=[],
                  features_only=True)

<p>Here are the features created.</p>

In [None]:
print len(features)
features

Now let's compute the features. 

In [None]:
feature_matrix = compute_features(features, cutoff_time)

# Step 4: Build the Model

To build a model,
* we first seperate the data into a porition for ``training`` (75% in this case) and a portion for ``testing`` 
* We also get the log of the trip duration so that a more linear relationship can be found.
* We use ``GradientBoostingRegressor`` to train a model. 

In [None]:
# separates the whole feature matrix into train data feature matrix, 
# train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)

In [None]:
model = GradientBoostingRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test)

# Step 5: Adding more Transform Primitives

* Add ``Minute``, ``Hour``, ``Week``, ``Month``, ``Weekday`` primitives
* All these transform primitives apply to ``datetime`` column

In [None]:
trans_primitives = [Minute, Hour, Day, Week, Month, Weekday, Weekend]

features = ft.dfs(entities=entities,
                   relationships=relationships,
                   target_entity="trips",
                   trans_primitives=trans_primitives,
                   agg_primitives=[],
                   features_only=True)

In [None]:
print len(features)
features

Now let's compute the features. 

In [None]:
feature_matrix = compute_features(features, cutoff_time)

In [None]:
preview(feature_matrix, 10)

# Step 6: Build the new model

In [None]:
# separates the whole feature matrix into train data feature matrix,
# train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)

In [None]:
model = GradientBoostingRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test)

# Step 7: Add Aggregation Primitives

Now let's add aggregation primitives. These primitives will generate features for the parent entities in this case both ``vendors`` and ``passenger_cnt`` and then add them to the trips entity (which is the entity for which we are trying to make prediction.

In [None]:
trans_primitives = [Minute, Hour, Day, Week, Month, Weekday, Weekend]
aggregation_primitives = [Count, Sum, Mean, Median, Std]

features = ft.dfs(entities=entities,
                   relationships=relationships,
                   target_entity="trips",
                   trans_primitives=trans_primitives,
                   agg_primitives=aggregation_primitives,
                   features_only=True)

In [None]:
print len(features)
features

In [None]:
feature_matrix = compute_features(features, cutoff_time)

In [None]:
preview(feature_matrix, 10)

# Step 8: Build the new model

In [None]:
# separates the whole feature matrix into train data feature matrix,
# train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = utils.get_train_test_fm(feature_matrix,.75)

In [None]:
model = GradientBoostingRegressor()
model.fit(X_train,y_train)

# Step 9: Evalute on test data

In [None]:
y_pred = model.predict(X_test)
y_pred[5:]

In [None]:
model.score(X_test,y_test)

# Additional Analysis
<p>Let's look at how important each feature was for the model.</p>

In [None]:
feature_importances(model, feature_matrix.columns.tolist(), n=15)