## Project Predictive Analytics: New York City Taxi Ride Duration Prediction

## **Marks: 40**
---------------

## **Context**
---------------

New York City taxi rides form the core of the traffic in the city of New York. The many rides taken every day by New Yorkers in the busy city can give us a great idea of traffic times, road blockages, and so on. A typical taxi company faces a common problem of efficiently assigning the cabs to passengers so that the service is hassle-free. One of the main issues is predicting the duration of the current ride so it can predict when the cab will be free for the next trip. Here the data set contains various information regarding the taxi trips, its duration in New York City. We will apply different techniques here to get insights into the data and determine how different variables are dependent on the Trip Duration.

-----------------
## **Objective**
-----------------

- To Build a predictive model, for predicting the duration for the taxi ride. 
- Use Automated feature engineering to create new features

-----------------
## **Dataset**
-----------------

The ``trips`` table has the following fields
* ``id`` which uniquely identifies the trip
* ``vendor_id`` is the taxi cab company - in our case study we have data from three different cab companies
* ``pickup_datetime`` the time stamp for pickup
* ``dropoff_datetime`` the time stamp for drop-off
* ``passenger_count`` the number of passengers for the trip
* ``trip_distance`` total distance of the trip in miles 
* ``pickup_longitude`` the longitude for pickup
* ``pickup_latitude`` the latitude for pickup
* ``dropoff_longitude``the longitude of dropoff 
* ``dropoff_latitude`` the latitude of dropoff
* ``payment_type`` a numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided
* ``trip_duration`` this is the duration we would like to predict using other fields 
* ``pickup_neighborhood`` a one or two letter id of the neighborhood where the trip started
* ``dropoff_neighborhood`` a one or two letter id of the neighborhood where the trip ended



### We will do the following steps:
  * Install the dependencies
  * Load the data as pandas dataframe
  * Perform EDA on the dataset
  * Build features with Deep Feature Synthesis using the [featuretools](https://featuretools.com) package. We will start with simple features and incrementally improve the feature definitions and examine the accuracy of the system

#### Uncomment the code given below, and run the line of code to install featuretools library

In [None]:
# Uncomment the code given below, and run the line of code to install featuretools library

#!pip install featuretools==0.27.0

### Note: If !pip install featuretools doesn't work, please install using the anaconda prompt by typing the following command in anaconda prompt
      conda install -c conda-forge featuretools==0.27.0

### Importing libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Feataurestools for feature engineering
import featuretools as ft

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Importing gradient boosting regressor, to make prediction
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

#importing primitives
from featuretools.primitives import (Minute, Hour, Day, Month,
                                     Weekday, IsWeekend, Count, Sum, Mean, Median, Std, Min, Max)

print(ft.__version__)
%load_ext autoreload
%autoreload 2

In [None]:
# set global random seed
np.random.seed(40)

# To load the dataset
def load_nyc_taxi_data():
    trips = pd.read_csv('trips.csv',
                        parse_dates=["pickup_datetime","dropoff_datetime"],
                        dtype={'vendor_id':"category",'passenger_count':'int64'},
                        encoding='utf-8')
    trips["payment_type"] = trips["payment_type"].apply(str)
    trips = trips.dropna(axis=0, how='any', subset=['trip_duration'])

    pickup_neighborhoods = pd.read_csv("pickup_neighborhoods.csv", encoding='utf-8')
    dropoff_neighborhoods = pd.read_csv("dropoff_neighborhoods.csv", encoding='utf-8')

    return trips, pickup_neighborhoods, dropoff_neighborhoods

### To preview first five rows. 
def preview(df, n=5):
    """return n rows that have fewest number of nulls"""
    order = df.isnull().sum(axis=1).sort_values().head(n).index
    return df.loc[order]



#to compute features using automated feature engineering. 
def compute_features(features, cutoff_time):
    # shuffle so we don't see encoded features in the front or backs

    np.random.shuffle(features)
    feature_matrix = ft.calculate_feature_matrix(features,
                                                 cutoff_time=cutoff_time,
                                                 approximate='36d',
                                                 verbose=True)
    print("Finishing computing...")
    feature_matrix, features = ft.encode_features(feature_matrix, features,
                                                  to_encode=["pickup_neighborhood", "dropoff_neighborhood"],
                                                  include_unknown=False)
    return feature_matrix


#to generate train and test dataset
def get_train_test_fm(feature_matrix, percentage):
    nrows = feature_matrix.shape[0]
    head = int(nrows * percentage)
    tail = nrows-head
    X_train = feature_matrix.head(head)
    y_train = X_train['trip_duration']
    X_train = X_train.drop(['trip_duration'], axis=1)
    imp = SimpleImputer()
    X_train = imp.fit_transform(X_train)
    X_test = feature_matrix.tail(tail)
    y_test = X_test['trip_duration']
    X_test = X_test.drop(['trip_duration'], axis=1)
    X_test = imp.transform(X_test)

    return (X_train, y_train, X_test,y_test)



#to see the feature importance of variables in the final model
def feature_importances(model, feature_names, n=5):
    importances = model.feature_importances_
    zipped = sorted(zip(feature_names, importances), key=lambda x: -x[1])
    for i, f in enumerate(zipped[:n]):
        print("%d: Feature: %s, %.3f" % (i+1, f[0], f[1]))

### Load the Datasets

In [None]:
trips, pickup_neighborhoods, dropoff_neighborhoods = load_nyc_taxi_data()
preview(trips, 10)

### Display first five rows

In [None]:
trips.head()

### Display info of the dataset

In [None]:
#checking the info of the dataset
trips.info()

- There are 974409 non null values in the dataset

### Check the number of unique values in the dataset.

In [None]:
# Check the uniques values in each columns
trips.nunique()

**Write your answers here:_____**
- vendor_id has only 2 unique values, implies there are only 2 major taxi vendors are there. 
- Passenger count has 8 unique values and payment type have 4. 
- There are 49 neighborhood in the dataset, from where either a pickup or dropoff is happening. 

### Question 1 : Check summary statistics of the dataset (1 Mark)

In [None]:
#chekcing the descriptive stats of the data

#Remove _________ and complete the code

trips._____________

**Write your answers here:_____**

#### Checking for the rows for which trip_distance is 0

In [None]:
#Chekcing the rows where trip distance is 0
trips[trips['trip_distance']==0]

**Write your answers here:_____**
- We can observe that, where trip distance is 0 trip duration is not 0, hence we can replace those values. 
- There are 3807 such rows

#### Replacing the 0 values with median of the trip distance

In [None]:
trips['trip_distance']=trips['trip_distance'].replace(0,trips['trip_distance'].median())

In [None]:
trips[trips['trip_distance']==0].count()

#### Checking for the rows for which trip_duration is 0

In [None]:
trips[trips['trip_duration']==0].head()

**Write your answers here:_____**
- We can observe that, where trip distance is 0 trip duration is not 0, hence we can replace those values. 

In [None]:
trips['trip_duration']=trips['trip_duration'].replace(0,trips['trip_duration'].median())

In [None]:
trips[trips['trip_duration']==0].count()

### Question 2: Univariate Analysis

### Question 2.1: Build histogram for numerical columns (1 Marks)

In [None]:
#Remove _________ and complete the code
trips._______________
plt.show()

**Write your answers here:_____**

In [None]:
sns.boxplot(trips['trip_distance'])
plt.show()

- We can see there is an extreme outlier in the dataset, we drop investigate it further

In [None]:
trips[trips['trip_distance']>100]

- We can observe that, there are 2 observation>500, and there is a huge gap in the trip duration for them.
- Covering 501.4 distance in 141 sec, is not possible, it is better we can clip these values to 50. 

#### Clipping the outliers of trip distance to 50

In [None]:
trips['trip_distance']=trips['trip_distance'].clip(trips['trip_distance'].min(),50)

In [None]:
sns.boxplot(trips['trip_distance'])
plt.show()

### Question 2.2 Plotting countplot for Passenger_count (1 Marks)

In [None]:
#Remove _________ and complete the code

import seaborn as sns
plt.figure(figsize=(20,5))
sns.countplot(________________)
plt.show()

In [None]:
trips.passenger_count.value_counts(normalize=True)

**Write your answers here:_____**


### Question 2.3 Plotting countplot for pickup_neighborhood and dropoff_neighborhood (2 Marks)

In [None]:
#Remove _________ and complete the code
trips._____________

In [None]:
#Remove _________ and complete the code

trips._______________

**Write your answers here:_____**

In [None]:
pickup_neighborhoods.head()

### Bivariate analysis

#### Plot a scatter plot for trip distance and trip duration

In [None]:
sns.scatterplot(trips['trip_distance'],trips['trip_duration'])

- There is some positive correlation between trip_distance and trip_duration.

In [None]:
sns.countplot(trips['passenger_count'],hue=trips['payment_type'])

- There is no such specific pattern can be observed.

### Step 2: Prepare the Data

Lets create entities and relationships. The three entities in this data are 
* trips 
* pickup_neighborhoods
* dropoff_neighborhoods

This data has the following relationships
* pickup_neighborhoods --> trips (one neighborhood can have multiple trips that start in it. This means pickup_neighborhoods is the ``parent_entity`` and trips is the child entity)
* dropoff_neighborhoods --> trips (one neighborhood can have multiple trips that end in it. This means dropoff_neighborhoods is the ``parent_entity`` and trips is the child entity)

In <a <href="https://www.featuretools.com/"><featuretools (automated feature engineering software package)/></a>, we specify the list of entities and relationships as follows: 


### Question 3: Define entities and relationships for the Deep Feature Synthesis (2 Marks)

In [None]:
#Remove _________ and complete the codeV

entities = { __________________ }

#Remove _________ and complete the code
relationships = [_________________________]

Next, we specify the cutoff time for each instance of the target_entity, in this case ``trips``.This timestamp represents the last time data can be used for calculating features by DFS. In this scenario, that would be the pickup time because we would like to make the duration prediction using data before the trip starts. 

For the purposes of the case study, we choose to only select trips that started after January 12th, 2016. 

In [None]:
cutoff_time = trips[['id', 'pickup_datetime']]
cutoff_time = cutoff_time[cutoff_time['pickup_datetime'] > "2016-01-12"]
preview(cutoff_time, 10)

### Step 3: Create baseline features using Deep Feature Synthesis

Instead of manually creating features, such as "month of pickup datetime", we can let DFS come up with them automatically. It does this by 
* interpreting the variable types of the columns e.g categorical, numeric and others 
* matching the columns to the primitives that can be applied to their variable types
* creating features based on these matches

**Create transform features using transform primitives**

As we described in the video, features fall into two major categories, ``transform`` and ``aggregate``. In featureools, we can create transform features by specifying ``transform`` primitives. Below we specify a ``transform`` primitive called ``weekend`` and here is what it does:

* It can be applied to any ``datetime`` column in the data. 
* For each entry in the column, it assess if it is a ``weekend`` and returns a boolean. 

In this specific data, there are two ``datetime`` columns ``pickup_datetime`` and ``dropoff_datetime``. The tool automatically creates features using the primitive and these two columns as shown below. 

### Question 4: Creating a baseline model with only 1 transform primitive (10 Marks)

**Question: 4.1 Define transform primitive for weekend and define features using dfs?** 

In [None]:
#Remove _________ and complete the code
trans_primitives = [_______________]

#Remove _________ and complete the code
features = ft.dfs(entities=______________,
                  relationships=____________________,
                  target_entity="trips",
                  trans_primitives=_________________,
                  agg_primitives=[],
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude"]},
                  features_only=True)

*If you're interested about parameters to DFS such as `ignore_variables`, you can learn more about these parameters [here](https://docs.featuretools.com/generated/featuretools.dfs.html#featuretools.dfs)*
<p>Here are the features created.</p>

In [None]:
print ("Number of features: %d" % len(features))
features


Now let's compute the features. 

**Question: 4.2 Compute features and define feature matrix**

In [None]:
def compute_features(features, cutoff_time):
    # shuffle so we don't see encoded features in the front or backs

    np.random.shuffle(features)
    feature_matrix = ft.calculate_feature_matrix(features,
                                                 cutoff_time=cutoff_time,
                                                 approximate='36d',
                                                 verbose=True,entities=entities, relationships=relationships)
    print("Finishing computing...")
    feature_matrix, features = ft.encode_features(feature_matrix, features,
                                                  to_encode=["pickup_neighborhood", "dropoff_neighborhood"],
                                                  include_unknown=False)
    return feature_matrix

In [None]:
#Remove _________ and complete the code
feature_matrix1 = compute_features(______________)

In [None]:
preview(feature_matrix1, 5)

In [None]:
feature_matrix1.shape

### Build the Model

To build a model, we
* Separate the data into a portion for ``training`` (75% in this case) and a portion for ``testing`` 
* Get the log of the trip duration so that a more linear relationship can be found.
* Train a model using a ``Linear Regression, Decision Tree and Random Forest model``

#### Transforming the duration variable on sqrt and log

In [None]:
plt.hist(np.sqrt(trips['trip_duration']))

In [None]:
plt.hist(np.log(trips['trip_duration']))

- We can clearly see that the sqrt transformation is giving nearly normal distribution, there for we can choose the sqrt transformation on the dependent(trip_duration) variable.

### Splitting the data into train and test

In [None]:
# separates the whole feature matrix into train data feature matrix, 
# train data labels, and test data feature matrix 
X_train, y_train, X_test, y_test = get_train_test_fm(feature_matrix1,.75)
y_train = np.sqrt(y_train)
y_test = np.sqrt(y_test)

### Defining function for to check the performance of the model. 

In [None]:
#RMSE
def rmse(predictions, targets):
    return np.sqrt(((targets - predictions) ** 2).mean())

# MAE
def mae(predictions, targets):
    return np.mean(np.abs((targets - predictions)))


# Model Performance on test and train data
def model_pref(model, x_train, x_test, y_train,y_test):

    # Insample Prediction
    y_pred_train = model.predict(x_train)
    y_observed_train = y_train

    # Prediction on test data
    y_pred_test = model.predict(x_test)
    y_observed_test = y_test

    print(
        pd.DataFrame(
            {
                "Data": ["Train", "Test"],
                'RSquared':
                    [r2_score(y_observed_train,y_pred_train),
                    r2_score(y_observed_test,y_pred_test )
                    ],
                "RMSE": [
                    rmse(y_pred_train, y_observed_train),
                    rmse(y_pred_test, y_observed_test),
                ],
                "MAE": [
                    mae(y_pred_train, y_observed_train),
                    mae(y_pred_test, y_observed_test),
                ],
            }
        )
    )

#### Question 4.3 Build Linear regression using only weekend transform primitive

In [None]:
#Remove _________ and complete the code

#defining the model

lr1=_______________

#fitting the model
lr1.___________


#### Check the performance of the model

In [None]:
#Remove _________ and complete the code
model_pref(_______________)  

**Write your answers here:_____**

#### Question 4.4 Building decision tree using only weekend transform primitive

In [None]:
#Remove _________ and complete the code

#define the model
dt=______________

#fit the model

dt.fit(_________________)

#### Check the performance of the model

In [None]:
#Remove _________ and complete the code
model_pref(_________________)  

**Write your answers here:_____**


#### Question 4.5 Building Pruned decision tree using only weekend transform primitive

In [None]:
#Remove _________ and complete the code
#define the model

#use max_depth=7
dt_pruned=_____________________

#fit the model
dt_pruned.fit(_____________________-)

#### Check the performance of the model

In [None]:
#Remove _________ and complete the code
model_pref(_______________)  

**Write your answers here:_____**


#### Question 4.6 Building Random Forest using only weekend transform primitive

In [None]:
#Remove _________ and complete the code

#define the model

#using (n_estimators=60,max_depth=7)

rf=RandomForestRegressor(n_estimators=60,max_depth=7)

In [None]:
#fit the model

#Remove _________ and complete the code
rf._____________________

#### Check the performance of the model

In [None]:
#Remove _________ and complete the code

model_pref(___________________________)

**Write your answers here:_____**


### Step 4: Adding more Transform Primitives and creating new model

* Add ``Minute``, ``Hour``, ``Month``, ``Weekday`` , etc primitives
* All these transform primitives apply to ``datetime`` columns

### Question 5: Create models with more transform primitives (10 Marks)

**Question 5.1 Define more transform primitives and define features using dfs?**

In [None]:
#Remove _________ and complete the code
trans_primitives = [_____________________________]

#Remove _________ and complete the code
features = ft.dfs(entities=_________,
                  relationships=____________,
                  target_entity="trips",
                  trans_primitives=____________________,
                  agg_primitives=[],
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude"]},
                  features_only=True)

In [None]:
print ("Number of features: %d" % len(features))
features

Now let's compute the features. 

**Question: 5.2 Compute features and define feature matrix**

In [None]:
#Remove _________ and complete the code
feature_matrix2 = compute_features(______________)

In [None]:
feature_matrix2.shape

In [None]:
feature_matrix2.head()

### Build the new models more transform features

In [None]:
# separates the whole feature matrix into train data feature matrix,
# train data labels, and test data feature matrix 
X_train2, y_train2, X_test2, y_test2 = get_train_test_fm(feature_matrix2,.75)
y_train2 = np.sqrt(y_train2)
y_test2 = np.sqrt(y_test2)

#### Question 5.3 Building Linear regression using more transform primitive

In [None]:
#Remove _________ and complete the code

#defining the model

lr2=_______________

#fitting the model
lr2.___________


#### Check the performance of the model

In [None]:
#Remove _________ and complete the code
model_pref(_______________)  

**Write your answers here:_____**

#### Question 5.4 Building Decision tree using more transform primitive

In [None]:
#Remove _________ and complete the code

#define the model
dt2=______________

#fit the model

dt2.fit(_________________)

#### Check the performance of the model

In [None]:
#Remove _________ and complete the code
model_pref(_________)  

**Write your answers here:_____**

#### Question 5.5 Building Pruned Decision tree using more transform primitive

In [None]:
#Remove _________ and complete the code
#define the model

#use max_depth=7
dt_pruned2=_____________________

#fit the model
dt_pruned2.fit(_____________________)

#### Check the performance of the model

In [None]:
#Remove _________ and complete the code
model_pref(_________________________)  

**Write your answers here:_____**

#### Question 5.6 Building Random Forest using more transform primitive

In [None]:
#fit the model

#Remove _________ and complete the code
#using (n_estimators=60,max_depth=7)

rf2._____________________

#fit the model

#Remove _________ and complete the code
rf2._____________________

#### Check the performance of the model

In [None]:
#Remove _________ and complete the code
model_pref(________________)  

**Write your answers here:_____**


**Question: 5.7 Comment on how the modeling accuracy differs when including more transform features.**

**Write your answers here:_____**

### Step 5: Add Aggregation Primitives

Now let's add aggregation primitives. These primitives will generate features for the parent entities ``pickup_neighborhoods``, and ``dropoff_neighborhood`` and then add them to the trips entity, which is the entity for which we are trying to make prediction.

### Question 6: Create a Models with transform and aggregate primitive. (10 Marks)
**6.1 Define more transform and aggregate primitive and define features using dfs?**

In [None]:
#Remove _________ and complete the code

trans_primitives = [____________]
aggregation_primitives = [____________________]

features = ft.dfs(entities=entities,
                  relationships=relationships,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=aggregation_primitives,
                  ignore_variables={"trips": ["pickup_latitude", "pickup_longitude",
                                              "dropoff_latitude", "dropoff_longitude"]},
                  features_only=True)

In [None]:
print ("Number of features: %d" % len(features))
features

**Question: 6.2 Compute features and define feature matrix**

In [None]:
#Remove _________ and complete the code
feature_matrix3 = compute_features(_______________)

In [None]:
feature_matrix3.head()

### Build the new models more transform and aggregate features

In [None]:
# separates the whole feature matrix into train data feature matrix,
# train data labels, and test data feature matrix 
X_train3, y_train3, X_test3, y_test3 = get_train_test_fm(feature_matrix3,.75)
y_train3 = np.sqrt(y_train3)
y_test3 = np.sqrt(y_test3)

#### Question 6.3 Building  Linear regression model with transform and aggregate primitive.

In [None]:
#Remove _________ and complete the code

#defining the model

lr3=_______________

#fitting the model
lr3.___________


#### Check the performance of the model

In [None]:
#Remove _________ and complete the code
model_pref(______________)  

**Write your answers here:_____**

#### Question 6.4 Building  Decision tree with transform and aggregate primitive.

In [None]:
#Remove _________ and complete the code

#define the model
dt3=______________

#fit the model

dt3.fit(_________________)

#### Check the performance of the model

In [None]:
#Remove _________ and complete the code
model_pref(________________)  

**Write your answers here:_____**

#### Question 6.5 Building  Pruned Decision tree with transform and aggregate primitive.

In [None]:
#Remove _________ and complete the code
#define the model

#use max_depth=7
dt_pruned3=_____________________

#fit the model
dt_pruned3.fit(_____________________)

#### Check the performance of the model

In [None]:
#Remove _________ and complete the code
model_pref(___________________)  

**Write your answers here:_____**


#### Question 6.6 Building  Random Forest with transform and aggregate primitive.

In [None]:
#fit the model

#Remove _________ and complete the code
#using (n_estimators=60,max_depth=7)

rf3._____________________

#fit the model

#Remove _________ and complete the code
rf3._____________________

#### Check the performance of the model

In [None]:
model_pref(rf3, X_train3, X_test3,y_train3,y_test3)  

**Write your answers here:_____**



**Question 6.7 How do these aggregate transforms impact performance? How do they impact training time?**

**Write your answers here:_____**

#### Based on the above 3 models, we can make predictions using our model2, as it is giving almost same accuracy as model3 and also the training time is not that large as compared to model3

In [None]:
y_pred = rf2.predict(X_test2)
y_pred = y_pred**2 # undo the sqrt we took earlier
y_pred[5:]

### Question 7: What are some important features based on model2 and how can they affect the duration of the rides? (3 Marks)

In [None]:
feature_importances(___________________________)

**Write your answers here:_____**
