# <b><u> Project Title : Taxi trip time Prediction : Predicting total ride duration of taxi trips in New York City</u></b>

## <b> Problem Description </b>

### The task is to build a model that predicts the total ride duration of taxi trips in New York City. The primary dataset is one released by the NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables.

## <b> Data Description </b>

### Data fields
* #### id - a unique identifier for each trip
* #### vendor_id - a code indicating the provider associated with the trip record
* #### pickup_datetime - date and time when the meter was engaged
* #### dropoff_datetime - date and time when the meter was disengaged
* #### passenger_count - the number of passengers in the vehicle (driver entered value)
* #### pickup_longitude - the longitude where the meter was engaged
* #### pickup_latitude - the latitude where the meter was engaged
* #### dropoff_longitude - the longitude where the meter was disengaged
* #### dropoff_latitude - the latitude where the meter was disengaged
* #### store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* #### trip_duration - duration of the trip in seconds

#<b>Import libraries

In [None]:
!pip install klib

In [None]:
import klib 
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.metrics import accuracy_score, auc
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
import datetime as dt
import warnings; warnings.simplefilter('ignore')

#<b>Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

#<b>Import Dataset

In [None]:
nyc_taxi = pd.read_csv('/content/drive/MyDrive/NYC/NYC.csv')

#<b>Data Overview

In [None]:
nyc_taxi.head()

In [None]:
nyc_taxi.tail()

In [None]:
nyc_taxi.info()

In [None]:
nyc_taxi.describe(include= 'all')

In [None]:
nyc_taxi.isnull().sum()

In [None]:
nyc_taxi.nunique()

#<b> Exploratary Data Analysis

##<b>So Lets do Some Exploratary data analysis of dataset  if any inconsitancy somewhere lets deal with that

<b>Using klib Python library for cleaning, analyzing and preprocessing data.

In [None]:
klib.cat_plot(nyc_taxi)

In [None]:
klib.dist_plot(nyc_taxi)

In [None]:
klib.corr_plot(nyc_taxi, target=nyc_taxi['trip_duration'])

In [None]:
print("Number of rows is: ", nyc_taxi.shape[0])
print("Number of columns is: ", nyc_taxi.shape[1])

In [None]:
nyc_taxi.columns

In [None]:
nyc_taxi['pickup_datetime'] = pd.to_datetime(nyc_taxi['pickup_datetime'])
nyc_taxi['dropoff_datetime'] = pd.to_datetime(nyc_taxi['dropoff_datetime'])

In [None]:
nyc_taxi.describe()

#**Feature Creation**
Now, let us extract pickup_datetime and dropoff_datetimeand create Some new features from this datetime features we  have just created.

In [None]:
nyc_taxi['pickup_day']=nyc_taxi['pickup_datetime'].dt.day_name()
nyc_taxi['dropoff_day']=nyc_taxi['dropoff_datetime'].dt.day_name()

nyc_taxi['pickup_day_no']=nyc_taxi['pickup_datetime'].dt.weekday
nyc_taxi['dropoff_day_no']=nyc_taxi['dropoff_datetime'].dt.weekday

nyc_taxi['pickup_hour']=nyc_taxi['pickup_datetime'].dt.hour
nyc_taxi['dropoff_hour']=nyc_taxi['dropoff_datetime'].dt.hour

nyc_taxi['pickup_month']=nyc_taxi['pickup_datetime'].dt.month
nyc_taxi['dropoff_month']=nyc_taxi['dropoff_datetime'].dt.month

**I have created the following features:**

**pickup_day and dropoff_day** which will contain the name of the day on which the ride was taken.
**pickup_day_no and dropoff_day_n**o which will contain the day number instead of characters with Monday=0 and Sunday=6.
**pickup_hour and dropoff_hour** with an hour of the day in the 24-hour format.
**pickup_month and dropoff_month** with month number 

###**Importing the geopy.distance library which will help us calculate distance from geographical coordinates**.

In [None]:
from geopy.distance import great_circle

In [None]:
def cal_distance(pickup_lat,pickup_long,dropoff_lat,dropoff_long):
 
 start_coordinates=(pickup_lat,pickup_long)
 stop_coordinates=(dropoff_lat,dropoff_long)
 
 return great_circle(start_coordinates,stop_coordinates).km

In [None]:
nyc_taxi['distance'] = nyc_taxi.apply(lambda x: cal_distance(x['pickup_latitude'],x['pickup_longitude'],x['dropoff_latitude'],x['dropoff_longitude'] ), axis=1)

**Lets Create New Feature Speed(km/Hr) With help of Distance and Duration Column**

In [None]:
nyc_taxi['speed'] = (nyc_taxi.distance*3600/(nyc_taxi.trip_duration))



1.**Morning** (from 6:00 am to 11:59 pm),

2.**Afternoon** (from 12 noon to 3:59 pm),

3.**Evening** (from 4:00 pm to 9:59 pm), and

4.**Late Night** (from 10:00 pm to 5:59 am)

In [None]:
def time_of_day(x):
    if x in range(6,12):
        return 'Morning'
    elif x in range(12,16):
        return 'Afternoon'
    elif x in range(16,22):
        return 'Evening'
    else:
        return 'Late night'

In [None]:
nyc_taxi['pickup_timeofday'] = nyc_taxi['pickup_hour'].apply(time_of_day)
nyc_taxi['dropoff_timeofday']=nyc_taxi['dropoff_hour'].apply(time_of_day)

In [None]:
nyc_taxi.head()

In [None]:
nyc_taxi.dtypes

Now our dataset is complete for the further analysis before we train our model with optimal variables.

#**Analysis**

**Target Variable**

Let us start with the target varibale i.e trip duration.
# **1.Trip duration.**

In [None]:
sns.histplot(nyc_taxi['trip_duration'],kde=False,bins=20)

In [None]:
sns.boxplot(nyc_taxi['trip_duration'])

In [None]:
for i in range(0,100,10):
  duration= nyc_taxi['trip_duration'].values
  duration= np.sort(duration, axis= None)
  print("{} percentile value is {}".format(i, duration[int(len(duration)*(float(i)/100))]))
print("100 percentile value is ",duration[-1])

In [None]:
for i in range(90,100):
  duration= nyc_taxi['trip_duration'].values
  duration= np.sort(duration, axis= None)
  print("{} percentile value is {}".format(i, duration[int(len(duration)*(float(i)/100))]))
print("100 percentile value is ",duration[-1])

In [None]:
for i in range(0,10):
  duration= nyc_taxi['trip_duration'].values
  duration= np.sort(duration, axis= None)
  print("{} percentile value is {}".format(i, duration[int(len(duration)*(float(i)/100))]))
print("100 percentile value is ",duration[-1])

In [None]:
nyc_taxi = nyc_taxi[nyc_taxi.trip_duration <= 3400]

In [None]:
plt.figure(figsize = (10,5))
sns.distplot(nyc_taxi['trip_duration'])
plt.xlabel('Trip Duration')
plt.show()

In [None]:
plt.figure(figsize = (10,5))
sns.distplot(np.log10(nyc_taxi['trip_duration']))
plt.xlabel('Trip Duration')
plt.show()

In [None]:
plt.figure(figsize = (10,5))
sns.boxplot(nyc_taxi.trip_duration)
plt.xlabel('Trip Duration')
plt.show()

We can clearly see an outlier and should be removed for the data consistency.

Calclating 0-100th percentille to find a correct percentile value for removal of outliers.

The 90th percentile i.e. 1634 looks reasonable. But, 100th percentile i.e. 3526282 is outlier. Hence removing these would be a better idea.

Let's expand from 90th percentile to 100th percentile as to look further for the 99th percentile.

**Let's visualize the number of trips taken in diifferent slabs of ... secconds respectively**

In [None]:
plt.figure(figsize = (10,5))
nyc_taxi.trip_duration.groupby(pd.cut(nyc_taxi.trip_duration, np.arange(1,5000,500))).count().plot(kind='bar')
plt.xlabel('Trip Duration Slots in Second')
plt.ylabel('Trip Counts')
plt.show()

In [None]:
plt.figure(figsize = (10,5))
nyc_taxi.trip_duration.groupby(pd.cut(nyc_taxi.trip_duration, np.arange(0,600,60))).count().plot(kind='bar')
plt.xlabel('Trip Duration Slots in Second')
plt.ylabel('Trip Counts')
plt.show()

In [None]:
plt.figure(figsize = (10,5))
nyc_taxi.trip_duration.groupby(pd.cut(nyc_taxi.trip_duration, np.arange(0,61,5))).count().plot(kind='bar')
plt.xlabel('Trip Duration Slots in Second')
plt.ylabel('Trip Counts')
plt.show()

# **2.Pickup_timeofday & Dropoff_timeofday**

In [None]:
figure,(ax3,ax4)=plt.subplots(ncols=2,figsize=(20,5))
ax3.set_title('Pickup Time of Day')
ax=sns.countplot(x="pickup_timeofday",data=nyc_taxi,ax=ax3)
ax4.set_title('Dropoff Time of Day')
ax=sns.countplot(x="dropoff_timeofday",data=nyc_taxi,ax=ax4)

As we saw above, evenings are the busiest.

# <b>3. Vendor id

In [None]:
plt.figure(figsize = (10,5))
sns.countplot(nyc_taxi.vendor_id)
plt.xlabel('Vendor ID')
plt.ylabel('Count')
plt.show()

We see that there is not so much difference between the trips taken by both vendors.

# <b>4. Passenger count

In [None]:
sns.boxplot(nyc_taxi['passenger_count'])

In [None]:
no_of_passenger = nyc_taxi['passenger_count'].value_counts().reset_index()
no_of_passenger.rename(columns={'index':'no_of_passenger', 'passenger_count':'trip_counts'})

Let us remove the rows which have 0,7, 8 and 9 passenger count

In [None]:
nyc_taxi = nyc_taxi[nyc_taxi['passenger_count'] != 0]
nyc_taxi = nyc_taxi[nyc_taxi['passenger_count']<=6]

In [None]:
plt.figure(figsize = (10,5))
sns.countplot(x='passenger_count',data=nyc_taxi)
plt.ylabel('Count')
plt.xlabel('No.of Passngers')
plt.show()

* We see the highest amount of trips was taken by a single passenger.
* The instance of large group of people travelling together is rare.

# <b>5.Store and Forward Flag

In [None]:
nyc_taxi['store_and_fwd_flag'].value_counts(normalize=True)

In [None]:
plt.figure(figsize = (10,5))
sns.countplot(x='store_and_fwd_flag',data=nyc_taxi)
plt.ylabel('Count')
plt.xlabel('store_and_fwd_flag')
plt.show()

# <b>6.Distance

In [None]:
plt.figure(figsize = (10,5))
sns.distplot(nyc_taxi['distance'])
plt.xlabel('distance')
plt.show()

In [None]:
nyc_taxi = nyc_taxi[nyc_taxi['distance'] > 0.05]

In [None]:
plt.figure(figsize = (10,5))
sns.distplot(np.log10(nyc_taxi['distance']))
plt.xlabel('distance')
plt.show()

In [None]:
plt.figure(figsize = (10,5))
nyc_taxi.distance.groupby(pd.cut(nyc_taxi.distance, np.arange(0,1200,100))).count().plot(kind='bar')
plt.xlabel('Trip distance Slots in km')
plt.ylabel('Trip Counts')
plt.show()

In [None]:
plt.figure(figsize = (10,5))
nyc_taxi.distance.groupby(pd.cut(nyc_taxi.distance, np.arange(100,1001,100))).count().plot(kind='bar')
plt.xlabel('Trip distance Slots in km')
plt.ylabel('Trip Counts')
plt.show()

In [None]:
plt.figure(figsize = (10,5))
nyc_taxi.distance.groupby(pd.cut(nyc_taxi.distance, np.arange(0.5,10.1,0.5))).count().plot(kind='bar')
plt.xlabel('Trip distance Slots in km')
plt.ylabel('Trip Counts')
plt.show()

In [None]:
plt.figure(figsize = (10,5))
nyc_taxi.speed.groupby(pd.cut(nyc_taxi.distance, np.arange(0,1.05,0.05))).count().plot(kind='bar')
plt.xlabel('Trip distance Slots in km')
plt.ylabel('Trip Counts')
plt.show()

In [None]:
nyc_taxi = nyc_taxi[nyc_taxi['distance'] <= 100]

In [None]:
nyc_taxi.distance.max()

# <b>7.Speed

In [None]:
plt.figure(figsize = (10,5))
sns.distplot(nyc_taxi['speed'])
plt.xlabel('Speed (km/hr)')
plt.show()

In [None]:
nyc_taxi.speed.max()

So At Some Places Speed of the taxi is quite high its from **200 to 9274** So which is unresonble.

1. So it may be because of Some pasemger they might have canceled trip in between on  the way after traveling some distance.
2. The dropoff location couldn’t be tracked.
3. The passengers or driver cancelled the trip due to some issue.
4. Due to some technical issue in software, etc.

so in order to have consitant data **lets drop the rows which have speed more than 50 km/hr.** 

In [None]:
nyc_taxi = nyc_taxi[nyc_taxi['speed']<=50]
nyc_taxi = nyc_taxi[nyc_taxi['speed']>=5]

In [None]:
plt.figure(figsize = (10,5))
sns.distplot(nyc_taxi['speed'])
plt.xlabel('Speed (km/hr)')
plt.show()

There are trips that were done at a speed of over 100 km/h.

As per the rule in NYC, the speed limit is 25 mph(approx. 40km/h) in New York City.

 **Mostly trips are done at a speed range of 5-25 km/hr.**

# <b>8.Pickup_hour & Dropup_hour

In [None]:
figure,(ax3,ax4)=plt.subplots(ncols=2,figsize=(20,5))
ax3.set_title('Pickup Time of Day (24hr format)')
ax=sns.countplot(x="pickup_hour",data=nyc_taxi,ax=ax3)
ax4.set_title('Dropoff Time of Day (24hr format)')
ax=sns.countplot(x="dropoff_hour",data=nyc_taxi,ax=ax4)

We see the busiest hours are 6:00 pm to 7:00 pm which makes sense as this is the time for people to return home from work.

# <b> 9. Pickup_day & Dropup_day

In [None]:
figure,(ax1,ax2)=plt.subplots(ncols=2,figsize=(20,5))
ax1.set_title('Pickup Days')
ax=sns.countplot(x="pickup_day",data=nyc_taxi,ax=ax1)
ax2.set_title('Dropoff Days')
ax=sns.countplot(x="dropoff_day",data=nyc_taxi,ax=ax2)

We see Fridays are the busiest days followed by Saturdays. That is probably because it’s weekend.

# **10.Pickup_month & Dropup_month**

In [None]:
figure,(ax1,ax2)=plt.subplots(ncols=2,figsize=(20,5))
ax1.set_title('Pickup Months (Jan=1 to June=6)')
ax=sns.countplot(x="pickup_month",data=nyc_taxi,ax=ax1)
ax2.set_title('Dropoff Months (Jan=1 to June=6)')
ax=sns.countplot(x="dropoff_month",data=nyc_taxi,ax=ax2)

There is not much difference in the number of trips across months.

##<b> Latitude and longitude

In [None]:
figure,(ax3,ax4)=plt.subplots(ncols=2,figsize=(20,5))
ax3.set_title('Pickup Location')
ax=sns.scatterplot(x=nyc_taxi.pickup_longitude,y=nyc_taxi.pickup_latitude,ax=ax3)
ax4.set_title('Dropoff Location')
ax=sns.scatterplot(x=nyc_taxi.dropoff_longitude,y=nyc_taxi.dropoff_latitude,ax=ax4)

#<b>Bivariate Analysis

1.Trip Duration per Vendor

In [None]:
plt.figure(figsize = (10,5))
sns.catplot(y='trip_duration',x='vendor_id',data=nyc_taxi,estimator=np.mean)
plt.xlabel('Vendor ID')
plt.ylabel('Trip Duration')
plt.show()

There is no difference beteen  vendor 1 and 2

<b>2.Trip Duration per Store and Forward Flagt

In [None]:
plt.figure(figsize = (10,5))
sns.catplot(y='trip_duration',x='store_and_fwd_flag',data=nyc_taxi,kind='strip')
plt.xlabel('Store and Forward Flag')
plt.ylabel('Duration (seconds)')
plt.show()

So from the above graph we can come to know that mostely long trips data havent store on server.

<b>3.Trip Duration per hour

In [None]:
plt.figure(figsize = (10,5))
sns.lineplot(x='pickup_hour',y='trip_duration',data=nyc_taxi)
plt.xlabel('Time of Pickup (24hr format)')
plt.ylabel('Duration (seconds)')
plt.show()

* We see the trip duration is the maximum around 3 pm which may be because of traffic on the roads.
* Trip duration is the lowest around 6 am as streets may not be busy.

<b>4.Trip duration per weekday

In [None]:
plt.figure(figsize = (10,5))
sns.lineplot(x='pickup_day_no',y='trip_duration',data = nyc_taxi)
plt.ylabel('Duration (seconds)')
plt.xlabel('week days')
plt.show()

Trip duration on thursday is longest among all days.


<b>5.Trip duration per month

In [None]:
plt.figure(figsize = (10,5))
sns.lineplot(x='pickup_month',y='trip_duration', data = nyc_taxi)
plt.ylabel('Duration (seconds)')
plt.xlabel('Month of Trip ')
plt.show()

* From February, we can see trip duration rising every month.
* There might be some seasonal parameters like wind/rain which can be a factor of this gradual increase in trip duration over a period.


<b>6.Distance and Hour

In [None]:
plt.figure(figsize = (10,5))
sns.lineplot(y='distance',x='pickup_hour',data=nyc_taxi)
plt.ylabel('Distance')
plt.xlabel('Pickup Hour')
plt.show()

* Trip distance is highest during early morning hours.
* It is fairly equal from morning till the evening varying around 3 - 3.5 kms.
* It starts increasing gradually towards the late night hours starting from evening till 5 AM and decrease steeply towards morning.


**7.Passenger_count and Trip Duration**

In [None]:
sns.catplot(y='trip_duration',x='passenger_count',data=nyc_taxi)

<b>8.Distance and Trip Duration

In [None]:
plt.figure(figsize = (10,5))
plt.scatter(x='trip_duration', y='distance',data=nyc_taxi)
plt.ylabel('Distance')
plt.xlabel('Trip Duration')
plt.show()


**9.Passenger_count and Distance**

In [None]:
sns.catplot(y='distance',x='passenger_count',data=nyc_taxi,kind='strip')

We see some of the longer distances are covered by either 1 or 2 or 4 passenger rides.

<b>10. Pickup_month and Distance

In [None]:
plt.figure(figsize = (10,5))
sns.lineplot(x='pickup_month',y='distance',data= nyc_taxi)

Maximum distance covered in monty of May

<b>11. Distance and Store and Forward Flag

In [None]:
sns.catplot(y='distance',x='store_and_fwd_flag',data=nyc_taxi,kind='strip')

More distance covered when data was not stored on server.

<b>12. Distance and Vendor

In [None]:
sns.catplot(y='distance',x='vendor_id',data=nyc_taxi,kind='strip')

#<b>Feature Engineering

**One Hot Encoding**

Dummify features like 'store_and_fwd_flag', and 'pickup_weekday'.

In [None]:
nyc_taxi.head(2)

In [None]:
nyc_taxi = pd.get_dummies(nyc_taxi, columns=["store_and_fwd_flag", "pickup_timeofday","dropoff_timeofday"], prefix=["store", "pickup","dropoff"])

#<b>Correlation Analysis

In [None]:
plt.figure(figsize=(20,12))
correlation = nyc_taxi.corr()
sns.heatmap(abs(correlation), annot=True)

In [None]:
df_corr = nyc_taxi.copy()
df_corr.columns

In [None]:
df_corr.drop(['dropoff_Afternoon','dropoff_Evening', 'dropoff_Late night', 'dropoff_Morning','store_Y','store_N','dropoff_day_no',
              'pickup_Evening','pickup_Morning','dropoff_month','dropoff_hour', 'id'],axis=1,inplace=True)

In [None]:
df_corr.columns

In [None]:
plt.figure(figsize=(20,12))
correlation = df_corr.corr()
sns.heatmap(abs(correlation), annot=True)

#<b>Prepairing Dataset for Modeling

In [None]:
independent_variables=['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude','distance', 'pickup_hour']

dependent_variables = 'trip_duration'

In [None]:
X = df_corr[independent_variables]

y = df_corr[dependent_variables]

In [None]:
print(X.shape)
print(y.shape)

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
X[0:2]

<b>Splitting the data in train and test sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=0)

Splited the selected data set in 65-35 split for training and testing purpose respectively

In [None]:
print('Train Data Shape')
print(X_train.shape)
print(y_train.shape)
print('\n')
print('Test Data Shape')
print(X_test.shape)
print(y_test.shape)

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

#<b>Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
linear_reg =  LinearRegression()

linear_reg.fit(X_train, y_train)

In [None]:
linear_reg.score(X_train, y_train)

In [None]:
y_pred_train = linear_reg.predict(X_train)
y_pred_test = linear_reg.predict(X_test)

<b>Linear Regression Model Evaluation

In [None]:
lr_train_mse  = mean_squared_error((y_train), (y_pred_train))
print("Train MSE :" , lr_train_mse)

lr_train_r2 = r2_score((y_train), (y_pred_train))
print("Train R2 :" ,lr_train_r2) 

lr_train_r2_ = 1-(1-r2_score((y_train), (y_pred_train)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",lr_train_r2_)

In [None]:
lr_test_mse  = mean_squared_error((y_test), (y_pred_test))
print("Test MSE :" , lr_test_mse)

lr_test_r2 = r2_score((y_test), (y_pred_test))
print("Test R2 :" ,lr_test_r2)

lr_test_r2_ = 1-(1-r2_score((y_test), (y_pred_test)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ",lr_test_r2_)

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test-y_pred_test, color='blue', linewidth=2.5, linestyle='-')
plt.title('Error Term', fontsize=20)
plt.show()

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_test, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Test Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

As we can clearly see the Linear regression model does not provide us with high accuracy. It has high prediction error on the metrics we tested.

In [None]:
plt.figure(figsize=(10,5))
sns.distplot(y_test - y_pred_test)
plt.title('Error Term', fontsize=20)
plt.show()

#<b>Running Lasso Regression

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
lasso = Lasso()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='r2', cv=5)

In [None]:
lasso_regressor.fit(X_train, y_train)

In [None]:
print('The best fit alpha value is found out to be :', lasso_regressor.best_params_)
print('The R2 score using the same alpha is :', lasso_regressor.best_score_)

**The best parameters for the LASSO Regression which we already tested before to save time when running it again.**
* The best fit alpha value is found out to be : {'alpha': 0.01}

In [None]:
lasso_regressor.score(X_train, y_train)

In [None]:
y_pred_lasso_train = lasso_regressor.predict(X_train)
y_pred_lasso_test = lasso_regressor.predict(X_test)

<b>Lasso Regression Model Evaluation

In [None]:
lasso_train_mse  = mean_squared_error(y_train, y_pred_lasso_train)
print("Train MSE :" , lasso_train_mse)

lasso_train_r2 = r2_score(y_train, y_pred_lasso_train)
print("Train R2 :" ,lasso_train_r2)

lasso_train_r2_= 1-(1-r2_score(y_train, y_pred_lasso_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ", lasso_train_r2)

In [None]:
lasso_test_mse  = mean_squared_error(y_test, y_pred_lasso_test)
print("Test MSE :" , lasso_test_mse)

lasso_test_r2 = r2_score(y_test, y_pred_lasso_test)
print("Test R2 :" ,lasso_test_r2)

lasso_test_r2_= 1-(1-r2_score(y_test, y_pred_lasso_test))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ", lasso_test_r2_)

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test-y_pred_lasso_test, color='blue', linewidth=2.5, linestyle='-')
plt.title('Error Term', fontsize=20)
plt.show()

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_lasso_test, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Test Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

The Lasso regresion model doesn't improve on the Linear model either.

In [None]:
plt.figure(figsize=(10,5))
sns.distplot(y_test - y_pred_lasso_test)
plt.title('Error Term', fontsize=20)
plt.show()

#<b>Running Ridge Regression

In [None]:
from sklearn.linear_model import Ridge
ridge = Ridge()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='r2', cv=5)
ridge_regressor.fit(X_train, y_train)

In [None]:
print('The best fit alpha value is found out to be :' ,ridge_regressor.best_params_)
print('The R2 score using the same alpha is :', ridge_regressor.best_score_)

**The best parameters for the RIDGE Regression which we already tested before to save time when running it again.**
* The best fit alpha value is found out to be : {'alpha': 30}

In [None]:
y_pred_ridge_train=ridge_regressor.predict(X_train)
y_pred_ridge_test = ridge_regressor.predict(X_test)

<b>Ridge Regression Model Evaluation

In [None]:
ridge_train_mse  = mean_squared_error(y_train, y_pred_ridge_train)
print("Train MSE :" , ridge_train_mse)

ridge_train_r2 = r2_score(y_train, y_pred_ridge_train)
print("Train R2 :" ,ridge_train_r2)

ridge_train_r2_= 1-(1-r2_score(y_train, y_pred_ridge_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ", ridge_train_r2)

In [None]:
ridge_test_mse  = mean_squared_error(y_test, y_pred_ridge_test)
print("Test MSE :" , ridge_test_mse)

ridge_test_r2 = r2_score(y_test, y_pred_ridge_test)
print("Test R2 :" ,ridge_test_r2)

ridge_test_r2_= 1-(1-r2_score(y_test, y_pred_ridge_test))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ", ridge_test_r2_)

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test-y_pred_ridge_test, color='blue', linewidth=2.5, linestyle='-')
plt.title('Error Term', fontsize=20)
plt.show()

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_ridge_test, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Test Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

The Ridge regresion model doesn't improve on the Linear model either.

In [None]:
plt.figure(figsize=(10,5))
sns.distplot(y_test - y_pred_ridge_test)
plt.title('Error Term', fontsize=20)
plt.show()

#<b>Running Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
max_depth = [4,6,8,10]

min_samples_split = [10,20,30]

min_samples_leaf = [8,16,22]

param_dict_dt = {'max_depth' : max_depth,'min_samples_split' : min_samples_split,'min_samples_leaf' : min_samples_leaf}

In [None]:
dt = DecisionTreeRegressor()

dt_grid = GridSearchCV(estimator=dt, param_grid = param_dict_dt, cv = 5, verbose=2, scoring='r2')

dt_grid.fit(X_train,y_train)

In [None]:
print('The best fit alpha value is found out to be :' ,dt_grid.best_params_)
print('The R2 score using the same alpha is :', dt_grid.best_score_)

**The best parameters for the Decision Tree Regression which we already tested before to save time when running it again.**
* {'max_depth': 10, 'min_samples_leaf': 22, 'min_samples_split': 30}

In [None]:
y_pred_dt_train=dt_grid.predict(X_train)
y_pred_dt_test=dt_grid.predict(X_test)

<b>Decision Tree Regressor Model Evaluation

In [None]:
dt_train_mse  = mean_squared_error(y_train, y_pred_dt_train)
print("Train MSE :" , dt_train_mse)

dt_train_r2 = r2_score(y_train, y_pred_dt_train)
print("Train R2 :" ,dt_train_r2)

dt_train_r2_= 1-(1-r2_score(y_train, y_pred_dt_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ", dt_train_r2_)

In [None]:
dt_test_mse  = mean_squared_error(y_test, y_pred_dt_test)
print("Test MSE :" , dt_test_mse)

dt_test_r2 = r2_score(y_test, y_pred_dt_test)
print("Test R2 :" ,dt_test_r2)

dt_test_r2_= 1-(1-r2_score(y_test, y_pred_dt_test))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ", dt_test_r2_)

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test-y_pred_dt_test, color='blue', linewidth=2.5, linestyle='-')
plt.title('Error Term', fontsize=20)
plt.show()

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_dt_test, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Test Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
plt.figure(figsize=(10,5))
sns.distplot(y_test - y_pred_dt_test )
plt.title('Error Term', fontsize=20)
plt.show()

The decision tree with the selected hyperparameters does improve the predictions of the model considerably. It still isn't ideal but it is certainly much better than Linear models.

#<b>Random Forest 

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 500, num = 5)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10]
random_grid = {'n_estimators': n_estimators,'max_features': max_features,'max_depth': max_depth,'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

random_grid.fit(X_train, y_train)

The best parameters for the Random Forest Regression which we already tested before to save time when running it again.

{'n_estimators' = 40, 'n_jobs' = -4}

In [None]:
forest_reg = RandomForestRegressor(n_estimators = 40, n_jobs = -4)

In [None]:
forest_reg.fit(X_train, y_train)

In [None]:
y_pred_forest_train = forest_reg.predict(X_train)
y_pred_forest_test = forest_reg.predict(X_test)

<b>Random Forest Model Evaluation

In [None]:
forest_train_mse  = mean_squared_error(y_train, y_pred_forest_train)
print("Train MSE :" , forest_train_mse)

forest_train_r2 = r2_score(y_train, y_pred_forest_train)
print("Train R2 :" ,forest_train_r2)

forest_train_r2_= 1-(1-r2_score((y_train), (y_pred_forest_train)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",forest_train_r2_)

In [None]:
forest_test_mse  = mean_squared_error(y_test, y_pred_forest_test)
print("Test MSE :" , forest_test_mse)

forest_test_r2 = r2_score(y_test, y_pred_forest_test)
print("Test R2 :" ,forest_test_r2)

forest_test_r2_= 1-(1-r2_score((y_test), (y_pred_forest_test)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ", forest_test_r2_)

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test-y_pred_forest_test, color='blue', linewidth=2.5, linestyle='-')
plt.title('Error Term', fontsize=20)
plt.show()

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_forest_test, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Test Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
plt.figure(figsize=(10,5))
sns.distplot(y_test - y_pred_forest_test )
plt.title('Error Term', fontsize=20)
plt.show()

#<b>Running XGBoost Regressor

In [None]:
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe

In [None]:
xgb_model = xgb.XGBRegressor(random_state=0, objective='reg:squarederror')
param_tuning = {'learning_rate': [0.1, 0.2, 0.3],'max_depth': [5, 8, 10],'min_samples_' : [2,4,6],'n_estimators' : [100,200,300]}

xgb_model = GridSearchCV(xgb_model, param_grid = param_tuning,scoring = 'r2', cv=5,verbose=1,)
xgb_model.fit(X_train,y_train)

**The best parameters for the XGBoost which we already tested before to save time when running it again.**
* {'learning_rate': 0.2, 'max_depth': 8, 'min_samples_': 4, 'n_estimators': 200}

In [None]:
xgb_model = xgb.XGBRegressor()
grid_values = {'n_estimators' : [200], 'max_depth': [8],'min_samples_' : [4],'learning_rate' : [0.2]}
xgb_model = GridSearchCV(estimator = xgb_model, param_grid = grid_values, scoring = 'r2', cv=3,verbose=1,)

In [None]:
xgb_model.fit(X_train,y_train)

In [None]:
print('The R2 score using the same alpha is :', xgb_model.best_score_)

In [None]:
xgb_model.best_params_

In [None]:
y_pred_xgb_train=xgb_model.predict(X_train)
y_pred_xgb_test=xgb_model.predict(X_test)

<b>XGBoost Regressor Model Evaluation

In [None]:
xgb_train_mse  = mean_squared_error(y_train, y_pred_xgb_train)
print("Train MSE :" , xgb_train_mse)

xgb_train_r2 = r2_score(y_train, y_pred_xgb_train)
print("Train R2 :" ,xgb_train_r2)

xgb_train_r2_= 1-(1-r2_score((y_train), (y_pred_xgb_train)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ", xgb_train_r2_)

In [None]:
xgb_test_mse  = mean_squared_error(y_test, y_pred_xgb_test)
print("Test MSE :" , xgb_test_mse)

xgb_test_r2 = r2_score(y_test, y_pred_xgb_test)
print("Test R2 :" ,xgb_test_r2)

xgb_test_r2_= 1-(1-r2_score((y_test), (y_pred_xgb_test)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ", xgb_test_r2_)

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test-y_pred_xgb_test, color='blue', linewidth=2.5, linestyle='-')
plt.title('Error Term', fontsize=20)
plt.show()

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_xgb_test, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Test Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
plt.figure(figsize=(10,5))
sns.distplot(y_test - y_pred_xgb_test)
plt.title('Error Term', fontsize=20)
plt.show()

##<b>Finally, let's also look the feature importance.

In [None]:
importance_df= pd.DataFrame({'Features': independent_variables, 'Feature_importance': list(xgb_model.best_estimator_.feature_importances_)})
importance_df

In [None]:
importance_df.sort_values(by=['Feature_importance'],ascending=False,inplace=True)

Let's look it by using bar grabh.

In [None]:
plt.figure(figsize=(15,6))
plt.title('Feature Importance', fontsize=20)
sns.barplot(x="Feature_importance",y='Features', data=importance_df[:6], orient = 'h')
plt.show()

Clearly, we can see distance is the top contributor to trip duration followed by different days of the weeks.

#<b>Evaluating the models
Models Summary for the Train data.

In [None]:
models= ['Linear Regression', 'Lasso Regression', 'Ridge Regression','DecisionTree Regressor','Ramdom Forest' 'XGBoost Regressor']
train_mse= [lr_train_mse, lasso_train_mse, ridge_train_mse, dt_train_mse, forest_train_mse, xgb_train_mse]
train_r2= [lr_train_r2, lasso_train_r2, ridge_train_r2, dt_train_r2, forest_train_r2, xgb_train_r2]
train_adjusted_r2= [lr_train_r2_, lasso_train_r2_, ridge_train_r2_, dt_train_r2_, forest_train_r2_, xgb_train_r2_]

<b>Models Summary for the test data.

In [None]:
models= ['Linear Regression', 'Lasso Regression', 'Ridge Regression','DecisionTree Regressor','Ramdom Forest', 'XGBoost Regressor']
test_mse= [lr_test_mse, lasso_test_mse, ridge_test_mse, dt_test_mse, forest_test_mse, xgb_test_mse]
test_r2= [lr_test_r2, lasso_test_r2, ridge_test_r2, dt_test_r2, forest_test_r2, xgb_test_r2]
test_adjusted_r2= [lr_test_r2_, lasso_test_r2_, ridge_test_r2_, dt_test_r2_, forest_test_r2_, xgb_test_r2_]

<b>Model Comparison & Selection

In [None]:
model_comparison = pd.DataFrame({'Model Name': models,
                          'Train MSE': train_mse,'Test MSE': test_mse,
                          'Train R^2': train_r2, 'Test R^2': test_r2,
                          'Train Adjusted R^2': train_adjusted_r2, 'Test Adjusted R^2': test_adjusted_r2})
model_comparison

#**Conclusion**
* We can see that MSE and R2 and Adjusted R2 which are the metrics used to evaluate the performance of regression model of **Linear Regression, Lasso, Ridge, Decision Tree, Random Forest and XGBoost Regressor.**
* The Linear models don't show good performance on our training and testing environment.
* From above table we can conclude that **XGBoost Regressor (81%)** is the best models as compare to the other models to predict the trip duration for a particular taxi.