<a href="https://colab.research.google.com/github/aynaval/nyc-taxi-trip-duration-predicton/blob/main/NYC_Taxi_Trip_Time_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Taxi trip time Prediction : Predicting total ride duration of taxi trips in New York City</u></b>

## <b> Problem Description </b>

### Your task is to build a model that predicts the total ride duration of taxi trips in New York City. Your primary dataset is one released by the NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables.

## <b> Data Description </b>

### The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this project. Based on individual trip attributes, you should predict the duration of each trip in the test set.

### <b>NYC Taxi Data.csv</b> - the training set (contains 1458644 trip records)


### Data fields
* #### id - a unique identifier for each trip
* #### vendor_id - a code indicating the provider associated with the trip record
* #### pickup_datetime - date and time when the meter was engaged
* #### dropoff_datetime - date and time when the meter was disengaged
* #### passenger_count - the number of passengers in the vehicle (driver entered value)
* #### pickup_longitude - the longitude where the meter was engaged
* #### pickup_latitude - the latitude where the meter was engaged
* #### dropoff_longitude - the longitude where the meter was disengaged
* #### dropoff_latitude - the latitude where the meter was disengaged
* #### store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* #### trip_duration - duration of the trip in seconds

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import datetime
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')

pd.set_option("display.max_columns", 36)
plt.style.use('seaborn')

plt.rcParams["font.weight"] = "bold"
plt.rcParams["axes.labelweight"] = "bold"
plt.rcParams["axes.titlesize"] = 25
plt.rcParams["axes.titleweight"] = 'bold'
plt.rcParams['xtick.labelsize']=15
plt.rcParams['ytick.labelsize']=15
plt.rcParams["axes.labelsize"] = 20
plt.rcParams["legend.fontsize"] = 15
plt.rcParams["legend.title_fontsize"] = 15
plt.rcParams['figure.figsize'] = [20, 10]
from geopy.distance import great_circle



In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
data = pd.read_csv('/content/drive/MyDrive/NYC Taxi Trip Time Prediction - Lavanya M/Copy of NYC Taxi Data.csv')

# **EDA**

In [4]:
# !pip install pandas-profiling==2.7.1
# from pandas_profiling import ProfileReport
# prof = ProfileReport(data)
# prof.to_file(output_file='output.html')

In [5]:
data.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435


In [6]:
data.tail()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
1458639,id2376096,2,2016-04-08 13:31:04,2016-04-08 13:44:02,4,-73.982201,40.745522,-73.994911,40.74017,N,778
1458640,id1049543,1,2016-01-10 07:35:15,2016-01-10 07:46:10,1,-74.000946,40.747379,-73.970184,40.796547,N,655
1458641,id2304944,2,2016-04-22 06:57:41,2016-04-22 07:10:25,1,-73.959129,40.768799,-74.004433,40.707371,N,764
1458642,id2714485,1,2016-01-05 15:56:26,2016-01-05 16:02:39,1,-73.982079,40.749062,-73.974632,40.757107,N,373
1458643,id1209952,1,2016-04-05 14:44:25,2016-04-05 14:47:43,1,-73.979538,40.78175,-73.972809,40.790585,N,198


In [7]:
data.sample(3)

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
167768,id1966168,2,2016-04-26 15:24:12,2016-04-26 15:39:27,1,-73.986961,40.731544,-74.008263,40.736137,N,915
157395,id2126971,2,2016-04-16 03:32:00,2016-04-16 03:49:41,2,-73.990112,40.760723,-73.898048,40.750378,N,1061
137178,id2228243,2,2016-04-19 12:33:15,2016-04-19 12:45:36,1,-73.970352,40.761688,-73.980904,40.767891,N,741


In [8]:
data.shape

(1458644, 11)

In [9]:
data.size

16045084

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   id                  1458644 non-null  object 
 1   vendor_id           1458644 non-null  int64  
 2   pickup_datetime     1458644 non-null  object 
 3   dropoff_datetime    1458644 non-null  object 
 4   passenger_count     1458644 non-null  int64  
 5   pickup_longitude    1458644 non-null  float64
 6   pickup_latitude     1458644 non-null  float64
 7   dropoff_longitude   1458644 non-null  float64
 8   dropoff_latitude    1458644 non-null  float64
 9   store_and_fwd_flag  1458644 non-null  object 
 10  trip_duration       1458644 non-null  int64  
dtypes: float64(4), int64(3), object(4)
memory usage: 122.4+ MB


In [11]:
data.describe()

Unnamed: 0,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration
count,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0
mean,1.53495,1.66453,-73.97349,40.75092,-73.97342,40.7518,959.4923
std,0.4987772,1.314242,0.07090186,0.03288119,0.07064327,0.03589056,5237.432
min,1.0,0.0,-121.9333,34.3597,-121.9333,32.18114,1.0
25%,1.0,1.0,-73.99187,40.73735,-73.99133,40.73588,397.0
50%,2.0,1.0,-73.98174,40.7541,-73.97975,40.75452,662.0
75%,2.0,2.0,-73.96733,40.76836,-73.96301,40.76981,1075.0
max,2.0,9.0,-61.33553,51.88108,-61.33553,43.92103,3526282.0


* Minimum value of trip duration is 1 secound and max is 3526282 (approx 40 days).


In [12]:
data.nunique()

id                    1458644
vendor_id                   2
pickup_datetime       1380222
dropoff_datetime      1380377
passenger_count            10
pickup_longitude        23047
pickup_latitude         45245
dropoff_longitude       33821
dropoff_latitude        62519
store_and_fwd_flag          2
trip_duration            7417
dtype: int64

In [None]:
sns.heatmap(data.isnull());
plt.title('null values')
plt.tight_layout()

* There are no missing values in the data.

In [None]:
data.duplicated().value_counts()

* There are no duplicate values.

## **Distribution**

### Categorical columns

In [None]:
sns.countplot(data=data,x='vendor_id');


* There is almost equal ratio of both vendors.

In [None]:
sns.countplot(data=data,x='store_and_fwd_flag');


* Only few records were recored in memory before sharing(Y).

### Numerical columns

* Pickup_datatime and dropoff datetime were both on onject data type, they are converted to datetime object to understand the data more.

In [4]:
data['pickup_datetime'] = pd.to_datetime(data['pickup_datetime'])

In [5]:
data['dropoff_datetime'] = pd.to_datetime(data['dropoff_datetime'])

In [None]:
print(data['pickup_datetime'].dt.year.unique(),data['dropoff_datetime'].dt.year.unique())

* Both pick up and and drop off columns have same year hence it year can be ignored.

In [None]:
(data['pickup_datetime'].dt.month.unique(),data['dropoff_datetime'].dt.month.unique())

In [None]:
(data[data['dropoff_datetime'].dt.month==7]['trip_duration']//60).value_counts().sort_index().plot();

In [None]:
data[data['dropoff_datetime'].dt.month==7][['pickup_datetime','dropoff_datetime']]

* dropoff_datetime has a extra month compared to pickup_datetime column with only 127 observatons when dropoff_datetime month is 7. 
* Observations when dropoff_datetime month is 7 is from the rides which were mostly taken at late-night of pickup_datetime month 6.
* Most rides were under 30 secounds.

In [None]:
((data['dropoff_datetime']-data['pickup_datetime']).dt.total_seconds().astype(int) == data['trip_duration']).value_counts() 

* Drop off column can be dropped as (drop off - pick up) = trip duration.

In [6]:
data.drop('dropoff_datetime',axis =1,inplace= True)

In [None]:
plt.figure(figsize=(15,10))
n =1;
for i in ['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']:
  plt.subplot(2,2,n)
  sns.boxplot(data[i])
  n+=1
plt.tight_layout()

There are logitude and latitude co-ordinates which are out of new york city.

> * city_long_border = (-74.03, -73.75)
* city_lat_border = (40.63, 40.85)



In [7]:


# dropping outliers
data = data[data['pickup_longitude'] <= -73.75]
data = data[data['pickup_longitude'] >= -74.03]
data = data[data['pickup_latitude'] <= 40.85]
data = data[data['pickup_latitude'] >= 40.63]
data = data[data['dropoff_longitude'] <= -73.75]
data = data[data['dropoff_longitude'] >= -74.03]
data = data[data['dropoff_latitude'] <= 40.85]
data = data[data['dropoff_latitude'] >= 40.63]

In [None]:
sns.boxplot(data['trip_duration']);

In [None]:
data['trip_duration'].describe()

* Min values is 1 secound and max is 352682 secounds (ie. 4 days)
* Using 2 standared deviation after taking log10 of trip duration.

In [None]:

sns.histplot(data['trip_duration'],bins = 100)
plt.title('skew :'+str(data['trip_duration'].skew()))
plt.ticklabel_format(style='plain')


In [None]:
sns.histplot(np.log10(data['trip_duration']),bins=100);
plt.title('skew :'+str(np.log(data['trip_duration']).skew()));

In [8]:
data['log_trip_duration']= np.log10(data['trip_duration'])

In [None]:
data['log_trip_duration'].mean()- 3*data['log_trip_duration'].std()

In [9]:
data = data[data['log_trip_duration']>(data['log_trip_duration'].mean()- 3*data['log_trip_duration'].std())]
data = data[data['log_trip_duration']<(data['log_trip_duration'].mean()+ 3*data['log_trip_duration'].std())]


In [10]:
data['month'] = data['pickup_datetime'].dt.month_name()

In [11]:
data['day_no'] = data['pickup_datetime'].dt.day

In [12]:
data['day'] = data['pickup_datetime'].dt.day_name()

In [13]:
data['hour'] = data['pickup_datetime'].dt.hour

In [14]:
data['minute'] = data['pickup_datetime'].dt.minute

In [15]:
data['second'] = data['pickup_datetime'].dt.second

In [None]:

sns.lineplot(data=data,x='month',y='trip_duration');


* January has least time duration and June has the max trip duration.

In [None]:
sns.lineplot(data=data,x='day_no',y='trip_duration' );

* Trip duration is lesser at beginning and ending of month while its hightest at around 25th day of the month.

In [None]:
sns.lineplot(data=data,x='day',y='trip_duration');

* Trip duration is least at weekends and max during Thursday.
* This trend follows for all months.

In [None]:
sns.boxplot(data=data,x='month',y='trip_duration',hue='day');

In [None]:
sns.lineplot(data=data,x='hour',y='trip_duration',marker="x");

* Trip duration is least at midnight to early morning(12am to 6am).
* Trip duration increases in morning after 6am till late afternoon 4 pm after which it starts.
* Highest trip duration is at around 3pm 

In [None]:
sns.catplot(data=data,kind='count',x='hour',col='day');
plt.tight_layout()

* On weekends people do travel at after midnight to early morning while during weekdays most people do not travel much at midnight.

In [None]:
sns.barplot(data=data,x='passenger_count',y='trip_duration',hue='day');

* When 0 passanger trip durations are high which could have been outliers hence removing them.

In [16]:
len(data[data['passenger_count']==0])

14

In [129]:
 data = data[data['passenger_count']>0]

In [17]:
def distancer(row):
    coords_1 = (row['pickup_latitude'], row['pickup_longitude'])
    coords_2 = (row['dropoff_latitude'], row['dropoff_longitude'])
    return great_circle(coords_1, coords_2).km

data['pickup_dropoff_distance'] = data.apply(distancer, axis=1)

In [None]:
sns.lineplot(data=data,y='pickup_dropoff_distance',x='hour');

* Most distance is travelled at around 5 am and leasr at around 9 am.

In [None]:
sns.catplot(data=data,x='store_and_fwd_flag',y='pickup_dropoff_distance');

* Its almost the same for both

In [None]:
sns.catplot(data=data,x='vendor_id',y='pickup_dropoff_distance');

In [18]:
data['store_and_fwd_flag'] = data['store_and_fwd_flag'].map(dict(N=0,Y=1))

In [19]:
data = pd.get_dummies(data ,columns= ['month','day'],drop_first=True)

In [20]:
data.drop(['id','pickup_datetime','trip_duration'],axis=1,inplace=True)

In [None]:
plt.figure(figsize=(17,10))
sns.heatmap(data.corr(),annot=True);

In [None]:
# ! pip install sweetviz

In [None]:
# import sweetviz as sv
# #You could specify which variable in your dataset is the target for your model creation. We can specify it using the target_feat parameter.
# my_report = sv.analyze(data, target_feat ='log_trip_duration')
# my_report.show_html()

In [21]:
scaler = MinMaxScaler()

In [22]:
data[['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude','day_no']]= scaler.fit_transform(data[['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude','day_no']])

In [23]:
X = data.drop('log_trip_duration',axis = 1)

In [24]:
y = data.loc[:,'log_trip_duration']

In [25]:
from sklearn.model_selection import train_test_split

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [27]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1142957, 23), (285740, 23), (1142957,), (285740,))

In [None]:
data.columns

In [None]:
n=1
plt.figure(figsize=(15,15))
for i in data.columns:
  plt.subplot(5,5,n)
  sns.distplot(data[i])
  n=n+1
plt.tight_layout()


# **Linear regression**

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(X_train,y_train)

In [None]:
lr.get_params()

In [None]:
lr.coef_

In [None]:
y_pred_train_lr= lr.predict(X_train)

In [None]:
y_pred_test_lr = lr.predict(X_test)

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

In [None]:
# for train data
lr_train_mse  = mean_squared_error((y_train), (y_pred_train_lr))
print("Train MSE :" , lr_train_mse)

lr_train_rmse = np.sqrt(lr_train_mse)

print("Train RMSE :" ,lr_train_rmse)

lr_train_r2 = r2_score((y_train), (y_pred_train_lr))
print("Train R2 :" ,lr_train_r2) 

lr_train_r2_ = 1-(1-r2_score((y_train), (y_pred_train_lr)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",lr_train_r2_)

In [None]:
lr_test_mse  = mean_squared_error((y_test), (y_pred_test_lr))
print("Test MSE :" , lr_test_mse)

lr_test_rmse = np.sqrt(lr_test_mse)

print("Test RMSE :" ,lr_test_rmse)

lr_test_r2 = r2_score((y_test), (y_pred_test_lr))
print("Test R2 :" ,lr_test_r2)

lr_test_r2_ = 1-(1-r2_score((y_test), (y_pred_test_lr)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ",lr_test_r2_)

In [None]:
c= [i for i in range(0, len(y_train))]
plt.plot(c, y_train, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_train_lr, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Train Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
#Actual vs Prediction

c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_test_lr, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Test Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test-y_pred_test_lr, color='blue', linewidth=2.5, linestyle='-')
plt.title('Error Term', fontsize=20)
plt.show()

# **Lasso Regression**

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

In [None]:
#Cross validation
lasso = Lasso()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='r2', cv=5)
lasso_regressor.fit(X_train, y_train)

In [None]:
print('The best fit alpha value is found out to be :', lasso_regressor.best_params_)
print('The R2 score using the same alpha is :', lasso_regressor.best_score_)

In [None]:
y_pred_train_lasso = lasso_regressor.predict(X_train)
y_pred_test_lasso = lasso_regressor.predict(X_test)


In [None]:
# for train data
lasso_train_mse  = mean_squared_error((y_train), (y_pred_train_lasso))
print("Train MSE :" , lasso_train_mse)

lasso_train_rmse = np.sqrt(lasso_train_mse)

print("Train RMSE :" ,lasso_train_rmse)

lasso_train_r2 = r2_score((y_train), (y_pred_train_lasso))
print("Train R2 :" ,lr_train_r2) 

lasso_train_r2_ = 1-(1-r2_score((y_train), (y_pred_train_lasso)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",lr_train_r2_)

In [None]:
lasso_test_mse  = mean_squared_error((y_test), (y_pred_test_lasso))
print("Test MSE :" , lasso_test_mse)

lasso_test_rmse = np.sqrt(lasso_test_mse)

print("Test RMSE :" ,lasso_test_rmse)

lasso_test_r2 = r2_score((y_test), (y_pred_test_lasso))
print("Test R2 :" ,lasso_test_r2)

lasso_test_r2_ = 1-(1-r2_score((y_test), (y_pred_test_lasso)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ",lr_test_r2_)

In [None]:
c= [i for i in range(0, len(y_train))]
plt.plot(c, y_train, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_train_lasso, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Train Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_test_lasso, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Test Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test-y_pred_test_lasso, color='blue', linewidth=2.5, linestyle='-')
plt.title('Error Term', fontsize=20)
plt.show()

# **Ridge**

In [None]:
from sklearn.linear_model import Ridge
#Cross validation
ridge = Ridge()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='r2', cv=5)
ridge_regressor.fit(X_train, y_train)

In [None]:
print('The best fit alpha value is found out to be :' ,ridge_regressor.best_params_)
print('The R2 score using the same alpha is :', lasso_regressor.best_score_)

In [None]:
ridge_regressor.best_estimator_

In [None]:
ridge_regressor.score(X_train, y_train)

In [None]:
y_pred_train_ridge = ridge_regressor.predict(X_train)
y_pred_test_ridge = ridge_regressor.predict(X_test)
# for train data
ridge_train_mse  = mean_squared_error((y_train), (y_pred_train_ridge))
print("Train MSE :" , ridge_train_mse)

ridge_train_rmse = np.sqrt(ridge_train_mse)

print("Train RMSE :" ,ridge_train_rmse)

ridge_train_r2 = r2_score((y_train), (y_pred_train_ridge))
print("Train R2 :" ,lr_train_r2) 

ridge_train_r2_ = 1-(1-r2_score((y_train), (y_pred_train_ridge)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",lr_train_r2_)

ridge_test_mse  = mean_squared_error((y_test), (y_pred_test_ridge))
print("Test MSE :" , ridge_test_mse)

ridge_test_rmse = np.sqrt(ridge_test_mse)

print("Test RMSE :" ,ridge_test_rmse)

ridge_test_r2 = r2_score((y_test), (y_pred_test_ridge))
print("Test R2 :" ,ridge_test_r2)

ridge_test_r2_ = 1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ",lr_test_r2_)

In [None]:
c= [i for i in range(0, len(y_train))]
plt.plot(c, y_train, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_train_ridge, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Train Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_test_ridge, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Train Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test-y_pred_test_ridge, color='blue', linewidth=2.5, linestyle='-')
plt.title('Error Term', fontsize=20)
plt.show()

# **DecisionTree**

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
# Maximum depth of trees
max_depth = [4,6,8,10]
 
# Minimum number of samples required to split a node
min_samples_split = [10,20,30]
 
# Minimum number of samples required at each leaf node
min_samples_leaf = [8,16,22]
 
# Hyperparameter Grid
param_dict_dt = {
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}
# best params 
best_dr = {'max_depth': [10], 'min_samples_leaf': [22], 'min_samples_split': [10]}


In [None]:
dtree = DecisionTreeRegressor()
dtree_regr = GridSearchCV(dtree,best_dr, scoring='r2', cv=5)

In [None]:
dtree_regr.fit(X_train,y_train)

In [None]:
print('The best fit alpha value is found out to be :', dtree_regr.best_params_)
print('The R2 score using the same alpha is :', dtree_regr.best_score_)

In [None]:
y_pred_train_dt = dtree_regr.predict(X_train)
y_pred_test_dt = dtree_regr.predict(X_test)

In [None]:
# for train data
dt_train_mse  = mean_squared_error((y_train), (y_pred_train_dt))
print("Train MSE :" , dt_train_mse)

dt_train_rmse = np.sqrt(dt_train_mse)

print("Train RMSE :" ,dt_train_rmse)

dt_train_r2 = r2_score((y_train), (y_pred_train_dt))
print("Train R2 :" ,dt_train_r2) 

dt_train_r2_ = 1-(1-r2_score((y_train), (y_pred_train_dt)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",dt_train_r2_)

In [None]:
dt_test_mse  = mean_squared_error((y_test), (y_pred_test_dt))
print("Test MSE :" , dt_test_mse)

dt_test_rmse = np.sqrt(dt_test_mse)

print("Test RMSE :" ,dt_test_rmse)

dt_test_r2 = r2_score((y_test), (y_pred_test_dt))
print("Test R2 :" ,dt_test_r2)

dt_test_r2_ = 1-(1-r2_score((y_test), (y_pred_test_dt)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ",dt_test_r2_)

In [None]:
c= [i for i in range(0, len(y_train))]
plt.plot(c, y_train, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_train_dt, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Train Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_test_dt, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Train Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test-y_pred_test_dt, color='blue', linewidth=2.5, linestyle='-')
plt.title('Error Term', fontsize=20)
plt.show()

# **xgboost**

In [None]:
n_estimators = [80,150,200]
 
# Maximum depth of trees
max_depth = [5,8,10]
min_samples_split = [40,50]
learning_rate=[0.2,0.4,0.6]
 
# Hyperparameter Grid
param_xgb = {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
             'min_samples_' : min_samples_split,
             'learning_rate' : learning_rate
             }
# Best parameters found after tuning
best = {'learning_rate': [0.2],
 'max_depth': [10],
 'min_samples_': [40],
 'n_estimators': [200]}

In [28]:
import xgboost as xgb

In [None]:
import xgboost as xgb
xgb_model = xgb.XGBRegressor(tree_method = 'gpu_hist',silent=1)

# Grid search
xgb_grid = GridSearchCV(estimator=xgb_model,
                        param_grid = best,
                        cv = 3, verbose=1,
                        scoring="r2")

xgb_grid.fit(X_train,y_train)

In [None]:
xgb_grid.score(X_train,y_train)

In [None]:
y_pred_train_xgb = xgb_grid.predict(X_train)
y_pred_test_xgb = xgb_grid.predict(X_test)

In [None]:
xgb_grid.score(X_test,y_test)

In [None]:
# for train data
xgb_train_mse  = mean_squared_error((y_train), (y_pred_train_xgb))
print("Train MSE :" , xgb_train_mse)

xgb_train_rmse = np.sqrt(xgb_train_mse)

print("Train RMSE :" ,xgb_train_rmse)

xgb_train_r2 = r2_score((y_train), (y_pred_train_xgb))
print("Train R2 :" ,xgb_train_r2) 

xgb_train_r2_ = 1-(1-r2_score((y_train), (y_pred_train_xgb)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",dt_train_r2_)

In [None]:
xgb_test_mse  = mean_squared_error((y_test), (y_pred_test_xgb))
print("Test MSE :" , xgb_test_mse)

xgb_test_rmse = np.sqrt(xgb_test_mse)

print("Test RMSE :" ,xgb_test_rmse)

xgb_test_r2 = r2_score((y_test), (y_pred_test_xgb))
print("Test R2 :" ,xgb_test_r2)

xgb_test_r2_ = 1-(1-r2_score((y_test), (y_pred_test_xgb)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))

In [None]:
c= [i for i in range(0, len(y_train))]
plt.plot(c, y_train, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_train_xgb, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Train Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_test_xgb, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Train Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test-y_pred_test_xgb, color='blue', linewidth=2.5, linestyle='-')
plt.title('Error Term', fontsize=20)
plt.show()

In [None]:
sns.distplot(y_test-y_pred_test_xgb);

# **lgb**

In [None]:
import lightgbm as lgb
from lightgbm import LGBMRegressor

In [None]:
n_estimators = [80,150,200]
 
# Maximum depth of trees
max_depth = [5,8,10,50]
min_samples_split = [40,50,100]
learning_rate=[0.2,0.4,0.6]
 
# Hyperparameter Grid
param_lgb = {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
             'min_samples_' : min_samples_split,
             'learning_rate' : learning_rate
             }
# best parameters after evaluating
best_lgb= {'learning_rate': [0.4],
 'max_depth': [50],
 'min_samples_': [40],
 'n_estimators': [200]}

In [None]:
# lgb_grid.best_params_

In [None]:
model = LGBMRegressor()
lgb_grid = GridSearchCV(estimator=model,
                        param_grid = best_lgb,
                        cv = 3, verbose=1,
                        scoring="r2")



In [None]:
lgb_grid.fit(X_train,y_train)

In [None]:
y_pred_train_lgb = lgb_grid.predict(X_train)
y_pred_test_lgb = lgb_grid.predict(X_test)

In [None]:
lgb_grid.score(X_train,y_train)

In [None]:
lgb_grid.score(X_test,y_test)

In [None]:
# for train data
lgb_train_mse  = mean_squared_error((y_train), (y_pred_train_lgb))
print("Train MSE :" , lgb_train_mse)

lgb_train_rmse = np.sqrt(lgb_train_mse)

print("Train RMSE :" ,lgb_train_rmse)

lgb_train_r2 = r2_score((y_train), (y_pred_train_lgb))
print("Train R2 :" ,lgb_train_r2) 

lgb_train_r2_ = 1-(1-r2_score((y_train), (y_pred_train_lgb)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train adjusted R2 :" ,lgb_train_r2_)

In [None]:
lgb_test_mse  = mean_squared_error((y_test), (y_pred_test_lgb))
print("Test MSE :" , lgb_test_mse)

lgb_test_rmse = np.sqrt(lgb_test_mse)

print("Test RMSE :" ,lgb_test_rmse)

lgb_test_r2 = r2_score((y_test), (y_pred_test_lgb))
print("Test R2 :" ,lgb_test_r2)

lgb_test_r2_ = 1-(1-r2_score((y_test), (y_pred_test_lgb)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Train adjusted R2 :" ,lgb_test_r2_)

In [None]:
c= [i for i in range(0, len(y_train))]
plt.plot(c, y_train, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_train_lgb, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Train Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test, color='blue', linewidth=2.5, linestyle='-')
plt.plot(c, y_pred_test_lgb, color='red', linewidth=2.5, linestyle='-')
plt.title('Actual vs Predicted for Train Data', fontsize=20)
plt.legend(["Actual", "Predicted"])
plt.show()

In [None]:
plt.figure(figsize= (10,5))
c= [i for i in range(0, len(y_test))]
plt.plot(c, y_test-y_pred_test_lgb, color='blue', linewidth=2.5, linestyle='-')
plt.title('Error Term', fontsize=20)
plt.show()

In [None]:
sns.distplot(y_test-y_pred_test_lgb);

* XGboost gives the best r2 score on training data.

In [33]:
xgb_model_1 = xgb.XGBRegressor(silent=1,learning_rate= 0.2,max_depth= 10,min_samples_= 40,n_estimators= 200)


In [34]:
xgb_model_1.fit(X_train,y_train)

XGBRegressor(learning_rate=0.2, max_depth=10, min_samples_=40, n_estimators=200,
             silent=1)

In [67]:
xgb_model_1.score(X_test,y_test)

0.8200019933179409

In [32]:
! pip install eli5

Collecting eli5
  Downloading eli5-0.11.0-py2.py3-none-any.whl (106 kB)
[?25l[K     |███                             | 10 kB 33.2 MB/s eta 0:00:01[K     |██████▏                         | 20 kB 7.4 MB/s eta 0:00:01[K     |█████████▎                      | 30 kB 6.5 MB/s eta 0:00:01[K     |████████████▍                   | 40 kB 6.2 MB/s eta 0:00:01[K     |███████████████▌                | 51 kB 4.2 MB/s eta 0:00:01[K     |██████████████████▌             | 61 kB 4.4 MB/s eta 0:00:01[K     |█████████████████████▋          | 71 kB 4.3 MB/s eta 0:00:01[K     |████████████████████████▊       | 81 kB 4.8 MB/s eta 0:00:01[K     |███████████████████████████▉    | 92 kB 3.9 MB/s eta 0:00:01[K     |███████████████████████████████ | 102 kB 4.3 MB/s eta 0:00:01[K     |████████████████████████████████| 106 kB 4.3 MB/s 
Installing collected packages: eli5
Successfully installed eli5-0.11.0


In [33]:
import eli5 as eli

In [34]:
eli.explain_weights(xgb_model_1)

Weight,Feature
0.5272,pickup_dropoff_distance
0.0818,day_Sunday
0.0621,day_Saturday
0.0591,hour
0.0367,day_Monday
0.0363,dropoff_latitude
0.0206,dropoff_longitude
0.0203,month_January
0.0178,pickup_longitude
0.0177,pickup_latitude


In [36]:
! pip install shapash



In [37]:
from shapash.explainer.smart_explainer import SmartExplainer
xpl = SmartExplainer()

In [60]:
y_pred = pd.DataFrame(xgb_model_1.predict(X_test),columns=['pred'],index=X_test.index)

In [61]:
xpl.compile(
    x=X_test[0:30],
    model=xgb_model_1,
     # Optional: compile step can use inverse_transform method
    y_pred=y_pred[0:30], # Optional
    # Optional: see tutorial postprocessing
)

Backend: Shap TreeExplainer


In [62]:
app = xpl.run_app()

Dash is running on http://0.0.0.0:8050/

Dash is running on http://0.0.0.0:8050/



INFO:root:Your Shapash application run on http://b8fbd30a485a:8050/
INFO:root:Use the method .kill() to down your app.
INFO:shapash.webapp.smart_app:Dash is running on http://0.0.0.0:8050/



 * Serving Flask app "shapash.webapp.smart_app" (lazy loading)
 * Environment: production


In [64]:
xpl.plot.features_importance()

In [66]:
xpl.plot.contribution_plot("hour")

In [72]:
xpl.plot.local_plot(331927)