<a href="https://colab.research.google.com/github/HariTarz/Transport_Demand_Prediction/blob/main/Demand_Prediction_for_Public_Transport_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Traffic Jam: Predicting People's Movement into Nairobi </u></b>

## <b> Problem Description </b>

### This challenge asks you to build a model that predicts the number of seats that Mobiticket can expect to sell for each ride, i.e. for a specific route on a specific date and time. There are 14 routes in this dataset. All of the routes end in Nairobi and originate in towns to the North-West of Nairobi towards Lake Victoria.


### The towns from which these routes originate are:

* Awendo
* Homa Bay
* Kehancha
* Kendu Bay
* Keroka
* Keumbu
* Kijauri
* Kisii
* Mbita
* Migori
* Ndhiwa
* Nyachenge
* Oyugis
* Rodi
* Rongo
* Sirare
* Sori

### The routes from these 14 origins to the first stop in the outskirts of Nairobi takes approximately 8 to 9 hours from time of departure. From the first stop in the outskirts of Nairobi into the main bus terminal, where most passengers get off, in Central Business District, takes another 2 to 3 hours depending on traffic.

### The three stops that all these routes make in Nairobi (in order) are:

1. Kawangware: the first stop in the outskirts of Nairobi
2. Westlands
3. Afya Centre: the main bus terminal where most passengers disembark

### All of these points are mapped [here](https://www.google.com/maps/d/viewer?mid=1Ef2pFdP8keVHHid8bwju2raoRvjOGagN&ll=-0.8281897101491997%2C35.51706279999996&z=8).

### Passengers of these bus (or shuttle) rides are affected by Nairobi traffic not only during their ride into the city, but from there they must continue their journey to their final destination in Nairobi wherever that may be. Traffic can act as a deterrent for those who have the option to avoid buses that arrive in Nairobi during peak traffic hours. On the other hand, traffic may be an indication for people’s movement patterns, reflecting business hours, cultural events, political events, and holidays.

## <b> Data Description </b>

### <b>Nairobi Transport Data.csv (zipped)</b> is the dataset of tickets purchased from Mobiticket for the 14 routes from “up country” into Nairobi between 17 October 2017 and 20 April 2018. This dataset includes the variables: ride_id, seat_number, payment_method, payment_receipt, travel_date, travel_time, travel_from, travel_to, car_type, max_capacity.


### Uber Movement traffic data can be accessed [here](https://movement.uber.com). Data is available for Nairobi through June 2018. Uber Movement provided historic hourly travel time between any two points in Nairobi. Any tables that are extracted from the Uber Movement platform can be used in your model.

### Variables description:

* #### ride_id: unique ID of a vehicle on a specific route on a specific day and time.
* #### seat_number: seat assigned to ticket
* #### payment_method: method used by customer to purchase ticket from Mobiticket (cash or Mpesa)
* #### payment_receipt: unique id number for ticket purchased from Mobiticket
* #### travel_date: date of ride departure. (MM/DD/YYYY)
* #### travel_time: scheduled departure time of ride. Rides generally depart on time. (hh:mm)
* #### travel_from: town from which ride originated
* #### travel_to: destination of ride. All rides are to Nairobi.
* #### car_type: vehicle type (shuttle or bus)
* #### max_capacity: number of seats on the vehicle

## Importing required libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
%matplotlib inline

## Mounting the drive and reading the dataset file

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path = '/content/drive/MyDrive/Colab Notebooks/ALMABETTER - Python For Data Science/CAPSTONE PROJECTS/Public Transport/Demand-Prediction-for-public-transport-main/train_revised.csv'
df = pd.read_csv(path)

## Exploring the dataset

In [None]:
# Viewing the data for the first time
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
# Viewing the type 'object' columns description 
df.describe(include = 'object')

In [None]:
# Checking for null value containing row counts
df.isnull().sum()

<b>Finding the target variable</b>

Since we are not given the target variable so we need to find target variable first.

The idea to find the target variable 'Number  of Tickets' allocated is for each ticket of an individual bus on individual day will be created with the sae 'ride_id'

In [None]:
# Calculation of Target varible based on the ride id
tmp_no_tickect_df = df.groupby(['ride_id']).seat_number.count().rename('number_of_ticket').reset_index()
tmp_no_tickect_df.head()

In [None]:
# droping the duplicate value rows with the repeating ride_id
df.drop_duplicates('ride_id', inplace = True)

# Dropping the columns which are not relevant to our target variable
df.drop(['seat_number','payment_method','payment_receipt', 'travel_to'], inplace= True, axis = 1)
df.shape

In [None]:
# Merging the calculated target variable column to the dataset based on the ride_id
df = df.merge(tmp_no_tickect_df, how= 'left', on='ride_id')

In [None]:
# Combaining the date column and time column to get the complete timestamp
df['travel_date_and_time'] = df['travel_date'] + " " + df['travel_time']
df['travel_date_and_time'] = pd.to_datetime(df['travel_date_and_time'])
df.drop(['travel_date', 'travel_time'], inplace= True, axis= 1)

In [None]:
df.head()

## Exploratory Data Analysis

In [None]:
# Plotting the target variable 'number_of_ticket'
fig = plt.figure(figsize=(10,7))
ax = fig.gca()
sns.histplot(x='number_of_ticket', data=df, color='#ad1759')
sns.set_theme(style='darkgrid')
plt.xticks(rotation=90)
# ax.set_xlabel('travel_from')
# ax.set_ylabel('Frequency')
ax.set_title('Disbribution of number_of_ticket')

In [None]:
fig = plt.figure(figsize=(10,7))
ax = fig.gca()
sns.barplot(x="travel_from", y="number_of_ticket", data=df, palette= 'rocket')
sns.set_theme(style='darkgrid')
plt.xticks(rotation=90)
# ax.set_xlabel('travel_from')
# ax.set_ylabel('Frequency')
ax.set_title('travel_from counts')

In [None]:
fig = plt.figure(figsize=(10,7))
ax = fig.gca()
sns.histplot(x='travel_from', data=df, hue='travel_from', palette='rocket')
sns.set_theme(style='darkgrid')
plt.xticks(rotation=90)
# ax.set_xlabel('travel_from')
# ax.set_ylabel('Frequency')
ax.set_title('travel_from counts')

In [None]:
fig = plt.figure(figsize=(10,7))
ax = fig.gca()
sns.countplot(x='car_type', data=df, hue='car_type', palette='rocket')
sns.set_theme(style='darkgrid')
ax.set_xlabel('car_type')
# ax.set_ylabel('Frequency')
ax.set_title('car_type counts')

In [None]:
fig = plt.figure(figsize=(10,7))
ax = fig.gca()
sns.countplot(x='max_capacity', data=df, hue='max_capacity', palette='rocket')
sns.set_theme(style='darkgrid')
ax.set_xlabel('max_capacity')
# ax.set_ylabel('Frequency')
ax.set_title('max_capacity counts')

## Feature Engineering

In [None]:
# Coping the dataset to a new variable
trans_df = df.copy()

In [None]:
# Extracting the date and time column to get time based informations
 
trans_df['travel_year']= trans_df['travel_date_and_time'].dt.year
trans_df['travel_month']= trans_df['travel_date_and_time'].dt.month
trans_df['travel_year_quarter']= trans_df['travel_date_and_time'].dt.quarter
trans_df['travel_day_of_year']= trans_df['travel_date_and_time'].dt.dayofyear
trans_df['travel_day_of_month']= trans_df['travel_date_and_time'].dt.day
trans_df['travel_day_of_week']= trans_df['travel_date_and_time'].dt.dayofweek
trans_df['travel_is_in_weekend']= trans_df['travel_day_of_week'].apply(lambda d: 1 if d in [5,6] else 0)
trans_df['travel_hour']= trans_df['travel_date_and_time'].dt.hour
# trans_df['travel_minute']= trans_df['travel_date_and_time'].dt.minute

In [None]:
fig = plt.figure(figsize=(10,7))
ax = fig.gca()
sns.barplot(x="travel_month", y="number_of_ticket", data=trans_df, palette='rocket')
sns.set_theme(style='darkgrid')
ax.set_title('month based travel counts')

From the above plot ticket booking are happening in all the 12 months of a year.

In [None]:
fig = plt.figure(figsize=(10,7))
ax = fig.gca()
sns.scatterplot(x="travel_day_of_month", y="number_of_ticket", data=trans_df, color='#ad1759')
sns.set_theme(style='darkgrid')
ax.set_title('dates of a month based travel counts')

We can see that there is the gap between 5 to 11 in the day of the month. We can assume that there is official holyday of public transport between these days. we can also say that the number of tickets in all the days of month are same.

In [None]:
# days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
fig = plt.figure(figsize=(10,7))
ax = fig.gca()
sns.barplot(x="travel_day_of_week", y="number_of_ticket", data=trans_df, palette= 'rocket')
plt.legend( loc='upper right')
sns.set_theme(style='darkgrid')
ax.set_title('days of a week based travel counts')

From the above plot ticket booking are happening in all the 7 days of a week.

In [None]:
fig = plt.figure(figsize=(10,7))
ax = fig.gca()
sns.scatterplot(x="travel_hour", y="number_of_ticket", data=trans_df, color='#ad1759')
sns.set_theme(style='darkgrid')
ax.set_title('hour of a day based travel counts')

We can see that most of the tickets were sold at 7 AM and 8 PM. And that seems true because in the morning most of the people go to the work and office.

From the above we can say that there is not ride between 12 PM to 5.30 PM

In [None]:
def time_to_period(h):
  '''This function can take hours as input and return the time period of a day as output'''
  if h >= 7 and h < 11:
    return 'morning'
  elif h >= 11 and h < 15:
    return 'after_noon'
  elif h >= 15 and h < 19:
    return 'evening'
  elif h >= 19 and h <= 24:
    return 'night'
  else:
    return 'early_morning'

In [None]:
# Calculation of time period based on the travel_date_and_time feature
trans_df['travel_time_period'] = trans_df.travel_hour.apply(time_to_period)

In [None]:
# Creating a seperate column for giving hour wise weights for the hours column
travel_time_period_counts = dict(trans_df.travel_time_period.value_counts())
trans_df['travel_hour_wise_weights'] = np.log1p(trans_df.travel_time_period.map(travel_time_period_counts))

In [None]:
# Creating a seperate column for giving day of a year wise weights for the hours column
travel_day_of_year_counts = dict(trans_df.travel_day_of_year.value_counts())
trans_df['travel_day_of_year_wise_weights'] = np.log1p(trans_df.travel_day_of_year.map(travel_day_of_year_counts))

In [None]:
# Counts of tickets in booked in each dates of a month
trans_df.travel_day_of_month.value_counts()

In [None]:
# Giving weights to the each days of the month based on the frequency of ticket bookings
travel_day_of_month_wise_weights_dict = {2:1, 12:1, 3:1, 4:2, 1:3, 13:3, 14:3, 16:3, 28:3, 19:3, 18:3, 15:3, 17:3, 20:3, 22:4, 21:4, 27:4, 29:4, 23:4, 24:4, 26:4, 30:4, 25:4, 31:4}
trans_df['travel_day_of_month_wise_weights'] = trans_df.travel_day_of_month.replace(travel_day_of_month_wise_weights_dict)

In [None]:
# Counts of tickets in booked in each months of year
trans_df.travel_month.value_counts()

In [None]:
# Creating a column for giving weights to the each months of a year based on the frequency of ticket bookings
travel_month_wise_weights_dict = {12: 1,
 2: 1,
 1: 1,
 3: 1,
 4: 1,
 11: 2,
 9: 3,
 7: 3,
 8: 3,
 10: 3,
 6: 3,
 5: 3}
trans_df['travel_month_wise_weights'] = trans_df.travel_month.replace(travel_month_wise_weights_dict)

In [None]:
trans_df.head()

In [None]:
# tmp_df = trans_df.copy()

In [None]:
def calculate_next_and_previous_timings(tmp_df):
  tmp_df.sort_values(['travel_from', 'travel_date_and_time'], inplace= True)
  tmp_df['delay_btw_initial_to_next_and_previous_bus'] = (tmp_df.groupby(['travel_from']).travel_date_and_time.shift(-1) - tmp_df.groupby(['travel_from']).travel_date_and_time.shift(1)).dt.total_seconds()/3600
  tmp_df['delay_btw_1bus_and_next_bus'] = (tmp_df.travel_date_and_time - tmp_df.groupby(['travel_from']).travel_date_and_time.shift(-1)).dt.total_seconds()/3600
  tmp_df['delay_btw_1bus_and_previous_bus'] = (tmp_df.travel_date_and_time - tmp_df.groupby(['travel_from']).travel_date_and_time.shift(1)).dt.total_seconds()/3600
  tmp_df['delay_btw_2bus_and_next_bus'] = (tmp_df.travel_date_and_time - tmp_df.groupby(['travel_from']).travel_date_and_time.shift(-2)).dt.total_seconds()/3600
  tmp_df['delay_btw_2bus_and_previous_bus'] = (tmp_df.travel_date_and_time - tmp_df.groupby(['travel_from']).travel_date_and_time.shift(2)).dt.total_seconds()/3600
  tmp_df['delay_btw_3bus_and_next_bus'] = (tmp_df.travel_date_and_time - tmp_df.groupby(['travel_from']).travel_date_and_time.shift(-3)).dt.total_seconds()/3600
  tmp_df['delay_btw_3bus_and_previous_bus'] = (tmp_df.travel_date_and_time - tmp_df.groupby(['travel_from']).travel_date_and_time.shift(3)).dt.total_seconds()/3600
  new_col = ['delay_btw_initial_to_next_and_previous_bus', 'delay_btw_1bus_and_next_bus', 'delay_btw_1bus_and_previous_bus', 'delay_btw_2bus_and_next_bus', 'delay_btw_2bus_and_previous_bus', 'delay_btw_3bus_and_next_bus','delay_btw_3bus_and_previous_bus']
  tmp_df[new_col] = tmp_df.groupby(['travel_from'])[new_col].fillna(method = 'ffill')
  tmp_df[new_col] = tmp_df.groupby(['travel_from'])[new_col].fillna(method = 'backfill')
  
  return tmp_df

In [None]:
trans_df = calculate_next_and_previous_timings(trans_df)

In [None]:
trans_df.isnull().sum()

In [None]:
trans_df.dropna(inplace= True)
trans_df.shape

In [None]:
trans_df.head()

In [None]:
travel_distance_to_Nairobi_dict = {'Awendo':352, 'Homa Bay':368, 'Kehancha':308, 'Kendu Bay':343, 'Keroka':281, 'Keumbu':295,
                                   'Kijauri':272, 'Kisii':306, 'Mbita':406, 'Migori':373, 'Ndhiwa':385, 'Nyachenge':313, 'Oyugis':324, 
                                   'Rodi':348, 'Rongo':333, 'Sirare':415, 'Sori':407}
trans_df['travel_distance_to_Nairobi'] = trans_df.travel_from.map(travel_distance_to_Nairobi_dict)

In [None]:
sns.histplot(x="travel_distance_to_Nairobi", data=trans_df, color='#ad1759', kde=True)

In [None]:
fig = plt.figure(figsize=(10,7))
ax = fig.gca()
sns.barplot(x='travel_from' ,y="travel_distance_to_Nairobi", data=trans_df, palette= 'rocket')
sns.set_theme(style='darkgrid')
ax.set_title('Travel distance to Nairobi')
plt.xticks(rotation=90)

In [None]:
travel_time_to_Nairobi_dict = {'Awendo': 6*60+24, 'Homa Bay': 6*60+29, 'Kehancha': 6*60+11, 'Kendu Bay': 6*60, 'Keroka': 4*60+55, 'Keumbu': 5*60+13, 'Kijauri': 4*60+44, 
 'Kisii': 5*60+29, 'Mbita': 7*60+8, 'Migori': 6*60+54, 'Ndhiwa': 6*60+47, 'Nyachenge': 5*60+40, 'Oyugis': 5*60+42, 'Rodi': 6*60+40, 'Rongo': 6*60+5, 'Sirare': 8*60+4, 'Sori': 7*60+11}
trans_df['travel_time_to_Nairobi'] = trans_df.travel_from.map(travel_time_to_Nairobi_dict)

In [None]:
fig = plt.figure(figsize=(10,7))
ax = fig.gca()
sns.histplot(x="travel_time_to_Nairobi", data=trans_df, color='#ad1759', kde=True)
sns.set_theme(style='darkgrid')
ax.set_title('Distribution of travel time to Nairobi')

In [None]:
trans_df['travel_speed_to_Nairobi'] = trans_df.travel_distance_to_Nairobi / trans_df.travel_time_to_Nairobi

In [None]:
fig = plt.figure(figsize=(10,7))
ax = fig.gca()
sns.histplot(x="travel_speed_to_Nairobi", data=trans_df, color='#ad1759', kde=True)
sns.set_theme(style='darkgrid')
ax.set_title('Distribution of travel speed to Nairobi')

In [None]:
from datetime import timedelta

for key in travel_time_to_Nairobi_dict.keys(): 
    travel_time_to_Nairobi_dict[key]=timedelta( minutes=travel_time_to_Nairobi_dict[key])
travel_time_to_Nairobi_dict

In [None]:
trans_df['travel_arrival_data_and_time'] = trans_df.travel_from.map(travel_time_to_Nairobi_dict)
trans_df['travel_arrival_data_and_time'] = trans_df.travel_date_and_time + trans_df['travel_arrival_data_and_time']
trans_df['travel_arrival_hour'] = trans_df.travel_arrival_data_and_time.dt.hour
trans_df['travel_arrival_minute'] = trans_df.travel_arrival_data_and_time.dt.minute
trans_df["travel_is_in_rush_hour"]= trans_df.travel_arrival_hour.apply(lambda h: 1 if (h>=7) and (h<= 17) else 0)

## Handling Categorical Features

In [None]:
trans_df.columns

In [None]:
cat_features = ['travel_from','travel_day_of_month_wise_weights','travel_month_wise_weights']
trans_df = pd.get_dummies(trans_df, columns=cat_features)
label_enc = {'Bus':1, 'shuttle':0}
trans_df.car_type.replace(label_enc, inplace= True)

In [None]:
trans_df.head()

In [None]:
trans_df.columns

In [None]:
req_columns = ['car_type', 'travel_day_of_week','travel_day_of_year', 'travel_is_in_weekend', 'travel_hour', 'travel_year', 'travel_year_quarter',
       'travel_hour_wise_weights', 'travel_day_of_year_wise_weights',
       'delay_btw_initial_to_next_and_previous_bus',
       'delay_btw_1bus_and_next_bus', 'delay_btw_1bus_and_previous_bus',
       'delay_btw_2bus_and_next_bus', 'delay_btw_2bus_and_previous_bus',
       'delay_btw_3bus_and_next_bus', 'delay_btw_3bus_and_previous_bus',
       'travel_distance_to_Nairobi', 'travel_time_to_Nairobi', 'travel_speed_to_Nairobi', 'travel_arrival_hour', 'travel_is_in_rush_hour', 
       'travel_from_Awendo', 'travel_from_Homa Bay', 'travel_from_Kehancha', 
       'travel_from_Keroka', 'travel_from_Keumbu', 'travel_from_Kijauri', 
       'travel_from_Kisii', 'travel_from_Mbita', 'travel_from_Migori', 
       'travel_from_Ndhiwa', 'travel_from_Nyachenge', 'travel_from_Rodi', 
       'travel_from_Rongo', 'travel_from_Sirare', 'travel_from_Sori',
       'travel_day_of_month_wise_weights_1',
       'travel_day_of_month_wise_weights_2',
       'travel_day_of_month_wise_weights_3',
       'travel_day_of_month_wise_weights_4',
       'travel_month_wise_weights_1', 'travel_month_wise_weights_2', 'travel_month_wise_weights_3',
       'number_of_ticket']
len(req_columns)

In [None]:
transport_df = trans_df[req_columns]
transport_df.head()

In [None]:
transport_df_corr = transport_df.corr()
fig = plt.figure(figsize=(35,32))
ax = fig.gca()
sns.heatmap(abs(transport_df_corr), annot=True, cmap='rocket')
plt.title('Public Transport dataset correlation table')

## Training models

In [None]:
# Importing the required sklearn packages
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error,mean_absolute_percentage_error, r2_score

In [None]:
def adjusted_r2(x, r2):
  '''This function will take X variables' dataset and r^2 value as inputs and can return the adjusted r^2 as output'''
  n = len(x)
  p = len(x.columns)
  adj_r2 = 1-((1-r2)*(n-1)/(n-p-1))
  return adj_r2

In [None]:
# Seperating dependent and independent variables of the dataset
X = transport_df.drop(['number_of_ticket'], axis= 1).copy()
y = transport_df['number_of_ticket'].copy()

print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

In [None]:
# Splitting the dataset for Training and Testing models
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= 30)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_train: {y_train.shape}\n")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_test: {y_test.shape}")

## Random Forest Regressor



In [None]:
# Importing the RandomForest packages
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Training the simple gradient Boosting model
randf_reg = RandomForestRegressor(criterion='squared_error', max_leaf_nodes=10, random_state=30)
randf_reg.fit(X_train, y_train)

In [None]:
# Predicting the values for tarining data
y_train_rf_reg = randf_reg.predict(X_train)

In [None]:
# Sample of predicted values of training data
y_train_rf_reg[:10]

In [None]:
# Actual values of test data
y_test[:10]

In [None]:
# Predicting the values for test data
y_test_rf_reg = randf_reg.predict(X_test)

In [None]:
# Sample of predicted values of test data
y_test_rf_reg[:10]

In [None]:
print("Train data Reg Score :",randf_reg.score(X_train,y_train))
print("Test data Reg Score :",randf_reg.score(X_test,y_test))

In [None]:
# Evaluation metrics for training data
MSE_train_rf_reg  = mean_squared_error(y_train, y_train_rf_reg)
print("MSE for Train data :" , MSE_train_rf_reg)

RMSE_train_rf_reg = np.sqrt(MSE_train_rf_reg)
print("RMSE for Train data:" ,RMSE_train_rf_reg)

MAE_train_rf_reg = mean_absolute_error(y_train, y_train_rf_reg)
print("MAE for Train data:" ,MAE_train_rf_reg)

MAPE_train_rf_reg = mean_absolute_percentage_error(y_train, y_train_rf_reg)
print("MAPE for Train data:" ,MAPE_train_rf_reg)

r2_score_train_rf_reg = r2_score(y_train, y_train_rf_reg)
print("R2 for Train data:" ,r2_score_train_rf_reg)
print("Adjusted R2 for Train data: " ,adjusted_r2(X_train, r2_score_train_rf_reg))

In [None]:
# Evaluation metrics for test data
MSE_test_rf_reg  = mean_squared_error(y_test, y_test_rf_reg)
print("MSE for Test data :" , MSE_test_rf_reg)

RMSE_test_rf_reg = np.sqrt(MSE_test_rf_reg)
print("RMSE for Test data:" ,RMSE_test_rf_reg)

MAE_test_rf_reg = mean_absolute_error(y_test, y_test_rf_reg)
print("MAE for Test data:" ,MAE_test_rf_reg)

MAPE_test_rf_reg = mean_absolute_percentage_error(y_test, y_test_rf_reg)
print("MAPE for Test data:" ,MAPE_test_rf_reg)

r2_score_test_rf_reg = r2_score(y_test, y_test_rf_reg)
print("R2 for Test data:" ,r2_score_test_rf_reg)
print("Adjusted R2 for Test data: " ,adjusted_r2(X_test, r2_score_test_rf_reg))

In [None]:
# Accuracy of the model for training data
train_accuracy_rf_reg = cross_val_score(randf_reg, X_train,y_train, cv=5 )
print(f"Train_data_accuracy: {train_accuracy_rf_reg}")

In [None]:
# Accuracy of the model for test data
test_accuracy_rf_reg = cross_val_score(randf_reg, X_test,y_test, cv=5 )
print(f"Test_data_accuracy: {test_accuracy_rf_reg}")

## Grid Search Cross Validation on Random Forest Regressor

In [None]:
rfr = RandomForestRegressor()
from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rfr.get_params())

In [None]:
# Importing the GridSearch Cross Valiation Packages
from sklearn.model_selection import GridSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 400, stop = 1000, num = 4)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(40, 100, num = 4)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [5, 10, 12]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the parameters grid
grid_params_dict = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
pprint(grid_params_dict)

In [None]:
# First create the base model to tune
rfr = RandomForestRegressor()

# Grid Search of parameters, using 3 fold cross validation, 
rf_gridCV = GridSearchCV(estimator = rfr, param_grid = grid_params_dict, cv = 3, verbose=2, n_jobs = -1)

# Fit the random search model
rf_gridCV.fit(X, y)

In [None]:
# Viewing the best parameters of the optimal model
rf_gridCV.best_params_

In [None]:
rf_gridCV.best_estimator_

In [None]:
# Predicting the values of training data using the calculated optimal model
rf_grid_optimal_model =rf_gridCV.best_estimator_
y_train_pred_gridCV = rf_grid_optimal_model.predict(X_train)

In [None]:
# Evaluation metrics for training data
MSE_train_rf_reg_best_gridCV  = mean_squared_error(y_train, y_train_pred_gridCV)
print("MSE for Train data :" , MSE_train_rf_reg_best_gridCV)

RMSE_train_rf_reg_best_gridCV = np.sqrt(MSE_train_rf_reg_best_gridCV)
print("RMSE for Train data:" ,RMSE_train_rf_reg_best_gridCV)

MAE_train_rf_reg_best_gridCV = mean_absolute_error(y_train, y_train_pred_gridCV)
print("MAE for Train data:" ,MAE_train_rf_reg_best_gridCV)

MAPE_train_rf_reg_best_gridCV = mean_absolute_percentage_error(y_train, y_train_pred_gridCV)
print("MAPE for Train data:" ,MAPE_train_rf_reg_best_gridCV)

r2_score_train_rf_reg_gridCV = r2_score(y_train, y_train_pred_gridCV)
print("R2 for Train data:" ,r2_score_train_rf_reg_gridCV)
print("Adjusted R2 for Train data: " ,adjusted_r2(X_train, r2_score_train_rf_reg_gridCV))

In [None]:
# Predicting the target values of test data using calculated best model
y_test_pred_gridCV = rf_grid_optimal_model.predict(X_test)

In [None]:
# Evaluation metrics for test data
MSE_test_rf_reg_best_gridCV  = mean_squared_error(y_test, y_test_pred_gridCV)
print("MSE for Test data :" , MSE_test_rf_reg_best_gridCV)

RMSE_test_rf_reg_best_gridCV = np.sqrt(MSE_test_rf_reg_best_gridCV)
print("RMSE for Test data:" ,RMSE_test_rf_reg_best_gridCV)

MAE_test_rf_reg_best_gridCV = mean_absolute_error(y_test, y_test_pred_gridCV)
print("MAE for Test data:" ,MAE_test_rf_reg_best_gridCV)

MAPE_test_rf_reg_best_gridCV = mean_absolute_percentage_error(y_test, y_test_pred_gridCV)
print("MAPE for Test data:" ,MAPE_test_rf_reg_best_gridCV)

r2_score_test_rf_reg_best_gridCV = r2_score(y_test, y_test_pred_gridCV)
print("R2 for Test data:" ,r2_score_test_rf_reg_best_gridCV)
print("Adjusted R2 for Test data: " ,adjusted_r2(X_test, r2_score_test_rf_reg_best_gridCV))

In [None]:
# Accuracy of the model for training data
train_accuracy_rf_reg_best_gridCV = cross_val_score(rf_grid_optimal_model, X_train,y_train, cv=5 )
print(f"Train_data_accuracy: {train_accuracy_rf_reg_best_gridCV}")

In [None]:
# Accuracy of the model for test data
test_accuracy_rf_reg_best_gridCV = cross_val_score(rf_grid_optimal_model, X_test,y_test, cv=5 )
print(f"Test_data_accuracy: {test_accuracy_rf_reg_best_gridCV}")

## Gradient Boosting Regressor

In [None]:
# Importing the GradientBoosting algorithm
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
# Training the simple gradient Boosting model
gb_reg = GradientBoostingRegressor(random_state= 30)
gb_reg.fit(X_train, y_train)

In [None]:
# Predicting the values for tarining data
y_train_gb_reg = gb_reg.predict(X_train)

In [None]:
# Sample of predicted values of training data
y_train_gb_reg[:10]

In [None]:
# Actual values of training data
y_train[:10]

In [None]:
# Predicting the values for test data
y_test_gb_reg = gb_reg.predict(X_test)

In [None]:
# Sample of predicted values of test data
y_test_gb_reg[:10]

In [None]:
# Actual values of test data
y_test[:10]

In [None]:
print("Train data Reg Score :",gb_reg.score(X_train,y_train))
print("Test data Reg Score :",gb_reg.score(X_test,y_test))

In [None]:
# Evaluation metrics for training data
MSE_train_gb_reg  = mean_squared_error(y_train, y_train_gb_reg)
print("MSE for Train data :" , MSE_train_gb_reg)

RMSE_train_gb_reg = np.sqrt(MSE_train_gb_reg)
print("RMSE for Train data:" ,RMSE_train_gb_reg)

MAE_train_gb_reg = mean_absolute_error(y_train, y_train_gb_reg)
print("MAE for Train data:" ,MAE_train_gb_reg)

MAPE_train_gb_reg = mean_absolute_percentage_error(y_train, y_train_gb_reg)
print("MAPE for Train data:" ,MAPE_train_gb_reg)

r2_score_train_gb_reg = r2_score(y_train, y_train_gb_reg)
print("R2 for Train data:" ,r2_score_train_gb_reg)
print("Adjusted R2 for Train data: " ,adjusted_r2(X_train, r2_score_train_gb_reg))

In [None]:
# Evaluation metrics for test data
MSE_test_gb_reg  = mean_squared_error(y_test, y_test_gb_reg)
print("MSE for Test data :" , MSE_test_gb_reg)

RMSE_test_gb_reg = np.sqrt(MSE_test_gb_reg)
print("RMSE for Test data:" ,RMSE_test_gb_reg)

MAE_test_gb_reg = mean_absolute_error(y_test, y_test_gb_reg)
print("MAE for Test data:" ,MAE_test_gb_reg)

MAPE_test_gb_reg = mean_absolute_percentage_error(y_test, y_test_gb_reg)
print("MAPE for Test data:" ,MAPE_test_gb_reg)

r2_score_test_gb_reg = r2_score(y_test, y_test_gb_reg)
print("R2 for Test data:" ,r2_score_test_gb_reg)
print("Adjusted R2 for Test data: " ,adjusted_r2(X_test, r2_score_test_gb_reg))

In [None]:
# Accuracy of the model for training data
train_accuracy_gb_reg = cross_val_score(gb_reg, X_train,y_train, cv=5 )
print(f"Train_data_accuracy: {train_accuracy_gb_reg}")

In [None]:
# Accuracy of the model for test data
test_accuracy_gb_reg = cross_val_score(gb_reg, X_test,y_test, cv=5 )
print(f"Test_data_accuracy: {test_accuracy_gb_reg}")

## Grid Search Cross Validation on Gradient Boosting Regressor

In [None]:
gbr = GradientBoostingRegressor()
from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(gbr.get_params())

In [None]:
from sklearn.model_selection import GridSearchCV

#Create the parameters grid
#Magnitude of this change in the estimates
learning_rate=  [0.01, 0.05, 0.1, 1, 5]

# Maximum number of levels in tree
max_depth= [4, 6, 8, 10]

# Number of trees in random forest
n_estimators= [20, 30, 50, 60, 70]

#Fraction of observations to be selected for each tree
subsample= [0.1, 0.3, 0.5, 0.6, 0.7, 0.9, 1]

gb_grid_params_dict = {'learning_rate': learning_rate,
         'max_depth': max_depth,
         'n_estimators': n_estimators,
         'subsample': subsample}
pprint(gb_grid_params_dict)

In [None]:
# First create the base model to tune
gbr = GradientBoostingRegressor(criterion = 'friedman_mse')

# Grid Search of parameters, using 3 fold cross validation, 
gbr_grid = GridSearchCV(estimator = gbr, param_grid = gb_grid_params_dict, cv = 3, verbose=2, n_jobs = -1)

# Fit the GridSearch model
gbr_grid.fit(X, y)

In [None]:
# Viewing the best model parameters
gbr_grid.best_params_

In [None]:
gbr_grid.best_estimator_

In [None]:
# Predicting the target values of training data using calculated best model
gbr_optimal_model =gbr_grid.best_estimator_
y_train_pred_gbr_gridCV = gbr_optimal_model.predict(X_train)

In [None]:
# Evaluation metrics for training data
MSE_train_gbr_best_gridCV  = mean_squared_error(y_train, y_train_pred_gbr_gridCV)
print("MSE for Train data :" , MSE_train_gbr_best_gridCV)

RMSE_train_gbr_best_gridCV = np.sqrt(MSE_train_gbr_best_gridCV)
print("RMSE for Train data:" ,RMSE_train_gbr_best_gridCV)

MAE_train_gbr_best_gridCV = mean_absolute_error(y_train, y_train_pred_gbr_gridCV)
print("MAE for Train data:" ,MAE_train_gbr_best_gridCV)

MAPE_train_gbr_best_gridCV = mean_absolute_percentage_error(y_train, y_train_pred_gbr_gridCV)
print("MAPE for Train data:" ,MAPE_train_gbr_best_gridCV)

r2_score_train_gbr_gridCV = r2_score(y_train, y_train_pred_gbr_gridCV)
print("R2 for Train data:" ,r2_score_train_gbr_gridCV)
print("Adjusted R2 for Train data: " ,adjusted_r2(X_train, r2_score_train_gbr_gridCV))

In [None]:
# Predicting the target values of test data using calculated best model
y_test_pred_gbr_gridCV = gbr_optimal_model.predict(X_test)

In [None]:
# Evaluation metrics for test data
MSE_test_gbr_best_gridCV  = mean_squared_error(y_test, y_test_pred_gbr_gridCV)
print("MSE for Test data :" , MSE_test_gbr_best_gridCV)

RMSE_test_gbr_best_gridCV = np.sqrt(MSE_test_gbr_best_gridCV)
print("RMSE for Test data:" ,RMSE_test_gbr_best_gridCV)

MAE_test_gbr_best_gridCV = mean_absolute_error(y_test, y_test_pred_gbr_gridCV)
print("MAE for Test data:" ,MAE_test_gbr_best_gridCV)

MAPE_test_gbr_best_gridCV = mean_absolute_percentage_error(y_test, y_test_pred_gbr_gridCV)
print("MAPE for Test data:" ,MAPE_test_gbr_best_gridCV)

r2_score_test_gbr_best_gridCV = r2_score(y_test, y_test_pred_gbr_gridCV)
print("R2 for Test data:" ,r2_score_test_gbr_best_gridCV)
print("Adjusted R2 for Test data: " ,adjusted_r2(X_test, r2_score_test_gbr_best_gridCV))

In [None]:
# Accuracy of the model for training data
train_accuracy_gbr_best_gridCV = cross_val_score(gbr_optimal_model, X_train,y_train, cv=5 )
print(f"Train_data_accuracy: {train_accuracy_gbr_best_gridCV}")

In [None]:
# Accuracy of the model for test data
test_accuracy_gbr_best_gridCV = cross_val_score(gbr_optimal_model, X_test,y_test, cv=5 )
print(f"Test_data_accuracy: {test_accuracy_gbr_best_gridCV}")

## XGBoost Regressior

In [None]:
# Importing Extreme Gradient Boosting
import xgboost as xgb

In [None]:
#Training basic XGBoost model
xgb_reg = xgb.XGBRegressor(objective='reg:squarederror')
xgb_reg.fit(X_train, y_train)

In [None]:
# Predicting the values for training data
y_train_xgb_reg = xgb_reg.predict(X_train)

In [None]:
# Sample of predicted values for tarining data
y_train_xgb_reg[:10]

In [None]:
# Actual values of training data
y_train[:10]

In [None]:
# Predicting the values for test data
y_test_xgb_reg = xgb_reg.predict(X_test)

In [None]:
# Sample of predicted values for test data
y_test_xgb_reg[:10]

In [None]:
# Actual values of test data
y_test[:10]

In [None]:
print("Train data Reg Score :",xgb_reg.score(X_train,y_train))
print("Test data Reg Score :",xgb_reg.score(X_test,y_test))

In [None]:
# Evaluation metrics for training data
MSE_train_xgb_reg  = mean_squared_error(y_train, y_train_xgb_reg)
print("MSE for Train data :" , MSE_train_xgb_reg)

RMSE_train_xgb_reg = np.sqrt(MSE_train_xgb_reg)
print("RMSE for Train data:" ,RMSE_train_xgb_reg)

MAE_train_xgb_reg = mean_absolute_error(y_train, y_train_xgb_reg)
print("MAE for Train data:" ,MAE_train_xgb_reg)

MAPE_train_xgb_reg = mean_absolute_percentage_error(y_train, y_train_xgb_reg)
print("MAPE for Train data:" ,MAPE_train_xgb_reg)

r2_score_train_xgb_reg = r2_score(y_train, y_train_xgb_reg)
print("R2 for Train data:" ,r2_score_train_xgb_reg)
print("Adjusted R2 for Train data: " ,adjusted_r2(X_train, r2_score_train_xgb_reg))

In [None]:
# Evaluation metrics for test data
MSE_test_xgb_reg  = mean_squared_error(y_test, y_test_xgb_reg)
print("MSE for Test data :" , MSE_test_xgb_reg)

RMSE_test_xgb_reg = np.sqrt(MSE_test_xgb_reg)
print("RMSE for Test data:" ,RMSE_test_xgb_reg)

MAE_test_xgb_reg = mean_absolute_error(y_test, y_test_xgb_reg)
print("MAE for Test data:" ,MAE_test_xgb_reg)

MAPE_test_xgb_reg = mean_absolute_percentage_error(y_test, y_test_xgb_reg)
print("MAPE for Test data:" ,MAPE_test_xgb_reg)

r2_score_test_xgb_reg = r2_score(y_test, y_test_xgb_reg)
print("R2 for Test data:" ,r2_score_test_xgb_reg)
print("Adjusted R2 for Test data: " ,adjusted_r2(X_test, r2_score_test_xgb_reg))

In [None]:
# Accuarcy of the model for training data
train_accuracy_xgb_reg = cross_val_score(xgb_reg, X_train,y_train, cv=5 )
print(f"Train_data_accuracy: {train_accuracy_xgb_reg}")

In [None]:
# Accuarcy of the model for test data
test_accuracy_xgb_reg = cross_val_score(xgb_reg, X_test,y_test, cv=5 )
print(f"Test_data_accuracy: {test_accuracy_xgb_reg}")

## GridSearch Cross Validation on XGBoost Regressior

In [None]:
xgbr = xgb.XGBRegressor()
from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(xgbr.get_params())

In [None]:
from sklearn.model_selection import GridSearchCV

# Fraction of columns to be randomly samples for each tree
colsample_bytree= [0.1, 0.3, 0.5, 0.7, 0.9]
eta = [0.0001, 0.0004, 0.001, 0.004]

# Magnitude of this change in the estimates
learning_rate=  [0.01, 0.05, 0.1]

# Maximum number of levels in tree
max_depth= [6, 8, 10, 12]
min_child_weight= [7, 8, 10, 12]

# Number of trees in random forest
n_estimators= [70, 100, 120]

#Fraction of observations to be selected for each tree
subsample= [0.5, 0.7, 0.9, 1]

# Create the random grid
xgb_grid_params_dict = {'colsample_bytree': colsample_bytree,
         'eta': eta,
         'learning_rate': learning_rate,
         'max_depth': max_depth,
         'min_child_weight': min_child_weight,
         'n_estimators': n_estimators,
         'subsample': subsample}
pprint(xgb_grid_params_dict)

In [None]:
# First create the base model to tune
xgbr = xgb.XGBRegressor(objective='reg:squarederror', random_state = 3)

# Grid Search of parameters, using 3 fold cross validation,
xgbr_grid = GridSearchCV(estimator = xgbr, param_grid = xgb_grid_params_dict, cv = 3, verbose=2, n_jobs = -1)

# Fit the Grid Search model
xgbr_grid.fit(X, y)

In [None]:
# Viewing the best paramters for the optimal model
xgbr_grid.best_params_

In [None]:
xgbr_grid.best_estimator_

In [None]:
# Testing the optimal model with training data
xgbr_optimal_model =xgbr_grid.best_estimator_
y_train_pred_xgbr_gridCV = xgbr_optimal_model.predict(X_train)

In [None]:
#Evaluation metrics of the model for training data
MSE_train_xgbr_best_gridCV  = mean_squared_error(y_train, y_train_pred_xgbr_gridCV)
print("MSE for Train data :" , MSE_train_xgbr_best_gridCV)

RMSE_train_xgbr_best_gridCV = np.sqrt(MSE_train_xgbr_best_gridCV)
print("RMSE for Train data:" ,RMSE_train_xgbr_best_gridCV)

MAE_train_xgbr_best_gridCV = mean_absolute_error(y_train, y_train_pred_xgbr_gridCV)
print("MAE for Train data:" ,MAE_train_xgbr_best_gridCV)

MAPE_train_xgbr_best_gridCV = mean_absolute_percentage_error(y_train, y_train_pred_xgbr_gridCV)
print("MAPE for Train data:" ,MAPE_train_xgbr_best_gridCV)

r2_score_train_xgbr_gridCV = r2_score(y_train, y_train_pred_xgbr_gridCV)
print("R2 for Train data:" ,r2_score_train_xgbr_gridCV)
print("Adjusted R2 for Train data: " ,adjusted_r2(X_train, r2_score_train_xgbr_gridCV))

In [None]:
# Testing the optimal model with training data
y_test_pred_xgbr_gridCV = xgbr_optimal_model.predict(X_test)

In [None]:
#Evaluation metrics of the model for test data
MSE_test_xgbr_best_gridCV  = mean_squared_error(y_test, y_test_pred_xgbr_gridCV)
print("MSE for Test data :" , MSE_test_xgbr_best_gridCV)

RMSE_test_xgbr_best_gridCV = np.sqrt(MSE_test_xgbr_best_gridCV)
print("RMSE for Test data:" ,RMSE_test_xgbr_best_gridCV)

MAE_test_xgbr_best_gridCV = mean_absolute_error(y_test, y_test_pred_xgbr_gridCV)
print("MAE for Test data:" ,MAE_test_xgbr_best_gridCV)

MAPE_test_xgbr_best_gridCV = mean_absolute_percentage_error(y_test, y_test_pred_xgbr_gridCV)
print("MAPE for Test data:" ,MAPE_test_xgbr_best_gridCV)

r2_score_test_xgbr_best_gridCV = r2_score(y_test, y_test_pred_xgbr_gridCV)
print("R2 for Test data:" ,r2_score_test_xgbr_best_gridCV)
print("Adjusted R2 for Test data: " ,adjusted_r2(X_test, r2_score_test_xgbr_best_gridCV))

In [None]:
# Accuarcy of the model for training data
train_accuracy_xgbr_best_gridCV = cross_val_score(xgbr_optimal_model, X_train,y_train, cv=5 )
print(f"Train_data_accuracy: {train_accuracy_xgbr_best_gridCV}")

In [None]:
# Accuarcy of the model for test data
test_accuracy_xgbr_best_gridCV = cross_val_score(xgbr_optimal_model, X_test,y_test, cv=5 )
print(f"Test_data_accuracy: {test_accuracy_xgbr_best_gridCV}")

## Important Features

In [None]:
importances = xgbr_optimal_model.feature_importances_

In [None]:
importance_dict = {'Feature' : list(X_train.columns), 'Feature Importance' : importances}

In [None]:
importance_df = pd.DataFrame(importance_dict)

In [None]:
important_features=importance_df.sort_values(by=['Feature Importance'],ascending=False).head(20)

In [None]:
imp_features = important_features['Feature'].tolist()
print(f"Import Features are: {imp_features}")

In [None]:
#plotting the important fetures obtainind fro the optimal XGB model
fig = plt.figure(figsize=(25,22))
ax = fig.gca()
sns.barplot(x = 'Feature Importance', y = 'Feature', data=important_features, palette= 'rocket')
sns.set_theme(style='darkgrid')
plt.xticks(rotation=90)
# ax.set_xlabel('travel_from')
# ax.set_ylabel('Frequency')
ax.set_title('travel_from counts')

## Conclusion

  * In this project, we have used three different types of regression-based algorithms like Random Forest Regressor, Gradient Boosting Regressor, and XGBoost Regressor. We have done hyperparameter tuning parameters for them to find the best model to get the best results, and we also found the important features for training the model.
  * Out of the three different models the hyperparameters tuned XGBoost Regressor algorithm gives the best results with an accuracy of around 86%.