<a href="https://colab.research.google.com/github/antonypradeep54/Machine-Learning---NYC-taxi-trip-duration-prediction/blob/main/NYC_Taxi_Trip_Duration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font size=+3 color="blue"><center><b> Assignment : NYC Taxi Trip Duration</b></center></font>


## Table of contents

- Problem statement
- Libraries and functions
- Import dataset
- Data cleaning
- Hypothesis test (t-test)
- Regression model (Random Forest)
- References



##Problem Statement

The NYC Taxi Trip Duration dataset provides detailed records of taxi trips, including pickup and drop-off times, locations, and other relevant features. The objective of this project is to develop a predictive model to estimate the trip duration based on these attributes. Accurate predictions can enhance ride-hailing services, optimize route planning, and improve overall transportation efficiency in New York City.

<a id="Librarys"></a>

## Libraries and functions

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from google.colab import files
import os

<a id="ImportDatasets"></a>

## Import dataset

In [None]:
upoaded = files.upload()

Saving nyc_taxi_trip_duration.csv to nyc_taxi_trip_duration.csv


In [None]:
print(os.listdir())

['.config', 'nyc_taxi_trip_duration.csv', 'sample_data']


In [None]:
# Creates the "dataset" directory if it doesn't exist
os.makedirs("dataset", exist_ok=True)

# Imports the shutil module, which provides functions for file and directory operations.
import shutil

# This moves the file nyc_taxi_trip_duration.csv from its current location to the dataset/ folder.
shutil.move("nyc_taxi_trip_duration.csv", "dataset/nyc_taxi_trip_duration.csv")

'dataset/nyc_taxi_trip_duration.csv'

In [None]:
NYC_Taxi_df = pd.read_csv('dataset/nyc_taxi_trip_duration.csv')
NYC_Taxi_df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id1080784,2,2016-02-29 16:40:21,2016-02-29 16:47:01,1,-73.953918,40.778873,-73.963875,40.771164,N,400
1,id0889885,1,2016-03-11 23:35:37,2016-03-11 23:53:57,2,-73.988312,40.731743,-73.994751,40.694931,N,1100
2,id0857912,2,2016-02-21 17:59:33,2016-02-21 18:26:48,2,-73.997314,40.721458,-73.948029,40.774918,N,1635
3,id3744273,2,2016-01-05 09:44:31,2016-01-05 10:03:32,6,-73.96167,40.75972,-73.956779,40.780628,N,1141
4,id0232939,1,2016-02-17 06:42:23,2016-02-17 06:56:31,1,-74.01712,40.708469,-73.988182,40.740631,N,848


<a id="Data Cleaning"></a>

## Data Cleaning

In [None]:
NYC_Taxi_df.dtypes

Unnamed: 0,0
id,object
vendor_id,int64
pickup_datetime,object
dropoff_datetime,object
passenger_count,int64
pickup_longitude,float64
pickup_latitude,float64
dropoff_longitude,float64
dropoff_latitude,float64
store_and_fwd_flag,object


#### Following columns have data type as object which needs to be changed to datetime. </br>

*   pickup_datetime
*   dropoff_datetime



In [None]:
# Convert pickup_datetime and dropoff_datetime columns to datetime
NYC_Taxi_df['pickup_datetime'] = pd.to_datetime(NYC_Taxi_df['pickup_datetime'])
NYC_Taxi_df['dropoff_datetime'] = pd.to_datetime(NYC_Taxi_df['dropoff_datetime'])

In [None]:
# Adding few additional columns based on pickup_datetime and dropoff_datetime columns for further analysis

NYC_Taxi_df['pickup_hour'] = NYC_Taxi_df['pickup_datetime'].dt.hour
NYC_Taxi_df['pickup_day'] = NYC_Taxi_df['pickup_datetime'].dt.day
NYC_Taxi_df['day_of_week'] = NYC_Taxi_df['pickup_datetime'].dt.dayofweek
NYC_Taxi_df['pickup_weekday'] = NYC_Taxi_df['pickup_datetime'].dt.weekday  # 0=Monday, 6=Sunday
NYC_Taxi_df['pickup_month'] = NYC_Taxi_df['pickup_datetime'].dt.month

NYC_Taxi_df['dropoff_hour'] = NYC_Taxi_df['dropoff_datetime'].dt.hour
NYC_Taxi_df['dropoff_day'] = NYC_Taxi_df['dropoff_datetime'].dt.day
NYC_Taxi_df['dropoff_weekday'] = NYC_Taxi_df['dropoff_datetime'].dt.weekday  # 0=Monday, 6=Sunday
NYC_Taxi_df['dropoff_month'] = NYC_Taxi_df['dropoff_datetime'].dt.month



In [None]:
NYC_Taxi_df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,pickup_hour,pickup_day,day_of_week,pickup_weekday,pickup_month,dropoff_hour,dropoff_day,dropoff_weekday,dropoff_month
0,id1080784,2,2016-02-29 16:40:21,2016-02-29 16:47:01,1,-73.953918,40.778873,-73.963875,40.771164,N,400,16,29,0,0,2,16,29,0,2
1,id0889885,1,2016-03-11 23:35:37,2016-03-11 23:53:57,2,-73.988312,40.731743,-73.994751,40.694931,N,1100,23,11,4,4,3,23,11,4,3
2,id0857912,2,2016-02-21 17:59:33,2016-02-21 18:26:48,2,-73.997314,40.721458,-73.948029,40.774918,N,1635,17,21,6,6,2,18,21,6,2
3,id3744273,2,2016-01-05 09:44:31,2016-01-05 10:03:32,6,-73.96167,40.75972,-73.956779,40.780628,N,1141,9,5,1,1,1,10,5,1,1
4,id0232939,1,2016-02-17 06:42:23,2016-02-17 06:56:31,1,-74.01712,40.708469,-73.988182,40.740631,N,848,6,17,2,2,2,6,17,2,2


In [None]:
NYC_Taxi_df.dtypes

Unnamed: 0,0
id,object
vendor_id,int64
pickup_datetime,datetime64[ns]
dropoff_datetime,datetime64[ns]
passenger_count,int64
pickup_longitude,float64
pickup_latitude,float64
dropoff_longitude,float64
dropoff_latitude,float64
store_and_fwd_flag,object


#### Applying mapping on follwoing to convert from numerical values to actual weekday and month respectively. </br>

*   pickup_weekday
*   dropoff_weekday
*   pickup_month
*   dropoff_month

In [None]:
# Create a dictionary to perfrom required weekday mapping
weekday_mapping = {
    0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday',
    4: 'Friday', 5: 'Saturday', 6: 'Sunday'
}

# Convert numeric weekday to weekday name
NYC_Taxi_df['pickup_weekday_mapped'] = NYC_Taxi_df['pickup_weekday'].map(weekday_mapping)
NYC_Taxi_df['dropoff_weekday_mapped'] = NYC_Taxi_df['dropoff_weekday'].map(weekday_mapping)

In [None]:
# Create a dictionary to perfrom required month mapping
month_mapping = {
    1: 'January', 2: 'February', 3: 'March', 4: 'April',
    5: 'May', 6: 'June', 7: 'July', 8: 'August',
    9: 'September', 10: 'October', 11: 'November', 12: 'December'
}

# Convert numeric weekday to weekday name
NYC_Taxi_df['pickup_month_mapped'] = NYC_Taxi_df['pickup_month'].map(month_mapping)
NYC_Taxi_df['dropoff_month_mapped'] = NYC_Taxi_df['dropoff_month'].map(month_mapping)

In [None]:
NYC_Taxi_df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,...,pickup_weekday,pickup_month,dropoff_hour,dropoff_day,dropoff_weekday,dropoff_month,pickup_weekday_mapped,dropoff_weekday_mapped,pickup_month_mapped,dropoff_month_mapped
0,id1080784,2,2016-02-29 16:40:21,2016-02-29 16:47:01,1,-73.953918,40.778873,-73.963875,40.771164,N,...,0,2,16,29,0,2,Monday,Monday,February,February
1,id0889885,1,2016-03-11 23:35:37,2016-03-11 23:53:57,2,-73.988312,40.731743,-73.994751,40.694931,N,...,4,3,23,11,4,3,Friday,Friday,March,March
2,id0857912,2,2016-02-21 17:59:33,2016-02-21 18:26:48,2,-73.997314,40.721458,-73.948029,40.774918,N,...,6,2,18,21,6,2,Sunday,Sunday,February,February
3,id3744273,2,2016-01-05 09:44:31,2016-01-05 10:03:32,6,-73.96167,40.75972,-73.956779,40.780628,N,...,1,1,10,5,1,1,Tuesday,Tuesday,January,January
4,id0232939,1,2016-02-17 06:42:23,2016-02-17 06:56:31,1,-74.01712,40.708469,-73.988182,40.740631,N,...,2,2,6,17,2,2,Wednesday,Wednesday,February,February


<a id="HypothesisTest"></a>

## Hypothesis Test


<a id="Ttest"></a>

## t-Test

t-test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. Null hypothesis and Alterntive hypothesis for T-test is given below. </br>

* Null hypothesis: The means in the two groups are equal (so there is no difference between the two groups) </br>
* Alternative hypothesis: The mean values in the two groups are not equal (i.e. there is a difference between the two groups) </br>

Note : If the calcualted p-value is below significance level (typically p < 0.05), then null hypothesis is rejected.

In [None]:
# Separate the data for the two vendors
vendor_1_data = NYC_Taxi_df[NYC_Taxi_df['vendor_id'] == 1]['trip_duration']
vendor_2_data = NYC_Taxi_df[NYC_Taxi_df['vendor_id'] == 2]['trip_duration']

# Perform the t-test
t_stat, p_value = stats.ttest_ind(vendor_1_data, vendor_2_data)

# Output the results
print(f"P-value: {p_value}")

P-value: 3.228422510941252e-124


#### Insight : t-test

* Since the calcualted p-value is less than 5, the Null hypothesis is rejected and the Alternative hypothesis is accepted. </br>
* The mean trip duration for Vendor 1 and Vendor 2 are not equal.


#Regression model (Random Forest)

In [None]:
def haversine(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2)**2 + np.cos(lat1)*np.cos(lat2)*np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    r = 6371  # Radius of earth in km
    return c * r


NYC_Taxi_df['distance'] = haversine(NYC_Taxi_df['pickup_longitude'], NYC_Taxi_df['pickup_latitude'],
                           NYC_Taxi_df['dropoff_longitude'], NYC_Taxi_df['dropoff_latitude'])

# Log transform target
NYC_Taxi_df['log_trip_duration'] = np.log1p(NYC_Taxi_df['trip_duration'])

In [None]:
NYC_Taxi_df['distance'] = NYC_Taxi_df.apply(
    lambda row: haversine(row['pickup_longitude'], row['pickup_latitude'],
                          row['dropoff_longitude'], row['dropoff_latitude']),
    axis=1
)

In [None]:
# Features & Target
features = ['passenger_count', 'pickup_hour', 'day_of_week', 'distance']
X = NYC_Taxi_df[features]
y = NYC_Taxi_df['log_trip_duration']

In [None]:
from sklearn.model_selection import train_test_split

# Train/val split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from math import sqrt
from sklearn.metrics import mean_squared_log_error

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluation
y_pred = model.predict(X_val)
rmsle = sqrt(mean_squared_log_error(y_val, y_pred))
print("Validation RMSLE:", rmsle)

Validation RMSLE: 0.07391562246339738


Benchmarks for Root Mean Squared Logarithmic Error

Less than 0.1 → Excellent

Between 0.1 – 0.2 → Good

Between 0.2 – 0.5 → Moderate / acceptable depending on use case

Greater than 0.5 → Poor


In [None]:
print(f"Since RMSLE value of the model is: {rmsle:.4f}, the model is perfroming good")

Since RMSLE value of the model is: 0.0739, the model is perfroming good


<a id="References"></a>

## References

* ChatGPT for Python syntax. </br>
* DataTab for Statistics. </br>


<font size=+1 color="blue"><center><b> End of Assignment </b></center></font>