<a href="https://colab.research.google.com/github/debasishpohi1999/seeds.csv-df-/blob/main/Taxi_Fare_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Predicting Taxi Fares with Machine Learning: A Comprehensive Analysis Using RandomForestRegressor**

**Description:**

In the bustling urban landscape, taxis play a pivotal role in facilitating transportation for millions of people daily. Understanding and accurately predicting taxi fares are crucial not only for passengers but also for drivers, taxi companies, and policymakers. This Python project embarks on a comprehensive journey to develop a robust predictive model for taxi fares, leveraging the power of machine learning techniques, particularly the RandomForestRegressor algorithm.

1. **Data Acquisition and Exploration:**
The project kicks off with the acquisition of a rich dataset containing a wealth of information about taxi rides. This dataset encompasses diverse attributes such as timestamps, geographic coordinates (pickup and dropoff locations), fare amounts, and additional temporal features. Exploratory data analysis (EDA) techniques are employed to gain insights into the data's structure, distributions, and potential patterns.

2. **Data Preprocessing and Feature Engineering:**
With the dataset in hand, meticulous preprocessing steps are undertaken to ensure data quality and usability. Missing values are handled, irrelevant features are dropped, and categorical variables are encoded appropriately. Moreover, feature engineering plays a pivotal role in enriching the dataset, extracting valuable insights from raw attributes. Features such as hour of the day, day of the week, month, and distance traveled are engineered to enhance the model's predictive capabilities.

3. **Filtering Outliers and Ensuring Data Integrity:**
An essential aspect of the preprocessing phase involves identifying and filtering out outliers. Taxi fare data is scrutinized to eliminate anomalies and ensure that only reliable and realistic observations are retained for model training. By setting reasonable thresholds for fares and distances, the dataset is cleansed of erroneous entries, thereby enhancing the model's robustness and generalizability.

4. **Model Selection and Training:**
With the preprocessed dataset at hand, the stage is set for model selection and training. RandomForestRegressor, a powerful ensemble learning algorithm capable of handling non-linear relationships and capturing complex interactions, is chosen as the predictive model. The dataset is split into training and testing sets, and the RandomForestRegressor model is trained using the training data, incorporating features such as pickup coordinates, dropoff coordinates, distance traveled, and temporal attributes.

5. **Model Evaluation and Performance Metrics:**
Rigorous evaluation of the trained model is conducted using a battery of performance metrics, including Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE). These metrics provide valuable insights into the model's predictive accuracy and its ability to generalize to unseen data. By comparing the model's predictions against actual fare amounts from the testing set, the efficacy and reliability of the predictive model are assessed.

In [1]:
# Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from statsmodels.api import add_constant, OLS
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import plot_tree
from sklearn.metrics import accuracy_score
from io import StringIO

In [2]:
# Upload the data set named taxifare
df = pd.read_csv("/content/TaxiFare.csv")

In [3]:
df.head(10)

Unnamed: 0,unique_id,date_time_of_pickup,longitude_of_pickup,latitude_of_pickup,longitude_of_dropoff,latitude_of_dropoff,no_of_passenger,amount
0,26:21.0,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1,4.5
1,52:16.0,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1,16.9
2,35:00.0,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2,5.7
3,30:42.0,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1,7.7
4,51:00.0,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1,5.3
5,50:45.0,2011-01-06 09:50:45 UTC,-74.000964,40.73163,-73.972892,40.758233,1,12.1
6,35:00.0,2012-11-20 20:35:00 UTC,-73.980002,40.751662,-73.973802,40.764842,1,7.5
7,22:00.0,2012-01-04 17:22:00 UTC,-73.9513,40.774138,-73.990095,40.751048,1,16.5
8,10:00.0,2012-12-03 13:10:00 UTC,-74.006462,40.726713,-73.993078,40.731628,1,9.0
9,11:00.0,2009-09-02 01:11:00 UTC,-73.980658,40.733873,-73.99154,40.758138,2,8.9


In [4]:
# drop unnecessary data
df = df.drop("unique_id",axis = 1)

In [5]:
df.describe()

Unnamed: 0,longitude_of_pickup,latitude_of_pickup,longitude_of_dropoff,latitude_of_dropoff,no_of_passenger,amount
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,-72.509756,39.933759,-72.504616,39.926251,1.66784,11.364171
std,10.39386,6.224857,10.40757,6.014737,1.289195,9.685557
min,-75.423848,-74.006893,-84.654241,-74.006377,0.0,-5.0
25%,-73.992062,40.73488,-73.991152,40.734372,1.0,6.0
50%,-73.98184,40.752678,-73.980082,40.753372,1.0,8.5
75%,-73.967148,40.76736,-73.963584,40.768167,2.0,12.5
max,40.783472,401.083332,40.851027,43.41519,6.0,200.0


In [6]:
df.isna().sum()   # no missing value found

date_time_of_pickup     0
longitude_of_pickup     0
latitude_of_pickup      0
longitude_of_dropoff    0
latitude_of_dropoff     0
no_of_passenger         0
amount                  0
dtype: int64

In [None]:
df.info()  #check data type and other information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   date_time_of_pickup   50000 non-null  object 
 1   longitude_of_pickup   50000 non-null  float64
 2   latitude_of_pickup    50000 non-null  float64
 3   longitude_of_dropoff  50000 non-null  float64
 4   latitude_of_dropoff   50000 non-null  float64
 5   no_of_passenger       50000 non-null  int64  
 6   amount                50000 non-null  float64
dtypes: float64(5), int64(1), object(1)
memory usage: 2.7+ MB


In [8]:
# Extract various date time components as seperate indep variables
#first convert date_time_of_pickup to date and time formate
df["date_time_of_pickup"] = pd.to_datetime(df["date_time_of_pickup"])
new_df = df.assign(hour = df["date_time_of_pickup"].dt.hour,
                  dayOfTheMonth = df["date_time_of_pickup"].dt.day,
                  month = df["date_time_of_pickup"].dt.month,
                  dayOfTheWeek = df["date_time_of_pickup"].dt.dayofweek)

# Remove date_time_of_pickup
new_df.drop("date_time_of_pickup", axis = 1, inplace = True)

new_df.head()

Unnamed: 0,longitude_of_pickup,latitude_of_pickup,longitude_of_dropoff,latitude_of_dropoff,no_of_passenger,amount,hour,dayOfTheMonth,month,dayOfTheWeek
0,-73.844311,40.721319,-73.84161,40.712278,1,4.5,17,15,6,0
1,-74.016048,40.711303,-73.979268,40.782004,1,16.9,16,5,1,1
2,-73.982738,40.76127,-73.991242,40.750562,2,5.7,0,18,8,3
3,-73.98713,40.733143,-73.991567,40.758092,1,7.7,4,21,4,5
4,-73.968095,40.768008,-73.956655,40.783762,1,5.3,7,9,3,1


In [9]:
new_df.shape

(50000, 10)

In [10]:
def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c # 6367 is radius of earth in kilometers.
    return km

new_df["distance"] = haversine_np(new_df["longitude_of_pickup"], new_df["latitude_of_pickup"],
                                   new_df["longitude_of_dropoff"], new_df["latitude_of_dropoff"])

new_df.head()

Unnamed: 0,longitude_of_pickup,latitude_of_pickup,longitude_of_dropoff,latitude_of_dropoff,no_of_passenger,amount,hour,dayOfTheMonth,month,dayOfTheWeek,distance
0,-73.844311,40.721319,-73.84161,40.712278,1,4.5,17,15,6,0,1.030117
1,-74.016048,40.711303,-73.979268,40.782004,1,16.9,16,5,1,1,8.444828
2,-73.982738,40.76127,-73.991242,40.750562,2,5.7,0,18,8,3,1.388653
3,-73.98713,40.733143,-73.991567,40.758092,1,7.7,4,21,4,5,2.797513
4,-73.968095,40.768008,-73.956655,40.783762,1,5.3,7,9,3,1,1.997902


In [11]:
# Lets do a simple check of what range is currently present in the various variables
new_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
longitude_of_pickup,50000.0,-72.509756,10.39386,-75.423848,-73.992062,-73.98184,-73.967148,40.783472
latitude_of_pickup,50000.0,39.933759,6.224857,-74.006893,40.73488,40.752678,40.76736,401.083332
longitude_of_dropoff,50000.0,-72.504616,10.40757,-84.654241,-73.991152,-73.980082,-73.963584,40.851027
latitude_of_dropoff,50000.0,39.926251,6.014737,-74.006377,40.734372,40.753372,40.768167,43.41519
no_of_passenger,50000.0,1.66784,1.289195,0.0,1.0,1.0,2.0,6.0
amount,50000.0,11.364171,9.685557,-5.0,6.0,8.5,12.5,200.0
hour,50000.0,13.48908,6.506935,0.0,9.0,14.0,19.0,23.0
dayOfTheMonth,50000.0,15.67204,8.660789,1.0,8.0,16.0,23.0,31.0
month,50000.0,6.2733,3.461157,1.0,3.0,6.0,9.0,12.0
dayOfTheWeek,50000.0,3.02998,1.956936,0.0,1.0,3.0,5.0,6.0


In [12]:
# A. Amount < 2.5 as the minimum fare is $2.5

print(new_df["amount"].describe())
fullRaw = new_df[new_df["amount"] >= 2.5]
print(new_df["amount"].describe())

count    50000.000000
mean        11.364171
std          9.685557
min         -5.000000
25%          6.000000
50%          8.500000
75%         12.500000
max        200.000000
Name: amount, dtype: float64
count    50000.000000
mean        11.364171
std          9.685557
min         -5.000000
25%          6.000000
50%          8.500000
75%         12.500000
max        200.000000
Name: amount, dtype: float64


In [13]:
# B. Trips with travel distance greater than or equal to 1, and less than 130Kms.

print(new_df["distance"].describe())
new_df = new_df[(new_df["distance"] >= 1) & (new_df["distance"] <= 130)]
print(new_df["distance"].describe())

count    50000.000000
mean        18.497326
std        355.341070
min          0.000000
25%          1.222378
50%          2.118783
75%          3.893124
max       8662.376766
Name: distance, dtype: float64
count    40926.000000
mean         3.919168
std          4.491177
min          1.000150
25%          1.662453
50%          2.576932
75%          4.490593
max        129.868894
Name: distance, dtype: float64


In [14]:
new_df.shape

(40926, 11)

In [15]:
x = new_df.drop(["amount"], axis = 1).copy()
y = new_df["amount"].copy()


x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.20,random_state=100)


print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(32740, 10)
(8186, 10)
(32740,)
(8186,)


In [16]:
from sklearn.ensemble import RandomForestRegressor

In [17]:
M1 = RandomForestRegressor(random_state=123)
M1 = M1.fit(x_train,y_train) # Indep, Dep variable

In [18]:
varImpDf = pd.DataFrame()
varImpDf["Importance"] = M1.feature_importances_
varImpDf["Variable"] = x_train.columns
varImpDf.sort_values("Importance", ascending = False, inplace = True)

varImpDf.head()

Unnamed: 0,Importance,Variable
9,0.833108,distance
2,0.046062,longitude_of_dropoff
3,0.033203,latitude_of_dropoff
0,0.026619,longitude_of_pickup
1,0.019757,latitude_of_pickup


In [19]:
# Model Prediction on Testset

testPredDf = pd.DataFrame()

testPredDf["Prediction"] = M1.predict(x_test)

# Create a column to store actuals
testPredDf["Actual"] = y_test.values

# Validate if the above worked
testPredDf.head()

Unnamed: 0,Prediction,Actual
0,13.2817,15.5
1,9.181,10.0
2,8.382,8.0
3,6.14,5.0
4,15.753,21.0


In [20]:
# RMSE
print("RMSE",np.sqrt(np.mean((testPredDf["Actual"] - testPredDf["Prediction"])**2)))
# This means on an "average", the taxi fare prediction would have +/- error of about $4.24
# Lower the RMSE, better the model prediction

# MAPE
print("MAPE",(np.mean(np.abs(((testPredDf["Actual"] - testPredDf["Prediction"])/testPredDf["Actual"]))))*100)
# This means on an "average", the taxi fare prediction would have +/- error of ~18%
# Lower the MAPE, better the model prediction

RMSE 4.249447879749447
MAPE 18.924176905704467


In [21]:
1

1

In [23]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


 **Conclusion and Future Directions:**

In conclusion, the project offers a comprehensive analysis of predicting taxi fares using machine learning techniques, with a particular focus on the RandomForestRegressor algorithm. Insights gleaned from the project not only provide valuable guidance for taxi fare estimation but also pave the way for future enhancements and refinements. From incorporating additional features to exploring advanced modeling techniques, the journey towards more accurate and reliable fare prediction continues, underscoring the intersection of data science and urban mobility.

**Project Title: Taxi Fare Prediction**

**Description:**
Implemented a machine learning solution to predict taxi fares based on diverse factors such as pickup/dropoff locations, date/time of pickup, and travel distance. Leveraged data preprocessing techniques including feature extraction and filtering to ensure data quality. Utilized RandomForestRegressor model for accurate fare estimation. Evaluated model performance using metrics like RMSE and MAPE, achieving robust predictions with an average error of $4.24 and an error rate of approximately 18%. This project demonstrates proficiency in data preprocessing, feature engineering, and regression modeling for real-world applications in the transportation industry.