# New York City Taxi Fare Prediction

This project is a part of the course offered by Jovian: Machine Learning with Python

We will be training a Machine Learning model to predict which shoppers will become repeat buyers given information like transaction history of customers, incentive offered to each customer, and their response to the offer

Dataset link: https://www.kaggle.com/competitions/acquire-valued-shoppers-challenge/data

# Project Outline:
- Downloading the Dataset
- Exploring the Dataset
- Preparing the Dataset for training

# 1. Downloading the Dataset

(1.1)  Installing required libraries

In [None]:
!pip install jovian opendatasets pandas numpy scikit-learn xgboost --quiet


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/68.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━[0m [32m61.4/68.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.6/68.6 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for uuid (setup.py) ... [?25l[?25hdone


In [None]:
import jovian

In [None]:
jovian.commit()

[jovian] Detected Colab notebook...[0m
[jovian] jovian.commit() is no longer required on Google Colab. If you ran this notebook from Jovian, 
then just save this file in Colab using Ctrl+S/Cmd+S and it will be updated on Jovian. 
Also, you can also delete this cell, it's no longer necessary.[0m


(1.2) Downloading data from kaggle

In [None]:
import opendatasets as od

In [None]:
dataset_url = 'https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/data'

In [None]:
od.download(dataset_url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:

In [None]:
data_dir = 'new-york-city-taxi-fare-prediction'

(1.3) Viewing dataset files

In [None]:
#getting the size of files
!ls -lh {data_dir}

In [None]:
!wc -l {data_dir}/train.csv

In [None]:
!wc -l {data_dir}/test.csv

In [None]:
!head {data_dir}/train.csv

In [None]:
!head {data_dir}/test.csv

What does the sample submission file look like? Let's take a look:

In [None]:
!head {data_dir}/sample_submission.csv

OBSERVATIONS:
- This is a supervised learning regression problem
- The training data is 5.4 GB in size
- The test data is 960 KB in size
- The training set has 8 columns:
   - key
   - fare_amount (target column)
   - pickup_datetime
   - pickup_longitude
   - pickup_latitude
   - dropoff_longitude
   - dropoff_latitude
   - passenger_count
- The test set contains the same columns except the fare_amount column
- The submission file should contain the key and fare_amount

(1.4) Loading Training Set

In [None]:
selected_cols = 'key,pickup_datetime,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count'.split(',')
selected_cols


In [None]:
dtypes = {
 'fare_amount': 'float32',
 'pickup_longitude': 'float32',
 'pickup_latitude': 'float32',
 'dropoff_longitude': 'float32',
 'dropoff_latitude': 'float32',
 'passenger_count': 'uint8'

}

In [None]:
import pandas as pd

In [None]:
import random
def skip_row(idx):
  if idx == 0:
    return False
  return random.random() > 0.01
random.seed(42)
train_df = pd.read_csv(data_dir+ '/train.csv' , dtype = dtypes, usecols = selected_cols, skiprows = skip_row, parse_dates = ['pickup_datetime'])

In [None]:
train_df

In [None]:
jovian.commit()

(1.5) Loading Test Set

In [None]:
test_df = pd.read_csv(data_dir+ '/test.csv', dtype = dtypes, parse_dates = ['pickup_datetime'])

In [None]:
test_df

In [None]:
jovian.commit()

# 2. Exploring the Dataset

(2.1) Training Set

In [None]:
train_df.info()

In [None]:
train_df.describe()

In [None]:
train_df['pickup_datetime'].min(), train_df['pickup_datetime'].max()

OBSERVATIONS ABOUT TRAINING DATA:
- No missing data (in sample)
- passenger_count ranges from 0 to 208
- fare_amount ranges from -52 to 499
- Errors in longitude and latitude values
- The training set contains details of rides from 1st Jan 2009 to 30th June 2015


In [None]:
jovian.commit()

(2.2) Test Set

In [None]:
test_df.info()

In [None]:
test_df.describe()

In [None]:
test_df['pickup_datetime'].min(), test_df['pickup_datetime'].max()

OBSERVATIONS ABOUT TEST DATA:
- Around 10k rows of data
- No missing values
- Pickup dates are in the same range as the training data
- passenger_count is between 1 and 6, so we can limit training data to this range

(2.3) Exploratory Data Analysis and Visualization

Let's perform some exploratory analysis to gain a deeper understanding of our data

Let's begin by importing the libraries required for data visualisation


In [None]:
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

In [None]:
train_df.corr()

In [None]:
sns.heatmap(train_df.corr())

In [None]:
train_df.corr()['fare_amount'].sort_values(ascending=False)

We can make out that the fare_amount column has the highest positive correlation with the passenger_count column, which is quite intuitive. More passengers can indicate the need of bigger taxis, which in turn will have higher fares.

However it wouldn't be fare to make any conclusions based of this data, as we still need to do some feature engineering to calculate the distance between the pickup and dropoff locations

Let's also take a look at how the chain colun impacts the number of repeat trips

In [None]:
import numpy as np
plt.figure(figsize=(17, 8))
plt.title("Distribution of fare_amount")
plt.hist(train_df['fare_amount'], color = 'orange');
plt.xlabel("Fare amount")



In [None]:

plt.figure(figsize=(17, 8))
plt.title("Ratings of Sephoras' products")
plt.hist(train_df['passenger_count'], bins = np.arange(1,6),color = 'orange');
plt.xlabel("Fare amount")
plt.ylabel("No. of products")


We can see that even though chain column has a lower correlation with the repeattrips column, it is quite evident that customers prefer going back in certain chains more than the rest

#3. Preparing the Dataset for training

(3.1) Splitting training and validation set

We will set aside 20% of our training data as validation set

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_new_df, val_df = train_test_split(train_df, test_size = 0.2, random_state = 42)
len(train_new_df), len(val_df)

(3.2) Extract Inputs and Outputs

In [None]:
train_df.columns

In [None]:
input_cols = ['pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'passenger_count']

In [None]:
target_col = 'fare_amount'

Training

In [None]:
train_inputs = train_df[input_cols]
train_targets = train_df[target_col]

In [None]:
train_inputs

In [None]:
train_targets

Validation

In [None]:
val_inputs = val_df[input_cols]
val_targets = val_df[target_col]

In [None]:
val_inputs

In [None]:
val_targets

Test

In [None]:
test_inputs = test_df[input_cols]


In [None]:
test_inputs

# 4. Train Hardcoded and Baseline Models

(4.1) Let's create a simple model that predicts the average

In [None]:
import numpy as np
class MeanRegressor:
  def fit(self, inputs, targets):
    self.mean = round(targets.mean())

  def predict(self, inputs):
    return np.full(inputs.shape[0],int(self.mean))

In [None]:
mean_model = MeanRegressor()

In [None]:
mean_model.fit(train_inputs, train_targets)

In [None]:
mean_model.mean

In [None]:
train_preds = mean_model.predict(train_inputs)

In [None]:
train_preds

In [None]:
val_preds = mean_model.predict(val_inputs)

In [None]:
val_preds

In [None]:
val_targets

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
def rmse(targets,preds):
  return mean_squared_error(targets,preds,squared = False)

In [None]:
train_rmse = rmse(train_targets, train_preds)
train_rmse

In [None]:
val_rmse = rmse(val_targets, val_preds)
val_rmse

(4.2) Train and evaluate baseline model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linear_model = LinearRegression()

In [None]:
linear_model.fit(train_inputs, train_targets)

In [None]:
train_preds = linear_model.predict(train_inputs)



In [None]:
train_preds

In [None]:
train_targets

In [None]:
rmse(train_targets, train_preds)

In [None]:
val_preds = linear_model.predict(val_inputs)

In [None]:
rmse(val_targets, val_preds)

## 5. Make predictions and submit to kaggle

In [None]:
test_inputs

In [None]:
test_preds = linear_model.predict(test_inputs)

In [None]:
test_preds

In [None]:
submission_df = pd.read_csv('/content/new-york-city-taxi-fare-prediction/sample_submission.csv')

In [None]:
submission_df

In [None]:
submission_df['fare_amount'] = test_preds

In [None]:
submission_df.to_csv('submission.csv', index = None)

In [None]:
def submit(fname, model, test_inputs):
  preds = model.predict(test_inputs)
  submission_df = pd.read_csv('/content/new-york-city-taxi-fare-prediction/sample_submission.csv')
  submission_df['fare_amount'] = preds
  submission_df.to_csv(fname, index = None)


# 6. Feature Engineering

-- extracting parts of date (day, month, year)

-- removing outliers and invalid data

-- adding distance between pickup and drop

-- adding distance from Landmarks

(6.1) Extracting parts of Date

Let's extract the year, month, day, weekday and hour from our datetime column

In [None]:
def extract_date(df,col):
  df['Year'] = df[col].dt.year
  df['Month'] = df[col].dt.month
  df['Day'] = df[col].dt.day
  df['WeekDay'] = df[col].dt.weekday
  df['Hour'] = df[col].dt.hour

In [None]:
extract_date(train_df, 'pickup_datetime')

In [None]:
extract_date(val_df, 'pickup_datetime')

In [None]:
extract_date(test_df, 'pickup_datetime')

In [None]:
train_df

In [None]:
val_df

In [None]:
test_df

(6.2) Adding distance between pickup and drop

We will be using the Haversine formula


In [None]:
import numpy as np

def haversine_form(lon1, lat1, lon2, lat2):
  lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
  dlon = lon2 - lon1
  dlat = lat2 - lat1
  a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0) ** 2
  c = np.arcsin(np.sqrt(a))
  km = 6367 * c
  return km




In [None]:
def trip_distance(df):
  df['tripdistance'] = haversine_form(df['pickup_longitude'],df['pickup_latitude'],df['dropoff_longitude'] ,df['dropoff_latitude'])

In [None]:
trip_distance(train_df)

In [None]:
trip_distance(val_df)

In [None]:
trip_distance(test_df)

(6.3) Adding distance from Landmarks

Let's add the distance between the drop-off locations and popular destinations in New York to see if the promixity to popular landmarks has any effect on the taxi fare prediction

Some popular landmarks in New York:
- JFK Airport
- LGA Airport
- EWR Airport
- Times Square
- Met Museum
- World Trade Centre

In [None]:
jfk_lonlat = -73.7781, 40.6413
lga_lonlat = -73.8740, 40.7769
ewr_lonlat = -74.1745, 40.6895
met_lonlat = -73.9632, 40.7794
wtc_lonlat = -74.0099, 40.7126


In [None]:
def add_landmark_distance(df, landmark, landmarklonlat):
  lon,lat = landmarklonlat
  df[landmark + 'dropoff_distance'] = haversine_form(lon,lat,df['dropoff_longitude'],df['dropoff_latitude'])

In [None]:
landmarks = {'jfk': jfk_lonlat, 'lga': lga_lonlat, 'ewr': ewr_lonlat, 'met': met_lonlat, 'wtc': wtc_lonlat}
def add_landmarks(df):
  for i in landmarks:
    add_landmark_distance(df, i, landmarks[i])


In [None]:
add_landmarks(train_df)

In [None]:
add_landmarks(val_df)

In [None]:
add_landmarks(test_df)

In [None]:
train_df

In [None]:
val_df

In [None]:
test_df

(6.4) Removing outliers and invalid data

We noticed invalid data in the following columns when we did data analysis:

- Pickup latitude and longitude
- Drop latitude and longitude
- Passenger count
- Fare amount

In [None]:
test_df.describe()

Modified ranges for the above columns:
- Fare amount: $1-$500
- longitudes: -75 - -72
- latitudes: 40 - 42
- passenger count: 1 to 6

because the test set is limited to these ranges


In [None]:
def remove_outliers(df):
  return df[(df['fare_amount']>=1.0) &
            (df['fare_amount']<=500.0 )&
            (df['pickup_longitude']>=-75) &
            (df['pickup_longitude']<=-72 ) &
            (df['dropoff_longitude']>=-75) &
            (df['dropoff_longitude']<=-72) &
            (df['pickup_latitude']>=40) &
            (df['pickup_latitude']<=42 )&
            (df['dropoff_latitude']>=40 )&
            (df['dropoff_latitude']<=42) &
            (df['passenger_count']>=1.0) &
            (df['passenger_count']<=6.0 )
  ]

In [None]:
train_df = remove_outliers(train_df)

In [None]:
val_df = remove_outliers(val_df)

#7. Train and evaluate different models

We will be training each of the following models:
- Ridge Regression
- Random Forests
- Gradient Boosting

In [None]:
train_df.columns

In [None]:
input_cols = ['pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'passenger_count', 'Year', 'Month', 'Day', 'WeekDay', 'Hour',
       'tripdistance', 'jfkdropoff_distance', 'lgadropoff_distance',
       'ewrdropoff_distance', 'metdropoff_distance', 'wtcdropoff_distance']

In [None]:
target_col = 'fare_amount'

In [None]:
train_inputs = train_df[input_cols]
train_targets = train_df[target_col]

In [None]:
val_inputs = val_df[input_cols]
val_targets = val_df[target_col]

In [None]:
test_inputs = test_df[input_cols]

In [None]:
def evaluate(model):
  train_preds = model.predict(train_inputs)
  train_rmse = rmse(train_targets, train_preds)
  val_preds = model.predict(val_inputs)
  val_rmse = rmse(val_targets, val_preds)
  return train_rmse, val_rmse, train_preds, val_preds

Ridge Regression

In [None]:
from sklearn.linear_model import Ridge

In [None]:
model1 = Ridge(random_state = 42, alpha = 0.9)

In [None]:
model1.fit(train_inputs, train_targets)

In [None]:
evaluate(model1)

In [None]:
submit('ridge_submission.csv', model1, test_inputs)

Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
model2 = RandomForestRegressor(random_state = 42, n_jobs = -1, max_depth = 10 )

In [None]:
model2.fit(train_inputs, train_targets)

In [None]:
evaluate(model2)

In [None]:
submit('randomforest_submission.csv', model2, test_inputs)

Gradient Boosting

In [None]:
from xgboost import XGBRegressor

In [None]:
model3 = XGBRegressor(max_depth = 5,objective = 'reg:squarederror', n_estimators = 200, random_state = 42, n_jobs = -1)

In [None]:
model3.fit(train_inputs, train_targets)

In [None]:
evaluate(model3)

In [None]:
submit('xgb_submission.csv', model3, test_inputs)

# 8. Tune Hyperparameters for the XGBoost Model

In [None]:
import matplotlib.pyplot as plt

def test_params(ModelClass, **params):
    model = ModelClass(**params).fit(train_inputs,train_targets)
    train_rmse = rmse(model.predict(train_inputs), train_targets)
    val_rmse = rmse(model.predict(val_inputs), val_targets)
    return (train_rmse, val_rmse)

def test_param_and_plot(ModelClass, param_name, param_values, **other_params):
    train_errors, val_errors = [], []
    for value in param_values:
      params = dict(other_params)
      params[param_name] = value
      train_rmse, val_rmse = test_params(ModelClass, **params)
      train_errors.append(train_rmse)
      val_errors.append(val_rmse)
    plt.figure(figsize = (10,6))
    plt.title('Overfitting curve:' + param_name)
    plt.plot(param_values, train_errors, 'b-o')
    plt.plot(param_values, val_errors, 'r-o')
    plt.xlabel(param_name)
    plt.ylabel('RMSE')
    plt.legend(['Training','Validation'])


In [None]:
best_params = {
    'random_state': 42,
    'n_jobs': -1,
    'objective': 'reg:squarederror'
}

(8.1) Number of Trees

In [None]:
test_param_and_plot(XGBRegressor, 'n_estimators', [100,200,300,400,500], **best_params)

400 estimators has the lowest rmse

In [None]:
best_params['num_estimators'] = 400

(8.2) Max Depth

In [None]:
 test_param_and_plot(XGBRegressor, 'max_depth', [3,5,6,7,8,9,10,11,12,14,16,20,24,26,30,40], **best_params)

The max depth of 6 gives us the lowest rmse

In [None]:
best_params['max_depth'] = 6

(8.3) Learning Rate

In [None]:
test_param_and_plot(XGBRegressor, 'learning_rate', [0.05,0.1,0.2,0.3,0.4,0.6,0.7,0.8,1.0,1.2] ,**best_params)

The best learning rate seems to be 0.20

In [None]:
best_params['learning_rate'] = 0.2

In [None]:
jovian.commit()