<a href="https://colab.research.google.com/github/ahassanzadeh/Taxi_Fare_Prediction/blob/main/Taxi_Fare_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# loading data to google ceolab
# !wget -O train.csv https://www.dropbox.com/s/mnty1y72gweqjj1/train.csv?dl=0
# !wget -O test.csv https://www.dropbox.com/s/7cvc0s50u9350lo/test.csv?dl=0
# !wget -O sammple_submission.csv https://www.dropbox.com/s/euh08kcj7khs89b/sample_submission.csv?dl=0

In [None]:
# Visulation package 
# !pip install folium

In [None]:
# install bayesian-optimization package 
# !pip install bayesian-optimization


In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
# from mpl_toolkits.basemap import Basemap

In [None]:
%%time
# To reduce the computation and memory allocation, just read 2M rows initially 
df_train= pd.read_csv('train.csv', nrows=2000000)
df_test = pd.read_csv('test.csv')

In [None]:
# tranform object to datetime format 
df_train['pickup_datetime'] = df_train['pickup_datetime'].str.slice(0, 16)
df_train['pickup_datetime'] = pd.to_datetime(df_train['pickup_datetime'], utc=True, format='%Y-%m-%d %H:%M')

In [None]:
df_train.head()

In [None]:
df_train.dtypes

In [None]:
df_train.describe()

Describe function indicate that there is negetive amount for fares, so it has to be corrected! 

In [None]:
df_train.columns

In [None]:
df_train = df_train[df_train.fare_amount>=0]

In [None]:
# plot histogram of fare
plt.figure(figsize=(20, 3))
df_train[df_train.fare_amount<60].fare_amount.hist(bins=60)
plt.xlabel('Taxi Fare ($)')
plt.title('Histogram');

There is some unexpected price increase at $45, 50, 55 that has to be investigated. 

In [None]:
# remove any possible null rows in the dataset 
df_train = df_train.dropna(how = 'any', axis = 'rows')

# Location Data 
As pickup and dropoff location is critical, it is necessary to visulize the location for train and test data, find any outlier and whether their effect on prediction is dominant! 

In [None]:
# first finding the maximum and minimum longtitude and latitude in the test dataset 
print('maximum latitude is ', max(df_test.pickup_latitude.max(), df_test.dropoff_latitude.max()))
print('minimum latitude is ', min(df_test.pickup_latitude.min(), df_test.dropoff_latitude.min()))
print('maximum longitude is ', max(df_test.pickup_longitude.max(), df_test.dropoff_longitude.max()))
print('minimum longitude is ', min(df_test.pickup_longitude.min(), df_test.dropoff_longitude.min()))

In [None]:
import folium
from folium.plugins import FastMarkerCluster

def plot_map(df, maxpoints=len(df_test)):
    map = folium.Map( location=[ df["pickup_latitude"].median(), df["pickup_longitude"].median()], width ='90%', height='90%', zoom_start=10)

    for index, row in enumerate(list(zip(df["pickup_latitude"].values, df["pickup_longitude"].values))):
        folium.CircleMarker(location=row, radius=2, weight=1, color='green').add_to(map)
        if index == maxpoints:
            break

    for index, row in enumerate(list(zip(df["dropoff_latitude"].values, df["dropoff_longitude"].values))):
        folium.CircleMarker(location=row, radius=2, weight=1, color='red').add_to(map)
        if index == maxpoints:
            break     
            
    return map

In [None]:
plot_map(df_train)

In [None]:
plot_map(df_test)


As the visulation of data for train and datasets shows, the distribution of data for location are from similar distribution. Also there are minorites of outliers and wrong location data(in the water), however as the very few of these data, it is ignoreable at this stage! 

# Feature Engineering 

We add features for distance and NYC Taxis regulation:
## There are additional regulations for Taxis in NYC:

- Initial charge for most rides (excluding from JFK and other airports) is 2.5 dollars upon entry. After that there \$0.5 every unit where the unit is defined as 1/5th of a mile or when the Taxicab is traveling 12 Miles an hour or more.
- \$0.5 of additional surcharge between 8PM - 6AM.
- Peak hour weekday surcharge of \$1 Monday-Friday between 4PM-8PM.

## Cleaning Data

In [None]:
# Removing observations with erroneous values
limit = df_train['pickup_longitude'].between(-75, -73)
limit &= df_train['dropoff_longitude'].between(-75, -73)
limit &= df_train['pickup_latitude'].between(40, 42)
limit &= df_train['dropoff_latitude'].between(40, 42)
limit &= df_train['passenger_count'].between(0, 8)
limit &= df_train['fare_amount'].between(0, 250)

df = df_train[limit]

In [None]:
# Distance feature 
# the best practice to find road distance of two location, is using google distance matrix api (https://developers.google.com/maps/documentation/distance-matrix/overview)
# to find exact distance, however as it is not free, i just use the next best thing, which is using Manhattan distance. 
# round up distance to 2 digits as less than 0.01 mile is negligible distance 

def distance_func(pickup_lat, pickup_long, dropoff_lat, dropoff_long):  
    distance = np.abs(dropoff_lat - pickup_lat) + np.abs(dropoff_long - pickup_long)

    return round(distance,2)

# feature engineer the passenger trip distance and also the distance to 3 major airports in NewYork in order consider the airport effect on the fare forcasting

def distance_features(data):

    # Extract date attributes and then drop the pickup_datetime column
    data['hour'] = data['pickup_datetime'].dt.hour
    data['day'] = data['pickup_datetime'].dt.day
    data['weekday'] = data['pickup_datetime'].dt.weekday
    data['month'] = data['pickup_datetime'].dt.month
    data['year'] = data['pickup_datetime'].dt.year
    data = data.drop('pickup_datetime', axis=1)

    # Longtitue and Latitude of city center and nearby airports
    NewYork = (-74.0063889, 40.72)
    JFK_airport = (-73.7822222222, 40.64)
    Neward_airport = (-74.175, 40.69)
    Laguardia_airport = (-73.87, 40.77)


    # Adding feature columns 
    data['distance'] = distance_func(data['pickup_latitude'], data['pickup_longitude'], data['dropoff_latitude'], data['dropoff_longitude'])

    data['distance_to_center_NewYork'] = distance_func(NewYork[1], NewYork[0],
                                          data['pickup_latitude'], data['pickup_longitude'])
    data['pickup_distance_to_JFK_airport'] = distance_func(JFK_airport[1], JFK_airport[0],
                                         data['pickup_latitude'], data['pickup_longitude'])
    data['dropoff_distance_to_JFK_airport'] = distance_func(JFK_airport[1], JFK_airport[0],
                                           data['dropoff_latitude'], data['dropoff_longitude'])
    data['pickup_distance_to_Neward_airport'] = distance_func(Neward_airport[1], Neward_airport[0], 
                                          data['pickup_latitude'], data['pickup_longitude'])
    data['dropoff_distance_to_Neward_airport'] = distance_func(Neward_airport[1], Neward_airport[0],
                                           data['dropoff_latitude'], data['dropoff_longitude'])
    data['pickup_distance_to_Laguardia_airport'] = distance_func(Laguardia_airport[1], Laguardia_airport[0],
                                          data['pickup_latitude'], data['pickup_longitude'])
    data['dropoff_distance_to_Laguardia_airport'] = distance_func(Laguardia_airport[1], Laguardia_airport[0],
                                           data['dropoff_latitude'], data['dropoff_longitude'])
    
    data['long_dist'] = abs(data['pickup_longitude'] - data['dropoff_longitude'])
    data['lat_dist'] = abs(data['pickup_latitude'] - data['dropoff_latitude'])
    
    data = data.dropna(how = 'any', axis = 'rows')

    return data



In [None]:
# Adding distance features to the dataset  
df_train_new = distance_features(df_train)

In [None]:
# remove datapoints with distance less than 0.01 miles(too close)
idx = (df_train_new['distance'] >= 0.01)
df_train_new['distance'] = df_train_new['distance'][idx]

In [None]:
# adding time features

# one-hot encoding the 8PM-6PM 
df_train_new['daily_subcharge'] =  np.zeros((len(df_train_new), 1)).astype('int')
idx_hour = df_train_new[(df_train_new['hour'] >= 20) | (df_train_new['hour'] <= 6)]['hour']
df_train_new['daily_subcharge'][idx_hour.index] = 1

# one-hot encoding the Peak hour weekday surcharge of $1 Monday-Friday between 4PM-8PM.
df_train_new['weekday_subcharge'] =  np.zeros((len(df_train_new), 1)).astype('int')
idx_day = df_train_new[((df_train_new['hour'] >= 16) & (df_train_new['hour'] <= 20)) & (((df_train_new['day'] >= 0) & (df_train_new['day'] <= 4)))]['day']
df_train_new['weekday_subcharge'][idx_day.index] = 1

# Training and Cross Validation 

In [None]:
# Training library
import xgboost as xgb
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from bayes_opt import BayesianOptimization
from sklearn.metrics import mean_squared_error

In [None]:
X, y = df_train_new.drop('fare_amount', axis = 1), df_train_new['fare_amount']

In [None]:
# find correlation of training data 
plt.figure(figsize = (20, 10))
sns.heatmap(X.corr(), annot = True, cmap="YlGnBu")
plt.show()

In [None]:
df_train_new = df_train_new.drop(['key'], axis=1)

In [None]:
df_train_new.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_train_new.drop('fare_amount', axis=1), df_train_new['fare_amount'], test_size=0.2)

In [None]:
del(df_train_new)
dtrain = xgb.DMatrix(X_train, label=y_train)
del(X_train)
dtest = xgb.DMatrix(X_test)
del(X_test)

In [None]:
def evaluate(max_depth, gamma, colsample_bytree):
    params = {'eval_metric': 'rmse',
              'max_depth': int(max_depth),
              'subsample': 0.8,
              'eta': 0.1,
              'gamma': gamma,
              'colsample_bytree': colsample_bytree}
    # Used around 1000 boosting rounds in the full model
    cv_result = xgb.cv(params, dtrain, num_boost_round=100, nfold=3)    
    
    # Bayesian optimization only knows how to maximize, not minimize, so return the negative RMSE
    return -1.0 * cv_result['test-rmse-mean'].iloc[-1]

In [None]:
xgb_boost = BayesianOptimization(evaluate, {'max_depth': (3, 7), 
                                             'gamma': (0, 1),
                                             'colsample_bytree': (0.3, 0.9)})
xgb_boost.maximize(init_points=3, n_iter=5, acq='ei')

In [None]:
# Extract the parameters of the best model for training xgboost.
params = xgb_bo.res[3]['params']
params['max_depth'] = int(params['max_depth'])

# Testing

In [None]:
# Train a new model with the best parameters from the search
model_xgboost = xgb.train(params, dtrain, num_boost_round=250)

# Predict on testing and training set
y_pred = model_xgboost.predict(dtest)
y_train_pred = model_xgboost.predict(dtrain)

# Report testing and training RMSE
print(np.sqrt(mean_squared_error(y_test, y_pred)))
print(np.sqrt(mean_squared_error(y_train, y_train_pred)))

In [None]:
#EVALUATION OF THE MODEL
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure(figsize=(20,10))
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)

# Feature Importance 

In [None]:
fscores = pd.DataFrame({'X': list(model_xgboost.get_fscore().keys()), 'Y': list(model_xgboost.get_fscore().values())})
fscores.sort_values(by='Y').plot.bar(x='X')

# Predict on the given test dataset 

In [None]:
test = pd.read_csv('test.csv').set_index('key')
test['pickup_datetime'] = test['pickup_datetime'].str.slice(0, 16)
test['pickup_datetime'] = pd.to_datetime(test['pickup_datetime'], utc=True, format='%Y-%m-%d %H:%M')

# Predict on holdout set
test = distance_features(test)


In [None]:
# adding time features

# one-hot encoding the 8PM-6PM 
test['daily_subcharge'] =  np.zeros((len(test), 1)).astype('int')
idx_hour = test[(test['hour'] >= 20) | (test['hour'] <= 6)]['hour']
test['daily_subcharge'][idx_hour.index] = 1

# one-hot encoding the Peak hour weekday surcharge of $1 Monday-Friday between 4PM-8PM.
test['weekday_subcharge'] =  np.zeros((len(test), 1)).astype('int')
idx_day = test[((test['hour'] >= 16) & (test['hour'] <= 20)) & (((test['day'] >= 0) & (test['day'] <= 4)))]['day']
test['weekday_subcharge'][idx_day.index] = 1

In [None]:
dtest = xgb.DMatrix(test)
y_pred_test = model_xgboost.predict(dtest)

# Submission

In [None]:
holdout = pd.DataFrame({'key': test.index, 'fare_amount': y_pred_test})
holdout.to_csv('submission.csv', index=False)