# Getting the dataset

## First converting the data into `PARQUET` format.

In [None]:
!pip install opendatasets --upgrade --quiet

In [None]:
import opendatasets as od

dataset_url = "https://www.kaggle.com/c/new-york-city-taxi-fare-prediction"

od.download(dataset_url)

Downloading new-york-city-taxi-fare-prediction.zip to ./new-york-city-taxi-fare-prediction


100%|██████████| 1.56G/1.56G [00:37<00:00, 44.4MB/s]



Extracting archive ./new-york-city-taxi-fare-prediction/new-york-city-taxi-fare-prediction.zip to ./new-york-city-taxi-fare-prediction


In [None]:
data_path = 'new-york-city-taxi-fare-prediction'

In [None]:
!ls -lh {data_path}

total 5.4G
-rw-r--r-- 1 root root  486 Aug  8 05:39 GCP-Coupons-Instructions.rtf
-rw-r--r-- 1 root root 336K Aug  8 05:39 sample_submission.csv
-rw-r--r-- 1 root root 960K Aug  8 05:39 test.csv
-rw-r--r-- 1 root root 5.4G Aug  8 05:40 train.csv


In [None]:
!wc -l {data_path}/train.csv

55423856 new-york-city-taxi-fare-prediction/train.csv


In [None]:
!wc -l {data_path}/test.csv

9914 new-york-city-taxi-fare-prediction/test.csv


In [None]:
selected_cols = 'fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count'.split(',')
selected_cols

['fare_amount',
 'pickup_datetime',
 'pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'passenger_count']

In [None]:
dtypes = {'fare_amount': 'float32',
 'pickup_longitude': 'float32',
 'pickup_latitude': 'float32',
 'dropoff_longitude': 'float32',
 'dropoff_latitude': 'float32',
 'passenger_count': 'uint8'}

In [None]:
import random
random.random()

0.89370233537046

In [None]:
sample_fraction = 0.05
random.random()
def skip_row(row_idx):
  '''
  This function return True or False on the basis whether 
  random.random() value is greater or less than sample_fraction
  sample_fraction is the percentage of total number of records 
  we will use for our project.
  '''
  if row_idx ==0:
    return True
  return random.random() > sample_fraction
random.seed(42)

#### We will read around 5% of the total dataset and convert it to parquet format so it will be easy to do analysis on the data. We won't have to get the whole dataset everytime, we will use the data in `parquet` format.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("/content/new-york-city-taxi-fare-prediction/train.csv", usecols=selected_cols,
                 dtype=dtypes, 
                 parse_dates=["pickup_datetime"],
                 skip_rows=skip_row)
df.to_parquet("nyc-taxi-fare-prediction.parquet", engine="pyarrow", compression=None)

## Facts about the dataset

- This is a supervised learning regression problem.
- Training daa is 5.5gb in size. It has 5.5 million rows of records.
- Test set is much smaller with only 9914 rows. Since this test set is not provided with target variable `fare_amount` we will use different approach on training and test sets.
- The training set has 8 unique columns.
  - key (a unique identifier).
  - fare_amount (target_column).
  - pickup_datetime
  - pickup_longitude
  - pickup_latitude
  - dropoff_longitude
  - dropoff_latitude
  - passenger_count


# Loading and Exploring the data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

In [None]:
df = pd.read_parquet("/content/nyc-taxi-fare-prediction.parquet", engine="pyarrow")

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2769960 entries, 0 to 2769959
Data columns (total 7 columns):
 #   Column             Dtype              
---  ------             -----              
 0   fare_amount        float32            
 1   pickup_datetime    datetime64[ns, UTC]
 2   pickup_longitude   float32            
 3   pickup_latitude    float32            
 4   dropoff_longitude  float32            
 5   dropoff_latitude   float32            
 6   passenger_count    uint8              
dtypes: datetime64[ns, UTC](1), float32(5), uint8(1)
memory usage: 76.6 MB


In [None]:
df.isna().sum()

fare_amount           0
pickup_datetime       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude    16
dropoff_latitude     16
passenger_count       0
dtype: int64

In [None]:
df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,2769960.0,2769960.0,2769960.0,2769944.0,2769944.0,2769960.0
mean,11.35242,-72.50467,39.91627,-72.50388,39.91397,1.685441
std,9.850647,12.72167,10.37595,13.00545,10.44149,1.320807
min,-300.0,-3439.245,-3492.264,-3367.929,-3483.855,0.0
25%,6.0,-73.99205,40.73492,-73.99141,40.73399,1.0
50%,8.5,-73.9818,40.75264,-73.98017,40.75312,1.0
75%,12.5,-73.96708,40.76711,-73.96368,40.76808,2.0
max,1273.31,3442.185,3376.602,3442.185,3351.403,208.0


## Initial Observations from the data
- 2.76 million rows of records
- There are 16 values missing in `dropoff_longitude` and `dropoff_latitude`. We will simply remove the rows which contains null values since the number of records is very high.
- Date ranges from 1st jan 2009 to 30th june 2015.
We may need to deal with outliers and data entry errors before we train our model.

## How to split into train and test set ?
We will divide the dataset into train and test set on the basis of year. So first lets add some attributes to the dataset related with datetime value. The resulting validation and test set will be used to simulate production by passing the data from future to the model in production to see its performance.

In [7]:
def add_dateparts(df, col):
  df[col +'_year'] = df[col].dt.year
  df[col +'_month'] = df[col].dt.month
  df[col +'_day'] = df[col].dt.day
  df[col +'_weekday'] = df[col].dt.weekday
  df[col +'_hour'] = df[col].dt.hour

In [None]:
col = 'pickup_datetime'
df[col].dt.year

0          2010
1          2009
2          2009
3          2014
4          2011
           ... 
2769955    2013
2769956    2014
2769957    2010
2769958    2013
2769959    2010
Name: pickup_datetime, Length: 2769960, dtype: int64

In [None]:
add_dateparts(df, 'pickup_datetime')

In [None]:
df.columns

Index(['fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetime_day',
       'pickup_datetime_weekday', 'pickup_datetime_hour'],
      dtype='object')

In [None]:
df["pickup_datetime_year"].value_counts()

2012    445806
2011    442430
2013    432968
2009    426683
2010    417164
2014    412709
2015    192200
Name: pickup_datetime_year, dtype: int64

## Train test split
year `2009-2012` -> train set
#### We will use data for years 2013, 2014 and 2015 to duplicate production environment or for model retraining purposes.

In [None]:
df_2009 = df.loc[df["pickup_datetime_year"]==2009,].copy()
df_2010 = df.loc[df["pickup_datetime_year"]==2010,].copy()
df_2011 = df.loc[df["pickup_datetime_year"]==2011,].copy()
df_2012 = df.loc[df["pickup_datetime_year"]==2012,].copy()

df_2013 = df.loc[df["pickup_datetime_year"]==2013,].copy()

df_2014 = df.loc[df["pickup_datetime_year"]==2014,].copy()
df_2015 = df.loc[df["pickup_datetime_year"]==2015,].copy()

In [None]:
df_2009_2010 = pd.concat([df_2009,df_2010], axis=0)
df_2011_2010_2009 = pd.concat([df_2009_2010, df_2011], axis=0)
train_df = pd.concat([df_2011_2010_2009, df_2012], axis=0)

In [None]:
len(train_df)

1732083

In [None]:
train_df.to_parquet("train_set.parquet", engine="pyarrow", compression=None)

But in real life when we need to retrain models on new data, these date features would need to be added to the new dataset. So to make things more real, we will output data for different years (2013, 2014, 2015) in `parquet` format by removing these date features we just added above inorder to split the dataset on the basis of years.

In [15]:
df_2013.drop(columns=["pickup_datetime_year","pickup_datetime_month","pickup_datetime_day","pickup_datetime_weekday","pickup_datetime_hour"], inplace=True)
df_2014.drop(columns=["pickup_datetime_year","pickup_datetime_month","pickup_datetime_day","pickup_datetime_weekday","pickup_datetime_hour"], inplace=True)
df_2015.drop(columns=["pickup_datetime_year","pickup_datetime_month","pickup_datetime_day","pickup_datetime_weekday","pickup_datetime_hour"], inplace=True)

In [16]:
df_2013.to_parquet("data_2013.parquet", engine="pyarrow", compression=None)
df_2014.to_parquet("data_2014.parquet", engine="pyarrow", compression=None)
df_2015.to_parquet("data_2015.parquet", engine="pyarrow", compression=None)

## Let's work with train set

In [None]:
train_df = pd.read_parquet("train_set.parquet", engine="pyarrow")

In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1732083 entries, 1 to 2769951
Data columns (total 12 columns):
 #   Column                   Dtype              
---  ------                   -----              
 0   fare_amount              float32            
 1   pickup_datetime          datetime64[ns, UTC]
 2   pickup_longitude         float32            
 3   pickup_latitude          float32            
 4   dropoff_longitude        float32            
 5   dropoff_latitude         float32            
 6   passenger_count          uint8              
 7   pickup_datetime_year     int64              
 8   pickup_datetime_month    int64              
 9   pickup_datetime_day      int64              
 10  pickup_datetime_weekday  int64              
 11  pickup_datetime_hour     int64              
dtypes: datetime64[ns, UTC](1), float32(5), int64(5), uint8(1)
memory usage: 127.2 MB


In [None]:
train_df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour
count,1732083.0,1732083.0,1732083.0,1732074.0,1732074.0,1732083.0,1732083.0,1732083.0,1732083.0,1732083.0,1732083.0
mean,10.48085,-72.5061,39.94386,-72.51649,39.94347,1.676351,2010.524,6.501521,15.76115,3.031379,13.4995
std,8.76688,13.738,11.78713,14.20628,12.10593,1.28551,1.121105,3.441698,8.663839,1.944523,6.514927
min,-50.0,-3439.245,-3492.264,-3367.929,-3483.855,0.0,2009.0,1.0,1.0,0.0,0.0
25%,5.7,-73.99203,40.73509,-73.99142,40.7342,1.0,2010.0,4.0,8.0,1.0,9.0
50%,7.7,-73.9818,40.75279,-73.98026,40.75323,1.0,2011.0,6.0,16.0,3.0,14.0
75%,11.7,-73.96725,40.76716,-73.96414,40.76806,2.0,2012.0,10.0,23.0,5.0,19.0
max,500.0,3442.185,3376.602,3442.185,3351.403,208.0,2012.0,12.0,31.0,6.0,23.0


In [None]:
train_df.isna().sum()

fare_amount                0
pickup_datetime            0
pickup_longitude           0
pickup_latitude            0
dropoff_longitude          9
dropoff_latitude           9
passenger_count            0
pickup_datetime_year       0
pickup_datetime_month      0
pickup_datetime_day        0
pickup_datetime_weekday    0
pickup_datetime_hour       0
dtype: int64

#### Originally there were 16 null values in those columns and some of them might be present in validation and test set.

In [None]:
train_df.loc[train_df.dropoff_longitude.isna(),]

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour
977790,10.32,2011-12-10 08:55:32+00:00,-73.991989,40.759422,,,0,2011,12,10,5,8
1486196,19.92,2011-11-14 13:24:03+00:00,-73.979828,40.765251,,,0,2011,11,14,0,13
2634367,8.5,2011-06-30 11:57:47+00:00,-73.9618,40.764301,,,0,2011,6,30,3,11
666100,8.0,2012-12-11 12:21:57+00:00,-73.977684,40.757519,,,0,2012,12,11,1,12
955084,21.6,2012-12-11 13:01:46+00:00,-73.999222,40.734257,,,0,2012,12,11,1,13
1065485,17.5,2012-12-11 12:09:48+00:00,-74.005547,40.726398,,,0,2012,12,11,1,12
2081500,22.200001,2012-12-13 14:45:57+00:00,0.0,0.0,,,0,2012,12,13,3,14
2310119,13.8,2012-12-11 12:23:37+00:00,-73.990158,40.751694,,,0,2012,12,11,1,12
2608390,13.1,2012-12-11 12:49:04+00:00,-73.999596,40.762009,,,0,2012,12,11,1,12


In [None]:
train_df.loc[train_df.dropoff_longitude.isna(),].describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour
count,9.0,9.0,9.0,0.0,0.0,9.0,9.0,9.0,9.0,9.0,9.0
mean,14.993334,-65.767311,36.224541,,,0.0,2011.666667,11.222222,13.555556,1.777778,11.888889
std,5.523342,24.662746,13.584209,,,0.0,0.5,1.986063,6.287112,1.563472,1.691482
min,8.0,-74.005547,0.0,,,0.0,2011.0,6.0,10.0,0.0,8.0
25%,10.32,-73.999222,40.734257,,,0.0,2011.0,12.0,11.0,1.0,12.0
50%,13.8,-73.990158,40.757519,,,0.0,2012.0,12.0,11.0,1.0,12.0
75%,19.92,-73.977684,40.762009,,,0.0,2012.0,12.0,13.0,3.0,13.0
max,22.200001,0.0,40.765251,,,0.0,2012.0,12.0,30.0,5.0,14.0


## Remove Null values from `dropoff_longitude` and `dropoff_latitude`.

In [None]:
train_df1 = train_df.dropna()

In [None]:
train_df1.isna().sum()

fare_amount                0
pickup_datetime            0
pickup_longitude           0
pickup_latitude            0
dropoff_longitude          0
dropoff_latitude           0
passenger_count            0
pickup_datetime_year       0
pickup_datetime_month      0
pickup_datetime_day        0
pickup_datetime_weekday    0
pickup_datetime_hour       0
dtype: int64

## Removing outliers in `pickup_longitude` `pickup_latitude` and `dropoff_longitude` `dropoff_latitude` and `passenger_count`.

#### Setting up lower and higher ranges for filtering out outliers

In [None]:
pickup_lon_low = train_df1["pickup_longitude"].quantile(0.001)
pickup_lon_high = train_df1["pickup_longitude"].quantile(0.999)

pickup_lat_low = train_df1["pickup_latitude"].quantile(0.001)
pickup_lat_high = train_df1["pickup_latitude"].quantile(0.999)

dropoff_lon_low = train_df1["dropoff_longitude"].quantile(0.001)
dropoff_lon_high = train_df1["dropoff_longitude"].quantile(0.999)

dropoff_lat_low = train_df1["dropoff_latitude"].quantile(0.001)
dropoff_lat_high = train_df1["dropoff_latitude"].quantile(0.999)

In [None]:
train_df2 = train_df1.loc[(train_df1["pickup_longitude"]>pickup_lon_low)&(train_df["pickup_longitude"]<pickup_lon_high),]
train_df3 = train_df2.loc[(train_df["pickup_latitude"]>pickup_lat_low)&(train_df["pickup_latitude"]<pickup_lat_high),]
train_df4 = train_df3.loc[(train_df["dropoff_longitude"]>dropoff_lon_low)&(train_df["dropoff_longitude"]<dropoff_lon_high),]
train_df_mod = train_df4.loc[(train_df["dropoff_latitude"]>dropoff_lat_low)&(train_df["dropoff_latitude"]<dropoff_lat_high),]

In [None]:
train_df_mod.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour
count,1692031.0,1692031.0,1692031.0,1692031.0,1692031.0,1692031.0,1692031.0,1692031.0,1692031.0,1692031.0,1692031.0
mean,10.40812,-73.96602,40.74825,-73.96556,40.74826,1.675293,2010.52,6.501468,15.76043,3.031291,13.5024
std,8.441167,0.8199587,0.3184218,0.7960032,0.3316749,1.274519,1.122122,3.442358,8.664017,1.944582,6.513868
min,-50.0,-74.06762,1e-05,-74.17792,8e-06,0.0,2009.0,1.0,1.0,0.0,0.0
25%,5.7,-73.99222,40.7367,-73.99157,40.73583,1.0,2010.0,4.0,8.0,1.0,9.0
50%,7.7,-73.98209,40.75349,-73.98069,40.75391,1.0,2011.0,6.0,16.0,3.0,14.0
75%,11.7,-73.96849,40.76753,-73.96587,40.76833,2.0,2012.0,10.0,23.0,5.0,19.0
max,500.0,-5e-06,40.87978,-2e-06,40.90434,6.0,2012.0,12.0,31.0,6.0,23.0


#### Let's round off the values in co-ordinates to get better describe 

In [None]:
train_df_mod[["pickup_longitude","pickup_latitude","dropoff_longitude","dropoff_latitude"]]= train_df_mod[["pickup_longitude","pickup_latitude","dropoff_longitude","dropoff_latitude"]].round(4)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [None]:
train_df_mod.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour
count,1692031.0,1692031.0,1692031.0,1692031.0,1692031.0,1692031.0,1692031.0,1692031.0,1692031.0,1692031.0,1692031.0
mean,10.40812,-73.96602,40.74825,-73.96556,40.74826,1.675293,2010.52,6.501468,15.76043,3.031291,13.5024
std,8.441167,0.8199586,0.3184218,0.7960032,0.3316749,1.274519,1.122122,3.442358,8.664017,1.944582,6.513868
min,-50.0,-74.0676,0.0,-74.1779,0.0,0.0,2009.0,1.0,1.0,0.0,0.0
25%,5.7,-73.9922,40.7367,-73.9916,40.7358,1.0,2010.0,4.0,8.0,1.0,9.0
50%,7.7,-73.9821,40.7535,-73.9807,40.7539,1.0,2011.0,6.0,16.0,3.0,14.0
75%,11.7,-73.9685,40.7675,-73.9659,40.7683,2.0,2012.0,10.0,23.0,5.0,19.0
max,500.0,-0.0,40.8798,-0.0,40.9043,6.0,2012.0,12.0,31.0,6.0,23.0


#### Outlier in `passenger_count`.
It must have been filtered out with unusual co-ordinate values.

In [None]:
index = train_df.loc[train_df["passenger_count"]==208,].index

In [None]:
index

Int64Index([145519], dtype='int64')

In [None]:
train_df.loc[145519,]

fare_amount                                      4.5
pickup_datetime            2010-12-16 06:44:00+00:00
pickup_longitude                                 0.0
pickup_latitude                                  0.0
dropoff_longitude                                0.0
dropoff_latitude                                 0.0
passenger_count                                  208
pickup_datetime_year                            2010
pickup_datetime_month                             12
pickup_datetime_day                               16
pickup_datetime_weekday                            3
pickup_datetime_hour                               6
Name: 145519, dtype: object

#### Outlier in  target variable `fare_amount`

In [None]:
train_df_mod.fare_amount.describe()

count    1.692031e+06
mean     1.040812e+01
std      8.441167e+00
min     -5.000000e+01
25%      5.700000e+00
50%      7.700000e+00
75%      1.170000e+01
max      5.000000e+02
Name: fare_amount, dtype: float64

In [None]:
train_df_mod1 = train_df_mod.loc[train_df_mod.fare_amount>0,]

In [None]:
train_df_mod1.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour
count,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0
mean,10.4088,-73.96602,40.74824,-73.96555,40.74826,1.675295,2010.52,6.501614,15.76037,3.031287,13.50245
std,8.440215,0.8199735,0.3184275,0.7960176,0.3316809,1.274521,1.122138,3.442335,8.664081,1.944572,6.513853
min,0.01,-74.0676,0.0,-74.1779,0.0,0.0,2009.0,1.0,1.0,0.0,0.0
25%,5.7,-73.9922,40.7367,-73.9916,40.7358,1.0,2010.0,4.0,8.0,1.0,9.0
50%,7.7,-73.9821,40.7535,-73.9807,40.7539,1.0,2011.0,6.0,16.0,3.0,14.0
75%,11.7,-73.9685,40.7675,-73.9659,40.7683,2.0,2012.0,10.0,23.0,5.0,19.0
max,500.0,-0.0,40.8798,-0.0,40.9043,6.0,2012.0,12.0,31.0,6.0,23.0


# Feature Engineering


## Addition of new feature
#### Adding new feature `trip_distance` which gives the distance of the trip of a particular ride. We will use function to convert pickup and dropoff geolocation co-ordinates into distance between those pickup and dropoff points.

In [None]:
def haversine_np(lon1, lat1, lon2, lat2):
  # calculate the great circle distance between two points on the earth(specified in decimal degrees)
  # All args must be of equal length.
  lon1, lat1, lon2, lat2  = map(np.radians, [lon1, lat1, lon2, lat2])

  dlon = lon2 - lon1
  dlat = lat2 - lat1

  a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

  c = 2 * np.arcsin(np.sqrt(a))
  km = 6367 * c
  return km

In [None]:
def add_trip_distance(df):
  df['trip_distance'] = haversine_np(df['pickup_longitude'],
                                      df['pickup_latitude'],
                                      df['dropoff_longitude'],
                                      df['dropoff_latitude'])

In [None]:
add_trip_distance(train_df_mod1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [None]:
train_df_mod1.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour,trip_distance
count,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0,1691969.0
mean,10.4088,-73.96602,40.74824,-73.96555,40.74826,1.675295,2010.52,6.501614,15.76037,3.031287,13.50245,4.833371
std,8.440215,0.8199735,0.3184275,0.7960176,0.3316809,1.274521,1.122138,3.442335,8.664081,1.944572,6.513853,91.90954
min,0.01,-74.0676,0.0,-74.1779,0.0,0.0,2009.0,1.0,1.0,0.0,0.0,0.0
25%,5.7,-73.9922,40.7367,-73.9916,40.7358,1.0,2010.0,4.0,8.0,1.0,9.0,1.245078
50%,7.7,-73.9821,40.7535,-73.9807,40.7539,1.0,2011.0,6.0,16.0,3.0,14.0,2.14263
75%,11.7,-73.9685,40.7675,-73.9659,40.7683,2.0,2012.0,10.0,23.0,5.0,19.0,3.873166
max,500.0,-0.0,40.8798,-0.0,40.9043,6.0,2012.0,12.0,31.0,6.0,23.0,8615.8


# Define inputs and outputs for model

## First check the correlation of columns with target variable `fare_amount`.

In [None]:
train_df_mod1.corr()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour,trip_distance
fare_amount,1.0,0.020592,-0.017504,0.015584,-0.016543,0.014242,0.046176,0.044848,0.001635,0.006091,-0.017488,0.034744
pickup_longitude,0.020592,1.0,-0.280249,0.21287,-0.256064,0.003693,0.008948,0.00092,-0.000592,-0.000768,0.001136,0.580933
pickup_latitude,-0.017504,-0.280249,1.0,-0.27415,0.397163,-0.004398,-0.005922,0.001245,-0.000877,-0.003293,0.001667,-0.231677
dropoff_longitude,0.015584,0.21287,-0.27415,1.0,-0.270736,0.003866,0.009044,-0.000812,0.000155,-0.000201,-0.000281,0.554258
dropoff_latitude,-0.016543,-0.256064,0.397163,-0.270736,1.0,-0.003293,-0.005564,-0.000195,-0.00097,-0.001181,0.000615,-0.249297
passenger_count,0.014242,0.003693,-0.004398,0.003866,-0.003293,1.0,-0.00397,0.00902,0.004551,0.040673,0.019075,0.00532
pickup_datetime_year,0.046176,0.008948,-0.005922,0.009044,-0.005564,-0.00397,1.0,-0.013893,-0.005841,0.003243,-0.002963,0.016623
pickup_datetime_month,0.044848,0.00092,0.001245,-0.000812,-0.000195,0.00902,-0.013893,1.0,-0.024994,-0.006248,-0.005966,-0.000597
pickup_datetime_day,0.001635,-0.000592,-0.000877,0.000155,-0.00097,0.004551,-0.005841,-0.024994,1.0,0.004743,-0.000845,-0.001318
pickup_datetime_weekday,0.006091,-0.000768,-0.003293,-0.000201,-0.001181,0.040673,0.003243,-0.006248,0.004743,1.0,-0.089766,-0.000377


#### Looks like there is not much of a correlation between columns and target variable. So we would start out training models considering `passenger_count` `pickup_datetime_month` `pickup_datetime_year` `pickup_datetime_hour` and `trip_distance` as input columns.

In [None]:
input_cols = ["passenger_count","pickup_datetime_month","pickup_datetime_year","pickup_datetime_hour", "trip_distance"]
target = "fare_amount"

## Train and validation split
#### The earlier performed train, validation and test set on the basis of `pickup_datetime_year` will be used for simulating the production environment where we pass that data to the model in production and perform model evaluation.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_set, val_set = train_test_split(train_df_mod1, test_size=0.2, random_state=42)

In [None]:
train_inputs = train_set[input_cols]
train_targets = train_set[target]
val_inputs = val_set[input_cols]
val_targets = val_set[target]

# Model Training and Evaluation

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

In [None]:
import pickle

In [None]:
!pip install mlflow --quiet

## Tracking Experiments with `DagsHub`.
#### We will use tracking server from `DagsHub` and log the experiment runs which can be viewed from the repository in `DagsHub` under `Experiments` tab. Since model registry is not available with `DagsHub`, this method cannot be used in real production. Apart from this, mlflow run id of the particular model can be obtained from the UI in `DagsHub`.

In [None]:
import mlflow
import os
from getpass import getpass

os.environ['MLFLOW_TRACKING_USERNAME'] = input('Enter your DAGsHub username: ')
os.environ['MLFLOW_TRACKING_PASSWORD'] = getpass('Enter your DAGsHub access token: ')
os.environ['MLFLOW_TRACKING_PROJECTNAME'] = input('Enter your DAGsHub project name: ')

mlflow.set_tracking_uri(f'https://dagshub.com/' + os.environ['MLFLOW_TRACKING_USERNAME'] + '/' + os.environ['MLFLOW_TRACKING_PROJECTNAME'] + '.mlflow')


Enter your DAGsHub username: maiden90
Enter your DAGsHub access token: ··········
Enter your DAGsHub project name: mlops-zoomcamp-project


#### We will train two algorithms with various combination of hyperparameters. First we will try `LinearRegression` `RandomForestRegressor` `XGBoost`.

In [None]:
models = {
    "LRMODEL": LinearRegression(),
    "RFMODEL": RandomForestRegressor(),
    "XGBOOST": XGBRegressor()
}

In [None]:
mlflow.set_experiment("nyc-taxi-fare-models-test")

for name, model in models.items():
  with mlflow.start_run():
    model.fit(train_inputs, train_targets)
    val_pred = model.predict(val_inputs)
    rmse = mean_squared_error(val_targets, val_pred, squared=False)
    mlflow.log_metric("rmse", rmse)
  with open(f'models/{name}.bin', 'wb') as f_out:
    pickle.dump(model, f_out)



From experiment logs in DagsHub, it was found that XGBoost performed the best among three models. Since the `rmse` for the model was around 4, we will skip the hyperparameter optimization and use this model for deployment.

In [None]:
!pip freeze > requirements.txt