# Homework
The goal of this homework is to train a simple model for predicting the duration of a ride.

## Setup
All libraries used should be added here.

In [1]:
# import libraries
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# settings
pd.set_option('display.max_columns', None)

## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "Green Taxi Trip Records", we'll use "For-Hire Vehicle Trip Records".

Download the data for January and February 2021.

Note that you need "For-Hire Vehicle Trip Records", not "High Volume For-Hire Vehicle Trip Records".

Read the data for January. How many records are there?

In [2]:
# Loading the data
df = pd.read_parquet('./data/fhv_tripdata_2021-01.parquet')

# display the first few rows
display(df.head())
print('The dataframe contains {} rows and {} columns'.format(df.shape[0], df.shape[1]))

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009
2,B00013,2021-01-01 00:01:00,2021-01-01 01:51:00,,,,B00013
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037


The dataframe contains 1154112 rows and 7 columns


## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the average trip duration in January?

In [3]:
# compute the duration variable
df['duration'] = df.dropOff_datetime - df.pickup_datetime
# convert to mins
df.duration = df.duration.apply(lambda x: x.total_seconds() / 60)

# inspect the first few rows
display(df.head())

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,duration
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009,17.0
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009,17.0
2,B00013,2021-01-01 00:01:00,2021-01-01 01:51:00,,,,B00013,110.0
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037,8.283333
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037,15.216667


In [4]:
print(f'The average trip duration in January is {df.duration.mean()} minutes.')

The average trip duration in January is 19.1672240937939 minutes.


## Data preparation

Check the distribution of the duration variable. There are some outliers. 

Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

How many records did you drop? 

## Q3. Missing values

The features we'll use for our model are the pickup and dropoff location IDs. 

But they have a lot of missing values there. Let's replace them with "-1".

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs.


In [5]:
# keep only the records where the duration was between 1 and 60 minutes (inclusive)
df = df[(df.duration >= 1) & (df.duration <= 60)]

print(f'df.shape: {df.shape}')

## dropped 44286 records

df.shape: (1109826, 8)


In [6]:
# fill up nans in pickup and dropoff location IDs with "-1"
df['PUlocationID'][df['PUlocationID'].isnull()] = -1
df['DOlocationID'][df['DOlocationID'].isnull()] = -1

In [7]:
# check pct of missing values in PUlocationID
print(f"Fractions of missing values for the pickup location ID: { len(df[df['PUlocationID'] == -1]) / df.shape[0] * 100}.")

Fractions of missing values for the pickup location ID: 83.52732770722618.


## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix? (The number of columns).

In [8]:
# check dtypes
display(df.dtypes)

dispatching_base_num              object
pickup_datetime           datetime64[ns]
dropOff_datetime          datetime64[ns]
PUlocationID                     float64
DOlocationID                     float64
SR_Flag                           object
Affiliated_base_number            object
duration                         float64
dtype: object

In [9]:
# convert 'PULocationID', 'DOLocationID' to object type
df = df.astype({"PUlocationID": str, "DOlocationID": str})

In [10]:
# use only pickup and dropoff location IDs as features for our model
features = ['PUlocationID', 'DOlocationID']
# turn into a list of dictionaries
train_dicts = df[features].to_dict(orient='records')
# fit a dictionary vectorizer
dv = DictVectorizer()
# transform
X_train = dv.fit_transform(train_dicts)

In [11]:
print(f'X_train dim: {X_train.shape}')

X_train dim: (1109826, 525)


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

In [12]:
# set target
y_train = df['duration'].values

In [13]:
# Initialise the linear regression model
lr = LinearRegression()  

# Fit model
lr.fit(X_train, y_train)

# Generate predictions
y_pred_lr_train = lr.predict(X_train)

# Calculate the RMSE of the model on the training data
print('Train RMSE:', np.sqrt(mean_squared_error(y_train,y_pred_lr_train)))

Train RMSE: 10.5285191072048


## Q6. Evaluating the model

Now let's apply this model to the validation dataset (Feb 2021). 

What's the RMSE on validation?

In [14]:
# load the validation dataset
def transform(data_path):
    # load the parquet file
    df = pd.read_parquet(data_path)
    # compute the duration variable
    df['duration'] = df.dropOff_datetime - df.pickup_datetime
    # convert to mins
    df.duration = df.duration.apply(lambda x: x.total_seconds() / 60)
    # keep only the records where the duration was between 1 and 60 minutes (inclusive)
    df = df[(df.duration >= 1) & (df.duration <= 60)]
    # fill up nans in pickup and dropoff location IDs with "-1"
    df['PUlocationID'][df['PUlocationID'].isnull()] = -1
    df['DOlocationID'][df['DOlocationID'].isnull()] = -1
    # convert 'PULocationID', 'DOLocationID' to object type
    df = df.astype({"PUlocationID": str, "DOlocationID": str})
    return df

In [15]:
val_df = transform('./data/fhv_tripdata_2021-02.parquet')

# display the first few rows
display(val_df.head())
print('The dataframe contains {} rows and {} columns'.format(val_df.shape[0], val_df.shape[1]))

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,duration
1,B00021,2021-02-01 00:55:40,2021-02-01 01:06:20,173.0,82.0,,B00021,10.666667
2,B00021,2021-02-01 00:14:03,2021-02-01 00:28:37,173.0,56.0,,B00021,14.566667
3,B00021,2021-02-01 00:27:48,2021-02-01 00:35:45,82.0,129.0,,B00021,7.95
4,B00037,2021-02-01 00:12:50,2021-02-01 00:26:38,-1.0,225.0,,B00037,13.8
5,B00037,2021-02-01 00:00:37,2021-02-01 00:09:35,-1.0,61.0,,B00037,8.966667


The dataframe contains 990113 rows and 8 columns


In [16]:
# turn into a list of dictionaries
val_dicts = val_df[features].to_dict(orient='records')
# transform
X_val = dv.transform(val_dicts)
print(f'X_val dim: {X_train.shape}')

# set target
y_val = val_df['duration'].values

X_val dim: (1109826, 525)


In [17]:
# Generate predictions
y_pred_lr = lr.predict(X_val)

# Calculate the RMSE of the model on the validation data
print('Val RMSE:', np.sqrt(mean_squared_error(y_val, y_pred_lr)))

Val RMSE: 11.01428314516757
