##### Q1. Downloading the data
##### We'll use the same NYC taxi dataset, but instead of "Green Taxi Trip Records", we'll use "Yellow Taxi Trip Records".

##### Download the data for January and February 2023.

##### Read the data for January. How many columns are there

In [1]:
import pandas as pd
import numpy as np

In [2]:
# read the January data
jan_df = pd.read_parquet('data/yellow_tripdata_2023-01.parquet') 
jan_df.head() 

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,2,2023-01-01 00:32:10,2023-01-01 00:40:36,1.0,0.97,1.0,N,161,141,2,9.3,1.0,0.5,0.0,0.0,1.0,14.3,2.5,0.0
1,2,2023-01-01 00:55:08,2023-01-01 01:01:27,1.0,1.1,1.0,N,43,237,1,7.9,1.0,0.5,4.0,0.0,1.0,16.9,2.5,0.0
2,2,2023-01-01 00:25:04,2023-01-01 00:37:49,1.0,2.51,1.0,N,48,238,1,14.9,1.0,0.5,15.0,0.0,1.0,34.9,2.5,0.0
3,1,2023-01-01 00:03:48,2023-01-01 00:13:25,0.0,1.9,1.0,N,138,7,1,12.1,7.25,0.5,0.0,0.0,1.0,20.85,0.0,1.25
4,2,2023-01-01 00:10:29,2023-01-01 00:21:19,1.0,1.43,1.0,N,107,79,1,11.4,1.0,0.5,3.28,0.0,1.0,19.68,2.5,0.0


In [3]:
# number of columns
len(jan_df.columns)

19

##### Q2. Computing duration
##### Now let's compute the duration variable. It should contain the duration of a ride in minutes.

###### What's the standard deviation of the trips duration in January?

In [4]:
# duration column in minutes
jan_df['duration'] = jan_df['tpep_dropoff_datetime'] - jan_df['tpep_pickup_datetime']
jan_df['duration'] = jan_df['duration'].apply(lambda x : x.total_seconds() / 60) 

In [5]:
# standard deviation of duration column
np.std(jan_df['duration'])

42.5943442974141

##### Q3. Dropping outliers
##### Next, we need to check the distribution of the duration variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

##### What fraction of the records left after you dropped the outliers?

In [6]:
# removing outliers
jan_df = jan_df[(jan_df['duration'] >= 1) & (jan_df['duration'] <= 60)]
jan_df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,duration
0,2,2023-01-01 00:32:10,2023-01-01 00:40:36,1.0,0.97,1.0,N,161,141,2,9.3,1.0,0.5,0.0,0.0,1.0,14.3,2.5,0.0,8.433333
1,2,2023-01-01 00:55:08,2023-01-01 01:01:27,1.0,1.1,1.0,N,43,237,1,7.9,1.0,0.5,4.0,0.0,1.0,16.9,2.5,0.0,6.316667
2,2,2023-01-01 00:25:04,2023-01-01 00:37:49,1.0,2.51,1.0,N,48,238,1,14.9,1.0,0.5,15.0,0.0,1.0,34.9,2.5,0.0,12.75
3,1,2023-01-01 00:03:48,2023-01-01 00:13:25,0.0,1.9,1.0,N,138,7,1,12.1,7.25,0.5,0.0,0.0,1.0,20.85,0.0,1.25,9.616667
4,2,2023-01-01 00:10:29,2023-01-01 00:21:19,1.0,1.43,1.0,N,107,79,1,11.4,1.0,0.5,3.28,0.0,1.0,19.68,2.5,0.0,10.833333


In [7]:
# Fraction left after dropping outliers
((jan_df['duration'] >= 1) & (jan_df['duration'] <= 60)).mean(), ((jan_df['duration'] > 1) & (jan_df['duration'] <= 60)).mean()

(1.0, 0.9999066188617272)

##### Q4. One-hot encoding
##### Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

##### Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will label encode them)
##### Fit a dictionary vectorizer
##### Get a feature matrix from it
##### What's the dimensionality of this matrix (number of columns)?

In [8]:
# get two categorical features
categorical_features = ['PULocationID', 'DOLocationID']
jan_df[categorical_features].head(), jan_df[categorical_features].dtypes

(   PULocationID  DOLocationID
 0           161           141
 1            43           237
 2            48           238
 3           138             7
 4           107            79,
 PULocationID    int64
 DOLocationID    int64
 dtype: object)

In [9]:
# convert the categorical variables to strings
jan_df[categorical_features] = jan_df[categorical_features].astype('str') 
jan_df[categorical_features].dtypes

PULocationID    object
DOLocationID    object
dtype: object

In [10]:
# import and estimate Dicvectorizer
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer()

In [11]:
# turn features to dict
train_dicts = jan_df[categorical_features].to_dict(orient='records')

In [12]:
# get feature matrix 
X_train = dv.fit_transform(train_dicts)

In [13]:
# dimensionality of the matrix
X_train.shape, X_train.ndim

((3009173, 515), 2)

##### Q5. Training a model
##### Now let's use the feature matrix from the previous step to train a model.

##### Train a plain linear regression model with default parameters
##### Calculate the RMSE of the model on the training data
##### What's the RMSE on train?

In [14]:
# get the y_train
y_train = jan_df['duration'].values

In [15]:
# train a plain Linear regression model
from sklearn.linear_model import LinearRegression
model_LR = LinearRegression()

In [16]:
# fit the training data
model_LR.fit(X_train, y_train)

In [17]:
# predict with the training feature
y_pred = model_LR.predict(X_train)

In [18]:
# calculate RMSE on the training data
from sklearn.metrics import mean_squared_error
print('The RMSE on train data is', np.sqrt(mean_squared_error(y_train, y_pred)))

The RMSE on train data is 7.649262109734842


##### Q6. Evaluating the model
##### Now let's apply this model to the validation dataset (February 2023).

##### What's the RMSE on validation?

In [19]:
# function to get the dataframe

def read_dataframe(filename):
    
    df = pd.read_parquet(filename) 

    df['duration'] = df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']
    df['duration'] = df['duration'].apply(lambda td : td.total_seconds() / 60) 
    
    df = df[(df['duration'] >= 1) & (df['duration'] <= 60)]
    
    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype('str')

    return df

In [20]:
# get January and February data with the function above
jan_df = read_dataframe('data/yellow_tripdata_2023-01.parquet')
feb_df = read_dataframe('data/yellow_tripdata_2023-02.parquet')

In [21]:
# revisiting codes from above
categorical = ['PULocationID', 'DOLocationID']

dv = DictVectorizer()

train_dicts = jan_df[categorical].to_dict(orient='records')  
X_train = dv.fit_transform(train_dicts)    

val_dicts = feb_df[categorical].to_dict(orient='records') 
X_val = dv.transform(val_dicts)

In [22]:
# values of the y_train and y_val
y_train = jan_df['duration'].values
y_val = feb_df['duration'].values

In [23]:
# applying Linear regresssion and get RMSE on validation

model_LR.fit(X_train, y_train)

y_pred = model_LR.predict(X_val)    

print('The RMSE on validation data is', np.sqrt(mean_squared_error(y_val, y_pred)))

The RMSE on validation data is 7.8118141443234945
