## 01. Introduction Homework
The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

In [9]:
# import libraries
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

#### Q1. Downloading the data
We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2023.

Read the data for January. How many columns are there?

In [3]:
df_jan_2023 = pd.read_parquet('../../data/yellow_tripdata_2023-01.parquet')
df_feb_2023 = pd.read_parquet('../../data/yellow_tripdata_2023-02.parquet')

print(f"jan 2023: {len(df_jan_2023.columns)}, feb 2023: {len(df_feb_2023.columns)}.")

jan 2023: 19, feb 2023: 19.


#### Q2. Computing duration
Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

In [4]:
df_jan_2023['duration'] = df_jan_2023.tpep_dropoff_datetime - df_jan_2023.tpep_pickup_datetime
df_jan_2023.duration = df_jan_2023.duration.apply(lambda td: td.total_seconds() / 60)

df_feb_2023['duration'] = df_feb_2023.tpep_dropoff_datetime - df_feb_2023.tpep_pickup_datetime
df_feb_2023.duration = df_feb_2023.duration.apply(lambda td: td.total_seconds() / 60)

std_dev = df_jan_2023['duration'].std()

print(np.round(std_dev, 2))

42.59


#### Q3. Dropping outliers
Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

In [5]:
# filter the DataFrame to remove outliers
df_jan_filtered = df_jan_2023[(df_jan_2023['duration'] >= 1) & (df_jan_2023['duration'] <= 60)]
df_feb_filtered = df_feb_2023[(df_feb_2023['duration'] >= 1) & (df_feb_2023['duration'] <= 60)]

# calculate the fraction of the records left
fraction_left = len(df_jan_filtered) / len(df_jan_2023)

print(fraction_left)

0.9812202822125979


#### Q4. One-hot encoding
Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will 
  label encode them)
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

In [6]:
# convert the dataframe to a list of dictionaries
train_data_dict = df_jan_filtered[['PULocationID', 'DOLocationID']].astype(str).to_dict('records')

# initialize the DictVectorizer
vec = DictVectorizer()

# fit and transform the data
feature_matrix = vec.fit_transform(train_data_dict)

print(f"the dimensionality of the matrix is: {feature_matrix.shape[1]}")

the dimensionality of the matrix is: 515


#### Q5. One-hot encoding
Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?


In [7]:
# initialize the linear regression model
model = LinearRegression()

# fit the model on the feature matrix and the target variable
model.fit(feature_matrix, df_jan_filtered['duration'])

# predict the target variable on the training data
train_preds = model.predict(feature_matrix)

# calculate the RMSE on the training data
rmse_train = np.sqrt(mean_squared_error(df_jan_filtered['duration'], train_preds))

print(f"the RMSE on the training data is: {rmse_train}")

the RMSE on the training data is: 7.649261936284003


#### Q6. Evaluating the model
Now let's apply this model to the validation dataset (February 2023). 

What's the RMSE on validation?

In [8]:
# convert the validation dataframe to a list of dictionaries
val_data_dict = df_feb_filtered[['PULocationID', 'DOLocationID']].astype(str).to_dict('records')

# transform the validation data using the same DictVectorizer
val_feature_matrix = vec.transform(val_data_dict)

# predict the target variable on the validation data
val_preds = model.predict(val_feature_matrix)

# calculate the RMSE on the validation data
rmse_val = np.sqrt(mean_squared_error(df_feb_filtered['duration'], val_preds))

print(f"the RMSE on the validation data is: {rmse_val}")

the RMSE on the validation data is: 7.811818654341152
