## Homework

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.


## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2023.

Read the data for January. How many columns are there?

* 16
* 17
* 18
* 19

In [1]:
import sys
import os

# Den Pfad zum 'helper'-Verzeichnis hinzufügen
helper_dir = os.path.join(os.path.dirname(os.path.abspath("__file__")), 'helper')
if helper_dir not in sys.path:
    sys.path.append(helper_dir)

# Jetzt können wir die Funktion importieren
from download_tlc_helper import download_tlc_data



In [2]:
# Beispielverwendung der Helper-Funktion
record_type = "yellow_tripdata"
year = 2023
months = [1, 2]

download_tlc_data(record_type, year, months)


Downloading from: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
Datei heruntergeladen und gespeichert unter: tlc_data/yellow_tripdata/yellow_tripdata_2023-01.parquet
Downloading from: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet
Datei heruntergeladen und gespeichert unter: tlc_data/yellow_tripdata/yellow_tripdata_2023-02.parquet


In [3]:
import pandas as pd

df_2023_01 = pd.read_parquet("tlc_data/yellow_tripdata/yellow_tripdata_2023-01.parquet")
print("Columns Count of Dataset: ", len(df_2023_01.columns))

Columns Count of Dataset:  19


## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

* 32.59
* 42.59
* 52.59
* 62.59

In [4]:
df_2023_01.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee'],
      dtype='object')

In [5]:
df_2023_01["duration"] = df_2023_01.tpep_dropoff_datetime - df_2023_01.tpep_pickup_datetime
df_2023_01.duration = df_2023_01.duration.apply(lambda td: td.total_seconds() / 60)
df_2023_01.duration.std()
print("Standard deviation of the trips duration in January: " , df_2023_01.duration.std().round(2))

Standard deviation of the trips duration in January:  42.59


## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* 98%

In [6]:
df_2023_01_wo = df_2023_01[(df_2023_01.duration >= 1) & (df_2023_01.duration <=60)] # remove outliers

# What fraction of the records left after you dropped the outliers?
fraction = len(df_2023_01_wo) / len(df_2023_01)
print("Fraction of the records left after you dropped the outliers: ", fraction)


Fraction of the records left after you dropped the outliers:  0.9812202822125979


## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will 
  label encode them)
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2
* 155
* 345
* 515
* 715

In [7]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error

In [8]:
from sklearn.feature_extraction import DictVectorizer

# Turn the dataframe into a list of dictionaries
train_dicts = df_2023_01_wo[['PULocationID', 'DOLocationID']].astype(str).to_dict(orient='records')

# Fit a dictionary vectorizer
dv = DictVectorizer()
dv.fit(train_dicts)

# Get the feature matrix
X_train = dv.transform(train_dicts)

# Get the dimensionality of the matrix
num_columns = X_train.shape[1]
num_columns

515

## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 3.64
* 7.64
* 11.64
* 16.64

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

target = 'duration'
y_train = df_2023_01_wo[target].values

# Train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict the target variable
y_train_pred = model.predict(X_train)

# Calculate the RMSE
rmse = mean_squared_error(y_train, y_train_pred, squared=False)
rmse



7.649261822035489

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2023). 

What's the RMSE on validation?

* 3.81
* 7.81
* 11.81
* 16.81


In [10]:
df_2023_02 = pd.read_parquet("tlc_data/yellow_tripdata/yellow_tripdata_2023-02.parquet")

In [13]:
df_2023_02["duration"] = df_2023_02.tpep_dropoff_datetime - df_2023_02.tpep_pickup_datetime
df_2023_02.duration = df_2023_02.duration.apply(lambda td: td.total_seconds() / 60)
df_2023_02_wo = df_2023_02[(df_2023_02.duration >= 1) & (df_2023_02.duration <=60)]


train_dicts_2 = df_2023_02_wo[['PULocationID', 'DOLocationID']].astype(str).to_dict(orient='records')

# Fit a dictionary vectorizer
dv2 = DictVectorizer()
dv2.fit(train_dicts_2)

# Get the feature matrix
X_test = dv.transform(train_dicts_2)

y_test_pred = model.predict(X_test)


In [14]:
y_test = df_2023_02_wo[target].values

rmse_test = mean_squared_error(y_test, y_test_pred, squared=False)
rmse_test



7.811821332387183