## Homework

In [61]:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2023.

In [4]:
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet

--2024-05-20 17:53:06--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 18.160.37.133, 18.160.37.151, 18.160.37.115, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|18.160.37.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47673370 (45M) [application/x-www-form-urlencoded]
Saving to: ‘yellow_tripdata_2023-01.parquet.1’


2024-05-20 17:53:07 (138 MB/s) - ‘yellow_tripdata_2023-01.parquet.1’ saved [47673370/47673370]

--2024-05-20 17:53:07--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 18.160.37.133, 18.160.37.151, 18.160.37.215, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|18.160.37.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47748012 (46M) [ap

Read the data for January. How many columns are there?

In [13]:
df_jan = pd.read_parquet("yellow_tripdata_2023-01.parquet")
print(len(df_jan.columns))

19


## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

In [25]:
df_jan["duration"] = (df_jan.tpep_dropoff_datetime - df_jan.tpep_pickup_datetime).apply(lambda x: x.total_seconds()) / 60

What's the standard deviation of the trips duration in January?

In [27]:
df_jan.duration.std()

42.59435124195458

## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

In [29]:
df_jan_filtered = df_jan[df_jan.duration.between(1, 60, inclusive='both')]

What fraction of the records left after you dropped the outliers?

In [32]:
len(df_jan_filtered)/len(df_jan)

0.9812202822125979

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

* Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will 
  label encode them)
* Fit a dictionary vectorizer 
* Get a feature matrix from it

In [54]:
CATEGORICAL_COLUMNS = ["PULocationID", "DOLocationID"]
TARGET = "duration"

df_train = df_jan_filtered.copy()
df_train[CATEGORICAL_COLUMNS] = df_train[CATEGORICAL_COLUMNS].astype(str)
dict_train = df_train[CATEGORICAL_COLUMNS].to_dict(orient='records')

dv = DictVectorizer()
X_train = dv.fit_transform(dict_train)
y_train = df_train[TARGET].values

What's the dimensionality of this matrix (number of columns)?

In [56]:
X_train.shape[1]

515


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

In [58]:
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

What's the RMSE on train?

In [62]:
mean_squared_error(y_train, y_pred, squared=False)



7.649261929201487