In [1]:
!python -V

Python 3.10.14


In [2]:
import pandas as pd
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.metrics import root_mean_squared_error

## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2023.

Read the data for January. How many columns are there?

* 16
* 17
* 18
* 19

In [3]:
df_jan = pd.read_parquet("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet")
df_feb = pd.read_parquet("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet")

In [4]:
len(df_jan.columns)

19

In [5]:
assert len(df_jan.columns) == len(df_feb.columns)

In [6]:
df = df_jan

In [7]:
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee'],
      dtype='object')

## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

* 32.59
* 42.59
* 52.59
* 62.59

In [9]:
df["duration"] = (df["tpep_dropoff_datetime"] - df["tpep_pickup_datetime"]) / pd.Timedelta(minutes=1)

In [10]:
df["duration"].describe().apply(lambda x: format(x, ".2f"))

count    3066766.00
mean          15.67
std           42.59
min          -29.20
25%            7.12
50%           11.52
75%           18.30
max        10029.18
Name: duration, dtype: object

## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* 98%

In [11]:
n_raw_records = len(df)
df = df[(1 <= df["duration"]) & (df["duration"] <= 60)]

In [12]:
len(df) / n_raw_records

0.9812202822125979

## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will 
  label encode them)
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2
* 155
* 345
* 515
* 715

In [13]:
train_dicts = df[["PULocationID", "DOLocationID"]].astype(str).to_dict(orient="records")
# train_dicts = (df["PULocationID"].astype(str) + "_" + df["DOLocationID"].astype(str)).to_dict()

dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

In [14]:
X_train

<3009173x515 sparse matrix of type '<class 'numpy.float64'>'
	with 6018346 stored elements in Compressed Sparse Row format>


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 3.64
* 7.64
* 11.64
* 16.64

In [15]:
target = "duration"
y_train = df[target].values

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

root_mean_squared_error(y_train, y_pred)

7.649261929201487

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2023). 

What's the RMSE on validation?

* 3.81
* 7.81
* 11.81
* 16.81

In [17]:
df_feb["duration"] = (df_feb["tpep_dropoff_datetime"] - df_feb["tpep_pickup_datetime"]) / pd.Timedelta(minutes=1)
df_feb = df_feb[(1 <= df_feb["duration"]) & (df_feb["duration"] <= 60)].copy()

In [18]:
y_val = df_feb[target].values

In [19]:
val_dicts = df_feb[["PULocationID", "DOLocationID"]].astype(str).to_dict(orient="records")
X_val = dv.transform(val_dicts)

In [20]:
root_mean_squared_error(y_val, lr.predict(X_val))

7.811819793542861