# Homework 1

Link to exercise:
https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2025/01-intro/homework.md

## Setup

Environment: Conda `mlops-zoomcamp`

In [67]:
# dependencies
import numpy as np
import pandas as pd
import pyarrow.parquet as pq
from pathlib import Path
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [11]:
# paths
PATH_REPO = Path.cwd().parent.parent
PATH_DATA_DIR = PATH_REPO / 'data'
PATH_DATA_JAN = PATH_DATA_DIR / 'yellow_tripdata_2023-01.parquet'
PATH_DATA_FEB = PATH_DATA_DIR / 'yellow_tripdata_2023-02.parquet'


## Q1. Downloading the data

Question: How many columns are in the dataset?

The data was downloaded previously from the website linked in the exercise.


In [46]:
# load data from file and convert to pandas data frame
trips_jan = pq.read_table(PATH_DATA_JAN).to_pandas()
trips_feb = pq.read_table(PATH_DATA_FEB).to_pandas()

In [47]:
# print data, number of columns is specified below
trips_jan

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,2,2023-01-01 00:32:10,2023-01-01 00:40:36,1.0,0.97,1.0,N,161,141,2,9.30,1.00,0.5,0.00,0.0,1.0,14.30,2.5,0.00
1,2,2023-01-01 00:55:08,2023-01-01 01:01:27,1.0,1.10,1.0,N,43,237,1,7.90,1.00,0.5,4.00,0.0,1.0,16.90,2.5,0.00
2,2,2023-01-01 00:25:04,2023-01-01 00:37:49,1.0,2.51,1.0,N,48,238,1,14.90,1.00,0.5,15.00,0.0,1.0,34.90,2.5,0.00
3,1,2023-01-01 00:03:48,2023-01-01 00:13:25,0.0,1.90,1.0,N,138,7,1,12.10,7.25,0.5,0.00,0.0,1.0,20.85,0.0,1.25
4,2,2023-01-01 00:10:29,2023-01-01 00:21:19,1.0,1.43,1.0,N,107,79,1,11.40,1.00,0.5,3.28,0.0,1.0,19.68,2.5,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3066761,2,2023-01-31 23:58:34,2023-02-01 00:12:33,,3.05,,,107,48,0,15.80,0.00,0.5,3.96,0.0,1.0,23.76,,
3066762,2,2023-01-31 23:31:09,2023-01-31 23:50:36,,5.80,,,112,75,0,22.43,0.00,0.5,2.64,0.0,1.0,29.07,,
3066763,2,2023-01-31 23:01:05,2023-01-31 23:25:36,,4.67,,,114,239,0,17.61,0.00,0.5,5.32,0.0,1.0,26.93,,
3066764,2,2023-01-31 23:40:00,2023-01-31 23:53:00,,3.15,,,230,79,0,18.15,0.00,0.5,4.43,0.0,1.0,26.58,,


There are 19 columns in the dataset.

## Q2. Computing duration

What's the standard deviation of the trips duration in January?

In [48]:
# compute duration column
trips_jan['duration'] = trips_jan['tpep_dropoff_datetime'] - trips_jan['tpep_pickup_datetime']

# convert to minutes
trips_jan['duration'] = trips_jan['duration'].dt.total_seconds() / 60

# compute and print result
print(
    f'Standard deviation of the trips duration in January: ',
    f'{trips_jan["duration"].std().round(2)}'
)

Standard deviation of the trips duration in January:  42.59


## Q3. Dropping outliers

What fraction of the records is left after you dropped the outliers?

In [49]:
# get number of records before dropping outliers
n_records_before = trips_jan.shape[0]

# drop outliers according to exercise
trips_jan = trips_jan[
    (trips_jan['duration'] >= 1) & (trips_jan['duration'] <= 60)
]

# get number of records after dropping outliers
n_records_after = trips_jan.shape[0]

# compute and print result
print(
    f'Fraction of the records left after dropping outliers: ',
    f'{n_records_after / n_records_before:.2f}'
)

Fraction of the records left after dropping outliers:  0.98


## Q4. One-hot encoding

What's the number of columns of the one-hot encoded matrix?

In [66]:
# select only the location columns, convert to strings and make a dictionary
locations_dict = trips_jan[
    ['PULocationID', 'DOLocationID']
].astype(str).to_dict(orient='records')

# create and fit a dict vectorizer
dv = DictVectorizer(sparse=False) # get numpy array instead of sparse matrix
feature_matrix = dv.fit_transform(locations_dict)

# get number of columns
print(f'Number of columns: {feature_matrix.shape[1]}')

Number of columns: 515


## Q5. Training a model

What's the RMSE on train?

In [72]:
# get labels
labels = trips_jan['duration']
labels

0           8.433333
1           6.316667
2          12.750000
3           9.616667
4          10.833333
             ...    
3066761    13.983333
3066762    19.450000
3066763    24.516667
3066764    13.000000
3066765    14.400000
Name: duration, Length: 3009173, dtype: float64

In [68]:
# initialize the model
model = LinearRegression()

In [None]:
# fit the model
model.fit(
    feature_matrix,
    labels
)

In [None]:
# get predictions
predictions_train = model.predict(feature_matrix)

# compute RMSE on train set
rmse_train = np.sqrt(mean_squared_error(labels, predictions_train))
print(f"RMSE: {rmse_train:.2f} minutes")

## Q6. Evaluating the model

What's the RMSE on validation?

In [None]:
# compute duration for February data
trips_feb['duration'] = (
    trips_feb['tpep_dropoff_datetime'] - 
    trips_feb['tpep_pickup_datetime']
).dt.total_seconds() / 60

# apply the same duration filter as for January data
trips_feb = trips_feb[
    (trips_feb['duration'] >= 1) & 
    (trips_feb['duration'] <= 60)
]

# prepare features the same way
val_dicts = trips_feb[
    ['PULocationID', 'DOLocationID']
].astype(str).to_dict(orient='records')

# transform using the SAME DictVectorizer that was fit on January data
val_feature_matrix = dv.transform(val_dicts)  # note: use transform(), not fit_transform()

# get labels
val_y = trips_feb['duration'].values

In [None]:
# evaluate model on validation data
val_predictions = model.predict(val_feature_matrix)
val_score = model.score(val_feature_matrix, val_y)
print(f"Validation R-squared score: {val_score:.3f}")