# Homework 1: Introduction for MLOps Zoomcamp 2025

**Downloading the data "Yellow Taxi Trip Records" data for January and February 2023, from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page**

In [1]:
!mkdir -p data
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -P ./data
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet -P ./data

--2025-05-25 01:37:31--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 65.8.245.51, 65.8.245.50, 65.8.245.171, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|65.8.245.51|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47673370 (45M) [application/x-www-form-urlencoded]
Saving to: ‘./data/yellow_tripdata_2023-01.parquet.4’


2025-05-25 01:37:39 (6,20 MB/s) - ‘./data/yellow_tripdata_2023-01.parquet.4’ saved [47673370/47673370]

--2025-05-25 01:37:39--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 65.8.245.50, 65.8.245.51, 65.8.245.171, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|65.8.245.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47748012 (46M) [a

**Import all necessary modules**

In [2]:
import pandas as pd

import pickle

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression 
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error

**Read January data**

In [3]:
df_train = pd.read_parquet('./data/yellow_tripdata_2023-01.parquet')

**Retrieve number of rows and columns**

In [4]:
df_train.shape

(3066766, 19)

**Computing duration**

In [5]:
df_train['duration'] = df_train.tpep_dropoff_datetime - df_train.tpep_pickup_datetime
df_train.duration = df_train.duration.apply(lambda td: td.total_seconds() / 60)

In [6]:
df_train.duration.std()

np.float64(42.59435124195458)

**Droping outliers**

In [7]:
((df_train.duration >= 1) & (df_train.duration <= 60)).mean()

np.float64(0.9812202822125979)

In [8]:
df_train = df_train[(df_train.duration >= 1) & (df_train.duration <= 60)]

**One-hot encoding pickup and dropoff location IDs**

In [9]:
categorical = ['PULocationID', 'DOLocationID']

In [10]:
df_train[categorical] = df_train[categorical].astype(str)

In [11]:
dv = DictVectorizer()

train_dicts = df_train[categorical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

In [12]:
X_train.shape

(3009173, 515)

**Training a model**

In [13]:
target = 'duration'
y_train = df_train[target].values

In [14]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [15]:
y_pred = lr.predict(X_train)

In [16]:
mean_squared_error(y_train, y_pred) ** 0.5

7.6492619533128225

**Evaluating the model**

In [17]:
df_val = pd.read_parquet('./data/yellow_tripdata_2023-02.parquet')

In [18]:
df_val['duration'] = df_val.tpep_dropoff_datetime - df_val.tpep_pickup_datetime
df_val.duration = df_val.duration.apply(lambda td: td.total_seconds() / 60)

In [19]:
df_val = df_val[(df_val.duration >= 1) & (df_val.duration <= 60)]

In [20]:
df_val[categorical] = df_val[categorical].astype(str)

In [21]:
val_dicts = df_val[categorical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [22]:
y_val = df_val[target].values

In [23]:
y_pred = lr.predict(X_val)

In [24]:
mean_squared_error(y_val, y_pred) ** 0.5

7.811816520976144