## ML Zoomcamp 2025  - HW 2



Q1) What column has missing values?

In [21]:
import pandas as pd

df = pd.read_csv('car_fuel_efficiency.csv')
columns = ['engine_displacement', 'horsepower', 'vehicle_weight', 'model_year', 'fuel_efficiency_mpg']
df = df[columns]

missing_counts = df.isnull().sum()
print(missing_counts)

engine_displacement      0
horsepower             708
vehicle_weight           0
model_year               0
fuel_efficiency_mpg      0
dtype: int64


Q2) what is the median of horsepower?

In [24]:
median_hp = df['horsepower'].median()
print('Median horsepower:', median_hp)

Median horsepower: 149.0


Q3)

- We need to deal with missing values for the column from Q1.
- We have two options: fill it with 0 or with the mean of this variable.
- Try both options. For each, train a linear regression model without regularization using the code from the lessons.
- For computing the mean, use the training only!
- Use the validation dataset to evaluate the models and compare the RMSE of each option.
- Round the RMSE scores to 2 decimal digits using round(score, 2)
- Which option gives better RMSE?

In [27]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

df = df.sample(frac=1, random_state=42).reset_index(drop=True)
n = len(df)
n_train = int(0.6 * n)
n_val = int(0.2 * n)

df_train = df.iloc[:n_train].copy()
df_val = df.iloc[n_train:n_train+n_val].copy()

features = ['engine_displacement', 'horsepower', 'vehicle_weight', 'model_year']

df_train_0 = df_train.copy()
df_val_0 = df_val.copy()
df_train_0['horsepower'] = df_train_0['horsepower'].fillna(0)
df_val_0['horsepower'] = df_val_0['horsepower'].fillna(0)

model_0 = LinearRegression()
model_0.fit(df_train_0[features], df_train_0['fuel_efficiency_mpg'])
pred_0 = model_0.predict(df_val_0[features])
rmse_0 = mean_squared_error(df_val_0['fuel_efficiency_mpg'], pred_0, squared=False)
print('RMSE (fill with 0):', round(rmse_0, 2))

mean_hp = df_train['horsepower'].mean()
df_train_mean = df_train.copy()
df_val_mean = df_val.copy()
df_train_mean['horsepower'] = df_train_mean['horsepower'].fillna(mean_hp)
df_val_mean['horsepower'] = df_val_mean['horsepower'].fillna(mean_hp)

model_mean = LinearRegression()
model_mean.fit(df_train_mean[features], df_train_mean['fuel_efficiency_mpg'])
pred_mean = model_mean.predict(df_val_mean[features])
rmse_mean = mean_squared_error(df_val_mean['fuel_efficiency_mpg'], pred_mean, squared=False)
print('RMSE (fill with mean):', round(rmse_mean, 2))


RMSE (fill with 0): 0.52
RMSE (fill with mean): 0.46




Q4)

- Now let's train a regularized linear regression.
- For this question, fill the NAs with 0.
- Try different values of r from this list: [0, 0.01, 0.1, 1, 5, 10, 100].
- Use RMSE to evaluate the model on the validation dataset.
- Round the RMSE scores to 2 decimal digits.
- Which r gives the best RMSE?
- If multiple options give the same best RMSE, select the smallest r.

In [19]:
from sklearn.linear_model import Ridge

df_train_zero = df_train.fillna(0)
df_val_zero = df_val.fillna(0)
X_train = df_train_zero[features].values
y_train = df_train_zero['fuel_efficiency_mpg'].values
X_val = df_val_zero[features].values
y_val = df_val_zero['fuel_efficiency_mpg'].values

for r in [0, 0.01, 0.1, 1, 5, 10, 100]:
    model = Ridge(alpha=r)
    model.fit(X_train, y_train)
    preds = model.predict(X_val)
    rmse = mean_squared_error(y_val, preds, squared=False)
    print(f"r={r}: RMSE={round(rmse,2)}")

r=0: RMSE=0.52
r=0.01: RMSE=0.52
r=0.1: RMSE=0.52
r=1: RMSE=0.52
r=5: RMSE=0.52
r=10: RMSE=0.52
r=100: RMSE=0.52




Q5)
- We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
- Try different seed values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].
- For each seed, do the train/validation/test split with 60%/20%/20% distribution.
- Fill the missing values with 0 and train a model without regularization.
- For each seed, evaluate the model on the validation dataset and collect the RMSE scores.
- What's the standard deviation of all the scores? To compute the standard deviation, use np.std.
- Round the result to 3 decimal digits (round(std, 3))

In [30]:
rmse_scores = []
for seed in range(10):
    df_shuffled = df.sample(frac=1, random_state=seed).reset_index(drop=True)
    n = len(df_shuffled)
    n_train = int(n * 0.6)
    n_val = int(n * 0.2)
    df_train = df_shuffled[:n_train].fillna(0)
    df_val = df_shuffled[n_train:n_train+n_val].fillna(0)
    X_train = df_train[features].values
    y_train = df_train['fuel_efficiency_mpg'].values
    X_val = df_val[features].values
    y_val = df_val['fuel_efficiency_mpg'].values
    model = LinearRegression()
    model.fit(X_train, y_train)
    preds = model.predict(X_val)
    rmse = mean_squared_error(y_val, preds, squared=False)
    rmse_scores.append(rmse)
std = np.std(rmse_scores)
print('Std of RMSE:', round(std, 3))

Std of RMSE: 0.008




Q6)

- Split the dataset like previously, use seed 9.
- Combine train and validation datasets.
- Fill the missing values with 0 and train a model with r=0.001.
- What's the RMSE on the test dataset?

In [33]:
seed = 9
df_shuffled = df.sample(frac=1, random_state=seed).reset_index(drop=True)
n = len(df_shuffled)
n_train = int(n * 0.6)
n_val = int(n * 0.2)

df_train = df_shuffled[:n_train].fillna(0)
df_val = df_shuffled[n_train:n_train+n_val].fillna(0)
df_test = df_shuffled[n_train+n_val:].fillna(0)

df_combined = pd.concat([df_train, df_val])
X_train = df_combined[features].values
y_train = df_combined['fuel_efficiency_mpg'].values
X_test = df_test[features].values
y_test = df_test['fuel_efficiency_mpg'].values

model = Ridge(alpha=0.001)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
rmse_test = mean_squared_error(y_test, y_pred, squared=False)
print("Test RMSE (r=0.001):", round(rmse_test, 2))

Test RMSE (r=0.001): 0.53


