# Home Work Two - ML Zoomcamp 2025

Hi, I'm Norman! This is my submission for Home Work 2 for ML ZoomCamp 2025.

Let’s connect on [LinkedIn](https://www.linkedin.com/in/anormanangel/) and [Twitter](https://x.com/anormanangel) to keep learning together.


In [1]:
# Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# Importing the dataset

df = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv')
df.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


### Question 1. Missing values (1 point)


In [3]:
df.isna().sum()

engine_displacement      0
num_cylinders          482
horsepower             708
vehicle_weight           0
acceleration           930
model_year               0
origin                   0
fuel_type                0
drivetrain               0
num_doors              502
fuel_efficiency_mpg      0
dtype: int64

### Question 2. Median for horse power (1 point)

In [4]:
# Calculating the median of horsepower

df.horsepower.median()

149.0

### Question 3. Filling NAs (1 point)

ANSWER = MEAN 

When filling missing numeric values (like engine displacement, horsepower, acceleration, etc.), using the mean (or median) is usually better than filling with 0, because:

1. 0 may not make sense for features like horsepower or acceleration — it would distort the data.
2. Mean imputation preserves the overall data distribution and reduces bias.

In [6]:
# Fill numerical columns with their respective means

# Define the columns to fill
cols = ['num_cylinders', 'horsepower', 'acceleration', 'num_doors']

for col in cols:
    if col in df.columns and df[col].isna().sum() > 0:
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)
        print(f"Filled {col} with mean: {mean_value}")

# Check if all NAs are filled
print("\nMissing values after filling:")
print(df.isna().sum())

# Let's check the median of horsepower after filling
print(f"\nMedian of horsepower after filling NAs: {df.horsepower.median()}")


Missing values after filling:
engine_displacement    0
num_cylinders          0
horsepower             0
vehicle_weight         0
acceleration           0
model_year             0
origin                 0
fuel_type              0
drivetrain             0
num_doors              0
fuel_efficiency_mpg    0
dtype: int64

Median of horsepower after filling NAs: 149.65729212983547


### Question 4. Best regularization (1 point)

ANSWER = 0.01

Regularization controls model complexity — smaller values (like 0.01) usually provide enough penalty to reduce overfitting without underfitting the model.

- 0: No regularization → overfitting risk

- 10 or 100: Too strong → underfitting

- 1 Moderate, but might still be a bit high

- 0.01 → Often the best balance between bias and variance



### Question 5. RMSE Standard Deviation (1 point)

In [8]:
# Question 5: RMSE Standard Deviation using Cross-Validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Prepare the data
# Select features (excluding the target variable)
features = ['engine_displacement', 'num_cylinders', 'horsepower', 'vehicle_weight', 
           'acceleration', 'model_year']

X = df[features].copy()
y = df['fuel_efficiency_mpg'].copy()

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform 5-fold cross-validation with different regularization values
kf = KFold(n_splits=5, shuffle=True, random_state=42)
regularization_values = [0, 0.01, 0.1, 1, 10]

rmse_scores = {}

for alpha in regularization_values:
    fold_rmses = []
    
    for train_idx, val_idx in kf.split(X_scaled):
        X_train, X_val = X_scaled[train_idx], X_scaled[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
        
        # Use Ridge regression with regularization
        if alpha == 0:
            model = LinearRegression()
        else:
            model = Ridge(alpha=alpha)
        
        model.fit(X_train, y_train)
        y_pred = model.predict(X_val)
        
        rmse = np.sqrt(mean_squared_error(y_val, y_pred))
        fold_rmses.append(rmse)
    
    rmse_scores[alpha] = fold_rmses

# Calculate standard deviation of RMSE for each regularization value
print("RMSE Standard Deviation for different regularization values:")
print("-" * 60)

for alpha in regularization_values:
    rmse_std = np.std(rmse_scores[alpha])
    rmse_mean = np.mean(rmse_scores[alpha])
    print(f"Alpha = {alpha:4}: Mean RMSE = {rmse_mean:.3f}, Std RMSE = {rmse_std:.3f}")

# Find the regularization with the lowest standard deviation
best_alpha = min(regularization_values, key=lambda x: np.std(rmse_scores[x]))
best_std = np.std(rmse_scores[best_alpha])

print(f"\nBest regularization (lowest std): {best_alpha}")
print(f"Standard deviation of RMSE: {best_std:.3f}")

# Check which option it's closest to
options = [0.001, 0.006, 0.060, 0.600]
closest_option = min(options, key=lambda x: abs(x - best_std))
print(f"\nClosest to option: {closest_option}")

RMSE Standard Deviation for different regularization values:
------------------------------------------------------------
Alpha =    0: Mean RMSE = 0.396, Std RMSE = 0.005
Alpha = 0.01: Mean RMSE = 0.396, Std RMSE = 0.005
Alpha =  0.1: Mean RMSE = 0.396, Std RMSE = 0.005
Alpha =    1: Mean RMSE = 0.396, Std RMSE = 0.005
Alpha =   10: Mean RMSE = 0.396, Std RMSE = 0.005

Best regularization (lowest std): 0
Standard deviation of RMSE: 0.005

Closest to option: 0.006


### Question 6. Evaluation on test (1 point)

ANSWER = 0.515

The test RMSE was 0.396, and among the given options (0.15, 0.515, 5.15, 51.5), the closest value is 0.515.

In [9]:
# Question 6: Evaluation on Test Set
from sklearn.model_selection import train_test_split

# Split the data into train and test sets (80-20 split)
X_train_final, X_test, y_train_final, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train_final.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

# Train models with different regularization values on the training set
test_rmse_scores = {}

for alpha in regularization_values:
    if alpha == 0:
        model = LinearRegression()
    else:
        model = Ridge(alpha=alpha)
    
    # Fit on training data
    model.fit(X_train_final, y_train_final)
    
    # Predict on test data
    y_test_pred = model.predict(X_test)
    
    # Calculate RMSE on test set
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    test_rmse_scores[alpha] = test_rmse
    
    print(f"Alpha = {alpha:4}: Test RMSE = {test_rmse:.3f}")

# Find the best regularization parameter (lowest test RMSE)
best_test_alpha = min(test_rmse_scores.keys(), key=lambda x: test_rmse_scores[x])
best_test_rmse = test_rmse_scores[best_test_alpha]

print(f"\nBest regularization on test set: {best_test_alpha}")
print(f"Best test RMSE: {best_test_rmse:.3f}")

# Compare with cross-validation results
print(f"\nComparison:")
print(f"Best alpha from CV (lowest std): {best_alpha}")
print(f"Best alpha from test evaluation: {best_test_alpha}")

# Train final model with best regularization and evaluate
if best_test_alpha == 0:
    final_model = LinearRegression()
else:
    final_model = Ridge(alpha=best_test_alpha)

final_model.fit(X_train_final, y_train_final)
final_test_pred = final_model.predict(X_test)
final_rmse = np.sqrt(mean_squared_error(y_test, final_test_pred))

print(f"\nFinal model test RMSE: {final_rmse:.3f}")

# Additional evaluation metrics
from sklearn.metrics import mean_absolute_error, r2_score

mae = mean_absolute_error(y_test, final_test_pred)
r2 = r2_score(y_test, final_test_pred)

print(f"Mean Absolute Error: {mae:.3f}")
print(f"R² Score: {r2:.3f}")

Training set size: 7763
Test set size: 1941
Alpha =    0: Test RMSE = 0.396
Alpha = 0.01: Test RMSE = 0.396
Alpha =  0.1: Test RMSE = 0.396
Alpha =    1: Test RMSE = 0.396
Alpha =   10: Test RMSE = 0.396

Best regularization on test set: 0
Best test RMSE: 0.396

Comparison:
Best alpha from CV (lowest std): 0
Best alpha from test evaluation: 0

Final model test RMSE: 0.396
Mean Absolute Error: 0.315
R² Score: 0.976


### Homework URL 

https://github.com/anormanangel/Machine-Learning-Zoomcamp/blob/main/02-Regression/Home%20work%202.ipynb

### Learning in Public

https://x.com/anormanangel/status/1975651301745664129