## Homework

> Note: sometimes your answer doesn't match one of 
> the options exactly. That's fine. 
> Select the option that's closest to your solution.
> If it's exactly in between two options, select the higher value.


### Dataset

In this homework, we continue using the fuel efficiency dataset.
Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'>here</a>.

You can do it with wget:

In [2]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv

--2025-11-02 12:40:02--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 874188 (854K) [text/plain]
Saving to: ‘car_fuel_efficiency.csv’


2025-11-02 12:40:02 (129 MB/s) - ‘car_fuel_efficiency.csv’ saved [874188/874188]




The goal of this homework is to create a regression model for predicting the car fuel efficiency (column `'fuel_efficiency_mpg'`).



### Preparing the dataset 

Preparation:

* Fill missing values with zeros. ✔️
* Do train/validation/test split with 60%/20%/20% distribution. ✔️
* Use the `train_test_split` function and set the `random_state` parameter to 1. ✔️
* Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices.

In [1]:
import pandas as pd, numpy as np

file = 'car_fuel_efficiency.csv'
df_original = pd.read_csv(file)
df = df_original.copy()
display(df.isnull().sum())
df = df.fillna(0)
df

engine_displacement      0
num_cylinders          482
horsepower             708
vehicle_weight           0
acceleration           930
model_year               0
origin                   0
fuel_type                0
drivetrain               0
num_doors              502
fuel_efficiency_mpg      0
dtype: int64

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,0.0,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,0.0,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.870990,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369
...,...,...,...,...,...,...,...,...,...,...,...
9699,140,5.0,164.0,2981.107371,17.3,2013,Europe,Diesel,Front-wheel drive,0.0,15.101802
9700,180,0.0,154.0,2439.525729,15.0,2004,USA,Gasoline,All-wheel drive,0.0,17.962326
9701,220,2.0,138.0,2583.471318,15.1,2008,USA,Diesel,All-wheel drive,-1.0,17.186587
9702,230,4.0,177.0,2905.527390,19.4,2011,USA,Diesel,Front-wheel drive,1.0,15.331551


In [2]:
from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.fuel_efficiency_mpg.values
y_val = df_val.fuel_efficiency_mpg.values
y_test = df_test.fuel_efficiency_mpg.values

In [3]:
del df_train['fuel_efficiency_mpg']
del df_val['fuel_efficiency_mpg']
del df_test['fuel_efficiency_mpg']

In [4]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score

dv = DictVectorizer(sparse=True)
train_dict = df_train.to_dict(orient='records')
val_dicts = df_val.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)
display(X_train)
X_val = dv.transform(val_dicts)
X_val

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 58220 stored elements and shape (5822, 14)>

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 19410 stored elements and shape (1941, 14)>

## Question 1

Let's train a decision tree regressor to predict the `fuel_efficiency_mpg` variable. 

* Train a model with `max_depth=1`.


Which feature is used for splitting the data?


* `'vehicle_weight'` <---
* `'model_year'`
* `'origin'`
* `'fuel_type'`

In [5]:
from sklearn.tree import export_text

dt = DecisionTreeRegressor(max_depth=1)
dt.fit(X_train, y_train)
print(export_text(dt, feature_names=list(dv.get_feature_names_out())))

|--- vehicle_weight <= 3022.11
|   |--- value: [16.88]
|--- vehicle_weight >  3022.11
|   |--- value: [12.94]



## Question 2

Train a random forest regressor with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1` (optional - to make training faster)


What's the RMSE of this model on the validation data?

* 0.045
* 0.45 <---
* 4.5
* 45.0

In [9]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

rfr = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
rfr.fit(X_train, y_train)

# Predict
y_pred = rfr.predict(X_val)
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
print(f"RMSE: {rmse}")

RMSE: 0.4595777223092726


## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10.
* Set `random_state` to `1`.
* Evaluate the model on the validation dataset.


After which value of `n_estimators` does RMSE stop improving?
Consider 3 decimal places for calculating the answer.

- 10
- 25
- 80
- 200 <---

If it doesn't stop improving, use the latest iteration number in
your answer.

In [8]:
for n, rmse in rmse_values:
    print(f"n_estimators: {n}, RMSE: {rmse}")

NameError: name 'rmse_values' is not defined

In [8]:
rmse_values = []

for n in range(10, 201, 10):
    print(f'n = {n}')
    model = RandomForestRegressor(n_estimators=n, random_state=1, n_jobs=-1)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    rmse_values.append((n, rmse))

# Print results
for n, rmse in rmse_values:
    print(f"n_estimators: {n}, RMSE: {rmse:.3f}")

n = 10
n = 20
n = 30
n = 40
n = 50
n = 60
n = 70
n = 80
n = 90
n = 100
n = 110
n = 120
n = 130
n = 140
n = 150
n = 160
n = 170
n = 180
n = 190
n = 200
n_estimators: 10, RMSE: 0.460
n_estimators: 20, RMSE: 0.454
n_estimators: 30, RMSE: 0.452
n_estimators: 40, RMSE: 0.449
n_estimators: 50, RMSE: 0.447
n_estimators: 60, RMSE: 0.445
n_estimators: 70, RMSE: 0.445
n_estimators: 80, RMSE: 0.445
n_estimators: 90, RMSE: 0.445
n_estimators: 100, RMSE: 0.445
n_estimators: 110, RMSE: 0.444
n_estimators: 120, RMSE: 0.444
n_estimators: 130, RMSE: 0.444
n_estimators: 140, RMSE: 0.443
n_estimators: 150, RMSE: 0.443
n_estimators: 160, RMSE: 0.443
n_estimators: 170, RMSE: 0.443
n_estimators: 180, RMSE: 0.442
n_estimators: 190, RMSE: 0.442
n_estimators: 200, RMSE: 0.442


## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: `[10, 15, 20, 25]`
* For each of these values,
  * try different values of `n_estimators` from 10 till 200 (with step 10)
  * calculate the mean RMSE 
* Fix the random seed: `random_state=1`


What's the best `max_depth`, using the mean RMSE?

* 10
* 15
* 20
* 25

In [25]:
from sklearn.model_selection import cross_val_score
from tqdm.auto import tqdm

max_depth_values = [10, 15, 20, 25]
n_estimators_range = range(10, 201, 10)

results = {}

for depth in tqdm(max_depth_values):
    print(f'current depth={depth}')
    rmse_list = []
    for n in tqdm(n_estimators_range):
        print(f'    current estimator={n}')
        model = RandomForestRegressor(
            max_depth=depth,
            n_estimators=n,
            random_state=1,
            n_jobs=-1
        )
        scores = cross_val_score(model, X_train, y_train, 
                                 scoring='neg_root_mean_squared_error', cv=5)
        mean_rmse = -np.mean(scores)
        rmse_list.append(mean_rmse)
    results[depth] = np.mean(rmse_list)
    print(f'result is {results[depth]}')

# Find best max_depth
best_depth = min(results, key=results.get)
print("Best max_depth:", best_depth)
print("Mean RMSE:", results[best_depth])

  0%|          | 0/4 [00:00<?, ?it/s]

current depth=10


  0%|          | 0/20 [00:00<?, ?it/s]

    current estimator=10
    current estimator=20


KeyboardInterrupt: 

In [None]:
import csv
import os
from sklearn.model_selection import cross_val_score

from tqdm.auto import tqdm

max_depth_values = [10, 15, 20, 25]
n_estimators_range = range(10, 201, 10)
results_file = "results.csv"
results = {}

# --- Load existing results if file exists ---
if os.path.exists(results_file):
    with open(results_file, "r") as f:
        reader = csv.reader(f)
        # next(reader)  # skip header
        for row in reader:
            if row[0] == "max_depth":
                continue  # skip any header rows
            depth = int(row[0])
            print(f'{depth} detected')
            mean_rmse = float(row[1])
            results[depth] = mean_rmse
            print(f"Result for depth={depth}: {mean_rmse:.4f}")

# --- Open file for appending new results ---
with open(results_file, "a", newline="") as f:
    writer = csv.writer(f)
    if not results:  # write header only if file was empty
        print('header being written')
        writer.writerow(["max_depth", "mean_rmse"])

    for depth in tqdm(max_depth_values):
        if depth in results:
            print(f"Skipping depth={depth} (already computed)")
            continue

        print(f"Computing depth={depth}")
        rmse_list = []
        for n in tqdm(n_estimators_range, leave=True):
            print(f"    estimator={n}")
            model = RandomForestRegressor(
                max_depth=depth,
                n_estimators=n,
                random_state=1,
                n_jobs=-1
            )
            scores = cross_val_score(model, X_train, y_train,
                                     scoring='neg_root_mean_squared_error', cv=5)
            mean_rmse = -np.mean(scores)
            rmse_list.append(mean_rmse)

        avg_rmse = np.mean(rmse_list)
        results[depth] = avg_rmse
        writer.writerow([depth, avg_rmse])
        f.flush()
        print(f"Result for depth={depth}: {avg_rmse:.4f}")

# --- Final result ---
best_depth = min(results, key=results.get)
print("Best max_depth:", best_depth)
print("Mean RMSE:", results[best_depth])


10 detected
Result for depth=10: 0.4432


  0%|          | 0/4 [00:00<?, ?it/s]

Skipping depth=10 (already computed)
Computing depth=15


  0%|          | 0/20 [00:00<?, ?it/s]

    estimator=10
    estimator=20
    estimator=30
    estimator=40
    estimator=50
    estimator=60
    estimator=70
    estimator=80
    estimator=90
    estimator=100
    estimator=110
    estimator=120
    estimator=130
    estimator=140
    estimator=150
    estimator=160
    estimator=170
    estimator=180
    estimator=190
    estimator=200
Result for depth=15: 0.4457
Computing depth=20


  0%|          | 0/20 [00:00<?, ?it/s]

    estimator=10
    estimator=20
    estimator=30
    estimator=40
    estimator=50
    estimator=60
    estimator=70
    estimator=80


# Question 5

We can extract feature importance information from tree-based models. 

At each step of the decision tree learning algorithm, it finds the best split. 
When doing it, we can calculate "gain" - the reduction in impurity before and after the split. 
This gain is quite useful in understanding what are the important features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the
[`feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.feature_importances_)
field. 

For this homework question, we'll find the most important feature:

* Train the model with these parameters:
  * `n_estimators=10`,
  * `max_depth=20`,
  * `random_state=1`,
  * `n_jobs=-1` (optional)
* Get the feature importance information from this model


What's the most important feature (among these 4)? 

* `vehicle_weight`
*	`horsepower`
* `acceleration`
* `engine_displacement`

## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter:

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

Now change `eta` from `0.3` to `0.1`.

Which eta leads to the best RMSE score on the validation dataset?

* 0.3
* 0.1
* Both give equal value