## Homework

> Note: sometimes your answer doesn't match one of 
> the options exactly. That's fine. 
> Select the option that's closest to your solution.
> If it's exactly in between two options, select the higher value.


### Dataset

In this homework, we continue using the fuel efficiency dataset.
Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'>here</a>.

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
```

The goal of this homework is to create a regression model for predicting the car fuel efficiency (column `'fuel_efficiency_mpg'`).



### Preparing the dataset 

Preparation:

* Fill missing values with zeros.
* Do train/validation/test split with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1.
* Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer

car_fuel_eff_url = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'
df = pd.read_csv(car_fuel_eff_url)

In [2]:
# Filling "na"s with 0s
df = df.fillna(0)

In [3]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train['fuel_efficiency_mpg']
y_val = df_val['fuel_efficiency_mpg']
y_test = df_test['fuel_efficiency_mpg']

df_train.drop(columns=['fuel_efficiency_mpg'], inplace=True)
df_val.drop(columns=['fuel_efficiency_mpg'], inplace=True)
df_test.drop(columns=['fuel_efficiency_mpg'], inplace=True)


In [4]:
from sklearn.feature_extraction import DictVectorizer

In [5]:
# prepare data for dict vectorizer
train_dicts = df_train.to_dict(orient='records')
val_dicts = df_val.to_dict(orient='records')
test_dicts = df_test.to_dict(orient='records')

# init dict vectorizer
dv = DictVectorizer(sparse=True)

dv.fit(train_dicts)

# encode df_train
X_train = dv.transform(train_dicts)
X_val = dv.transform(val_dicts)
X_test = dv.transform(test_dicts)

## Question 1

Let's train a decision tree regressor to predict the `fuel_efficiency_mpg` variable. 

* Train a model with `max_depth=1`.


Which feature is used for splitting the data?


* `'vehicle_weight'`
* `'model_year'`
* `'origin'`
* `'fuel_type'`



In [6]:
from sklearn.tree import DecisionTreeRegressor

In [7]:
dtr = DecisionTreeRegressor(max_depth=1)

dtr.fit(X=X_train, y=y_train)


0,1,2
,criterion,'squared_error'
,splitter,'best'
,max_depth,1
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [8]:
dv.get_feature_names_out()

array(['acceleration', 'drivetrain=All-wheel drive',
       'drivetrain=Front-wheel drive', 'engine_displacement',
       'fuel_type=Diesel', 'fuel_type=Gasoline', 'horsepower',
       'model_year', 'num_cylinders', 'num_doors', 'origin=Asia',
       'origin=Europe', 'origin=USA', 'vehicle_weight'], dtype=object)

In [9]:
features_df = pd.DataFrame(list(zip(dv.get_feature_names_out(), dtr.feature_importances_)), columns=['feature', 'importance'])
features_df.sort_values(by='importance', ascending=False, inplace=True)
features_df.head(1)


Unnamed: 0,feature,importance
13,vehicle_weight,1.0


## Question 2

Train a random forest regressor with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1` (optional - to make training faster)


What's the RMSE of this model on the validation data?

* 0.045
* 0.45
* 4.5
* 45.0


In [10]:
from sklearn.ensemble import RandomForestRegressor

In [11]:
rfr = RandomForestRegressor(n_estimators=10, random_state=1)

# train model
rfr.fit(X_train, y=y_train)

0,1,2
,n_estimators,10
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [12]:
from sklearn.metrics import root_mean_squared_error

y_pred = rfr.predict(X_val)
root_mean_squared_error(y_true=y_val, y_pred=y_pred)


0.4595777223092726

## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10.
* Set `random_state` to `1`.
* Evaluate the model on the validation dataset.


After which value of `n_estimators` does RMSE stop improving?
Consider 3 decimal places for calculating the answer.

- 10
- 25
- 80
- 200

If it doesn't stop improving, use the latest iteration number in
your answer.



In [13]:
import numpy as np

In [15]:
n_estimators = np.arange(10, 210, step=10)

RANDOM_STATE = 1

n_est_rmse = []

for n_est in n_estimators:
    
    rfr_worker = RandomForestRegressor(n_estimators=n_est, random_state=RANDOM_STATE, n_jobs=-1)
    rfr_worker.fit(X=X_train, y=y_train)

    y_pred = rfr_worker.predict(X_val)
    rmse = root_mean_squared_error(y_true=y_val, y_pred=y_pred)

    n_est_rmse.append((n_est, round(rmse, 3)))


n_est_rmse

[(np.int64(10), 0.46),
 (np.int64(20), 0.454),
 (np.int64(30), 0.452),
 (np.int64(40), 0.449),
 (np.int64(50), 0.447),
 (np.int64(60), 0.445),
 (np.int64(70), 0.445),
 (np.int64(80), 0.445),
 (np.int64(90), 0.445),
 (np.int64(100), 0.445),
 (np.int64(110), 0.444),
 (np.int64(120), 0.444),
 (np.int64(130), 0.444),
 (np.int64(140), 0.443),
 (np.int64(150), 0.443),
 (np.int64(160), 0.443),
 (np.int64(170), 0.443),
 (np.int64(180), 0.442),
 (np.int64(190), 0.442),
 (np.int64(200), 0.442)]

### visually after n_est = 60 the RMSE looks like not imporving significantly 

In [None]:
from matplotlib import pyplot as plt

df_n_est_rmse = pd.DataFrame(n_est_rmse, columns=['n_est', 'rmse'])

plt.plot(df_n_est_rmse.n_est, df_n_est_rmse.rmse)

## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: `[10, 15, 20, 25]`
* For each of these values,
  * try different values of `n_estimators` from 10 till 200 (with step 10)
  * calculate the mean RMSE 
* Fix the random seed: `random_state=1`


What's the best `max_depth`, using the mean RMSE?

* 10
* 15
* 20
* 25


In [30]:
max_depths = [10, 15, 20, 25]

# dict to keep (max_depth, n_estimators, rmse)
md_n_est_md_rmse = []

for md in max_depths:
    for n_est_md in n_estimators:
        rfr_md = RandomForestRegressor(
            n_estimators=n_est_md,
            max_depth=md,
            random_state=RANDOM_STATE)

        rfr_md.fit(X_train, y_train)

        y_pred = rfr_md.predict(X_val)
        rmse_md = root_mean_squared_error(y_val, y_pred)

        md_n_est_md_rmse.append((md, n_est_md, rmse_md))

md_n_est_md_rmse


[(10, np.int64(10), 0.4502486597058524),
 (10, np.int64(20), 0.44685703362920204),
 (10, np.int64(30), 0.44547396459413735),
 (10, np.int64(40), 0.4430673112962584),
 (10, np.int64(50), 0.44195668621793566),
 (10, np.int64(60), 0.4416730330613033),
 (10, np.int64(70), 0.4412975503694072),
 (10, np.int64(80), 0.4414352350072895),
 (10, np.int64(90), 0.4415215165581006),
 (10, np.int64(100), 0.44121699790710184),
 (10, np.int64(110), 0.440526227247825),
 (10, np.int64(120), 0.4407083659646053),
 (10, np.int64(130), 0.440629500094825),
 (10, np.int64(140), 0.44033941277349004),
 (10, np.int64(150), 0.43994270355172643),
 (10, np.int64(160), 0.43979740503833187),
 (10, np.int64(170), 0.4400174394744503),
 (10, np.int64(180), 0.4397488696817066),
 (10, np.int64(190), 0.43985420021815086),
 (10, np.int64(200), 0.43984510625501455),
 (15, np.int64(10), 0.4576238478625812),
 (15, np.int64(20), 0.45307187353797973),
 (15, np.int64(30), 0.4508686966214543),
 (15, np.int64(40), 0.448609348897959)

In [None]:
import seaborn as sns

df_md_n_est_rmse = pd.DataFrame(md_n_est_md_rmse, columns=['max_depth', 'n_estimators', 'rmse'])

df_pivot = df_md_n_est_rmse.pivot(index='n_estimators', columns='max_depth', values='rmse').round(3)

sns.heatmap(df_pivot, annot=True, fmt='.3f')

In [38]:
df_pivot.round(3)

max_depth,10,15,20,25
n_estimators,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,0.45,0.458,0.459,0.459
20,0.447,0.453,0.454,0.454
30,0.445,0.451,0.452,0.452
40,0.443,0.449,0.449,0.449
50,0.442,0.446,0.447,0.447
60,0.442,0.445,0.446,0.446
70,0.441,0.445,0.445,0.445
80,0.441,0.445,0.446,0.445
90,0.442,0.445,0.446,0.445
100,0.441,0.444,0.445,0.445


### It seems that max depth of 10 gives the smallest RMSE

# Question 5

We can extract feature importance information from tree-based models. 

At each step of the decision tree learning algorithm, it finds the best split. 
When doing it, we can calculate "gain" - the reduction in impurity before and after the split. 
This gain is quite useful in understanding what are the important features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the
[`feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.feature_importances_)
field. 

For this homework question, we'll find the most important feature:

* Train the model with these parameters:
  * `n_estimators=10`,
  * `max_depth=20`,
  * `random_state=1`,
  * `n_jobs=-1` (optional)
* Get the feature importance information from this model


What's the most important feature (among these 4)? 

* `vehicle_weight`
*	`horsepower`
* `acceleration`
* `engine_displacement`	


In [48]:
rfr_md20 = RandomForestRegressor(n_estimators=10, max_depth=20, random_state=1)

rfr_md20.fit(X_train, y_train)

feat_imp = pd.DataFrame(
    list(
        zip(dv.get_feature_names_out(), rfr_md20.feature_importances_)
        ),
    columns=['feature_name', 'importance']
    )

feat_imp.sort_values('importance', ascending=False)

Unnamed: 0,feature_name,importance
13,vehicle_weight,0.95915
6,horsepower,0.015998
0,acceleration,0.01148
3,engine_displacement,0.003273
7,model_year,0.003212
8,num_cylinders,0.002343
9,num_doors,0.001635
12,origin=USA,0.00054
11,origin=Europe,0.000519
10,origin=Asia,0.000462


## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter:

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

Now change `eta` from `0.3` to `0.1`.

Which eta leads to the best RMSE score on the validation dataset?

* 0.3
* 0.1
* Both give equal value


In [7]:
import xgboost as xgb

In [12]:
features = list(dv.get_feature_names_out())

dtrain = xgb.DMatrix(data=X_train, label=y_train, feature_names=features)
dval = xgb.DMatrix(data=X_val, label=y_val, feature_names=features)

In [21]:
watchlist = [(dtrain, 'train'), (dval, 'val')]

In [33]:
xgb_params = {
    'eta': 0.3,
    'max_depth': 6,
    'min_child_weight': 1,

    'objective': 'reg:squarederror',
    'nthread': 24,

    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(params=xgb_params, dtrain=dtrain, num_boost_round=200,
                  evals=watchlist)

[0]	train-rmse:1.81393	val-rmse:1.85444
[1]	train-rmse:1.31919	val-rmse:1.35353
[2]	train-rmse:0.98120	val-rmse:1.01316
[3]	train-rmse:0.75443	val-rmse:0.78667
[4]	train-rmse:0.60680	val-rmse:0.64318
[5]	train-rmse:0.51381	val-rmse:0.55664


[6]	train-rmse:0.45470	val-rmse:0.50321
[7]	train-rmse:0.41881	val-rmse:0.47254
[8]	train-rmse:0.39534	val-rmse:0.45509
[9]	train-rmse:0.38038	val-rmse:0.44564
[10]	train-rmse:0.37115	val-rmse:0.43896
[11]	train-rmse:0.36361	val-rmse:0.43594
[12]	train-rmse:0.35850	val-rmse:0.43558
[13]	train-rmse:0.35365	val-rmse:0.43394
[14]	train-rmse:0.35025	val-rmse:0.43349
[15]	train-rmse:0.34666	val-rmse:0.43362
[16]	train-rmse:0.34459	val-rmse:0.43378
[17]	train-rmse:0.34128	val-rmse:0.43405
[18]	train-rmse:0.33822	val-rmse:0.43391
[19]	train-rmse:0.33709	val-rmse:0.43374
[20]	train-rmse:0.33553	val-rmse:0.43376
[21]	train-rmse:0.33243	val-rmse:0.43453
[22]	train-rmse:0.33031	val-rmse:0.43510
[23]	train-rmse:0.32815	val-rmse:0.43601
[24]	train-rmse:0.32670	val-rmse:0.43592
[25]	train-rmse:0.32268	val-rmse:0.43683
[26]	train-rmse:0.32085	val-rmse:0.43678
[27]	train-rmse:0.32035	val-rmse:0.43681
[28]	train-rmse:0.31879	val-rmse:0.43719
[29]	train-rmse:0.31653	val-rmse:0.43739
[30]	train-rmse:0.31

In [15]:
from sklearn.metrics import root_mean_squared_error

In [29]:
y_pred = model.predict(dval)
rmse = root_mean_squared_error(y_val, y_pred)
rmse

0.46203156380184496

In [34]:
xgb_params = {
    'eta': 0.1,
    'max_depth': 6,
    'min_child_weight': 1,

    'objective': 'reg:squarederror',
    'nthread': 24,

    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(params=xgb_params, dtrain=dtrain, num_boost_round=200,
                         evals=watchlist)

[0]	train-rmse:2.28944	val-rmse:2.34561
[1]	train-rmse:2.07396	val-rmse:2.12434
[2]	train-rmse:1.88066	val-rmse:1.92597
[3]	train-rmse:1.70730	val-rmse:1.74987
[4]	train-rmse:1.55163	val-rmse:1.59059
[5]	train-rmse:1.41247	val-rmse:1.44988
[6]	train-rmse:1.28796	val-rmse:1.32329
[7]	train-rmse:1.17660	val-rmse:1.20930
[8]	train-rmse:1.07736	val-rmse:1.10830
[9]	train-rmse:0.98883	val-rmse:1.02009
[10]	train-rmse:0.91008	val-rmse:0.94062
[11]	train-rmse:0.84030	val-rmse:0.87100
[12]	train-rmse:0.77874	val-rmse:0.80916
[13]	train-rmse:0.72417	val-rmse:0.75465
[14]	train-rmse:0.67626	val-rmse:0.70780
[15]	train-rmse:0.63402	val-rmse:0.66672
[16]	train-rmse:0.59690	val-rmse:0.63062
[17]	train-rmse:0.56447	val-rmse:0.60016
[18]	train-rmse:0.53619	val-rmse:0.57383
[19]	train-rmse:0.51138	val-rmse:0.55044
[20]	train-rmse:0.48983	val-rmse:0.53064
[21]	train-rmse:0.47135	val-rmse:0.51451
[22]	train-rmse:0.45501	val-rmse:0.49998
[23]	train-rmse:0.44120	val-rmse:0.48790
[24]	train-rmse:0.42929	va

In [31]:
y_pred = model.predict(dval)
rmse = root_mean_squared_error(y_val, y_pred)
rmse

0.43151829815693454

### With eta set to 0.1 the results are better


## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw06
* If your answer doesn't match options exactly, select the closest one. If the answer is exactly in between two options, select the higher value.