## Homework - Week 06 | Trees


### Dataset

In this homework, we continue using the fuel efficiency dataset.
Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'>here</a>.

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
```

The goal of this homework is to create a regression model for predicting the car fuel efficiency (column `'fuel_efficiency_mpg'`).



### Preparing the dataset 

Preparation:

* Fill missing values with zeros.
* Do train/validation/test split with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1.
* Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices.

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer

# load data
df = pd.read_csv('data/car_fuel_efficiency.csv')

In [7]:
# check missing values
df.isnull().sum()

engine_displacement      0
num_cylinders          482
horsepower             708
vehicle_weight           0
acceleration           930
model_year               0
origin                   0
fuel_type                0
drivetrain               0
num_doors              502
fuel_efficiency_mpg      0
dtype: int64

In [8]:
# fill missing values with 0
df.fillna(0, inplace=True)

In [10]:
# Do train/validation/test split with 60%/20%/20% distribution. 
# Use the `train_test_split` function and set the `random_state` parameter to 1.
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

print(len(df_train), len(df_val), len(df_test))

5822 1941 1941


In [15]:
# set y and remove it from dataframes
y_train = df_train['fuel_efficiency_mpg'].values
y_val = df_val['fuel_efficiency_mpg'].values
y_test = df_test['fuel_efficiency_mpg'].values

del df_train['fuel_efficiency_mpg']
del df_val['fuel_efficiency_mpg']
del df_test['fuel_efficiency_mpg']

In [16]:
# Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices.
dv = DictVectorizer(sparse=True)

train_dicts = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

val_dicts = df_val.to_dict(orient='records')
X_val = dv.transform(val_dicts)

test_dicts = df_test.to_dict(orient='records')
X_test = dv.transform(test_dicts)


## Question 1

Let's train a decision tree regressor to predict the `fuel_efficiency_mpg` variable. 

* Train a model with `max_depth=1`.


Which feature is used for splitting the data?


> `'vehicle_weight'`
* `'model_year'`
* `'origin'`
* `'fuel_type'`

In [17]:
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor(max_depth=1, random_state=1)
dt.fit(X_train, y_train)



In [18]:
# visualize the tree

from sklearn.tree import export_text

r = export_text(dt, feature_names=dv.get_feature_names_out().tolist())
print(r)

|--- vehicle_weight <= 3022.11
|   |--- value: [16.88]
|--- vehicle_weight >  3022.11
|   |--- value: [12.94]



## Question 2

Train a random forest regressor with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1` (optional - to make training faster)


What's the RMSE of this model on the validation data?

* 0.045
> 0.45
* 4.5
* 45.0

In [19]:
# train rain forest with the parameters max_depth=4, n_estimators=100, random_state=1

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error

rf = RandomForestRegressor(n_jobs=-1, n_estimators=10, random_state=1)

rf.fit(X_train, y_train)

# calculate RMSE on validation set
y_pred = rf.predict(X_val)
rmse = root_mean_squared_error(y_val, y_pred)
print(f'RMSE on validation set: {rmse}')

RMSE on validation set: 0.45957772230927263


## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10.
* Set `random_state` to `1`.
* Evaluate the model on the validation dataset.


After which value of `n_estimators` does RMSE stop improving?
Consider 3 decimal places for calculating the answer.

- 10
- 25
- 80
> 200

If it doesn't stop improving, use the latest iteration number in
your answer.

In [21]:
# set different values for n_estimators for experimentation, from 10 to 200 with step 10
for n_estimators in range(10, 201, 10):
    rf = RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators, random_state=1)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_val)
    rmse = root_mean_squared_error(y_val, y_pred)
    print(f'n_estimators: {n_estimators}, RMSE on validation set: {rmse}')

n_estimators: 10, RMSE on validation set: 0.4595777223092726
n_estimators: 20, RMSE on validation set: 0.45359067251247054
n_estimators: 30, RMSE on validation set: 0.45168672575457125
n_estimators: 40, RMSE on validation set: 0.4487208301736997
n_estimators: 50, RMSE on validation set: 0.4466568972416094
n_estimators: 60, RMSE on validation set: 0.44545970260811213
n_estimators: 70, RMSE on validation set: 0.4451263244986996
n_estimators: 80, RMSE on validation set: 0.44498431197772836
n_estimators: 90, RMSE on validation set: 0.4448614906399874
n_estimators: 100, RMSE on validation set: 0.44465186808680407
n_estimators: 110, RMSE on validation set: 0.44357876439860233
n_estimators: 120, RMSE on validation set: 0.4439118681233816
n_estimators: 130, RMSE on validation set: 0.44370259039668697
n_estimators: 140, RMSE on validation set: 0.4433549955101688
n_estimators: 150, RMSE on validation set: 0.44289761494219454
n_estimators: 160, RMSE on validation set: 0.4427612219659299
n_estimat

## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: `[10, 15, 20, 25]`
* For each of these values,
  * try different values of `n_estimators` from 10 till 200 (with step 10)
  * calculate the mean RMSE 
* Fix the random seed: `random_state=1`


What's the best `max_depth`, using the mean RMSE?

> 10
* 15
* 20
* 25

In [26]:
# list to save mean RMSE values for different max_depth
mean_rmse_list = []

# try different values of max_depth: 10, 15, 20, 25
for max_depth in [10, 15, 20, 25]:
    for n_estimators in range(10, 201, 10):
        rf = RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators, max_depth=max_depth, random_state=1)
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_val)

        rmse = root_mean_squared_error(y_val, y_pred)
        mean_rmse_list.append((max_depth, rmse))

        #print(f'max_depth: {max_depth}, RMSE on validation set: {rmse}')
    
    # calculate mean RMSE for this max_depth
    rmse_values = [rmse for depth, rmse in mean_rmse_list if depth == max_depth]
    mean_rmse = np.mean(rmse_values)
    print('---')
    print(f'max_depth: {max_depth}, Mean RMSE: {mean_rmse}')
    print('---')

# find the combination with the lowest RMSE
best_combination = min(mean_rmse_list, key=lambda x: x[1])
print(f'Best combination: max_depth={best_combination[0]}, RMSE={best_combination[1]}')

---
max_depth: 10, Mean RMSE: 0.4418078609323356
---
---
max_depth: 15, Mean RMSE: 0.44541664456381075
---
---
max_depth: 20, Mean RMSE: 0.44625292424422536
---
---
max_depth: 25, Mean RMSE: 0.44590993626161624
---
Best combination: max_depth=10, RMSE=0.43974886968170657


# Question 5

We can extract feature importance information from tree-based models. 

At each step of the decision tree learning algorithm, it finds the best split. 
When doing it, we can calculate "gain" - the reduction in impurity before and after the split. 
This gain is quite useful in understanding what are the important features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the
[`feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.feature_importances_)
field. 

For this homework question, we'll find the most important feature:

* Train the model with these parameters:
  * `n_estimators=10`,
  * `max_depth=20`,
  * `random_state=1`,
  * `n_jobs=-1` (optional)
* Get the feature importance information from this model


What's the most important feature (among these 4)? 

> `vehicle_weight`
*	`horsepower`
* `acceleration`
* `engine_displacement`	

In [28]:
# train new model with set parameters
rf = RandomForestRegressor(n_estimators=10, max_depth=20, random_state=1, n_jobs=-1)

rf.fit(X_train, y_train)

# get the feature importances
importances = rf.feature_importances_
feature_names = dv.get_feature_names_out()

# print feature importances ordered by importance
feature_importances = sorted(zip(feature_names, importances), key=lambda x: x[1], reverse=True)
for feature, importance in feature_importances:
    print(f'{feature}: {importance}')

vehicle_weight: 0.9591499647407432
horsepower: 0.015997897714266237
acceleration: 0.01147970063142938
engine_displacement: 0.0032727919136094864
model_year: 0.003212300094794675
num_cylinders: 0.0023433469524512048
num_doors: 0.0016349895439306998
origin=USA: 0.0005397216891829147
origin=Europe: 0.000518739638586969
origin=Asia: 0.0004622464955097423
fuel_type=Gasoline: 0.00036038360069172865
drivetrain=All-wheel drive: 0.0003571085493021933
drivetrain=Front-wheel drive: 0.00034538411263183535
fuel_type=Diesel: 0.000325424322869738


## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter:

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

Now change `eta` from `0.3` to `0.1`.

Which eta leads to the best RMSE score on the validation dataset?

* 0.3
> 0.1
* Both give equal value

In [34]:
# train a XGBoost model with the given parameters
import xgboost as xgb

# set the parameters
def set_xgb_params(eta: str)-> dict:
    params = {
        'eta': eta, # learning rate
        'max_depth': 6, # size of the trees
        'min_child_weight': 1, # min samples leaf

        'objective': 'reg:squarederror', # regression task
        'nthread': 8,

        'seed': 1,
        'verbosity': 1,
    }

    return params

# create DMatrix for XGBoost
features = list(dv.get_feature_names_out())
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=features)


# fit the model with the parameters
for eta in [0.3, 0.1]:
    xgb_params = set_xgb_params(eta)
    print(f'Training XGBoost model with eta={eta}')
    xgb_model = xgb.train(xgb_params, dtrain, num_boost_round=100)
    # RSME on validation set
    y_pred = xgb_model.predict(dval)
    rmse = root_mean_squared_error(y_val, y_pred)
    print(f'RMSE on validation set: {rmse}')



Training XGBoost model with eta=0.3
RMSE on validation set: 0.45017755678087246
Training XGBoost model with eta=0.1
RMSE on validation set: 0.42622800553359225
