# üìù Exercise M6.01

The aim of this notebook is to investigate if we can tune the hyperparameters
of a bagging regressor and evaluate the gain obtained.

We will load the California housing dataset and split it into a training and
a testing set.

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0, test_size=0.5)

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

Create a `BaggingRegressor` and provide a `DecisionTreeRegressor`
to its parameter `estimator`. Train the regressor and evaluate its
generalization performance on the testing set using the mean absolute error.

In [60]:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
import numpy as np

from sklearn.metrics import mean_absolute_error

In [63]:
# Creating a BaggigRegressor
estimator = DecisionTreeRegressor(random_state=0)
baggign_regressor = BaggingRegressor(
    estimator=estimator,
    #n_estimators=100,
    random_state=0
)

# train the regressor
_ = baggign_regressor.fit(data_train, target_train)

target_predict = baggign_regressor.predict(data_test)
# evaluate generalization perforance using mae
np.mean(np.abs(target_predict - target_test.to_numpy()))

36.69490981589148

In [64]:
print(f"Basic mean absolute error of the bagging regressor:\n"
      f"{mean_absolute_error(target_test, target_predict):.2f} k$")

Basic mean absolute error of the bagging regressor:
36.69 k$


Now, create a `RandomizedSearchCV` instance using the previous model and
tune the important parameters of the bagging regressor. Find the best
parameters  and check if you are able to find a set of parameters that
improve the default regressor still using the mean absolute error as a
metric.

<div class="admonition tip alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Tip</p>
<p class="last">You can list the bagging regressor's parameters using the <tt class="docutils literal">get_params</tt>
method.</p>
</div>

In [67]:
# solution: see what parabemters can be tuned
for param in baggign_regressor.get_params().keys():
    print(param)

base_estimator
bootstrap
bootstrap_features
estimator__ccp_alpha
estimator__criterion
estimator__max_depth
estimator__max_features
estimator__max_leaf_nodes
estimator__min_impurity_decrease
estimator__min_samples_leaf
estimator__min_samples_split
estimator__min_weight_fraction_leaf
estimator__random_state
estimator__splitter
estimator
max_features
max_samples
n_estimators
n_jobs
oob_score
random_state
verbose
warm_start


In [68]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

In [71]:
param_grid = {
    'n_estimators': randint(10, 100),
    'max_samples': [0.5, 0.8, 1.0],
    'max_features': [0.5, 0.8, 1.0],
    'estimator__max_depth': randint(3, 10)
}
cv = 10

tree = RandomizedSearchCV(
    baggign_regressor, 
    param_distributions=param_grid, n_iter=20, scoring='neg_mean_absolute_error',
    cv=cv, n_jobs=4, return_train_score=True)

_ = tree.fit(data_train, target_train)


target_predict = tree.predict(data_test)
# evaluate generalization perforance using mae
np.mean(np.abs(target_predict - target_test.to_numpy()))

39.59224907939918

In [74]:
import pandas as pd

columns = [f'param_{name}' for name in param_grid.keys()]
columns += ['mean_test_error', 'std_test_error']

cv_results = pd.DataFrame(tree.cv_results_)
cv_results['mean_test_error'] = -cv_results['mean_test_score']
cv_results['std_test_error'] = cv_results['std_test_score']
cv_results[columns].sort_values(by='mean_test_error')

Unnamed: 0,param_n_estimators,param_max_samples,param_max_features,param_estimator__max_depth,mean_test_error,std_test_error
19,62,0.8,0.8,8,39.593619,1.397274
16,13,0.5,0.8,9,39.998971,1.359944
8,83,0.5,0.5,9,43.585736,1.082776
7,81,0.5,1.0,6,44.769777,1.482435
10,41,0.5,1.0,6,44.836071,1.406291
0,75,0.8,1.0,6,45.005132,1.539216
17,39,0.5,0.5,8,45.249322,1.12704
18,10,0.5,1.0,6,45.2999,1.475842
12,74,0.5,0.8,5,47.512311,1.281349
13,95,0.5,1.0,5,47.534731,1.506151


In [75]:
target_predicted = tree.predict(data_test)
print(f"Mean absolute error after tuning of the bagging regressor:\n"
      f"{mean_absolute_error(target_test, target_predicted):.2f} k$")

Mean absolute error after tuning of the bagging regressor:
39.59 k$
