# 📝 Exercise M6.01

The aim of this notebook is to investigate if we can tune the hyperparameters
of a bagging regressor and evaluate the gain obtained.

We will load the California housing dataset and split it into a training and
a testing set.

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0, test_size=0.5)

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

Create a `BaggingRegressor` and provide a `DecisionTreeRegressor`
to its parameter `base_estimator`. Train the regressor and evaluate its
statistical performance on the testing set using the mean absolute error.

In [18]:
# Write your code here.
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

bagged_trees = BaggingRegressor(
    base_estimator = DecisionTreeRegressor(),
    n_estimators=100,
)

bagged_trees.fit(data_train, target_train)
target_predicted = bagged_trees.predict(data_test)
abs(target_test - target_predicted).mean()

34.559549116279086

Now, create a `RandomizedSearchCV` instance using the previous model and
tune the important parameters of the bagging regressor. Find the best
parameters  and check if you are able to find a set of parameters that
improve the default regressor still using the mean absolute error as a
metric.

<div class="admonition tip alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Tip</p>
<p class="last">You can list the bagging regressor's parameters using the <tt class="docutils literal">get_params</tt>
method.</p>
</div>

In [22]:
import numpy as np
np.linspace(1,10)

array([ 1.        ,  1.18367347,  1.36734694,  1.55102041,  1.73469388,
        1.91836735,  2.10204082,  2.28571429,  2.46938776,  2.65306122,
        2.83673469,  3.02040816,  3.20408163,  3.3877551 ,  3.57142857,
        3.75510204,  3.93877551,  4.12244898,  4.30612245,  4.48979592,
        4.67346939,  4.85714286,  5.04081633,  5.2244898 ,  5.40816327,
        5.59183673,  5.7755102 ,  5.95918367,  6.14285714,  6.32653061,
        6.51020408,  6.69387755,  6.87755102,  7.06122449,  7.24489796,
        7.42857143,  7.6122449 ,  7.79591837,  7.97959184,  8.16326531,
        8.34693878,  8.53061224,  8.71428571,  8.89795918,  9.08163265,
        9.26530612,  9.44897959,  9.63265306,  9.81632653, 10.        ])

In [46]:
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    "n_estimators": randint(10, 30),
    "max_samples": [0.5, 0.8, 1.0],
    "max_features": [0.5, 0.8, 1.0],
    "base_estimator__max_depth": randint(3, 10),
}
search = RandomizedSearchCV(
    bagged_trees, param_grid, n_iter=20, scoring="neg_mean_absolute_error"
)
_ = search.fit(data_train, target_train)

In [57]:
import pandas as pd

columns = search.cv_results_.keys()
df = pd.DataFrame(search.cv_results_)
df = df.sort_values("rank_test_score")
df.columns
df["mean_test_error"] = -df["mean_test_score"]
df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_base_estimator__max_depth,param_max_features,param_max_samples,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score,mean_test_error
19,0.198077,0.003182,0.004389,0.0004884613,8,0.8,0.8,15,"{'base_estimator__max_depth': 8, 'max_features...",-41.194488,-41.639472,-40.648088,-42.114165,-41.578395,-41.434922,0.490043,1,41.434922
15,0.373723,0.00389,0.005786,0.0003980672,7,1.0,1.0,21,"{'base_estimator__max_depth': 7, 'max_features...",-43.778301,-44.125926,-41.861638,-44.305708,-41.282434,-43.070802,1.248933,2,43.070802
2,0.274838,0.002178,0.00518,0.0007359981,6,1.0,1.0,19,"{'base_estimator__max_depth': 6, 'max_features...",-45.952036,-46.124363,-44.220228,-46.12648,-42.5894,-45.002501,1.404174,3,45.002501
1,0.131839,0.002923,0.004605,0.0004852462,6,0.8,0.5,18,"{'base_estimator__max_depth': 6, 'max_features...",-46.658055,-46.450382,-45.487516,-45.62773,-43.602013,-45.565139,1.080919,4,45.565139
4,0.109393,0.001594,0.002994,3.017291e-06,6,1.0,0.5,11,"{'base_estimator__max_depth': 6, 'max_features...",-46.492244,-46.499554,-45.137333,-46.72992,-43.896638,-45.751138,1.08474,5,45.751138
16,0.093949,0.001471,0.00359,0.0004893176,6,0.8,0.5,12,"{'base_estimator__max_depth': 6, 'max_features...",-46.264816,-47.495388,-45.491335,-45.551815,-44.661043,-45.892879,0.948826,6,45.892879
14,0.076496,0.000887,0.00399,2.336015e-07,8,0.5,0.5,11,"{'base_estimator__max_depth': 8, 'max_features...",-43.057289,-46.871978,-49.582994,-45.423206,-45.748395,-46.136773,2.124156,7,46.136773
10,0.217688,0.007399,0.007987,0.000633944,9,0.5,0.5,29,"{'base_estimator__max_depth': 9, 'max_features...",-43.616726,-47.191472,-46.144342,-48.707807,-47.862502,-46.70457,1.757241,8,46.70457
17,0.125961,0.002111,0.003989,1.994753e-06,8,0.5,1.0,12,"{'base_estimator__max_depth': 8, 'max_features...",-43.212918,-49.763557,-48.506569,-50.344013,-45.075275,-47.380466,2.771722,9,47.380466
7,0.097531,0.002779,0.003391,0.0004879553,7,0.5,0.8,12,"{'base_estimator__max_depth': 7, 'max_features...",-45.458446,-52.143081,-47.147471,-47.974649,-45.856658,-47.716061,2.389262,10,47.716061


In [62]:
from sklearn.metrics import mean_absolute_error

target_pred = search.predict(data_test)
abs(target_pred - target_test).mean()
mean_absolute_error(target_pred, target_test)

41.352642284467635

We see that the bagging regressor provides a predictor in which fine tuning
is not as important as in the case of fitting a single decision tree.