# 📝 Exercise M6.01

The aim of this notebook is to investigate if we can tune the hyperparameters
of a bagging regressor and evaluate the gain obtained.

We will load the California housing dataset and split it into a training and a
testing set.

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0, test_size=0.5
)

In [2]:
data.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [3]:
target.head()

0    452.6
1    358.5
2    352.1
3    341.3
4    342.2
Name: MedHouseVal, dtype: float64

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

Create a `BaggingRegressor` and provide a `DecisionTreeRegressor` to its
parameter `estimator`. Train the regressor and evaluate its generalization
performance on the testing set using the mean absolute error.

In [8]:
# Write your code here.
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_absolute_error

estimator = DecisionTreeRegressor(random_state=0)
bagging_regressor = BaggingRegressor(
    estimator=estimator, n_estimators=20, random_state=0
)
bagging_regressor.fit(data_train, target_train)
target_predicted = bagging_regressor.predict(data_test)

print(
    "Mean squared error obtain by cross-validation: "
    f"{mean_absolute_error(target_test, target_predicted):.2f} k$"
)

Mean squared error obtain by cross-validation: 35.65 k$


Now, create a `RandomizedSearchCV` instance using the previous model and tune
the important parameters of the bagging regressor. Find the best parameters
and check if you are able to find a set of parameters that improve the default
regressor still using the mean absolute error as a metric.

<div class="admonition tip alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Tip</p>
<p class="last">You can list the bagging regressor's parameters using the <tt class="docutils literal">get_params</tt> method.</p>
</div>

In [16]:
# Write your code here.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
import pandas as pd

for param in bagging_regressor.get_params():
    print(param)

bootstrap
bootstrap_features
estimator__ccp_alpha
estimator__criterion
estimator__max_depth
estimator__max_features
estimator__max_leaf_nodes
estimator__min_impurity_decrease
estimator__min_samples_leaf
estimator__min_samples_split
estimator__min_weight_fraction_leaf
estimator__monotonic_cst
estimator__random_state
estimator__splitter
estimator
max_features
max_samples
n_estimators
n_jobs
oob_score
random_state
verbose
warm_start


In [None]:
param_grid = {
    "n_estimators": randint(10, 30),    # Bagging procedure - Number estimators in ensemble
    "max_samples": [0.5, 0.8, 1.0],     # Bagging procedure - % of samples to train with
    "max_features": [0.5, 0.8, 1.0],     # Bagging procedure - % of features to draw from
    "estimator__max_depth": randint(3, 10),   # Decision Tree important param
}

search = RandomizedSearchCV(
    bagging_regressor, param_grid, n_iter=20, scoring="neg_mean_absolute_error"
)

_ = search.fit(data_train, target_train)   # find optimal hyperparams for this train dataset

In [17]:
columns = [f"param_{name}" for name in param_grid.keys()]
columns += ["mean_test_error", "std_test_error"]
cv_results = pd.DataFrame(search.cv_results_)
cv_results["mean_test_error"] = -cv_results["mean_test_score"]
cv_results["std_test_error"] = cv_results["std_test_score"]
cv_results[columns].sort_values(by="mean_test_error")

Unnamed: 0,param_n_estimators,param_max_samples,param_max_features,param_estimator__max_depth,mean_test_error,std_test_error
12,22,0.5,0.8,9,39.506223,0.913794
6,23,0.5,1.0,9,39.833023,0.722214
10,19,1.0,0.8,8,40.818321,0.873531
0,18,0.8,0.8,8,40.821208,0.998461
9,15,0.5,1.0,8,41.600365,0.839702
18,17,0.8,0.8,7,43.123931,1.04315
11,25,0.8,1.0,6,45.140791,0.98631
2,22,0.8,0.5,8,45.210917,0.788723
17,19,0.8,0.8,6,45.369537,1.017241
1,14,0.5,1.0,6,45.520025,1.244338


In [18]:
target_predicted = search.predict(data_test)
print(
    "Mean absolute error after tuning of the bagging regressor:\n"
    f"{mean_absolute_error(target_test, target_predicted):.2f} k$"
)

Mean absolute error after tuning of the bagging regressor:
39.15 k$


We see that the predictor provided by the bagging regressor does not need much
hyperparameter tuning compared to a single decision tree.