## Hyperparameter tuning

In the previous section, we did not discuss the parameters of random forest and gradient-boosting. However, there are a couple of things to keep in mind when setting these.

This notebook gives crucial information regarding how to set the hyperparameters of both random forest and gradient boosting decision tree models.

## Random forest

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0)

In [2]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

param_grid = {
    "n_estimators": [10, 20, 30],
    "max_depth": [3, 5, None],
}
grid_search = GridSearchCV(
    RandomForestRegressor(n_jobs=2), param_grid=param_grid,
    scoring="neg_mean_absolute_error", n_jobs=2,
)
grid_search.fit(data_train, target_train)

columns = [f"param_{name}" for name in param_grid.keys()]
columns += ["mean_test_score", "rank_test_score"]
cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results[columns].sort_values(by="rank_test_score")

Unnamed: 0,param_n_estimators,param_max_depth,mean_test_score,rank_test_score
8,30,,34.666337,1
7,20,,34.84536,2
6,10,,35.895705,3
4,20,5.0,48.629913,4
5,30,5.0,48.822695,5
3,10,5.0,48.986628,6
1,20,3.0,57.082996,7
2,30,3.0,57.144623,8
0,10,3.0,57.388213,9


We can observe that in our grid-search, the largest `max_depth` together
with the largest `n_estimators` led to the best statistical performance.

## Gradient-boosting decision trees

In [3]:
from sklearn.ensemble import GradientBoostingRegressor

param_grid = {
    "n_estimators": [10, 30, 50],
    "max_depth": [3, 5, None],
    "learning_rate": [0.1, 1],
}
grid_search = GridSearchCV(
    GradientBoostingRegressor(), param_grid=param_grid,
    scoring="neg_mean_absolute_error", n_jobs=2
)
grid_search.fit(data_train, target_train)

columns = [f"param_{name}" for name in param_grid.keys()]
columns += ["mean_test_score", "rank_test_score"]
cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results[columns].sort_values(by="rank_test_score")

Unnamed: 0,param_n_estimators,param_max_depth,param_learning_rate,mean_test_score,rank_test_score
5,50,5.0,0.1,35.660356,1
11,50,3.0,1.0,36.67084,2
10,30,3.0,1.0,37.524121,3
13,30,5.0,1.0,39.067819,4
4,30,5.0,0.1,39.358128,5
12,10,5.0,1.0,39.437542,6
14,50,5.0,1.0,39.858166,7
2,50,3.0,0.1,40.602496,8
9,10,3.0,1.0,41.589715,9
7,30,,0.1,45.546211,10


<div class="admonition caution alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Caution!</p>
<p class="last">Here, we tune the <tt class="docutils literal">n_estimators</tt> but be aware that using early-stopping as
in the previous exercise will be better.</p>
</div>