# Hyperparameter Tuning

In the previous notebook, we saw the impact of applying a `Quantile Transformer` to the dataset. What is more, the `Random Forest Regressor` was consolidated as the best performing algorithm.

The objective of this notebook it will be to apply a `GridSearchCV & RandomSearchCV` to find the best `hyperparameter tuning` configuration. To this aim, [this article from Jason Brownlee](https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/) was really helpful to understand the basics of both methods and select the best configuration possible.

`Hyperparameters` help on specifying the configuration of models to guide machine learning algorithms. The difference with `parameters` is that `hyperparameters` are not learned automatically, so they need to be set manually to help on the guidance of the learning process.

In [1]:
import pandas as pd
import numpy as np
from time import time
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, RepeatedKFold
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

## Loading transformed dataset

In [2]:
root = r'../data/regression/cars_reg_trf.csv'

df = pd.read_csv(root)

df.head()

Unnamed: 0,co2_emiss,height,length,max_speed,mixed_cons,weight,tank_vol,acc,price,gearbox_Automatic,...,doors_2,doors_3,doors_4,doors_5,brand_encoded,model_encoded,city_encoded,color_encoded,type_encoded,chassis_encoded
0,-1.120443,0.285121,-2.87753,-2.14361,-0.679227,-2.897385,-2.456544,1.199172,-1.286133,5.199338,...,5.199338,-5.199338,-5.199338,-5.199338,-1.579469,-1.083566,-1.505232,-0.902609,5.199338,0.608153
1,1.303557,-1.624785,-0.685567,1.345832,1.358362,-0.272066,-0.0855,-1.377568,-0.457524,-5.199338,...,-5.199338,5.199338,-5.199338,-5.199338,0.32989,-0.367258,0.314018,-0.902609,5.199338,1.521013
2,-0.572277,-1.297705,0.171631,0.937119,-0.833062,0.438088,-1.220053,-0.794587,1.419275,5.199338,...,-5.199338,-5.199338,5.199338,-5.199338,0.698331,0.833062,0.596109,5.199338,5.199338,1.521013
3,0.63558,0.540027,0.519795,0.651025,0.361889,1.07906,0.93323,-1.018128,1.035926,5.199338,...,-5.199338,-5.199338,-5.199338,5.199338,0.698331,1.6636,-1.056848,-0.902609,5.199338,1.521013
4,3.74407,-2.967122,0.012559,-0.174181,3.436439,0.439471,1.120443,-2.87753,1.035145,5.199338,...,5.199338,-5.199338,-5.199338,-5.199338,-1.183849,0.277283,-0.287738,-0.367258,5.199338,1.521013


In [3]:
# splitting features (X) & target (y)
X = df.drop('price', axis=1)
y = df['price']

print(X.shape)
print(y.shape)

if X.shape[0] == y.shape[0]:
    print("Correct shape to proceed with the fit!")
else:
    print("Please, review the shape since it is not matching fot X & y")

(55366, 29)
(55366,)
Correct shape to proceed with the fit!


## Optimizing Hyperparameters with Grid & Random Search

I found [this Towards Data Science article](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) really useful to build the search methods and find the best hyperparameters. First let's define the `parameters` of each search:
* model
* space
* cv
* scoring

### Random Search method

These are the `random Forest Regressor` hyperparameters:
- `n_estimators` => number of trees in the foreset
- `max_features` => max number of features considered for splitting a node
- `max_depth` => max number of levels in each decision tree
- `min_samples_split` => min number of data points placed in a node before the node is split
- `min_samples_leaf` => min number of data points allowed in a leaf node
- `bootstrap` => method for sampling data points (with or without replacement)

Their impact to the model is better explained in [this hyperparameter tuning article](https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/) dedicated to `RF`.

In [4]:
# define model
rf = RandomForestRegressor()

# define search space
param_grid = dict()
param_grid['n_estimators'] = [int(x) for x in np.linspace(start = 10, stop = 200, num = 5)]
param_grid['max_features'] = ['auto', 'sqrt', 'log2']
param_grid['max_depth'] = [int(x) for x in np.linspace(10, 50, num = 5)]
param_grid['max_depth'].append(None)
param_grid['min_samples_split'] = [2, 5, 10]
param_grid['min_samples_leaf'] = [1, 2, 4]
param_grid['bootstrap'] = [True, False]

# define evaluation
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=33)

# define scoring
scoring='r2'

# define search
search_rdm = RandomizedSearchCV(estimator=rf, param_distributions=param_grid, n_iter=10, cv=cv, scoring=scoring, n_jobs=1, random_state=33)    # RandomSearch

In [5]:
param_grid

{'n_estimators': [10, 57, 105, 152, 200],
 'max_features': ['auto', 'sqrt', 'log2'],
 'max_depth': [10, 20, 30, 40, 50, None],
 'min_samples_split': [2, 5, 10],
 'min_samples_leaf': [1, 2, 4],
 'bootstrap': [True, False]}

In [6]:
# execute search
start_time = time()

result_rdm = search_rdm.fit(X, y)

rdm_time = time() - start_time

In [7]:
# summarize result
print('Best Score: {}'.format(result_rdm.best_score_))
print('Best Hyperparameters: {}'.format(result_rdm.best_params_))
print('Time consumed on the Seach: {}'.format(rdm_time))

Best Score: 0.8929381150194977
Best Hyperparameters: {'n_estimators': 105, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 50, 'bootstrap': False}
Time consumed on the Seach: 2305.2893855571747


### Grid Search method

Random Search method is used to reduce the range of values for each hyperparameter. Therefore, we can specify now the best configurations found to be adjusted in a Grid Search.

In [8]:
# define model
rf = RandomForestRegressor()

# define search space
param_grid = dict()
param_grid['n_estimators'] = [90, 100, 110]
param_grid['max_features'] = ['sqrt']
param_grid['max_depth'] = [50, 60, 80]
param_grid['min_samples_split'] = [10, 12, 15]
param_grid['min_samples_leaf'] = [1, 2, 4]
param_grid['bootstrap'] = [False]

# define evaluation
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=33)

# define scoring
scoring='r2'

# define search
search_grd = GridSearchCV(estimator=rf, param_grid=param_grid, cv=cv, scoring=scoring, n_jobs=1)    # GridSearch

In [9]:
param_grid

{'n_estimators': [90, 100, 110],
 'max_features': ['sqrt'],
 'max_depth': [50, 60, 80],
 'min_samples_split': [10, 12, 15],
 'min_samples_leaf': [1, 2, 4],
 'bootstrap': [False]}

Once the search is defined, it is possible to conform the `fit` to the data and search for the best resulting `hyperparameters`.

In [10]:
# execute search
start_time = time()

result_grd = search_grd.fit(X, y)

grd_time = time() - start_time

In [12]:
# summarize result
print('Best Score: {}'.format(result_grd.best_score_))
print('Best Hyperparameters: {}'.format(result_grd.best_params_))
print('Time consumed on the Seach: {}'.format(grd_time))

Best Score: 0.8938731587848249
Best Hyperparameters: {'bootstrap': False, 'max_depth': 50, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 110}
Time consumed on the Seach: 17251.929941654205
