<a href="https://colab.research.google.com/github/dajebbar/FreeCodeCamp-python-data-analysis/blob/main/Hyperparameter_tuning_by_randomized_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Hyperparameter tuning by randomized-search
---
In the previous notebook, we showed how to use a grid-search approach to search for the best hyperparameters maximizing the generalization performance of a predictive model.

However, a grid-search approach has limitations. It does not scale when the number of parameters to tune is increasing. Also, the grid will impose a regularity during the search which might be problematic.

In this notebook, we will present another method to tune hyperparameters called randomized search.

## Predictive model

In [1]:
from sklearn import set_config

set_config(display="diagram")

In [2]:
import pandas as pd

adult_census = pd.read_csv("./adult.csv")

We extract the column containing the target.


In [3]:
target_name = "class"
target = adult_census[target_name]
target

0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
48837     <=50K
48838      >50K
48839     <=50K
48840     <=50K
48841      >50K
Name: class, Length: 48842, dtype: object

We drop from our data the target and the "`education-num`" column which duplicates the information with "`education`" columns.

In [4]:
data = adult_census.drop(columns=[target_name, "education-num"])
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,226802,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,160323,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,103497,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


Once the dataset is loaded, we split it into a training and testing sets.

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data, target, random_state=42)

## Predictive Pipeline

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import make_column_selector as selector

categorical_column = selector(dtype_include='object')(data)
numerical_column = selector(dtype_include='number')(data)
categorical_preprocessor = OrdinalEncoder(handle_unknown='use_encoded_value', 
                                          unknown_value=-1)
numerical_preprocessor = StandardScaler()
preprocessor = ColumnTransformer([
                                  ('standard-scaler', 
                                   numerical_preprocessor, 
                                   numerical_column),
                                  ('cat-prep', categorical_preprocessor, categorical_column)
], remainder='passthrough', sparse_threshold=0)

In [7]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline([
                  ('preprocessor', preprocessor),
                  ('classifier', HistGradientBoostingClassifier(
                      random_state=42,
                      max_leaf_nodes=4)
                  )
])

Tuning using a randomized-search
With the GridSearchCV estimator, the parameters need to be specified explicitly. We already mentioned that exploring a large number of values for different parameters will be quickly untractable.

Instead, we can randomly generate the parameter candidates. Indeed, such approach avoids the regularity of the grid. Hence, adding more evaluations can increase the resolution in each direction. This is the case in the frequent situation where the choice of some hyperparameters is not very important.  
Indeed, the number of evaluation points need to be divided across the two different hyperparameters. With a grid, the danger is that the region of good hyperparameters fall between the line of the grid: this region is aligned with the grid given that hyperparameter 2 has a weak influence. Rather, stochastic search will sample hyperparameter 1 independently from hyperparameter 2 and find the optimal region.

The RandomizedSearchCV class allows for such stochastic search. It is used similarly to the GridSearchCV but the sampling distributions need to be specified instead of the parameter values. For instance, we will draw candidates using a log-uniform distribution because the parameters we are interested in take positive values with a natural log scaling (.1 is as close to 1 as 10 is).

Random search (with RandomizedSearchCV) is typically beneficial compared to grid search (with GridSearchCV) to optimize 3 or more hyperparameters.

We will optimize 3 other parameters in addition to the ones we optimized in the notebook presenting the `GridSearchCV`:

-  `l2_regularization`: it corresponds to the constant to regularized the loss function
- `min_samples_leaf`: it corresponds to the minimum number of samples required in a leaf;
- `max_bins`: it corresponds to the maximum number of bins to construct the histograms.  

We recall the meaning of the 2 remaining parameters:

- `learning_rate`: it corresponds to the speed at which the - gradient-boosting will correct the residuals at each boosting iteration;
- `max_leaf_nodes`: it corresponds to the maximum number of leaves for each tree in the ensemble.

Note:
scipy.stats.loguniform can be used to generate floating numbers. To generate random values for integer-valued parameters (e.g. min_samples_leaf) we can adapt is as follows:

In [8]:
from scipy.stats import loguniform


class LogUniformInt:
    """Integer valued version of the log-uniform distribution"""
    def __init__(self, a, b):
        self._distribution = loguniform(a, b)

    def rvs(self, *args, **kwargs):
        """Random variable sample"""
        return self._distribution.rvs(*args, **kwargs).astype(int)

Now, we can define the randomized search using the different distributions. Executing 10 iterations of 5-fold cross-validation for random parametrizations of this model on this dataset can take from 10 seconds to several minutes, depending on the speed of the host computer and the number of available processors.

In [10]:
for p in model.get_params():
  print(p)

memory
steps
verbose
preprocessor
classifier
preprocessor__n_jobs
preprocessor__remainder
preprocessor__sparse_threshold
preprocessor__transformer_weights
preprocessor__transformers
preprocessor__verbose
preprocessor__verbose_feature_names_out
preprocessor__standard-scaler
preprocessor__cat-prep
preprocessor__standard-scaler__copy
preprocessor__standard-scaler__with_mean
preprocessor__standard-scaler__with_std
preprocessor__cat-prep__categories
preprocessor__cat-prep__dtype
preprocessor__cat-prep__handle_unknown
preprocessor__cat-prep__unknown_value
classifier__categorical_features
classifier__early_stopping
classifier__l2_regularization
classifier__learning_rate
classifier__loss
classifier__max_bins
classifier__max_depth
classifier__max_iter
classifier__max_leaf_nodes
classifier__min_samples_leaf
classifier__monotonic_cst
classifier__n_iter_no_change
classifier__random_state
classifier__scoring
classifier__tol
classifier__validation_fraction
classifier__verbose
classifier__warm_start


In [11]:
%%time
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    'classifier__l2_regularization': loguniform(1e-6, 1e3),
    'classifier__learning_rate': loguniform(.001, 10),
    'classifier__max_leaf_nodes': LogUniformInt(2, 256),
    'classifier__min_samples_leaf': LogUniformInt(1, 100),
    'classifier__max_bins': LogUniformInt(2, 255),
}

model_random_search = RandomizedSearchCV(model,
                                         param_distributions=param_distributions,
                                         n_jobs=4,
                                         cv=10,
                                         verbose=1)

model_random_search.fit(X_train, y_train)

Fitting 10 folds for each of 10 candidates, totalling 100 fits
CPU times: user 5.67 s, sys: 221 ms, total: 5.89 s
Wall time: 1min 10s
