# Hyperparameter Tuning



Hello learners!! So far we had theoritical understanding of hyperparameter tuning and its techniques. In this video, let us explore the first technique of Hyperparameter tuning - grid search.

We will begin with the basics. Import all the necessary libraries and read the dataset we prepared in the previous lesson.

In [25]:
#import relevant libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score
from skopt import BayesSearchCV
# parameter ranges are specified by one of below
from skopt.space import Real, Integer

In [26]:
#read train, validation and test data
df_train = pd.read_csv('train_new.csv')
df_test = pd.read_csv('test_new.csv')
df_val = pd.read_csv('val_new.csv')

In [27]:
x_train = df_train.drop(['Units_sold>1000'], axis = 1)
y_train = df_train['Units_sold>1000']

In [28]:
x_test = df_test.drop(['Units_sold>1000'], axis = 1)
y_test = df_test['Units_sold>1000']

In [29]:
x_val = df_val.drop(['Units_sold>1000'], axis = 1)
y_val = df_val['Units_sold>1000']

## Video 2 : Grid Search

<p style = 'color:green'><b>Run all the cells above before you begin</b><p>

We have been using GradientBoostClassifier to make our predictions throughput this module. We will continue to do so.

We have been using 3 hyperparameters in GradientBoostClassifier - n_estimators, max_depth and min_leaf_sample. But this time instead of min_sample leaves, we will use lerning_rate as it is an important parameter for gradient boosting. 

- n-estimator controls the number of decision trees which will be built in sequence in a boosted ensemble model. So, increaing the nummber of estimators will lead to better performance of model, but a very high number will lead to overfitting. Since our goal is to increase the performance of the model on test data, we will check the performance of the model across a list of values - [50, 100, 200, 300, 400].

- Now max_depth is used to set the depth of the tree. Higher the number, more the model learns from training data and hence may result in overfitting. Since our model was slightly overfitting at the end of feature engineering  at a max_depth of 9. Let us set the range between 6 to 14 with a gap of 2.

- The learning_rate ontrols the weightage each model gets for the final prediction. A higher value of learning_rate can lead to overfit models. So far we have been using the default value of 0.1. But we will be trying two other values i.e. 0.2 and 0.3 along with the default value this time. 

So let us begin with grid search by importing the relevant library.

In [30]:
from sklearn.model_selection import GridSearchCV

Then, we set the parameter grid with the values discussed above. The grid has to be presented in a dictionary format with the key being the name of the hyperparameter and the value being the list of values we want to try.

In [31]:
param_grid = {
    'n_estimators': [50, 100, 200, 300, 400],
    'max_depth': range(6, 13, 2),
    'learning_rate': [0.1, 0.2, 0.3]
}

Then let's create an instance model_GBC that saves the GradientBoostingClassifier with it's default values.

In [32]:
model_GBC = GradientBoostingClassifier(random_state=42)

Now, comes the step where we create the grid_search_cv instance which contains all the details or rules relevant for creation of the multiple models with different hyperparameters.

In [33]:
grid_search_cv = GridSearchCV(estimator=model_GBC, param_grid=param_grid, n_jobs=-1, verbose=2,
                      cv=5, scoring='f1')

Now, let us fit this instance to our training data. Note that, this may take some time as we are building 60 combination of models and each combination is training on 5 data samples since we are using 5-fold cross validation. For context, it took us 17 minutes to execute this code. Also, the verbose parameter will showcase the result of training on different folds as it happens.

Note that, by adding %%time in the beginning of the cell, you can get teh time taken for code block to execute.

In [34]:
%%time
grid_search_cv.fit(x_train, y_train)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
CPU times: user 43.2 s, sys: 558 ms, total: 43.7 s
Wall time: 17min 40s


Now, let us get the best values hyperparameter values and the best cross validation score.

In [36]:
# Best performing model
grid_search_cv.best_estimator_

In [37]:
# Best performing model
grid_search_cv.best_params_

{'learning_rate': 0.3, 'max_depth': 10, 'n_estimators': 400}

We can see that for learning_rate and n_estimators, we are getting a value at the extreme ends of their respective ranges. This suggests that there might be better performing hyperparameters outside the current range you have explored. This situation often indicates that the search space is not adequately covering the optimal hyperparameter values. For now, we are continuing with these values. You can experiment with further values by changing the values in the parameter grid we defined above.

In [39]:
# Mean cross-validated F1 score of the best estimator
grid_search_cv.best_score_

0.9077340243467941

We can see that the cross validation score is almost a percent higher than what we got at the end of feature selection in the last lesson. But to be really convinced with the performance, we need to check the performance on  the validation set. Note that the grid_search_cv instance automatically stores the model with the best hyperparameters within it.

In [40]:
#predict the dependent values
y_train_grid_search_pred = grid_search_cv.predict(x_train)
y_val_grid_search_pred = grid_search_cv.predict(x_val)

f1_train_grid_search = f1_score(y_train, y_train_grid_search_pred)
f1_val_grid_search = f1_score(y_val, y_val_grid_search_pred)

print("F1 Score on Train data:", f1_train_grid_search)
print("F1 Score on Val data:", f1_val_grid_search)

F1 Score on Train data: 1.0
F1 Score on Val data: 0.9122368146758391


And as we can see, overfitting model, with a perfect score of 1 on training data and a 91.22% score on the validation data which is our highest.

And with that we come to the end of this video where we have successfully implemented grid search cv. Without much manual trial and error, we were easily able to generate a higher performing model using grid search, But it comes with its own advantages and disadvantages. Let us discuss this next.

__Jump to PPT__

## Video 3 - RandomizedSearchCV
<p style = 'color:green'><b>Run all the cells above before you begin</b><p>


The entire process remains the same as we did in gridsearch. So we will move through this quickly.

Let us begin with importing the relevant library.

In [41]:
from sklearn.model_selection import RandomizedSearchCV

Then we set the instance for random cv containing all the details or rules relevant for creation of the multiple models. The only difference here being we have an additional rule which says we will only iterate through any 10 random combinations of hyperparameters.

Unlike GridSearchCV, which exhaustively tries all possible combinations of the specified hyperparameters, RandomizedSearchCV randomly selects combinations. This can be significantly faster, especially when the hyperparameter space is large and only a limited number of iterations are needed to find a good combination.

In [42]:
random_cv = RandomizedSearchCV(estimator=model_GBC, n_iter=10, 
                               param_distributions=param_grid, n_jobs=-1, 
                               cv=5, scoring='f1')

Now we fit the model, find the best score and the estimator.

In [43]:
%%time
random_cv.fit(x_train, y_train)

CPU times: user 42.8 s, sys: 230 ms, total: 43 s
Wall time: 3min 11s


Notice how we only took less than 4 minutes as compared to the whopping 17 minutes to execute the grid search code. Now let us get the best values of parameters.

In [48]:
#best model estimator as per random search
random_cv.best_estimator_

In [49]:
random_cv.best_params_

{'n_estimators': 400, 'max_depth': 10, 'learning_rate': 0.3}

Notice how random search found the same parameters as grid search in a fraction of time. This is because we have not looked at the entire pool of combinations of hyperparameters. Next lets check the score on cross validation.

In [50]:
#cross validation score for the best model after random search
random_cv.best_score_

0.9077340243467941

The score looks good. Now let us check the score on validation set.

In [47]:
y_train_randomcv_pred = random_cv.predict(x_train)

y_val_randomcv_pred = random_cv.predict(x_val)

f1_train_randomcv = f1_score(y_train, y_train_randomcv_pred)
f1_val_randomcv = f1_score(y_val, y_val_randomcv_pred)

print("F1 Score on Train data:", f1_train_randomcv)
print("F1 Score on Val data:", f1_val_randomcv)

F1 Score on Train data: 1.0
F1 Score on Val data: 0.9122368146758391
[CV] END ...learning_rate=0.1, max_depth=6, n_estimators=100; total time=   7.6s
[CV] END ...learning_rate=0.1, max_depth=6, n_estimators=300; total time=  26.1s
[CV] END ...learning_rate=0.1, max_depth=6, n_estimators=400; total time=  31.7s
[CV] END ...learning_rate=0.1, max_depth=8, n_estimators=300; total time=  33.7s
[CV] END ...learning_rate=0.1, max_depth=8, n_estimators=400; total time=  44.8s
[CV] END ..learning_rate=0.1, max_depth=10, n_estimators=300; total time=  44.0s
[CV] END ..learning_rate=0.1, max_depth=10, n_estimators=400; total time=  56.9s
[CV] END ..learning_rate=0.1, max_depth=12, n_estimators=300; total time=  51.7s
[CV] END ..learning_rate=0.1, max_depth=12, n_estimators=400; total time= 1.2min
[CV] END ...learning_rate=0.2, max_depth=6, n_estimators=400; total time=  33.7s
[CV] END ...learning_rate=0.2, max_depth=8, n_estimators=200; total time=  22.4s
[CV] END ...learning_rate=0.2, max_depth

[CV] END ...learning_rate=0.1, max_depth=6, n_estimators=100; total time=   7.5s
[CV] END ...learning_rate=0.1, max_depth=6, n_estimators=200; total time=  18.1s
[CV] END ...learning_rate=0.1, max_depth=6, n_estimators=400; total time=  31.8s
[CV] END ...learning_rate=0.1, max_depth=8, n_estimators=200; total time=  21.9s
[CV] END ...learning_rate=0.1, max_depth=8, n_estimators=300; total time=  34.4s
[CV] END ...learning_rate=0.1, max_depth=10, n_estimators=50; total time=   7.0s
[CV] END ..learning_rate=0.1, max_depth=10, n_estimators=100; total time=  14.2s
[CV] END ..learning_rate=0.1, max_depth=10, n_estimators=200; total time=  28.6s
[CV] END ..learning_rate=0.1, max_depth=10, n_estimators=300; total time=  44.1s
[CV] END ...learning_rate=0.1, max_depth=12, n_estimators=50; total time=   9.3s
[CV] END ..learning_rate=0.1, max_depth=12, n_estimators=100; total time=  17.8s
[CV] END ..learning_rate=0.1, max_depth=12, n_estimators=200; total time=  34.8s
[CV] END ..learning_rate=0.1

[CV] END ....learning_rate=0.1, max_depth=6, n_estimators=50; total time=   3.6s
[CV] END ...learning_rate=0.1, max_depth=6, n_estimators=200; total time=  18.3s
[CV] END ...learning_rate=0.1, max_depth=6, n_estimators=400; total time=  31.6s
[CV] END ...learning_rate=0.1, max_depth=8, n_estimators=100; total time=  10.9s
[CV] END ...learning_rate=0.1, max_depth=8, n_estimators=200; total time=  22.5s
[CV] END ...learning_rate=0.1, max_depth=8, n_estimators=400; total time=  44.9s
[CV] END ..learning_rate=0.1, max_depth=10, n_estimators=200; total time=  28.6s
[CV] END ..learning_rate=0.1, max_depth=10, n_estimators=300; total time=  44.1s
[CV] END ...learning_rate=0.1, max_depth=12, n_estimators=50; total time=   9.4s
[CV] END ..learning_rate=0.1, max_depth=12, n_estimators=100; total time=  17.7s
[CV] END ..learning_rate=0.1, max_depth=12, n_estimators=200; total time=  34.6s
[CV] END ..learning_rate=0.1, max_depth=12, n_estimators=300; total time=  51.7s
[CV] END ...learning_rate=0.

And we can see that we have got the same score, with minimal effort as compared to grid search which used had to work extra and generate 300 different models. Now let us understand the advantages and disadvantages of this method.

__Jump to PPT__

## Video 5 - Bayesian Optimization
<p style = 'color:green'><b>Run all the cells above before you begin</b><p>


Let us begin with importing the libraries. We will import the BayesSearchCV class from skopt library.
Scikit-Optimize, or skopt, is a simple and efficient library to minimize (very) expensive functions. It is made on top of numpy, SciPy and Scikit-Learn

In [9]:
from skopt import BayesSearchCV
# parameter ranges are specified by one of below
from skopt.space import Real, Integer

Now let us set up and configure an instance of BayesSearchCV which contains the parameters in a dictionary, and the model that will be used for predictions during Bayesian Optimization. We will be using the same parameters as before.

Then we add the n_iter, random_state and n_jobs. 

In [12]:
# log-uniform: understand as search over p = exp(x) by varying x
opt = BayesSearchCV(
    GradientBoostingClassifier(),
    {
        'learning_rate': Real(0.1, 0.3),
        'max_depth': Integer(6, 13),
        'n_estimators': Integer(50, 500),
    },
    n_iter=10,
    random_state=0,
    n_jobs=-1
)

Next, we fit the trian data to the opt instance we defined above. And as usual, this will take time to run. As you can see, it took us more tha7 minutes to run this code. 

In [13]:
%%time
_ = opt.fit(x_train, y_train)

CPU times: user 30.7 s, sys: 1.78 s, total: 32.4 s
Wall time: 7min 29s


Now, we find the best paramaters.

In [15]:
opt.best_params_

OrderedDict([('learning_rate', 0.29342624497686),
             ('max_depth', 11),
             ('n_estimators', 242)])

The benefit of using bayesian search is that, we may get any value of parameter that lies between the defined range as compared to grid search or random cv, that sticks to just the combination of values defined in the parameter grid.
Now let us get the scores on train and validation data.

In [14]:
print(opt.score(x_train, y_train))
print(opt.score(x_val, y_val))

1.0
0.8937850229240958


As we can see, we are getting an extremely overfit model, which performs lower as compared to random search. But the good thing is.... there is scope to get a better model by increasing the number of models or iterations. But this will come at the cost of a higher run time.

With that, let us jump to the advantages and disadvantages of Bayesian Optimization.