# <font color='blue'>Parameter Optimization with Randomized Search</font>

Every Machine Learning model has parameters that allow the customization of the model. These parameters are also called hyperparameters.

In programming, Machine Learning algorithms are represented by functions and each function has customization parameters, exactly what we call hyperparameters.

It is still common for people to refer to the model coefficients (found at the end of training) as parameters.

Part of our job as Data Scientists is to find the best combination of hyperparameters for each model.

In Ensemble Methods this work is even more complex, as we have the hyperparameters of the base estimator and the hyperparameters of the ensemble model, as shown in this example below:

Base estimator:
estim_base = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=5, p=2, weights='uniform')

Ensemble Model:
BaggingClassifier(base_estimator=base estimate, bootstrap=True, bootstrap_features=False, max_features=0.5, max_samples=0.5, n_estimators=10, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)

## Extremely Randomized Forest

Default model, with manually chosen hyperparameters.

In [2]:
# Imports
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import warnings
warnings.simplefilter(action = 'ignore', category=FutureWarning)

In [2]:
# Loading dataset
data = pd.read_excel('dados/credit.xls',skiprows=1)

In [4]:
display(data)

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0
5,6,50000,1,1,2,37,0,0,0,0,...,19394,19619,20024,2500,1815,657,1000,1000,800,0
6,7,500000,1,1,2,29,0,0,0,0,...,542653,483003,473944,55000,40000,38000,20239,13750,13770,0
7,8,100000,2,2,2,23,0,-1,-1,0,...,221,-159,567,380,601,0,581,1687,1542,0
8,9,140000,2,3,1,28,0,0,2,0,...,12211,11793,3719,3329,0,432,1000,1000,1000,0
9,10,20000,1,3,2,35,-2,-2,-2,-2,...,0,13007,13912,0,0,0,13007,1122,0,0


In [6]:
# Target variable
target = 'default payment next month'
y = np.asarray(data[target])

In [7]:
# Predictor variable
features = data.columns.drop(['ID', target])
X = np.array(data[features])

In [8]:
# Train/test data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 99)

In [9]:
# Classifier
clf = ExtraTreesClassifier(n_estimators=500, random_state=99)

In [10]:
# Model training
clf.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=500,
                     n_jobs=None, oob_score=False, random_state=99, verbose=0,
                     warm_start=False)

In [11]:
# Score
scores = cross_val_score(clf, X_train, y_train, cv = 3, scoring = 'accuracy', n_jobs=-1)

In [12]:
# Printing he result
print('ExtraTreesClassifier -> Acuracia em Treino: Media = %0.3f Devio Padrao = %0.3f' % (np.mean(scores), np.std(scores)))

ExtraTreesClassifier -> Acuracia em Treino: Media = 0.812 Devio Padrao = 0.002


In [13]:
# Making predictions
y_pred = clf.predict(X_test)

In [14]:
# Confusion Matrix
confusionMatrix = confusion_matrix(y_test, y_pred)
print(confusionMatrix)

[[6532  446]
 [1273  749]]


In [15]:
# Accuracy
print('Acuracia em Teste:', accuracy_score(y_test, y_pred))

Acuracia em Teste: 0.809


## Parameter Optimization with Randomized Search

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Randomized Search samples algorithm parameters from a uniform random distribution for a fixed number of iterations. A model is built and tested for each combination of parameters.

In [16]:
# Import
from sklearn.model_selection import RandomizedSearchCV

In [17]:
# Parameter definition
param_dist = {'max_depth': [1, 3, 7, 8, 12, None],
              'max_features': [8, 9 ,10, 11, 16, 22],
              'min_samples_split': [8, 10, 11, 14, 16, 19],
              'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7],
              'bootstrap': [True, False]} 

# For the classifier created with ExtraTrees, we tested different combinations of parameters
rsearch = RandomizedSearchCV(clf, param_distributions = param_dist, n_iter = 25, return_train_score = True)

# Applying the result to the training dataset and getting the score
rsearch.fit(X_train, y_train)

# Results 
rsearch.cv_results_

# Printing the best estimator
bestclf = rsearch.best_estimator_
display(bestclf)

# Applying the best estimator to make predictions
y_pred = bestclf.predict(X_test)

# Confusion Matrix
confusionMatrix = confusion_matrix(y_test, y_pred)
display(confusionMatrix)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
display(accuracy)

ExtraTreesClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=8, max_features=22,
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=2, min_samples_split=8,
                     min_weight_fraction_leaf=0.0, n_estimators=500,
                     n_jobs=None, oob_score=False, random_state=99, verbose=0,
                     warm_start=False)

array([[6655,  323],
       [1287,  735]], dtype=int64)

0.8211111111111111

In [18]:
# Getting the grid with all parameter combinations
rsearch.cv_results_

{'mean_fit_time': array([ 1.94569707, 26.37685919, 11.89235215,  5.26725478,  6.34976158,
         5.7690093 ,  2.51860065, 12.2201858 ,  2.85677147,  2.97098913,
         7.03003883,  4.72834454, 11.87376575,  5.30516758,  6.61777487,
         8.99297986,  2.1223762 ,  9.21238074,  5.10316858, 14.61928654,
         2.7393549 ,  6.1493432 , 18.30297132,  1.51610589,  9.16166816]),
 'std_fit_time': array([0.06391663, 0.24894138, 0.09201779, 0.08045881, 0.06214636,
        0.04038322, 0.07310179, 0.17175526, 0.02834409, 0.07089477,
        0.39867629, 0.26090535, 0.56203547, 0.35556586, 0.30335021,
        0.12185528, 0.35452263, 0.69731019, 0.1403083 , 0.55050327,
        0.04111368, 0.07775296, 0.43651785, 0.03476116, 0.04733845]),
 'mean_score_time': array([0.13202348, 0.53315101, 0.29235711, 0.2399755 , 0.31209311,
        0.22934499, 0.17918897, 0.32198129, 0.14802127, 0.17808981,
        0.24091787, 0.17859435, 0.52246437, 0.25442376, 0.22809029,
        0.23583755, 0.16116376, 0.2

## Grid Search x Randomized Search for Hyperparameter Estimation

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

The Grid Search methodically performs combinations between all the parameters of the algorithm, creating a grid. 

In [20]:
# Importing libraries/functions
import numpy as np
from time import time
from scipy.stats import randint as sp_randint
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits

# Dataset loading
digits = load_digits()
X, y = digits.data, digits.target

# Building classifier
clf = RandomForestClassifier(n_estimators = 20)

In [21]:
# Randomized Search

# Valores dos parametros que serao testados
param_dist = {'max_depth': [3, None],
              'max_features': sp_randint(1,11),
              'min_samples_leaf': sp_randint(1,11),
              'bootstrap': [True, False],
              'criterion': ['gini', 'entropy']}

# Values of the parameters to be tested
n_iter_search = 20
random_search = RandomizedSearchCV(clf,
                                   param_distributions = param_dist,
                                   n_iter = n_iter_search,
                                   return_train_score = True)

start = time()
random_search.fit(X, y)
print('RandomizedSearchCV executou em %.2f segundos para %d candidatos a parametros do modelo.'
      % ((time() - start), n_iter_search))

# Prints the parameter combinations and their respective accuracy averages
random_search.cv_results_

RandomizedSearchCV executou em 7.38 segundos para 20 candidatos a parametros do modelo.


{'mean_fit_time': array([0.08319497, 0.05199428, 0.03120189, 0.07743802, 0.0584909 ,
        0.04498496, 0.02874384, 0.03963089, 0.07546082, 0.04466381,
        0.06806479, 0.07358327, 0.10969782, 0.04830961, 0.05280528,
        0.11528282, 0.02997828, 0.07280784, 0.04043875, 0.05743456]),
 'std_fit_time': array([0.00820742, 0.00783205, 0.00084198, 0.01796276, 0.00316634,
        0.00294178, 0.00371716, 0.00336157, 0.00484614, 0.00749127,
        0.00188536, 0.00530221, 0.00975992, 0.00461434, 0.00194285,
        0.00733682, 0.00715059, 0.00219804, 0.00621608, 0.00570729]),
 'mean_score_time': array([0.00406528, 0.00439529, 0.00435567, 0.00462217, 0.00338497,
        0.00357413, 0.00267038, 0.00323653, 0.00459805, 0.00369959,
        0.00465088, 0.00463357, 0.00470438, 0.00438604, 0.00312271,
        0.00400438, 0.00270295, 0.00344925, 0.00441685, 0.00383401]),
 'std_score_time': array([0.00061404, 0.00103964, 0.00040051, 0.0011035 , 0.00102661,
        0.00121881, 0.00038568, 0.000388

In [22]:
# Grid Search

# Values of the parameters to be tested
param_grid = {'max_depth': [3, None],
              'max_features': [1, 3, 10],
              'min_samples_leaf': [1, 3, 10],
              'bootstrap': [True, False],
              'criterion': ['gini', 'entropy']}

# Executing Grid Search
grid_search = GridSearchCV(clf,
                           param_grid = param_grid,
                           return_train_score = True)

start = time()
grid_search.fit(X, y)

print('GridSearchCV executou em %.2f segundos para todas as combinacoes de candidatos a parametros do modelo.'
      % (time() - start))

# Prints the parameter combinations and their respective accuracy averages
grid_search.cv_results_

GridSearchCV executou em 25.57 segundos para todas as combinacoes de candidatos a parametros do modelo.


{'mean_fit_time': array([0.03912082, 0.03944354, 0.03258467, 0.03772817, 0.03246126,
        0.04078197, 0.05211825, 0.05478292, 0.05765777, 0.0542593 ,
        0.04757261, 0.04678392, 0.07442474, 0.05130744, 0.05148029,
        0.07524252, 0.07964392, 0.0736969 , 0.03268661, 0.0339417 ,
        0.03890471, 0.04646511, 0.04222207, 0.03216143, 0.05360637,
        0.0489995 , 0.05691981, 0.0563426 , 0.04275041, 0.04532938,
        0.06086297, 0.0561933 , 0.0487144 , 0.11347437, 0.10807409,
        0.08959684, 0.02478914, 0.02641182, 0.02850218, 0.03503838,
        0.03240089, 0.03261967, 0.04851818, 0.04279094, 0.05746369,
        0.053056  , 0.04655952, 0.03803816, 0.05148573, 0.0615447 ,
        0.06156802, 0.09678073, 0.09079299, 0.07749929, 0.03050818,
        0.03334479, 0.03189287, 0.03688159, 0.03384066, 0.04264941,
        0.06261101, 0.06370006, 0.04815612, 0.05461283, 0.04639568,
        0.03645782, 0.06887102, 0.09060616, 0.10029688, 0.23023081,
        0.21380715, 0.14089913]