# Hyper Parameter Optimization Techniques Contd.

## 3. Automated Hyperparameter Tuning

    Automated Hyperparameter Tuning can be done by using techniques such as

- Bayesian Optimization
- Gradient Descent
- Evolutionary Algorithms

## 3.1 Bayesian Optimization

    1. Bayesian optimization uses probability to find the minimum of a function. 
    2. The final aim is to find the input value to a function which can gives us the lowest possible output value.
    3. It usually performs better than random,grid and manual search providing better performance in the testing phase and reduced optimization time. 
    4. In Hyperopt, Bayesian Optimization can be implemented giving 3 three main parameters to the function fmin.

- Objective Function = 
    `defines the loss function to minimize.`
- Domain Space = 
    `defines the range of input values to test (in Bayesian Optimization this space creates a probability distribution for each of the used Hyperparameters).`
- Optimization Algorithm = 
    `defines the search algorithm to use to select the best input values to use in each new iteration.`

In [9]:

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
df=pd.read_csv('diabetes.csv')
df.head()



Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [10]:
import numpy as np
df['Glucose']=np.where(df['Glucose']==0,df['Glucose'].median(),df['Glucose'])
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72,35,0,33.6,0.627,50,1
1,1,85.0,66,29,0,26.6,0.351,31,0
2,8,183.0,64,0,0,23.3,0.672,32,1
3,1,89.0,66,23,94,28.1,0.167,21,0
4,0,137.0,40,35,168,43.1,2.288,33,1


In [11]:
#### Independent And Dependent features
X=df.drop('Outcome',axis=1)
y=df['Outcome']

In [12]:
#### Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=0)

In [2]:
!pip install hyperopt

Collecting hyperopt
  Downloading hyperopt-0.2.4-py2.py3-none-any.whl (964 kB)
Installing collected packages: hyperopt
Successfully installed hyperopt-0.2.4


In [3]:
from hyperopt import hp,fmin,tpe,STATUS_OK,Trials

In [4]:
space = {'criterion': hp.choice('criterion', ['entropy', 'gini']),
        'max_depth': hp.quniform('max_depth', 10, 1200, 10),
        'max_features': hp.choice('max_features', ['auto', 'sqrt','log2', None]),
        'min_samples_leaf': hp.uniform('min_samples_leaf', 0, 0.5),
        'min_samples_split' : hp.uniform ('min_samples_split', 0, 1),
        'n_estimators' : hp.choice('n_estimators', [10, 50, 300, 750, 1200,1300,1500])
    }


## hp.choice to select between a list
## hp.quniform to select between a range of no having saome equal
## hp.uniform to slecect btween floating no

In [5]:
space

{'criterion': <hyperopt.pyll.base.Apply at 0x2a16c279a88>,
 'max_depth': <hyperopt.pyll.base.Apply at 0x2a16c279ec8>,
 'max_features': <hyperopt.pyll.base.Apply at 0x2a16c282488>,
 'min_samples_leaf': <hyperopt.pyll.base.Apply at 0x2a16c282848>,
 'min_samples_split': <hyperopt.pyll.base.Apply at 0x2a16c282c08>,
 'n_estimators': <hyperopt.pyll.base.Apply at 0x2a16c2844c8>}

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

In [7]:
def objective(space):
    model = RandomForestClassifier(criterion = space['criterion'], max_depth = space['max_depth'],
                                 max_features = space['max_features'],
                                 min_samples_leaf = space['min_samples_leaf'],
                                 min_samples_split = space['min_samples_split'],
                                 n_estimators = space['n_estimators'], 
                                 )
    
    accuracy = cross_val_score(model, X_train, y_train, cv = 5).mean()

    # We aim to maximize accuracy, therefore we return it as a negative value
    return {'loss': -accuracy, 'status': STATUS_OK }

In [13]:
from sklearn.model_selection import cross_val_score
trials = Trials()
best = fmin(fn= objective,
            space= space,
            algo= tpe.suggest,
            max_evals = 80,
            trials= trials)
best

100%|███████████████████████████████████████████████| 80/80 [11:16<00:00,  8.46s/trial, best loss: -0.7622684259629482]


{'criterion': 0,
 'max_depth': 90.0,
 'max_features': 2,
 'min_samples_leaf': 0.014039745879919538,
 'min_samples_split': 0.10785280990155477,
 'n_estimators': 4}

In [14]:
## Since the above result in in key val pairs the below is done to get the vlue from key val

crit = {0: 'entropy', 1: 'gini'}
feat = {0: 'auto', 1: 'sqrt', 2: 'log2', 3: None}
est = {0: 10, 1: 50, 2: 300, 3: 750, 4: 1200,5:1300,6:1500}


print(crit[best['criterion']])
print(feat[best['max_features']])
print(est[best['n_estimators']])

entropy
log2
1200


In [15]:
trainedforest = RandomForestClassifier(criterion = crit[best['criterion']], max_depth = best['max_depth'], 
                                       max_features = feat[best['max_features']], 
                                       min_samples_leaf = best['min_samples_leaf'], 
                                       min_samples_split = best['min_samples_split'], 
                                       n_estimators = est[best['n_estimators']]).fit(X_train,y_train)
predictionforest = trainedforest.predict(X_test)

print('---------------------- confusion_matrix -------------------------')
print(confusion_matrix(y_test,predictionforest))

print('---------------------- accuracy_score -------------------------')
print(accuracy_score(y_test,predictionforest))

print('---------------------- classification_report -------------------------')
print(classification_report(y_test,predictionforest))
acc5 = accuracy_score(y_test,predictionforest)

---------------------- confusion_matrix -------------------------
[[97 10]
 [22 25]]
---------------------- accuracy_score -------------------------
0.7922077922077922
---------------------- classification_report -------------------------
              precision    recall  f1-score   support

           0       0.82      0.91      0.86       107
           1       0.71      0.53      0.61        47

    accuracy                           0.79       154
   macro avg       0.76      0.72      0.73       154
weighted avg       0.78      0.79      0.78       154



In [16]:
acc5

0.7922077922077922

## Genetic Algorithms
    Genetic Algorithms tries to apply natural selection mechanisms to Machine Learning contexts.

1. Let's immagine we create a population of N Machine Learning models with some predifined Hyperparameters. 
2. We can then calculate the accuracy of each model and decide to keep just half of the models (the ones that performs best). 
3. We can now generate some offsprings having similar Hyperparameters to the ones of the best models so that go get again a population of N models. 
4. At this point we can again calculate the accuracy of each model and repeate the cycle for a defined number of generations. 
5. In this way, just the best models will survive at the end of the process.

In [17]:
import numpy as np

from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 1000,10)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10,14]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,6,8]
# Create the random grid
param = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
              'criterion':['entropy','gini']}
print(param)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000], 'min_samples_split': [2, 5, 10, 14], 'min_samples_leaf': [1, 2, 4, 6, 8], 'criterion': ['entropy', 'gini']}


In [18]:
!pip install tpot

## needs tensflow to be installed

Collecting tpot
  Downloading TPOT-0.11.5-py3-none-any.whl (82 kB)
Collecting deap>=1.2
  Downloading deap-1.3.1-cp37-cp37m-win_amd64.whl (108 kB)
Collecting update-checker>=0.16
  Downloading update_checker-0.17-py2.py3-none-any.whl (7.0 kB)
Collecting stopit>=1.1.1
  Downloading stopit-1.1.2.tar.gz (18 kB)
Building wheels for collected packages: stopit
  Building wheel for stopit (setup.py): started
  Building wheel for stopit (setup.py): finished with status 'done'
  Created wheel for stopit: filename=stopit-1.1.2-py3-none-any.whl size=11959 sha256=d91aef9c7180479112cd743c11a393594664e025ad74b61ab0817c3df2eba800
  Stored in directory: c:\users\deepak\appdata\local\pip\cache\wheels\e2\d2\79\eaf81edb391e27c87f51b8ef901ecc85a5363dc96b8b8d71e3
Successfully built stopit
Installing collected packages: deap, update-checker, stopit, tpot
Successfully installed deap-1.3.1 stopit-1.1.2 tpot-0.11.5 update-checker-0.17


In [19]:
from tpot import TPOTClassifier


tpot_classifier = TPOTClassifier(generations= 5, population_size= 24, offspring_size= 12,
                                 verbosity= 2, early_stop= 12,
                                 config_dict={'sklearn.ensemble.RandomForestClassifier': param}, 
                                 cv = 4, scoring = 'accuracy')
tpot_classifier.fit(X_train,y_train)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=84.0, style=ProgressStyle(des…


Generation 1 - Current best internal CV score: 0.7606209150326797
Generation 2 - Current best internal CV score: 0.7606209150326797
Generation 3 - Current best internal CV score: 0.7622336813513284
Generation 4 - Current best internal CV score: 0.7622336813513284
Generation 5 - Current best internal CV score: 0.7622336813513284
Best pipeline: RandomForestClassifier(RandomForestClassifier(input_matrix, criterion=gini, max_depth=780, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=2000), criterion=gini, max_depth=890, max_features=log2, min_samples_leaf=6, min_samples_split=5, n_estimators=1800)


TPOTClassifier(config_dict={'sklearn.ensemble.RandomForestClassifier': {'criterion': ['entropy',
                                                                                      'gini'],
                                                                        'max_depth': [10,
                                                                                      120,
                                                                                      230,
                                                                                      340,
                                                                                      450,
                                                                                      560,
                                                                                      670,
                                                                                      780,
                                                                                 

In [20]:
accuracy = tpot_classifier.score(X_test, y_test)
print(accuracy)

0.8376623376623377


## Optimize hyperparameters of the model using Optuna
## Optuna

1. The hyperparameters of the above algorithm are n_estimators and max_depth for which we can try different values to see if the model accuracy can be improved. 
2. The objective function is modified to accept a trial object. 
3. This trial has several methods for sampling hyperparameters. 
4. We create a study to run the hyperparameter optimization and finally read the best hyperparameters.

In [21]:
!pip install optuna


Collecting optuna

ERROR: pytest-astropy 0.8.0 requires pytest-cov>=2.0, which is not installed.
ERROR: pytest-astropy 0.8.0 requires pytest-filter-subpackage>=0.1, which is not installed.



  Downloading optuna-1.5.0.tar.gz (200 kB)
Collecting alembic
  Downloading alembic-1.4.2.tar.gz (1.1 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'
Collecting cliff
  Downloading cliff-3.3.0-py3-none-any.whl (81 kB)
Collecting cmaes>=0.5.0
  Downloading cmaes-0.5.1-py3-none-any.whl (12 kB)
Collecting colorlog
  Downloading colorlog-4.2.1-py2.py3-none-any.whl (14 kB)
Collecting python-editor>=0.3
  Downloading python_editor-1.0.4-py3-none-any.whl (4.9 kB)
Collecting Mako
  Downloading Mako-1.1.3-py2.py3-none-any.whl (75 kB)
Collecting PrettyTable<0.8,>=0.7.2
  Downloading prettytable-0.7.2.tar.bz2 (21 kB)
Collecting stevedore>=1.20.0
  Downloading stevedore-3.2.0-py3-none-any.whl (42 kB)
Collecting pbr!=2.1.0,>=2.0.0
  D

In [22]:
import optuna
import sklearn.svm
def objective(trial):

    classifier = trial.suggest_categorical('classifier', ['RandomForest', 'SVC'])
    
    if classifier == 'RandomForest':
        n_estimators = trial.suggest_int('n_estimators', 200, 2000,10)
        max_depth = int(trial.suggest_float('max_depth', 10, 100, log=True))

        clf = sklearn.ensemble.RandomForestClassifier(
            n_estimators=n_estimators, max_depth=max_depth)
    else:
        c = trial.suggest_float('svc_c', 1e-10, 1e10, log=True)
        
        clf = sklearn.svm.SVC(C=c, gamma='auto')

    return sklearn.model_selection.cross_val_score(
        clf,X_train,y_train, n_jobs=-1, cv=3).mean()

In [23]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

trial = study.best_trial

print('Accuracy: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

[I 2020-07-26 11:06:44,666] Finished trial#0 with value: 0.640068547744301 with parameters: {'classifier': 'SVC', 'svc_c': 4.620567714545393}. Best is trial#0 with value: 0.640068547744301.
[I 2020-07-26 11:06:50,347] Finished trial#1 with value: 0.7507970667941973 with parameters: {'classifier': 'RandomForest', 'n_estimators': 1360, 'max_depth': 32.917356765853285}. Best is trial#1 with value: 0.7507970667941973.
[I 2020-07-26 11:06:52,653] Finished trial#2 with value: 0.7475450342738722 with parameters: {'classifier': 'RandomForest', 'n_estimators': 640, 'max_depth': 44.58695933445998}. Best is trial#1 with value: 0.7507970667941973.
[I 2020-07-26 11:07:00,108] Finished trial#3 with value: 0.7524310537223019 with parameters: {'classifier': 'RandomForest', 'n_estimators': 810, 'max_depth': 22.23830168249349}. Best is trial#3 with value: 0.7524310537223019.
[I 2020-07-26 11:07:18,748] Finished trial#4 with value: 0.7491790212019768 with parameters: {'classifier': 'RandomForest', 'n_est

[I 2020-07-26 11:10:28,839] Finished trial#36 with value: 0.7459269886816515 with parameters: {'classifier': 'RandomForest', 'n_estimators': 1490, 'max_depth': 35.1417149884023}. Best is trial#31 with value: 0.7605770763589988.
[I 2020-07-26 11:10:33,207] Finished trial#37 with value: 0.7508050374621393 with parameters: {'classifier': 'RandomForest', 'n_estimators': 1260, 'max_depth': 53.13414446095385}. Best is trial#31 with value: 0.7605770763589988.
[I 2020-07-26 11:10:38,911] Finished trial#38 with value: 0.7524310537223019 with parameters: {'classifier': 'RandomForest', 'n_estimators': 1690, 'max_depth': 24.64669624572094}. Best is trial#31 with value: 0.7605770763589988.
[I 2020-07-26 11:10:46,286] Finished trial#39 with value: 0.7508130081300813 with parameters: {'classifier': 'RandomForest', 'n_estimators': 1550, 'max_depth': 33.00937373850031}. Best is trial#31 with value: 0.7605770763589988.
[I 2020-07-26 11:10:53,760] Finished trial#40 with value: 0.7475450342738722 with par

[I 2020-07-26 11:14:10,919] Finished trial#72 with value: 0.7540490993145226 with parameters: {'classifier': 'RandomForest', 'n_estimators': 1550, 'max_depth': 24.96618441571684}. Best is trial#31 with value: 0.7605770763589988.
[I 2020-07-26 11:14:19,937] Finished trial#73 with value: 0.7524390243902439 with parameters: {'classifier': 'RandomForest', 'n_estimators': 1760, 'max_depth': 28.24708508320576}. Best is trial#31 with value: 0.7605770763589988.
[I 2020-07-26 11:14:28,389] Finished trial#74 with value: 0.7524310537223019 with parameters: {'classifier': 'RandomForest', 'n_estimators': 1600, 'max_depth': 20.191274910118423}. Best is trial#31 with value: 0.7605770763589988.
[I 2020-07-26 11:14:29,748] Finished trial#75 with value: 0.7507970667941973 with parameters: {'classifier': 'RandomForest', 'n_estimators': 250, 'max_depth': 29.315704474766573}. Best is trial#31 with value: 0.7605770763589988.
[I 2020-07-26 11:14:29,914] Finished trial#76 with value: 0.640068547744301 with pa

Accuracy: 0.7605770763589988
Best hyperparameters: {'classifier': 'RandomForest', 'n_estimators': 1630, 'max_depth': 21.80337012847624}


In [24]:
study.best_params

{'classifier': 'RandomForest',
 'n_estimators': 1630,
 'max_depth': 21.80337012847624}

In [25]:
study.best_trial

FrozenTrial(number=31, value=0.7605770763589988, datetime_start=datetime.datetime(2020, 7, 26, 11, 9, 53, 830675), datetime_complete=datetime.datetime(2020, 7, 26, 11, 9, 59, 429471), params={'classifier': 'RandomForest', 'n_estimators': 1630, 'max_depth': 21.80337012847624}, distributions={'classifier': CategoricalDistribution(choices=('RandomForest', 'SVC')), 'n_estimators': IntUniformDistribution(high=2000, low=200, step=10), 'max_depth': LogUniformDistribution(high=100, low=10)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=31, state=TrialState.COMPLETE)

In [26]:
rf=RandomForestClassifier(n_estimators=330,max_depth=30)
rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=30, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=330,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [29]:
!pip install plotly

Collecting plotly
  Downloading plotly-4.9.0-py2.py3-none-any.whl (12.9 MB)
Collecting retrying>=1.3.3
  Downloading retrying-1.3.3.tar.gz (10 kB)
Building wheels for collected packages: retrying
  Building wheel for retrying (setup.py): started
  Building wheel for retrying (setup.py): finished with status 'done'
  Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11435 sha256=d69d3d8fe17adba828ccc8d8fa403fa4fd953b5520efcf33aebfe523447e796f
  Stored in directory: c:\users\deepak\appdata\local\pip\cache\wheels\f9\8d\8d\f6af3f7f9eea3553bc2fe6d53e4b287dad18b06a861ac56ddf
Successfully built retrying
Installing collected packages: retrying, plotly
Successfully installed plotly-4.9.0 retrying-1.3.3


In [31]:
## needs plotly to be installed
optuna.visualization.plot_optimization_history(study)

ImportError: Plotly is not available. Please install plotly to use this feature. Plotly can be installed by executing `$ pip install plotly`. For further information, please refer to the installation guide of plotly. (The actual import error is as follows: No module named 'plotly')

In [33]:
y_pred=rf.predict(X_test)

print('---------------------- confusion_matrix -------------------------')
print(confusion_matrix(y_test,y_pred))

print('---------------------- accuracy_score -------------------------')
print(accuracy_score(y_test,y_pred))

print('---------------------- classification_report -------------------------')
print(classification_report(y_test,y_pred))


---------------------- confusion_matrix -------------------------
[[92 15]
 [15 32]]
---------------------- accuracy_score -------------------------
0.8051948051948052
---------------------- classification_report -------------------------
              precision    recall  f1-score   support

           0       0.86      0.86      0.86       107
           1       0.68      0.68      0.68        47

    accuracy                           0.81       154
   macro avg       0.77      0.77      0.77       154
weighted avg       0.81      0.81      0.81       154

