# <center>Hyperparameter Tuning</center><br>
<img src = "https://miro.medium.com/max/1005/1*qv2Su1gKmUJxpfG8lt2Jmw.png"></img><br>
#### <div align='right'>Made by: **Asad Mahmood</div>**

In [64]:
import pandas as pd
import numpy as np

from sklearn import ensemble # Access RandomForestClassifier and other ensemble classifers
from sklearn import metrics  # Access different metrics like confusion matrices, accuracy, etc.
from sklearn import model_selection 
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import decomposition
from sklearn import pipeline


from functools import partial
from skopt import space
from skopt import gp_minimize

from hyperopt import hp, fmin, tpe, Trials
from hyperopt.pyll.base import scope

import optuna

<a id="toc"></a>

<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0' role="tab" aria-controls="home"><center>Table of Contents</center></h2>

1. [Objective](#Obj)
2. [Data Preprocessing and Perparation](#Data)
2. [Different ways to do Hyper-parameter tuning](#Hyp)
    1. Grid Search CV
    2. Random Search
    3. Bayseian Optimization with Gaussian Process
    4. HyperOpt

<a name="Obj"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0' role="tab" aria-controls="home"><center>Objective</center></h3>

This is a template notebook that shows how to do hyper-parameter tuning using different methods. It is using an example dataset provided on kaggle titled as [Mobile Price Classification](#https://www.kaggle.com/iabhishekofficial/mobile-price-classification).

This data set is about a guy named Bob who has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc. I like this dataset because it has evenly distributed labels and in this scenario will work perfectly.

<a name="Data"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0' role="tab" aria-controls="home"><center>Data Preprocessing and Preparation</center></h3>

#### Reading in Data

In [2]:
df = pd.read_csv('mobile_price_data.csv')

#### Test Train Split

In [3]:
X = df.drop('price_range', axis = 1).values
y = df.price_range.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

<div class="list-group" id="list-tab" role="tablist">
<h4 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gray; border:0' role="tab" aria-controls="home"><center>Grid Search</center></h4>

In [4]:
# Using Random Forest as example 
clf = ensemble.RandomForestClassifier(n_jobs = -1) #This tells my pc to run all cores on the problem
param_grid = {
    #Put the parameters here
    #### Example 1: "n_estimators": [i for i in range(0,50)]#Change 0 and 50 here to your requirments
    "n_estimators": [100,200,300,400],
    "max_depth": [1, 3, 5, 7],
    "criterion": ["gini", "entropy"]
}

# Implementing GridSearch
model = model_selection.GridSearchCV(
    estimator = clf,
    param_grid = param_grid,
    scoring = "accuracy",
    verbose = 10,
    n_jobs = 1,
    cv = 5,
)
model.fit(X_train, y_train)

Fitting 5 folds for each of 32 candidates, totalling 160 fits
[CV] criterion=gini, max_depth=1, n_estimators=100 ...................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  criterion=gini, max_depth=1, n_estimators=100, score=0.647, total=   3.8s
[CV] criterion=gini, max_depth=1, n_estimators=100 ...................
[CV]  criterion=gini, max_depth=1, n_estimators=100, score=0.566, total=   0.2s
[CV] criterion=gini, max_depth=1, n_estimators=100 ...................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    3.9s remaining:    0.0s


[CV]  criterion=gini, max_depth=1, n_estimators=100, score=0.556, total=   0.1s
[CV] criterion=gini, max_depth=1, n_estimators=100 ...................
[CV]  criterion=gini, max_depth=1, n_estimators=100, score=0.553, total=   0.2s
[CV] criterion=gini, max_depth=1, n_estimators=100 ...................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    4.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    4.2s remaining:    0.0s


[CV]  criterion=gini, max_depth=1, n_estimators=100, score=0.622, total=   0.2s
[CV] criterion=gini, max_depth=1, n_estimators=200 ...................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    4.4s remaining:    0.0s


[CV]  criterion=gini, max_depth=1, n_estimators=200, score=0.572, total=   0.3s
[CV] criterion=gini, max_depth=1, n_estimators=200 ...................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    4.8s remaining:    0.0s


[CV]  criterion=gini, max_depth=1, n_estimators=200, score=0.544, total=   0.3s
[CV] criterion=gini, max_depth=1, n_estimators=200 ...................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    5.1s remaining:    0.0s


[CV]  criterion=gini, max_depth=1, n_estimators=200, score=0.616, total=   0.3s
[CV] criterion=gini, max_depth=1, n_estimators=200 ...................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    5.4s remaining:    0.0s


[CV]  criterion=gini, max_depth=1, n_estimators=200, score=0.603, total=   0.3s
[CV] criterion=gini, max_depth=1, n_estimators=200 ...................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    5.6s remaining:    0.0s


[CV]  criterion=gini, max_depth=1, n_estimators=200, score=0.628, total=   0.3s
[CV] criterion=gini, max_depth=1, n_estimators=300 ...................
[CV]  criterion=gini, max_depth=1, n_estimators=300, score=0.625, total=   0.4s
[CV] criterion=gini, max_depth=1, n_estimators=300 ...................
[CV]  criterion=gini, max_depth=1, n_estimators=300, score=0.578, total=   0.4s
[CV] criterion=gini, max_depth=1, n_estimators=300 ...................
[CV]  criterion=gini, max_depth=1, n_estimators=300, score=0.591, total=   0.4s
[CV] criterion=gini, max_depth=1, n_estimators=300 ...................
[CV]  criterion=gini, max_depth=1, n_estimators=300, score=0.588, total=   0.4s
[CV] criterion=gini, max_depth=1, n_estimators=300 ...................
[CV]  criterion=gini, max_depth=1, n_estimators=300, score=0.619, total=   0.4s
[CV] criterion=gini, max_depth=1, n_estimators=400 ...................
[CV]  criterion=gini, max_depth=1, n_estimators=400, score=0.597, total=   0.5s
[CV] criterion

[CV]  criterion=gini, max_depth=7, n_estimators=200, score=0.850, total=   0.3s
[CV] criterion=gini, max_depth=7, n_estimators=200 ...................
[CV]  criterion=gini, max_depth=7, n_estimators=200, score=0.853, total=   0.3s
[CV] criterion=gini, max_depth=7, n_estimators=200 ...................
[CV]  criterion=gini, max_depth=7, n_estimators=200, score=0.863, total=   0.3s
[CV] criterion=gini, max_depth=7, n_estimators=200 ...................
[CV]  criterion=gini, max_depth=7, n_estimators=200, score=0.856, total=   0.3s
[CV] criterion=gini, max_depth=7, n_estimators=200 ...................
[CV]  criterion=gini, max_depth=7, n_estimators=200, score=0.838, total=   0.3s
[CV] criterion=gini, max_depth=7, n_estimators=300 ...................
[CV]  criterion=gini, max_depth=7, n_estimators=300, score=0.853, total=   0.5s
[CV] criterion=gini, max_depth=7, n_estimators=300 ...................
[CV]  criterion=gini, max_depth=7, n_estimators=300, score=0.850, total=   0.5s
[CV] criterion

[CV]  criterion=entropy, max_depth=3, n_estimators=400, score=0.781, total=   0.5s
[CV] criterion=entropy, max_depth=5, n_estimators=100 ................
[CV]  criterion=entropy, max_depth=5, n_estimators=100, score=0.841, total=   0.2s
[CV] criterion=entropy, max_depth=5, n_estimators=100 ................
[CV]  criterion=entropy, max_depth=5, n_estimators=100, score=0.831, total=   0.2s
[CV] criterion=entropy, max_depth=5, n_estimators=100 ................
[CV]  criterion=entropy, max_depth=5, n_estimators=100, score=0.853, total=   0.2s
[CV] criterion=entropy, max_depth=5, n_estimators=100 ................
[CV]  criterion=entropy, max_depth=5, n_estimators=100, score=0.834, total=   0.2s
[CV] criterion=entropy, max_depth=5, n_estimators=100 ................
[CV]  criterion=entropy, max_depth=5, n_estimators=100, score=0.834, total=   0.2s
[CV] criterion=entropy, max_depth=5, n_estimators=200 ................
[CV]  criterion=entropy, max_depth=5, n_estimators=200, score=0.859, total= 

[Parallel(n_jobs=1)]: Done 160 out of 160 | elapsed:  1.0min finished


GridSearchCV(cv=5, estimator=RandomForestClassifier(n_jobs=-1), n_jobs=1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [1, 3, 5, 7],
                         'n_estimators': [100, 200, 300, 400]},
             scoring='accuracy', verbose=10)

In [5]:
# Printing best accuracy score because labels are equally divided and the estimators
print(model.best_score_)
print(model.best_estimator_.get_params())

0.8706250000000001
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 7, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


In [6]:
#Fine tuned model
clf = ensemble.RandomForestClassifier(**model.best_params_)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

#Printing model accuracy
gridAcc= metrics.accuracy_score(y_test,y_pred)
gridAcc

0.86

<div class="list-group" id="list-tab" role="tablist">
<h4 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gray; border:0' role="tab" aria-controls="home"><center>Random Search</center></h4>

In [7]:
# Using Random Forest as example 
clf = ensemble.RandomForestClassifier(n_jobs = -1) #This tells my pc to run all cores on the problem
param_grid = {
    #Put the parameters here
    #### Example 1: "n_estimators": [i for i in range(0,50)]#Change 0 and 50 here to your requirments
    "n_estimators": np.arange(100, 500, 100),
    "max_depth": np.arange(1, 20),
    "criterion": ["gini", "entropy"]
}

# Implementing GridSearch
model = model_selection.RandomizedSearchCV(
    estimator = clf,
    param_distributions = param_grid,
    scoring = "accuracy",
    n_iter = 10,
    verbose = 10,
    n_jobs = 1,
    cv = 5,
)
model.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] n_estimators=400, max_depth=16, criterion=gini ..................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  n_estimators=400, max_depth=16, criterion=gini, score=0.878, total=   0.5s
[CV] n_estimators=400, max_depth=16, criterion=gini ..................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s


[CV]  n_estimators=400, max_depth=16, criterion=gini, score=0.872, total=   0.5s
[CV] n_estimators=400, max_depth=16, criterion=gini ..................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s


[CV]  n_estimators=400, max_depth=16, criterion=gini, score=0.887, total=   0.6s
[CV] n_estimators=400, max_depth=16, criterion=gini ..................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.6s remaining:    0.0s


[CV]  n_estimators=400, max_depth=16, criterion=gini, score=0.878, total=   0.6s
[CV] n_estimators=400, max_depth=16, criterion=gini ..................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    2.1s remaining:    0.0s


[CV]  n_estimators=400, max_depth=16, criterion=gini, score=0.894, total=   0.6s
[CV] n_estimators=200, max_depth=9, criterion=gini ...................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    2.7s remaining:    0.0s


[CV]  n_estimators=200, max_depth=9, criterion=gini, score=0.850, total=   0.3s
[CV] n_estimators=200, max_depth=9, criterion=gini ...................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    3.0s remaining:    0.0s


[CV]  n_estimators=200, max_depth=9, criterion=gini, score=0.859, total=   0.3s
[CV] n_estimators=200, max_depth=9, criterion=gini ...................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    3.3s remaining:    0.0s


[CV]  n_estimators=200, max_depth=9, criterion=gini, score=0.887, total=   0.3s
[CV] n_estimators=200, max_depth=9, criterion=gini ...................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    3.6s remaining:    0.0s


[CV]  n_estimators=200, max_depth=9, criterion=gini, score=0.850, total=   0.3s
[CV] n_estimators=200, max_depth=9, criterion=gini ...................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    3.9s remaining:    0.0s


[CV]  n_estimators=200, max_depth=9, criterion=gini, score=0.878, total=   0.3s
[CV] n_estimators=200, max_depth=5, criterion=entropy ................
[CV]  n_estimators=200, max_depth=5, criterion=entropy, score=0.856, total=   0.3s
[CV] n_estimators=200, max_depth=5, criterion=entropy ................
[CV]  n_estimators=200, max_depth=5, criterion=entropy, score=0.828, total=   0.3s
[CV] n_estimators=200, max_depth=5, criterion=entropy ................
[CV]  n_estimators=200, max_depth=5, criterion=entropy, score=0.856, total=   0.3s
[CV] n_estimators=200, max_depth=5, criterion=entropy ................
[CV]  n_estimators=200, max_depth=5, criterion=entropy, score=0.838, total=   0.3s
[CV] n_estimators=200, max_depth=5, criterion=entropy ................
[CV]  n_estimators=200, max_depth=5, criterion=entropy, score=0.831, total=   0.3s
[CV] n_estimators=400, max_depth=1, criterion=gini ...................
[CV]  n_estimators=400, max_depth=1, criterion=gini, score=0.588, total=   0.5s

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:   21.7s finished


RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(n_jobs=-1), n_jobs=1,
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19]),
                                        'n_estimators': array([100, 200, 300, 400])},
                   scoring='accuracy', verbose=10)

In [8]:
# Printing best accuracy score because labels are equally divided and the estimators
print(model.best_score_)
print(model.best_estimator_.get_params())

0.884375
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'entropy', 'max_depth': 18, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 200, 'n_jobs': -1, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


In [9]:
#Fine tuned model
clf = ensemble.RandomForestClassifier(**model.best_params_)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

#Printing model accuracy
randAcc= metrics.accuracy_score(y_test,y_pred)
randAcc

0.875

<div class="list-group" id="list-tab" role="tablist">
<h4 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gray; border:0' role="tab" aria-controls="home"><center>Grid/Random Search with Pipelines</center></h4>

Wedon't really need to do PCA or scaling on this data but for the sake of example we are scaling the data, doing PCA and a creating a random forest classifier. 

In [10]:
scl = preprocessing.StandardScaler()
pca = decomposition.PCA()
rf = ensemble.RandomForestClassifier(n_jobs=-1)

**Steps in the created the pipeline:**

1. We are **scaling.**
2. We are doing **dimenstionality reduction** using PCA
3. Defing the **classification model** as a random forest

In [18]:
classfier = pipeline.Pipeline([
    ("scaling", scl),
    ("pca", pca),
    ("rf", rf)
])

param_grid = {
    #Put the parameters here--- In this case use two underscores 
    "pca__n_components": np.arange(5, 10),
    "rf__n_estimators": np.arange(100, 500, 100),
    "rf__max_depth": np.arange(1, 20),
    "rf__criterion": ["gini", "entropy"]
}

# Implementing GridSearch
model = model_selection.RandomizedSearchCV( #Can be changed to GridSearch
    estimator = classfier,
    param_distributions = param_grid,
    scoring = "accuracy",
    n_iter = 10,
    verbose = 10,
    n_jobs = 1,
    cv = 5,
)
model.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] rf__n_estimators=300, rf__max_depth=12, rf__criterion=entropy, pca__n_components=8 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  rf__n_estimators=300, rf__max_depth=12, rf__criterion=entropy, pca__n_components=8, score=0.472, total=   4.3s
[CV] rf__n_estimators=300, rf__max_depth=12, rf__criterion=entropy, pca__n_components=8 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.2s remaining:    0.0s


[CV]  rf__n_estimators=300, rf__max_depth=12, rf__criterion=entropy, pca__n_components=8, score=0.372, total=   0.6s
[CV] rf__n_estimators=300, rf__max_depth=12, rf__criterion=entropy, pca__n_components=8 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.8s remaining:    0.0s


[CV]  rf__n_estimators=300, rf__max_depth=12, rf__criterion=entropy, pca__n_components=8, score=0.425, total=   0.7s
[CV] rf__n_estimators=300, rf__max_depth=12, rf__criterion=entropy, pca__n_components=8 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    5.6s remaining:    0.0s


[CV]  rf__n_estimators=300, rf__max_depth=12, rf__criterion=entropy, pca__n_components=8, score=0.444, total=   0.7s
[CV] rf__n_estimators=300, rf__max_depth=12, rf__criterion=entropy, pca__n_components=8 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    6.2s remaining:    0.0s


[CV]  rf__n_estimators=300, rf__max_depth=12, rf__criterion=entropy, pca__n_components=8, score=0.362, total=   0.6s
[CV] rf__n_estimators=100, rf__max_depth=17, rf__criterion=entropy, pca__n_components=6 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    6.9s remaining:    0.0s


[CV]  rf__n_estimators=100, rf__max_depth=17, rf__criterion=entropy, pca__n_components=6, score=0.356, total=   0.3s
[CV] rf__n_estimators=100, rf__max_depth=17, rf__criterion=entropy, pca__n_components=6 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    7.1s remaining:    0.0s


[CV]  rf__n_estimators=100, rf__max_depth=17, rf__criterion=entropy, pca__n_components=6, score=0.322, total=   0.3s
[CV] rf__n_estimators=100, rf__max_depth=17, rf__criterion=entropy, pca__n_components=6 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    7.4s remaining:    0.0s


[CV]  rf__n_estimators=100, rf__max_depth=17, rf__criterion=entropy, pca__n_components=6, score=0.378, total=   0.3s
[CV] rf__n_estimators=100, rf__max_depth=17, rf__criterion=entropy, pca__n_components=6 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    7.7s remaining:    0.0s


[CV]  rf__n_estimators=100, rf__max_depth=17, rf__criterion=entropy, pca__n_components=6, score=0.331, total=   0.3s
[CV] rf__n_estimators=100, rf__max_depth=17, rf__criterion=entropy, pca__n_components=6 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    8.0s remaining:    0.0s


[CV]  rf__n_estimators=100, rf__max_depth=17, rf__criterion=entropy, pca__n_components=6, score=0.306, total=   0.3s
[CV] rf__n_estimators=300, rf__max_depth=4, rf__criterion=gini, pca__n_components=9 
[CV]  rf__n_estimators=300, rf__max_depth=4, rf__criterion=gini, pca__n_components=9, score=0.506, total=   0.5s
[CV] rf__n_estimators=300, rf__max_depth=4, rf__criterion=gini, pca__n_components=9 
[CV]  rf__n_estimators=300, rf__max_depth=4, rf__criterion=gini, pca__n_components=9, score=0.394, total=   0.5s
[CV] rf__n_estimators=300, rf__max_depth=4, rf__criterion=gini, pca__n_components=9 
[CV]  rf__n_estimators=300, rf__max_depth=4, rf__criterion=gini, pca__n_components=9, score=0.453, total=   0.5s
[CV] rf__n_estimators=300, rf__max_depth=4, rf__criterion=gini, pca__n_components=9 
[CV]  rf__n_estimators=300, rf__max_depth=4, rf__criterion=gini, pca__n_components=9, score=0.450, total=   0.5s
[CV] rf__n_estimators=300, rf__max_depth=4, rf__criterion=gini, pca__n_components=9 
[CV]  

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:   26.8s finished


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('scaling', StandardScaler()),
                                             ('pca', PCA()),
                                             ('rf',
                                              RandomForestClassifier(n_jobs=-1))]),
                   n_jobs=1,
                   param_distributions={'pca__n_components': array([5, 6, 7, 8, 9]),
                                        'rf__criterion': ['gini', 'entropy'],
                                        'rf__max_depth': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19]),
                                        'rf__n_estimators': array([100, 200, 300, 400])},
                   scoring='accuracy', verbose=10)

In [19]:
# Printing best accuracy score because labels are equally divided and the estimators
print(model.best_score_)
print(model.best_estimator_.get_params())

0.4462499999999999
{'memory': None, 'steps': [('scaling', StandardScaler()), ('pca', PCA(n_components=9)), ('rf', RandomForestClassifier(max_depth=4, n_estimators=300, n_jobs=-1))], 'verbose': False, 'scaling': StandardScaler(), 'pca': PCA(n_components=9), 'rf': RandomForestClassifier(max_depth=4, n_estimators=300, n_jobs=-1), 'scaling__copy': True, 'scaling__with_mean': True, 'scaling__with_std': True, 'pca__copy': True, 'pca__iterated_power': 'auto', 'pca__n_components': 9, 'pca__random_state': None, 'pca__svd_solver': 'auto', 'pca__tol': 0.0, 'pca__whiten': False, 'rf__bootstrap': True, 'rf__ccp_alpha': 0.0, 'rf__class_weight': None, 'rf__criterion': 'gini', 'rf__max_depth': 4, 'rf__max_features': 'auto', 'rf__max_leaf_nodes': None, 'rf__max_samples': None, 'rf__min_impurity_decrease': 0.0, 'rf__min_impurity_split': None, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 2, 'rf__min_weight_fraction_leaf': 0.0, 'rf__n_estimators': 300, 'rf__n_jobs': -1, 'rf__oob_score': False, 'rf_

In [20]:
#Fine tuned model
clf = ensemble.RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

#Printing model accuracy
pipelineAcc= metrics.accuracy_score(y_test,y_pred)
pipelineAcc

0.8825

<div class="list-group" id="list-tab" role="tablist">
<h4 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gray; border:0' role="tab" aria-controls="home"><center>Bayseian Optimization with Gaussian Process</center></h4>

In [40]:
def optimize(params, param_names, X, y):
    params = dict(zip(param_names, params))
    model = ensemble.RandomForestClassifier(**params)
    kf = model_selection.StratifiedKFold(n_splits = 5)
    accuracies = []
    for idx in kf.split(X=X, y=y):
        train_idx, test_idx = idx[0], idx[1]
        xtrain = X[train_idx]
        ytrain = y[train_idx]
        
        xtest = X[test_idx]
        ytest = y[test_idx]
        
        model.fit(xtrain, ytrain)
        preds = model.predict(xtest)
        fold_acc = metrics.accuracy_score(ytest, preds)
        accuracies.append(fold_acc)
        
        return -1.0 * np.mean(accuracies)

In [45]:
param_space = [
    space.Integer(3, 15, name = "max_depth"),
    space.Integer(100, 600, name = "n_estimators"),
    space.Categorical(["gini", "entropy"], name = "criterion"),
    space.Real(0.01, 1, prior = "uniform", name = "max_features")
]

param_names = [
    "max_depth",
    "n_estimators",
    "criterion",
    "max_features"
]

optimization_function = partial(
    optimize,
    param_names=param_names,
    X=X,
    y=y
)

result = gp_minimize(
    optimization_function,
    dimensions = param_space,
    n_calls = 15,
    n_random_starts = 10,
    verbose = 10,
)

print(
    dict(zip(param_names, result.X))
)

Iteration No: 1 started. Evaluating function at random point.
Iteration No: 1 ended. Evaluation done at random point.
Time taken: 5.1694
Function value obtained: -0.9200
Current minimum: -0.9200
Iteration No: 2 started. Evaluating function at random point.
Iteration No: 2 ended. Evaluation done at random point.
Time taken: 1.3031
Function value obtained: -0.8975
Current minimum: -0.9200
Iteration No: 3 started. Evaluating function at random point.
Iteration No: 3 ended. Evaluation done at random point.
Time taken: 2.4082
Function value obtained: -0.8975
Current minimum: -0.9200
Iteration No: 4 started. Evaluating function at random point.
Iteration No: 4 ended. Evaluation done at random point.
Time taken: 2.5582
Function value obtained: -0.9100
Current minimum: -0.9200
Iteration No: 5 started. Evaluating function at random point.
Iteration No: 5 ended. Evaluation done at random point.
Time taken: 0.7091
Function value obtained: -0.8950
Current minimum: -0.9200
Iteration No: 6 started. 

AttributeError: X

<div class="list-group" id="list-tab" role="tablist">
<h4 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gray; border:0' role="tab" aria-controls="home"><center>Optuna</center></h4>

In [68]:
def optimize(trial, X, y):
    
    criterion = trial.suggest_categorical("criterion", ["gini", "entropy"])
    n_estimators = trial.suggest_int("n_estimators", 100, 1500)
    max_depth = trial.suggest_int("max_depth", 3, 15)
    max_features = trial.suggest_uniform("max_features", 0.01, 1.0)
    model = ensemble.RandomForestClassifier(
        n_estimators = n_estimators,
        max_depth = max_depth,
        max_features = max_features,
        criterion = criterion,
    )
    kf = model_selection.StratifiedKFold(n_splits = 5)
    accuracies = []
    for idx in kf.split(X=X, y=y):
        train_idx, test_idx = idx[0], idx[1]
        xtrain = X[train_idx]
        ytrain = y[train_idx]
        
        xtest = X[test_idx]
        ytest = y[test_idx]
        
        model.fit(xtrain, ytrain)
        preds = model.predict(xtest)
        fold_acc = metrics.accuracy_score(ytest, preds)
        accuracies.append(fold_acc)
        
        return -1.0 * np.mean(accuracies)

In [69]:
optimization_function = partial(optimize, X=X, y=y)
study = optuna.create_study(direction = "minimize")
study.optimize(optimization_function, n_trials = 15)

[32m[I 2021-02-10 22:34:47,222][0m A new study created in memory with name: no-name-42c8b070-eb0e-4c63-9c99-25d23e240310[0m
[32m[I 2021-02-10 22:34:49,943][0m Trial 0 finished with value: -0.915 and parameters: {'criterion': 'entropy', 'n_estimators': 406, 'max_depth': 15, 'max_features': 0.6238888037615989}. Best is trial 0 with value: -0.915.[0m
[32m[I 2021-02-10 22:34:54,546][0m Trial 1 finished with value: -0.8975 and parameters: {'criterion': 'gini', 'n_estimators': 1156, 'max_depth': 13, 'max_features': 0.35286140785954856}. Best is trial 0 with value: -0.915.[0m
[32m[I 2021-02-10 22:34:56,789][0m Trial 2 finished with value: -0.8475 and parameters: {'criterion': 'entropy', 'n_estimators': 600, 'max_depth': 14, 'max_features': 0.11263493325609157}. Best is trial 0 with value: -0.915.[0m
[32m[I 2021-02-10 22:34:59,364][0m Trial 3 finished with value: -0.825 and parameters: {'criterion': 'gini', 'n_estimators': 804, 'max_depth': 10, 'max_features': 0.1133684072823349}