## All Techniques Of Hyper Parameter Optimization

1. GridSearchCV
2. RandomizedSearchCV
3. Bayesian Optimization -Automate Hyperparameter Tuning (Hyperopt) or **using Gaussian Transformation**
4. Sequential Model Based Optimization(Tuning a scikit-learn estimator with skopt)
4. Optuna- Automate Hyperparameter Tuning
5. Genetic Algorithms (TPOT Classifier)

#### References

- https://github.com/fmfn/BayesianOptimization
- https://github.com/hyperopt/hyperopt
- https://www.jeremyjordan.me/hyperparameter-tuning/
- https://optuna.org/
- https://towardsdatascience.com/hyperparameters-optimization-526348bb8e2d(By Pier Paolo Ippolito )
- https://scikit-optimize.github.io/stable/auto_examples/hyperparameter-optimization.html

In [10]:
import warnings
warnings.filterwarnings('ignore')

In [11]:
import pandas as pd
df=pd.read_csv('diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [12]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [13]:
import numpy as np
df['Glucose']=np.where(df['Glucose']==0,df['Glucose'].median(),df['Glucose'])
df['Insulin']=np.where(df['Insulin']==0,df['Insulin'].median(),df['Insulin'])
df['SkinThickness']=np.where(df['SkinThickness']==0,df['SkinThickness'].median(),df['SkinThickness'])
df['BMI']=np.where(df['BMI']==0,df['BMI'].median(),df['BMI'])
df['BloodPressure']=np.where(df['BloodPressure']==0,df['BloodPressure'].median(),df['BloodPressure'])
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,30.5,33.6,0.627,50,1
1,1,85.0,66.0,29.0,30.5,26.6,0.351,31,0
2,8,183.0,64.0,23.0,30.5,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1


#### Independent And Dependent features

In [14]:
X=df.drop('Outcome',axis=1)
y=df['Outcome']

In [15]:
X.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148.0,72.0,35.0,30.5,33.6,0.627,50
1,1,85.0,66.0,29.0,30.5,26.6,0.351,31
2,8,183.0,64.0,23.0,30.5,23.3,0.672,32
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33


In [16]:
y.head()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

#### Train Test Split

In [17]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=33)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_classifier=RandomForestClassifier(n_estimators=10).fit(X_train,y_train)
prediction=rf_classifier.predict(X_test)

In [None]:
y.value_counts()

0    500
1    268
Name: Outcome, dtype: int64

* Not an imbalanced dataset (1:2) ratio is fine.

In [None]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
print(confusion_matrix(y_test,prediction))
print(accuracy_score(y_test,prediction))
print(classification_report(y_test,prediction))

[[82 17]
 [27 28]]
0.7142857142857143
              precision    recall  f1-score   support

           0       0.75      0.83      0.79        99
           1       0.62      0.51      0.56        55

    accuracy                           0.71       154
   macro avg       0.69      0.67      0.67       154
weighted avg       0.71      0.71      0.71       154



The main parameters used by a Random Forest Classifier are:

- **criterion** = the function used to evaluate the quality of a split.
- **max_depth** = maximum number of levels allowed in each tree.
- **max_features** = maximum number of features considered when splitting a node.
- **min_samples_leaf** = minimum number of samples which can be stored in a tree leaf.
- **min_samples_split** = minimum number of samples necessary in a node to cause node splitting.
- **n_estimators** = number of trees in the ensemble.

### Manual Hyperparameter Tuning

In [None]:
model=RandomForestClassifier(n_estimators=300,criterion='entropy',
                             max_features='sqrt',min_samples_leaf=10,random_state=100).fit(X_train,y_train)
predictions=model.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(accuracy_score(y_test,predictions))
print(classification_report(y_test,predictions))

[[87 12]
 [27 28]]
0.7467532467532467
              precision    recall  f1-score   support

           0       0.76      0.88      0.82        99
           1       0.70      0.51      0.59        55

    accuracy                           0.75       154
   macro avg       0.73      0.69      0.70       154
weighted avg       0.74      0.75      0.74       154



### 1. Randomized Search Cv

In [None]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest.
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split.
max_features = ['auto', 'sqrt', 'log2']

# Maximum number of levels in tree.
max_depth = [int(x) for x in np.linspace(10, 1000,10)]

# Minimum number of samples required to split a node.
min_samples_split = [1, 3, 4, 5, 7, 9]

# Minimum number of samples required at each leaf node.
min_samples_leaf = [1, 2, 4, 6, 8]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'criterion':['entropy','gini']}

print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000], 'min_samples_split': [1, 3, 4, 5, 7, 9], 'min_samples_leaf': [1, 2, 4, 6, 8], 'criterion': ['entropy', 'gini']}


In [None]:
rf=RandomForestClassifier()
rf_randomcv=RandomizedSearchCV(estimator=rf,param_distributions=random_grid,n_iter=100,cv=3,verbose=2,
                               random_state=100,n_jobs=-1)

#### Fit the randomized model

In [None]:
rf_randomcv.fit(X_train,y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    8.3s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   42.2s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  1.5min finished


RandomizedSearchCV(cv=3, estimator=RandomForestClassifier(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'criterion': ['entropy', 'gini'],
                                        'max_depth': [10, 120, 230, 340, 450,
                                                      560, 670, 780, 890,
                                                      1000],
                                        'max_features': ['auto', 'sqrt',
                                                         'log2'],
                                        'min_samples_leaf': [1, 2, 4, 6, 8],
                                        'min_samples_split': [1, 3, 4, 5, 7, 9],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=100, verbose=2)

In [None]:
rf_randomcv.best_params_

{'n_estimators': 200,
 'min_samples_split': 3,
 'min_samples_leaf': 8,
 'max_features': 'auto',
 'max_depth': 670,
 'criterion': 'gini'}

**Model with the Best Parameters**

In [None]:
rf_randomcv.best_estimator_

RandomForestClassifier(max_depth=670, min_samples_leaf=8, min_samples_split=3,
                       n_estimators=200)

In [None]:
best_random_grid=rf_randomcv.best_estimator_

In [None]:
from sklearn.metrics import accuracy_score
y_pred=best_random_grid.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print("Accuracy Score {}".format(accuracy_score(y_test,y_pred)))
print("Classification report: {}".format(classification_report(y_test,y_pred)))

[[86 13]
 [28 27]]
Accuracy Score 0.7337662337662337
Classification report:               precision    recall  f1-score   support

           0       0.75      0.87      0.81        99
           1       0.68      0.49      0.57        55

    accuracy                           0.73       154
   macro avg       0.71      0.68      0.69       154
weighted avg       0.73      0.73      0.72       154



### 2. GridSearch CV

In [None]:
rf_randomcv.best_params_

{'n_estimators': 200,
 'min_samples_split': 3,
 'min_samples_leaf': 8,
 'max_features': 'auto',
 'max_depth': 670,
 'criterion': 'gini'}

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'criterion': [rf_randomcv.best_params_['criterion']],
    'max_depth': [rf_randomcv.best_params_['max_depth']],
    'max_features': [rf_randomcv.best_params_['max_features']],
    'min_samples_leaf': [rf_randomcv.best_params_['min_samples_leaf'], 
                         rf_randomcv.best_params_['min_samples_leaf']+2, 
                         rf_randomcv.best_params_['min_samples_leaf'] + 4],
    'min_samples_split': [rf_randomcv.best_params_['min_samples_split'] - 2,
                          rf_randomcv.best_params_['min_samples_split'] - 1,
                          rf_randomcv.best_params_['min_samples_split'], 
                          rf_randomcv.best_params_['min_samples_split'] +1,
                          rf_randomcv.best_params_['min_samples_split'] + 2],
    'n_estimators': [rf_randomcv.best_params_['n_estimators'] - 100, rf_randomcv.best_params_['n_estimators'], 
                     rf_randomcv.best_params_['n_estimators'] + 100, 
                     rf_randomcv.best_params_['n_estimators'] + 200, rf_randomcv.best_params_['n_estimators'] + 300]
}

print(param_grid)

{'criterion': ['gini'], 'max_depth': [670], 'max_features': ['auto'], 'min_samples_leaf': [8, 10, 12], 'min_samples_split': [1, 2, 3, 4, 5], 'n_estimators': [100, 200, 300, 400, 500]}


#### Fit the grid_search to the data

* **There is no n_iter parameter in GridSearch CV**.

* If you want to know how many iterations it will run for you can calculate it by (1 * 1 * 1 * 3 * 5 * 5) = **75 iterations will take place**

* We will multiple the length of each of the parameter that we have selected.

In [None]:
rf=RandomForestClassifier()
grid_search=GridSearchCV(estimator=rf,param_grid=param_grid,cv=10,n_jobs=-1,verbose=2)
grid_search.fit(X_train,y_train)

Fitting 10 folds for each of 75 candidates, totalling 750 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    1.4s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   11.3s
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:   26.5s
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:   48.3s
[Parallel(n_jobs=-1)]: Done 750 out of 750 | elapsed:   59.5s finished


GridSearchCV(cv=10, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'criterion': ['gini'], 'max_depth': [670],
                         'max_features': ['auto'],
                         'min_samples_leaf': [8, 10, 12],
                         'min_samples_split': [1, 2, 3, 4, 5],
                         'n_estimators': [100, 200, 300, 400, 500]},
             verbose=2)

In [None]:
grid_search.best_estimator_

RandomForestClassifier(max_depth=670, min_samples_leaf=8, min_samples_split=4)

In [None]:
best_grid=grid_search.best_estimator_

In [None]:
best_grid

RandomForestClassifier(max_depth=670, min_samples_leaf=8, min_samples_split=4)

In [None]:
y_pred=best_grid.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print("Accuracy Score {}".format(accuracy_score(y_test,y_pred)))
print("Classification report: {}".format(classification_report(y_test,y_pred)))

[[87 12]
 [25 30]]
Accuracy Score 0.7597402597402597
Classification report:               precision    recall  f1-score   support

           0       0.78      0.88      0.82        99
           1       0.71      0.55      0.62        55

    accuracy                           0.76       154
   macro avg       0.75      0.71      0.72       154
weighted avg       0.75      0.76      0.75       154



### Automated Hyperparameter Tuning

Automated Hyperparameter Tuning can be done by using techniques such as:

- Bayesian Optimization
- Gradient Descent
- Evolutionary Algorithms

### Bayesian Optimization

Bayesian optimization uses probability to find the minimum of a function. The final aim is to find the input value to a function which can gives us the lowest possible output value.It usually performs better than random grid and manual search providing better performance in the testing phase and reduced optimization time.


In Hyperopt, Bayesian Optimization can be implemented giving 3 three main parameters to the function **fmin**.


- **Objective Function** = defines the loss function to minimize.
- **Domain Space** = defines the range of input values to test (in Bayesian Optimization, This space creates a probability distribution for each of the used Hyperparameters).
- **Optimization Algorithm** = defines the search algorithm to use to select the best input values to use in each new iteration.

In [None]:
from hyperopt import hp,fmin,tpe,STATUS_OK,Trials

#### Domain Space

In [None]:
space = {'criterion': hp.choice('criterion', ['entropy', 'gini']),
        'max_depth': hp.quniform('max_depth', 10, 1200, 10),
        'max_features': hp.choice('max_features', ['auto', 'sqrt','log2', None]),
        'min_samples_leaf': hp.uniform('min_samples_leaf', 0, 0.5),
        'min_samples_split' : hp.uniform ('min_samples_split', 0, 1),
        'n_estimators' : hp.choice('n_estimators', [10, 50, 300, 750, 1200,1300,1500])
    }

In [None]:
space

{'criterion': <hyperopt.pyll.base.Apply at 0x7f085b822c10>,
 'max_depth': <hyperopt.pyll.base.Apply at 0x7f085b822e50>,
 'max_features': <hyperopt.pyll.base.Apply at 0x7f085b822f90>,
 'min_samples_leaf': <hyperopt.pyll.base.Apply at 0x7f085b830290>,
 'min_samples_split': <hyperopt.pyll.base.Apply at 0x7f085b830410>,
 'n_estimators': <hyperopt.pyll.base.Apply at 0x7f085b830590>}

#### Objective Function and Optimization Algorithm (Random Forest Classifier)

In [None]:
from sklearn.model_selection import cross_val_score
def objective(space):
    model = RandomForestClassifier(criterion = space['criterion'],
                                   max_depth = space['max_depth'],
                                   max_features = space['max_features'],
                                   min_samples_leaf = space['min_samples_leaf'],
                                   min_samples_split = space['min_samples_split'],
                                   n_estimators = space['n_estimators'], 
                                 )
    
    accuracy = cross_val_score(model, X_train, y_train, cv = 5).mean()

    # We aim to maximize accuracy, therefore we return it as a negative value
    
    return {'loss': -accuracy, 'status': STATUS_OK }

In [None]:
trials = Trials()

best = fmin(fn= objective,       # Objective Function
            space= space,        # Domain Space
            algo= tpe.suggest,   # Optimization Algorithm
            max_evals = 80,      # Number of evaluations
            trials= trials)
best

100%|██████████| 80/80 [06:38<00:00,  4.98s/trial, best loss: -0.7720378515260563]


{'criterion': 1,
 'max_depth': 780.0,
 'max_features': 3,
 'min_samples_leaf': 0.05602019841464171,
 'min_samples_split': 0.09877144025590444,
 'n_estimators': 6}

In [None]:
crit = {0: 'entropy', 1: 'gini'}
feat = {0: 'auto', 1: 'sqrt', 2: 'log2', 3: None}
est = {0: 10, 1: 50, 2: 300, 3: 750, 4: 1200,5: 1300,6: 1500}


print(crit[best['criterion']])
print(feat[best['max_features']])
print(est[best['n_estimators']])

gini
None
1500


In [None]:
trainedforest = RandomForestClassifier(criterion = crit[best['criterion']],
                                       max_depth = best['max_depth'], 
                                       max_features = feat[best['max_features']], 
                                       min_samples_leaf = best['min_samples_leaf'], 
                                       min_samples_split = best['min_samples_split'], 
                                       n_estimators = est[best['n_estimators']]).fit(X_train,y_train)

predictionforest = trainedforest.predict(X_test)
print(confusion_matrix(y_test,predictionforest))
print(accuracy_score(y_test,predictionforest))
print(classification_report(y_test,predictionforest))

[[87 12]
 [27 28]]
0.7467532467532467
              precision    recall  f1-score   support

           0       0.76      0.88      0.82        99
           1       0.70      0.51      0.59        55

    accuracy                           0.75       154
   macro avg       0.73      0.69      0.70       154
weighted avg       0.74      0.75      0.74       154



### Genetic Algorithms
Genetic Algorithms tries to apply natural selection mechanisms to Machine Learning contexts.

* Let's imagine we create a population of **N** Machine Learning models with some predifined Hyperparameters. 
* We can then calculate the accuracy of each model and decide to keep just half of the models (the ones that performs best).
* We can now generate some offsprings having similar Hyperparameters to the ones of the best models so that go get again a population of N models.
* At this point we can again calculate the accuracy of each model and repeat the cycle for a defined number of generations.
* In this way, just the best models will survive at the end of the process.

In [None]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest.
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split.
max_features = ['auto', 'sqrt','log2']

# Maximum number of levels in tree.
max_depth = [int(x) for x in np.linspace(10, 1000,10)]

# Minimum number of samples required to split a node.
min_samples_split = [2, 5, 10, 14]

# Minimum number of samples required at each leaf node.
min_samples_leaf = [1, 2, 4, 6, 8]

# Create the random grid
param = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
              'criterion':['entropy','gini']}

print(param)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000], 'min_samples_split': [2, 5, 10, 14], 'min_samples_leaf': [1, 2, 4, 6, 8], 'criterion': ['entropy', 'gini']}


In [None]:
!pip install tpot

Collecting tpot
[?25l  Downloading https://files.pythonhosted.org/packages/14/5e/cb87b0257033a7a396e533a634079ee151a239d180efe2a8b1d2e3584d23/TPOT-0.11.5-py3-none-any.whl (82kB)
[K     |████                            | 10kB 17.6MB/s eta 0:00:01[K     |████████                        | 20kB 3.4MB/s eta 0:00:01[K     |████████████                    | 30kB 4.4MB/s eta 0:00:01[K     |████████████████                | 40kB 4.9MB/s eta 0:00:01[K     |████████████████████            | 51kB 3.6MB/s eta 0:00:01[K     |████████████████████████        | 61kB 4.3MB/s eta 0:00:01[K     |████████████████████████████    | 71kB 4.7MB/s eta 0:00:01[K     |███████████████████████████████▉| 81kB 5.2MB/s eta 0:00:01[K     |████████████████████████████████| 92kB 4.0MB/s 
[?25hCollecting stopit>=1.1.1
  Downloading https://files.pythonhosted.org/packages/35/58/e8bb0b0fb05baf07bbac1450c447d753da65f9701f551dca79823ce15d50/stopit-1.1.2.tar.gz
Collecting update-checker>=0.16
  Downloading 

In [None]:
from tpot import TPOTClassifier

tpot_classifier = TPOTClassifier(generations= 5,
                                 population_size= 24,
                                 offspring_size= 12,
                                 verbosity= 2,
                                 early_stop= 12,
                                 config_dict={'sklearn.ensemble.RandomForestClassifier': param},
                                 cv = 4,
                                 scoring = 'accuracy')

tpot_classifier.fit(X_train,y_train)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=84.0, style=ProgressStyle(des…


Generation 1 - Current best internal CV score: 0.7752631355572532
Generation 2 - Current best internal CV score: 0.7818096935743994
Generation 3 - Current best internal CV score: 0.7818096935743994
Generation 4 - Current best internal CV score: 0.7818096935743994

Best pipeline: RandomForestClassifier(RandomForestClassifier(input_matrix, criterion=entropy, max_depth=560, max_features=auto, min_samples_leaf=4, min_samples_split=14, n_estimators=600), criterion=entropy, max_depth=230, max_features=log2, min_samples_leaf=6, min_samples_split=10, n_estimators=1400)


TPOTClassifier(config_dict={'sklearn.ensemble.RandomForestClassifier': {'criterion': ['entropy',
                                                                                      'gini'],
                                                                        'max_depth': [10,
                                                                                      120,
                                                                                      230,
                                                                                      340,
                                                                                      450,
                                                                                      560,
                                                                                      670,
                                                                                      780,
                                                                                 

In [None]:
accuracy = tpot_classifier.score(X_test, y_test)
print(accuracy)

0.7597402597402597


### Optimize hyperparameters of the model using Optuna

* The hyperparameters of the above algorithm are `n_estimators` and `max_depth` for which we can try different values to see if the model accuracy can be improved. 

* The `objective` function is modified to accept a trial object.

* This trial has several methods for sampling hyperparameters.

* We create a study to run the hyperparameter optimization and finally read the best hyperparameters.

In [18]:
!pip install optuna

Collecting optuna
[?25l  Downloading https://files.pythonhosted.org/packages/06/b0/9a6313c78bca92abfacc08a2ad8b27bfe845256f615786ee2b6452ae1978/optuna-2.0.0.tar.gz (226kB)
[K     |████████████████████████████████| 235kB 4.2MB/s 
[?25hCollecting alembic
[?25l  Downloading https://files.pythonhosted.org/packages/60/1e/cabc75a189de0fbb2841d0975243e59bde8b7822bacbb95008ac6fe9ad47/alembic-1.4.2.tar.gz (1.1MB)
[K     |████████████████████████████████| 1.1MB 8.1MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting cliff
[?25l  Downloading https://files.pythonhosted.org/packages/71/06/03b1f92d46546a18eabf33ff7f37ef422c18c93d5a926bf590fee32ebe75/cliff-3.4.0-py3-none-any.whl (76kB)
[K     |████████████████████████████████| 81kB 8.0MB/s 
[?25hCollecting cmaes>=0.5.1
  Downloading https://files.pythonhosted.org/packages/63/88/d5e9b78151dce671d7e78ee4cc8905d8320

In [19]:
import optuna
import sklearn.svm
def objective(trial):

    classifier = trial.suggest_categorical('classifier', ['RandomForest', 'SVC'])
    
    if classifier == 'RandomForest':
        n_estimators = trial.suggest_int('n_estimators', 200, 2000,10)
        max_depth = int(trial.suggest_float('max_depth', 10, 100, log=True))

        clf = sklearn.ensemble.RandomForestClassifier(
            n_estimators=n_estimators, max_depth=max_depth)
    else:
        c = trial.suggest_float('svc_c', 1e-10, 1e10, log=True)
        
        clf = sklearn.svm.SVC(C=c, gamma='auto')

    return sklearn.model_selection.cross_val_score(
        clf,X_train,y_train, n_jobs=-1, cv=3).mean()


In [20]:
# direction is maximized becausse we need to maximize the accuracy of the model.
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

trial = study.best_trial

print('Accuracy: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

[I 2020-08-11 14:19:31,295] Trial 0 finished with value: 0.6530926191614858 and parameters: {'classifier': 'SVC', 'svc_c': 0.00011586657074918459}. Best is trial 0 with value: 0.6530926191614858.
[I 2020-08-11 14:19:31,373] Trial 1 finished with value: 0.6530926191614858 and parameters: {'classifier': 'SVC', 'svc_c': 148921417.20673385}. Best is trial 0 with value: 0.6530926191614858.
[I 2020-08-11 14:19:37,000] Trial 2 finished with value: 0.7638928742228598 and parameters: {'classifier': 'RandomForest', 'n_estimators': 1360, 'max_depth': 12.484691478273927}. Best is trial 2 with value: 0.7638928742228598.
[I 2020-08-11 14:19:44,868] Trial 3 finished with value: 0.7704208512673363 and parameters: {'classifier': 'RandomForest', 'n_estimators': 1930, 'max_depth': 16.91760135096698}. Best is trial 3 with value: 0.7704208512673363.
[I 2020-08-11 14:19:44,926] Trial 4 finished with value: 0.6530926191614858 and parameters: {'classifier': 'SVC', 'svc_c': 0.12917337627473002}. Best is trial 

Accuracy: 0.775314841383708
Best hyperparameters: {'classifier': 'RandomForest', 'n_estimators': 710, 'max_depth': 23.85656263227053}


In [21]:
trial

FrozenTrial(number=21, value=0.775314841383708, datetime_start=datetime.datetime(2020, 8, 11, 14, 20, 34, 594452), datetime_complete=datetime.datetime(2020, 8, 11, 14, 20, 37, 504986), params={'classifier': 'RandomForest', 'n_estimators': 710, 'max_depth': 23.85656263227053}, distributions={'classifier': CategoricalDistribution(choices=('RandomForest', 'SVC')), 'n_estimators': IntUniformDistribution(high=2000, low=200, step=10), 'max_depth': LogUniformDistribution(high=100, low=10)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=21, state=TrialState.COMPLETE)

In [22]:
study.best_params

{'classifier': 'RandomForest',
 'max_depth': 23.85656263227053,
 'n_estimators': 710}

In [24]:
from sklearn.ensemble import RandomForestClassifier

rf=RandomForestClassifier(n_estimators=330,max_depth=30)
rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=30, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=330,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [26]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

y_pred=rf.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[82 17]
 [21 34]]
0.7532467532467533
              precision    recall  f1-score   support

           0       0.80      0.83      0.81        99
           1       0.67      0.62      0.64        55

    accuracy                           0.75       154
   macro avg       0.73      0.72      0.73       154
weighted avg       0.75      0.75      0.75       154

