# <font color='blue'>Otimização dos Parâmetros com Randomized Search</font>

Todo modelo de Machine Learning possui parâmetros que permitem a customização do modelo. Esses parâmetros também são chamados de hiperparâmetros.

Em programação os algoritmos de Machine Learning são representados por funções e cada função possui os parâmetros de customização, exatamente o que chamamos de hiperparâmetros.

É comum ainda que as pessoas se refiram aos coeficientes do modelo (encontrados ao final do treinamento) como parâmetros.

Parte do nosso trabalho como Cientistas de Dados é encontrar a melhor combinação de hiperparâmetros para cada modelo.

Em Métodos Ensemble esse trabalho é ainda mais complexo, pois temos os hiperparâmetros do estimador base e os hiperparâmetros do modelo ensemble, conforme este exemplo abaixo:

* Estimador base:

estim_base = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=5, p=2, weights='uniform')

* Modelo Ensemble:

BaggingClassifier(base_estimator=estim_base,
                  bootstrap=True, bootstrap_features=False, max_features=0.5,
                  max_samples=0.5, n_estimators=10, n_jobs=None,
                  oob_score=False, random_state=None, verbose=0,
                  warm_start=False)

# <font color='red'>-----------------------------</font>

# <font color='GREEN'>Extremely Randomized Forest</font>

Modelo padrão, com hiperparâmetros escolhidos manualmente

In [65]:
# Imports
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

In [66]:
# carrega o dataset
data = pd.read_excel("C:/formacao_dataScience_DSA_DADOS/04_machineLearning/Cap09/dados_python/credit.xls", skiprows=1)

In [67]:
data

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,29996,220000,1,3,1,39,0,0,0,0,...,88004,31237,15980,8500,20000,5003,3047,5000,1000,0
29996,29997,150000,1,3,2,43,-1,-1,-1,-1,...,8979,5190,0,1837,3526,8998,129,0,0,0
29997,29998,30000,1,2,2,37,4,3,2,-1,...,20878,20582,19357,0,0,22000,4200,2000,3100,1
29998,29999,80000,1,3,1,41,1,-1,0,0,...,52774,11855,48944,85900,3409,1178,1926,52964,1804,1


In [68]:
data["default payment next month"]

0        1
1        1
2        0
3        0
4        0
        ..
29995    0
29996    0
29997    1
29998    1
29999    1
Name: default payment next month, Length: 30000, dtype: int64

In [69]:
# Variável Target 
target = "default payment next month"
y = np.asarray(data[target])

In [70]:
# Variáveis preditoras
features = data.columns.drop({'ID', target})
X = np.asarray(data[features])

In [71]:
len(X)

30000

In [72]:
X_test, y_test = X[21000:], y[21000:]
X_train, y_train = X[:21000], y[:21000]

In [73]:
# # Divisão de dados em treino e teste
# X_train, y_train, X_teste, y_teste = train_test_split(X, 
#                                                       y, 
#                                                       test_size=0.30)

In [74]:
# Classificador
clf = ExtraTreesClassifier(n_estimators=500, random_state= 99)

In [75]:
# Treinando o modelo
clf.fit(X_train, y_train)

In [76]:
clf.get_params()

{'bootstrap': False,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 500,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 99,
 'verbose': 0,
 'warm_start': False}

In [77]:
# Score
scores = cross_val_score(clf, 
                         X_train, 
                         y_train, 
                         cv = 3,
                         scoring= "accuracy",
                         n_jobs= -1)

In [79]:
# imprimindo os resultados 
print(f"ExtraTreesClassifier -> ACC em treino: Média = {round(np.mean(scores), 3)}")
print(f"ExtraTreesClassifier -> Desvio padrão = {round(np.std(scores),3)}")

ExtraTreesClassifier -> ACC em treino: Média = 0.805
ExtraTreesClassifier -> Desvio padrão = 0.009


In [80]:
print(f"Pontuação do ada_clf_v1 => {scores}")


Pontuação do ada_clf_v1 => [0.79785714 0.79942857 0.81871429]


In [81]:
# Fazendo previsões 
y_pred = clf.predict(X_test)

In [82]:
# Confusion Matrix
confusion___matrix = confusion_matrix(y_test, y_pred)
print(confusion___matrix)

[[6774  386]
 [1190  650]]


In [83]:
# Acurácia em teste 
print(f"Acurácia em teste => {round(accuracy_score(y_test, y_pred),2)}")

Acurácia em teste => 0.82


# <font color='red'>-----------------------------</font>

## <font color='GREEN'>Otimização dos Hiperparâmetros com Randomized Search</font>

O Randomized Search gera amostras dos parâmetros dos algoritmos a partir de uma distribuição randômica uniforme para um número fixo de interações. Um modelo é construído e testado para cada combinação de parâmetros. 

In [46]:
# Imports

from sklearn.model_selection import RandomizedSearchCV

In [49]:
# Definição dos parâmetros
param_dist = {"max_depth": [1, 3, 7, 8, 12, None],
              "max_features": [8, 9, 10, 11, 16, 22],
              "min_samples_split": [8, 10, 11, 14, 16, 19],
              "min_samples_leaf": [1, 2, 3, 4, 5, 6, 7],
              "bootstrap": [True, False]}

# Para o classificador criado com ExtraTrees, testamos diferentes combinações de parâmetros 
rsearch = RandomizedSearchCV(clf, 
                             param_distributions= param_dist,
                             n_iter= 25,
                             return_train_score= True)


In [51]:
# Aplicando o resultado ao conjunto de dados de treino e obtendo o score
rsearch.fit(X_train, y_train)

In [52]:
# RESULTADO 
rsearch.cv_results_

# Imprimindo o melhor estimador 
bestclf = rsearch.best_estimator_
print(bestclf)

ExtraTreesClassifier(max_depth=12, max_features=11, min_samples_leaf=2,
                     min_samples_split=10, n_estimators=500, random_state=99)


In [54]:
# Aplicando o melhor estimador para realizar as previsões 
y_predict = bestclf.predict(X_test)

In [84]:
y_test = np.array(y_test)
y_predict = np.array(y_predict)

# Confusion Matrix
confusionMatrix = confusion_matrix(y_test, y_predict)
print(f"Matriz de confusão => {confusionMatrix}")

# Acurácia 
accuracy = accuracy_score(y_test, y_predict)
print(f"Acurácia => {accuracy}")

Matriz de confusão => [[6895  265]
 [1202  638]]
Acurácia => 0.837


In [61]:
type(y_test)

numpy.ndarray

In [62]:
type(y_predict)

numpy.ndarray

# <font color='red'>-----------------------------</font>

# <font color='GREEN'>Grid Search x Randomized Search para Estimação dos Hiperparâmetros</font>

O Grid Search realiza metodicamente combinações entre todos os parâmetros do algoritmo, criando um grid. 

In [85]:
# Imports

import numpy as np
from time import time
from scipy.stats import randint as sp_randint
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits

# Obtém o dataset
digits = load_digits()
X_comparativo, y_comparativo = digits.data, digits.target

# Construindo o classificador
clf_testes = RandomForestClassifier(n_estimators = 20)

In [87]:
# Randomized Search

# Valores dos parametros que serão testados 
param_distribuicao = {
 "max_depth": [3, None],
 "max_features": sp_randint(1,11),
 "min_samples_split": sp_randint(1,11),
 "bootstrap": [True, False],
 "criterion": ["gini", "entropy"]}

# Executando o Randomized Search
n_iter_search = 20
random_search = RandomizedSearchCV(clf_testes,
                                   param_distributions= param_distribuicao,
                                   n_iter = n_iter_search,
                                   return_train_score = True)

start = time()
random_search.fit(X_comparativo, y_comparativo)
print(f"RandomizedSearchCV executou em {round(time()- start, 2)} segundos, para {n_iter_search} candidatos")

# Imprime as combinações dos parametros e suas respectivas médias e acc
random_search.cv_results_

RandomizedSearchCV executou em 8.39 segundos, para 20 candidatos


{'mean_fit_time': array([0.10352502, 0.13533678, 0.05096722, 0.04557314, 0.1065958 ,
        0.06952639, 0.05931425, 0.05921731, 0.05126777, 0.0633235 ,
        0.06275144, 0.04122658, 0.04412975, 0.09494052, 0.0602571 ,
        0.06468406, 0.04523349, 0.09447331, 0.03956356, 0.11320076]),
 'std_fit_time': array([0.0247363 , 0.01142319, 0.00423586, 0.00095839, 0.00469769,
        0.00268937, 0.00616711, 0.00263822, 0.00469568, 0.00529783,
        0.00524585, 0.00609405, 0.00671404, 0.00461427, 0.00626679,
        0.00658622, 0.00301913, 0.01029195, 0.00305257, 0.00359618]),
 'mean_score_time': array([0.00539565, 0.0047689 , 0.00320091, 0.00408812, 0.00561738,
        0.00412278, 0.00401626, 0.00523248, 0.00366402, 0.00465236,
        0.00499973, 0.00349112, 0.00444632, 0.00491452, 0.00404449,
        0.00450258, 0.00362673, 0.00459795, 0.0045579 , 0.00465617]),
 'std_score_time': array([0.00138876, 0.00090862, 0.00064564, 0.0006542 , 0.00070222,
        0.0010157 , 0.000673  , 0.000758

In [88]:
# Grid Search

# Valores dos parametros que serão testados 
param_gridSV = {
 "max_depth": [3, None],
 "max_features": [1, 3, 10],
 "min_samples_leaf": [1, 3, 10],
 "bootstrap": [True, False],
 "criterion": ["gini", "entropy"]}

# Executando o Randomized Search
n_iter_search = 20
grid_search = GridSearchCV(clf_testes,
                             param_grid= param_gridSV,
                             return_train_score = True)


start = time()
grid_search.fit(X_comparativo, y_comparativo)
print(f"GridSearchCV executou em {round(time()- start, 2)} segundos, para {n_iter_search} candidatos")

# Imprime as combinações dos parametros e suas respectivas médias e acc
grid_search.cv_results_

GridSearchCV executou em 23.34 segundos, para 20 candidatos


{'mean_fit_time': array([0.03888497, 0.03219914, 0.03647342, 0.03865223, 0.04127269,
        0.03761415, 0.04324121, 0.04682164, 0.04815116, 0.05665956,
        0.04522481, 0.04265113, 0.06274109, 0.05466948, 0.05497117,
        0.09178624, 0.08259697, 0.07241654, 0.03619161, 0.03755383,
        0.03548713, 0.0403605 , 0.04199467, 0.04001975, 0.05244908,
        0.05411744, 0.05194097, 0.06228547, 0.04855046, 0.04027133,
        0.0808188 , 0.06867189, 0.05341554, 0.10350113, 0.09917488,
        0.08156366, 0.02983069, 0.02736788, 0.02833128, 0.03251009,
        0.03397036, 0.03325915, 0.04650059, 0.04435811, 0.0439877 ,
        0.0526278 , 0.03800869, 0.03362055, 0.06624427, 0.05550332,
        0.04931235, 0.10426826, 0.09235997, 0.08398738, 0.02930503,
        0.02918677, 0.02947669, 0.03563967, 0.03657265, 0.03470712,
        0.05246844, 0.0520277 , 0.05387096, 0.06235509, 0.0453599 ,
        0.03703809, 0.07363968, 0.06556778, 0.05302958, 0.12256041,
        0.11288772, 0.09077296]