# Introdução

Agora é com você!. Use o que desenvolvemos na prática guiada para aplicar o grid search em novos dados.


Diretrizes:<br>
    1) Use o arquivo de diabetes.csv para criar um modelo KNN<br>
    2) Observe a documentação da class KNeighborsClassifier e veja os hiperparâmetros <br>disponíveis e quais seus possíveis valores. Escolha 3 hiperparâmetros para otimizar.<br>
    3) Crie um Pipeline que combina ```MinMaxScaler``` e o ```KNeighborsClassifier```<br> 
    4) Crie uma grid para guardar os ranges de valores que você criou  para cada hiperparâmetro <br>
    5) Defina uma validação <br>
    6) Aplique o ```GridSearchCV```<br>
    7) Avalie a performance no treino e no teste.<br>

# Setup

In [1]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from scipy.stats import randint
from sklearn.model_selection import StratifiedKFold, ParameterGrid, ParameterSampler, GridSearchCV, RandomizedSearchCV, train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import log_loss, accuracy_score

In [2]:
df_diabetes = pd.read_csv("diabetes.csv")
df_diabetes.head()

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,tested_positive
1,1,85,66,29,0,26.6,0.351,31,tested_negative
2,8,183,64,0,0,23.3,0.672,32,tested_positive
3,1,89,66,23,94,28.1,0.167,21,tested_negative
4,0,137,40,35,168,43.1,2.288,33,tested_positive


# Pipeline e Grid

In [3]:
knn = Pipeline(steps=[('pre_processing', MinMaxScaler()) , ('model' , KNeighborsClassifier())])

In [4]:
df_train , df_teste = train_test_split(df_diabetes, stratify=df_diabetes['class'], test_size=0.15 , random_state=123)

In [5]:
n_rows_validation = int(0.15*df_diabetes.shape[0])
percent_validation = round((n_rows_validation/df_train.shape[0]), 2)

In [6]:
df_train , df_validation = train_test_split(df_train, stratify=df_train['class'], test_size = percent_validation, random_state = 123)

# Grid Search

In [7]:
# Definição da grid para o grid search
param_grid_search = {'model__metric':['manhattan' , 'euclidean'],
                    'model__n_neighbors':list(range(1,52,2)),
                    'model__weights': ['uniform','distance']}

In [8]:
# Definição da grid para o random search
param_random_search = {'model__metric':['manhattan','euclidean'],
                      'model__n_neighbors': randint(1,51),
                      'model__weights': ['uniform','distance']}

In [9]:
X_train , y_train = df_train.drop('class' , axis=1),df_train['class']
X_test , y_test = df_teste.drop('class', axis=1) , df_teste['class']
X_validation , y_validation = df_validation.drop('class', axis=1) , df_validation['class']

In [10]:
# loop grid search
validation_score_grid_search = []
train_score_grid_search = []
list_grid_search_params = list(ParameterGrid(param_grid_search))
for combinacao in list_grid_search_params:
    knn.set_params(**combinacao)
    knn.fit(X_train,y_train)
    y_validation_predict=knn.predict_proba(X_validation)
    y_train_predict = knn.predict_proba(X_train)
    validation_score_grid_search.append(log_loss(y_validation, y_validation_predict))
    train_score_grid_search.append(log_loss(y_train, y_train_predict))

In [11]:
np.min(validation_score_grid_search)

0.48424436121745135

In [12]:
np.argmin(validation_score_grid_search)

79

In [13]:
best_grid_search_params = list_grid_search_params[np.argmin(validation_score_grid_search)]

In [14]:
knn_grid_search = knn.set_params(**best_grid_search_params)
knn_grid_search.fit(X_train , y_train)

# Performance

In [15]:
print(f'Performance no treino: {log_loss(y_train, knn_grid_search.predict_proba(X_train))}')

Performance no treino: 2.2204460492503136e-16


In [16]:
print(f'Performance no test: {log_loss(y_test, knn_grid_search.predict_proba(X_test))}')

Performance no test: 0.4617165016364378


In [17]:
df_metrics_grid_search = pd.DataFrame({'params':list_grid_search_params, 'train_score':train_score_grid_search, 'validation_score':validation_score_grid_search})

In [18]:
df_metrics_grid_search.sort_values(by='validation_score', inplace=True)

In [19]:
df_metrics_grid_search['validation_score']

79     0.484244
77     0.484616
76     0.484912
78     0.485690
75     0.487739
        ...    
3      4.884952
53    10.996369
52    10.996369
1     11.912733
0     11.912733
Name: validation_score, Length: 104, dtype: float64

In [20]:
len(list_grid_search_params)

104