# LAB: Estimando hiperparámetros con `GridSearchCV` para Regresión Logística y KNN

## Introducción

El objetivo de esta práctica es ajusten hiperparámetros usando Cross Validation. Para eso, usaremos `GridSearchCV`.

Utilizaremos el dataset ya trabajado sobre cáncer de mama. Contiene información de estudios clínicos y celulares. El objetivo es predecir el carácter benigno ($class_t=0$) maligno ($class_t=1$) del cáncer en función de una serie de predictores a nivel celular.

    + class_t es la variable target
    + el resto son variables con valores normalizados de 1 a 10

[Aquí](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names) pueden encontrar más información sobre el dataset.

**Nota:** se eliminaron del dataset original 16 casos con valores perdidos en algunos campos.

## Tareas

Para esta práctica deberá 

1. Construir dos clasificadores: Regresión Logística y K-Vecinos más Cercanos (KNN)
2. Estimar los hiperpáremtros del modelo

          *2.1 LogReg:* deberá tunear un modelo con solver 'saga', C's = 1, 10, 100, 1000, y regularización L1 y L2
          
          *2.2 KNN:* deberá tunear tanto el parámetro k, como la medida del peso dado a los K vecinos (uniforme o distancia). También podría probar con el parámetro p que define el tipo de distancia con el que se calculan los vecinos más cercanos.
      
3. Estimar los modelos finales
4. Evaluar cuál de los dos performa mejor

**Importante:** recuerde que deberá diseñar cuidadosamente las diferentes estrategias de validación de las diferentes etapas de estimación del modelo.

Importamos los paquetes necesarios


In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import datasets
from sklearn.preprocessing import normalize

Importamos el dataset

In [2]:
df = pd.read_csv('../Data/breast-cancer.csv', header = None)
df.columns = ['ID', 'clump_Thickness', 'unif_cell_size', 'unif_cell_shape', 'adhesion', 'epith_cell_Size', 'bare_nuclei',
              'bland_chromatin ','norm_nucleoli', 'mitoses', 'class_t']

In [3]:
df.shape

(683, 11)

In [4]:
df.head()

Unnamed: 0,ID,clump_Thickness,unif_cell_size,unif_cell_shape,adhesion,epith_cell_Size,bare_nuclei,bland_chromatin,norm_nucleoli,mitoses,class_t
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


Recodificamos las clases en "0" y "1"

In [5]:
df.class_t[df['class_t'] == 2] = 0
df.class_t[df['class_t'] == 4] = 1

Hacemos el split entre target y features

In [6]:
X = df.iloc[:,1:9]
y = df['class_t']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

¿Hace falta estandarizar en este caso?

In [7]:
# Utilizamos sklearn para estandarizar la matriz de Features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

**Pista 1:** Conviene realizar dos listas, una con los estimadores de los modelos y otra con la grid de parámetros a estimar en cada modelo.

**Pista 2:** Conviene iterar sobre esas listas para estimar los hiperparámetros de los modelos

In [8]:
models = [LogisticRegression(),
          KNeighborsClassifier()]

In [9]:
params = [
    {'C': [1, 10, 100, 1000],
     'penalty': ['l1', 'l2',],
     'solver': ['saga']},
    {'n_neighbors': range(1,200),
     'weights' : ['uniform', 'distance'],
     'p' : [1, 2, 3]}
]

In [10]:
grids = []
for i in range(len(models)):
    gs = GridSearchCV(estimator=models[i], param_grid=params[i], scoring='accuracy', cv=10, n_jobs=4, iid=True)
    print (gs)
    fit = gs.fit(X_train, y_train)
    grids.append(fit)

GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid=True, n_jobs=4,
             param_grid={'C': [1, 10, 100, 1000], 'penalty': ['l1', 'l2'],
                         'solver': ['saga']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)
GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm

In [11]:
for i in grids:
    print (i.best_score_)
    print (i.best_estimator_)
    print (i.best_params_)

0.9633251833740831
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)
{'C': 1, 'penalty': 'l1', 'solver': 'saga'}
0.9731051344743277
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=1,
                     weights='uniform')
{'n_neighbors': 3, 'p': 1, 'weights': 'uniform'}


In [12]:
pd.DataFrame(grids[0].cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_penalty,param_solver,params,split0_test_score,split1_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.022701,0.008612,0.001101,0.001373,1,l1,saga,"{'C': 1, 'penalty': 'l1', 'solver': 'saga'}",0.952381,0.97619,...,0.97619,0.95122,0.925,0.95,1.0,0.95,0.95,0.963325,0.022943,1
1,0.009699,0.002528,0.0012,0.001987,1,l2,saga,"{'C': 1, 'penalty': 'l2', 'solver': 'saga'}",0.952381,0.97619,...,0.97619,0.95122,0.925,0.95,1.0,0.95,0.95,0.96088,0.019982,8
2,0.0193,0.013849,0.000501,0.000501,10,l1,saga,"{'C': 10, 'penalty': 'l1', 'solver': 'saga'}",0.952381,0.97619,...,0.97619,0.95122,0.925,0.95,1.0,0.95,0.975,0.963325,0.020031,1
3,0.0116,0.001563,0.000701,0.000459,10,l2,saga,"{'C': 10, 'penalty': 'l2', 'solver': 'saga'}",0.952381,0.97619,...,0.97619,0.95122,0.925,0.95,1.0,0.95,0.975,0.963325,0.020031,1
4,0.014299,0.005156,0.000601,0.000491,100,l1,saga,"{'C': 100, 'penalty': 'l1', 'solver': 'saga'}",0.952381,0.97619,...,0.97619,0.95122,0.925,0.95,1.0,0.95,0.975,0.963325,0.020031,1
5,0.008699,0.002411,0.000601,0.000491,100,l2,saga,"{'C': 100, 'penalty': 'l2', 'solver': 'saga'}",0.952381,0.97619,...,0.97619,0.95122,0.925,0.95,1.0,0.95,0.975,0.963325,0.020031,1
6,0.011299,0.003001,0.0004,0.000491,1000,l1,saga,"{'C': 1000, 'penalty': 'l1', 'solver': 'saga'}",0.952381,0.97619,...,0.97619,0.95122,0.925,0.95,1.0,0.95,0.975,0.963325,0.020031,1
7,0.010199,0.002181,0.000701,0.000459,1000,l2,saga,"{'C': 1000, 'penalty': 'l2', 'solver': 'saga'}",0.952381,0.97619,...,0.97619,0.95122,0.925,0.95,1.0,0.95,0.975,0.963325,0.020031,1


In [13]:
pd.DataFrame(grids[1].cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,param_p,param_weights,params,split0_test_score,split1_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001697,4.567818e-04,0.006199,0.004045,1,1,uniform,"{'n_neighbors': 1, 'p': 1, 'weights': 'uniform'}",0.928571,0.976190,...,1.000000,0.926829,0.950,0.925,1.000,0.950,0.975,0.955990,0.028393,190
1,0.001501,4.974592e-04,0.002102,0.001045,1,1,distance,"{'n_neighbors': 1, 'p': 1, 'weights': 'distance'}",0.928571,0.976190,...,1.000000,0.926829,0.950,0.925,1.000,0.950,0.975,0.955990,0.028393,190
2,0.001402,4.903098e-04,0.003998,0.000632,1,2,uniform,"{'n_neighbors': 1, 'p': 2, 'weights': 'uniform'}",0.928571,0.976190,...,1.000000,0.926829,0.950,0.925,1.000,0.925,0.950,0.955990,0.028800,190
3,0.001300,4.585857e-04,0.001599,0.000491,1,2,distance,"{'n_neighbors': 1, 'p': 2, 'weights': 'distance'}",0.928571,0.976190,...,1.000000,0.926829,0.950,0.925,1.000,0.925,0.950,0.955990,0.028800,190
4,0.002501,2.871775e-03,0.008299,0.005178,1,3,uniform,"{'n_neighbors': 1, 'p': 3, 'weights': 'uniform'}",0.928571,0.976190,...,1.000000,0.902439,0.925,0.925,0.975,0.900,0.975,0.948655,0.034040,399
5,0.003701,3.164108e-03,0.004200,0.000600,1,3,distance,"{'n_neighbors': 1, 'p': 3, 'weights': 'distance'}",0.928571,0.976190,...,1.000000,0.902439,0.925,0.925,0.975,0.900,0.975,0.948655,0.034040,399
6,0.001602,4.905525e-04,0.006399,0.003555,2,1,uniform,"{'n_neighbors': 2, 'p': 1, 'weights': 'uniform'}",0.928571,0.976190,...,1.000000,0.926829,0.875,0.925,0.950,0.925,0.950,0.936430,0.033564,824
7,0.002298,1.188912e-03,0.003599,0.005143,2,1,distance,"{'n_neighbors': 2, 'p': 1, 'weights': 'distance'}",0.928571,0.976190,...,1.000000,0.926829,0.950,0.925,1.000,0.950,0.975,0.955990,0.028393,190
8,0.001501,5.010390e-04,0.004399,0.001021,2,2,uniform,"{'n_neighbors': 2, 'p': 2, 'weights': 'uniform'}",0.928571,0.952381,...,1.000000,0.926829,0.875,0.925,0.975,0.875,0.950,0.938875,0.039083,735
9,0.005000,7.375993e-03,0.003001,0.002758,2,2,distance,"{'n_neighbors': 2, 'p': 2, 'weights': 'distance'}",0.928571,0.976190,...,1.000000,0.926829,0.950,0.925,1.000,0.925,0.950,0.955990,0.028800,190


In [14]:

X_test = scaler.transform(X_test)

In [15]:
y_preds_log = grids[0].predict(X_test)
y_preds_knn = grids[1].predict(X_test)

In [16]:
print (classification_report(y_test, y_preds_log), confusion_matrix(y_test, y_preds_log))

              precision    recall  f1-score   support

           0       0.96      0.99      0.98       180
           1       0.98      0.93      0.95        94

    accuracy                           0.97       274
   macro avg       0.97      0.96      0.96       274
weighted avg       0.97      0.97      0.97       274
 [[178   2]
 [  7  87]]


In [17]:
print (classification_report(y_test, y_preds_knn), confusion_matrix(y_test, y_preds_knn))

              precision    recall  f1-score   support

           0       0.98      0.99      0.99       180
           1       0.98      0.97      0.97        94

    accuracy                           0.98       274
   macro avg       0.98      0.98      0.98       274
weighted avg       0.98      0.98      0.98       274
 [[178   2]
 [  3  91]]


## Diferencia de performance entre Random Search y Gridsearch

Dado el siguiente conjunto de parámetros:

        param_dist = {
                    'n_neighbors': range(1,200),
                    'weights' : ['uniform', 'distance'],
                    'p' : [1, 2, 3]
                    }

Implementar una búsqueda del conjunto óptimo de hiperparámetros tanto con GridSearchCV como con RandomSearchCV.
Verificar la diferencia en cada caso de:
    
    1. El tiempo de ejecución (utilizando la biblioteca time)
    2. La combinación óptima de parámetros
    3. La performance del mejor modelo en cada caso sobre los datos del test set que separamos anteriormente en términos de accuracy


In [18]:
from sklearn.model_selection import RandomizedSearchCV

In [19]:
def busquedaGridsearch(params_):
    gs = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=params_, scoring='accuracy', cv=10, n_jobs=4, iid=True)
    fit = gs.fit(X_train, y_train)
    return gs    

In [20]:
def busquedaRandomSearch(params_,iter_):
    gs = RandomizedSearchCV(estimator=KNeighborsClassifier(), param_distributions=params_, \
                            scoring='accuracy', cv=10, n_jobs=4, n_iter = iter_ , iid=True)
    fit = gs.fit(X_train, y_train)
    return gs    

In [21]:
param_dist = {
    'n_neighbors': range(1,200),
    'weights' : ['uniform', 'distance'],
    'p' : [1, 2, 3]
}


In [22]:
import time

In [23]:
tic = time.time()
gs_random_search = busquedaRandomSearch(param_dist,100)        
toc = time.time()
print(str(toc-tic) + ' Segundos')

2.2960002422332764 Segundos


In [24]:
tic = time.time()
gs_grid_search = busquedaGridsearch(param_dist)   
toc = time.time()
print(str(toc-tic) + ' Segundos')

27.07099962234497 Segundos


In [25]:
gs_random_search.best_params_

{'weights': 'uniform', 'p': 1, 'n_neighbors': 3}

In [26]:
gs_grid_search.best_params_

{'n_neighbors': 3, 'p': 1, 'weights': 'uniform'}

In [27]:
from sklearn.metrics import accuracy_score

def obtener_performance(estimator):
    y_pred = estimator.predict(X_test)
    return accuracy_score(y_pred,y_test, normalize = True)

In [28]:
obtener_performance(gs_grid_search.best_estimator_)

0.9817518248175182

In [29]:
obtener_performance(gs_random_search.best_estimator_)

0.9817518248175182