# LAB: Estimando hiperparámetros con `GridSearchCV` para SVC y KNN

## Introducción

El objetivo de esta práctica es que puedan comenzar a tunear hiperparámetros usando Cross Validation. Para eso, usaremos `GridSearchCV`.

Utilizaremos el dataset ya trabajado sobre cáncer de mama. Contiene información de estudios clínicos y celulares. El objetivo es predecir el carácter benigno ($class_t=0$) maligno ($class_t=1) del cáncer en función de una serie de predictores a nivel celular.

    + class_t es la variable target
    + el resto son variables con valores normalizados de 1 a 10

[Aquí](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names) pueden encontrar más información sobre el dataset.

**Nota:** se eliminaron del dataset original 16 casos con valores perdidos en algunos campos.

## Tareas

Para esta práctica deberá 

1. Construir dos clasificadores: Support Vector Machines y K-Vecinos más Cercanos (KNN)
2. Estimar los hiperpáremtros del modelo

          *2.1 SVM:* deberá tunear un modelo con kernel lineal y C's = 1, 10, 100, 1000; y otro con kernel rbf, C=1, 10, 100, 1000 y [gamma](http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html) = 0.1, 0.01, 0.001, 0.0001, 0.00001
          
          *2.2 KNN:* deberá tunear tanto el parámetro k, como la medida del peso dado a los K vecinos (uniforme o distancia). También podría probar con el parámetro p que define el tipo de distancia con el que se calculan los vecinos más cercanos.
      
3. Estimar los modelos finales
4. Evaluar cuál de los dos performa mejor

**Importante:** recuerde que deberá diseñar cuidadosamente las diferentes estrategias de validación de las diferentes etapas de estimación del modelo.

Importamos los paquetes necesarios


In [1]:
from sklearn import svm, linear_model
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import datasets
from sklearn.preprocessing import normalize

Importamos el dataset

In [2]:
df = pd.read_csv('breast-cancer.csv', header = None)
df.columns = ['ID', 'clump_Thickness', 'unif_cell_size', 'unif_cell_shape', 'adhesion', 'epith_cell_Size', 'bare_nuclei',
              'bland_chromatin ','norm_nucleoli', 'mitoses', 'class_t']

Recodificamos las clases en "0" y "1"

In [3]:
df.class_t[df['class_t'] == 2] = 0
df.class_t[df['class_t'] == 4] = 1

Hacemos el split entre target y features

In [4]:
X = (df.iloc[:,1:9])
y = df['class_t']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

**Pista 1:** Conviene realizar dos listas, una con los estimadores de los modelos y otra con la grid de parámetros a estimar en cada modelo.

**Pista 2:** Conviene iterar sobre esas listas para estimar los hiperparámetros de los modelos

In [5]:
models = [svm.SVC(),
          KNeighborsClassifier()]

In [6]:
params = [
    
    [
        {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
          
        {'C': [1, 10, 100, 1000], 'gamma': [0.1, 0.01, 0.001, 0.0001, 0.00001], 'kernel': ['rbf']}
    ],
          {'n_neighbors': range(1,200),
           'weights' : ['uniform', 'distance'],
           'p' : [1, 2, 3]}
]

In [7]:
grids = []
for i in range(len(models)):
    gs = GridSearchCV(estimator=models[i], param_grid=params[i], scoring='accuracy', cv=10, n_jobs=4)
    print (gs)
    fit = gs.fit(X_train, y_train)
    grids.append(fit)

GridSearchCV(cv=10, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=4,
       param_grid=[{'C': [1, 10, 100, 1000], 'kernel': ['linear']}, {'C': [1, 10, 100, 1000], 'gamma': [0.1, 0.01, 0.001, 0.0001, 1e-05], 'kernel': ['rbf']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)
GridSearchCV(cv=10, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=4,
       param_grid={'n_neighbors': range(1, 200), 'weights': ['uniform', 'distance'], 'p': [1, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True

In [8]:
for i in grids:
    print (i.best_score_)
    print (i.best_estimator_)
    print (i.best_params_)

0.973105134474
SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
{'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.975550122249
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='distance')
{'n_neighbors': 5, 'p': 2, 'weights': 'distance'}


In [9]:
pd.DataFrame(grids[0].cv_results_)

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_C,param_gamma,param_kernel,params,rank_test_score,split0_test_score,...,split7_test_score,split7_train_score,split8_test_score,split8_train_score,split9_test_score,split9_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.006847,0.002157,0.968215,0.973104,1,,linear,"{'C': 1, 'kernel': 'linear'}",8,1.0,...,0.975,0.9729,0.975,0.9729,0.975,0.9729,0.001869,0.003694,0.024554,0.002845
1,0.027923,0.000737,0.968215,0.973648,10,,linear,"{'C': 10, 'kernel': 'linear'}",8,1.0,...,0.975,0.9729,0.975,0.9729,0.975,0.9729,0.010789,7.2e-05,0.024554,0.002738
2,0.186663,0.000688,0.968215,0.973648,100,,linear,"{'C': 100, 'kernel': 'linear'}",8,1.0,...,0.975,0.9729,0.975,0.9729,0.975,0.9729,0.067837,0.000242,0.024554,0.002738
3,1.426656,0.000512,0.968215,0.973648,1000,,linear,"{'C': 1000, 'kernel': 'linear'}",8,1.0,...,0.975,0.9729,0.975,0.9729,0.975,0.9729,0.6445,4.7e-05,0.024554,0.002738
4,0.005193,0.000762,0.96088,0.995381,1,0.1,rbf,"{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}",19,1.0,...,0.975,0.99729,0.95,0.99458,0.975,0.99458,0.001717,0.000226,0.022212,0.001246
5,0.004793,0.001186,0.968215,0.973377,1,0.01,rbf,"{'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}",8,1.0,...,1.0,0.97019,0.975,0.9729,0.95,0.9729,0.00106,0.000799,0.032908,0.003604
6,0.007169,0.001033,0.97066,0.972018,1,0.001,rbf,"{'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}",3,1.0,...,1.0,0.96748,0.975,0.9729,0.975,0.9729,0.004816,0.000353,0.03418,0.003657
7,0.007061,0.001332,0.96577,0.96767,1,0.0001,rbf,"{'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}",14,1.0,...,0.975,0.96748,0.975,0.96748,0.95,0.96748,0.002914,0.000548,0.033195,0.003943
8,0.008921,0.001294,0.677262,0.677262,1,1e-05,rbf,"{'C': 1, 'gamma': 1e-05, 'kernel': 'rbf'}",24,0.666667,...,0.675,0.677507,0.675,0.677507,0.675,0.677507,0.002593,0.000172,0.006375,0.000715
9,0.008022,0.001347,0.963325,1.0,10,0.1,rbf,"{'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}",16,1.0,...,0.975,1.0,0.975,1.0,0.975,1.0,0.000581,0.000358,0.022255,0.0


In [10]:
pd.DataFrame(grids[1].cv_results_)

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_n_neighbors,param_p,param_weights,params,rank_test_score,split0_test_score,...,split7_test_score,split7_train_score,split8_test_score,split8_train_score,split9_test_score,split9_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.004194,0.002357,0.970660,1.000000,1,1,uniform,"{'n_neighbors': 1, 'p': 1, 'weights': 'uniform'}",43,1.000000,...,0.975,1.000000,0.950,1.000000,1.000,1.000000,0.003340,0.000741,0.021381,0.000000
1,0.002243,0.002923,0.970660,1.000000,1,1,distance,"{'n_neighbors': 1, 'p': 1, 'weights': 'distance'}",43,1.000000,...,0.975,1.000000,0.950,1.000000,1.000,1.000000,0.000404,0.002144,0.021381,0.000000
2,0.002410,0.001986,0.965770,1.000000,1,2,uniform,"{'n_neighbors': 1, 'p': 2, 'weights': 'uniform'}",342,1.000000,...,0.975,1.000000,0.950,1.000000,1.000,1.000000,0.000892,0.000835,0.022357,0.000000
3,0.003438,0.003322,0.965770,1.000000,1,2,distance,"{'n_neighbors': 1, 'p': 2, 'weights': 'distance'}",342,1.000000,...,0.975,1.000000,0.950,1.000000,1.000,1.000000,0.001618,0.001808,0.022357,0.000000
4,0.002919,0.007033,0.960880,1.000000,1,3,uniform,"{'n_neighbors': 1, 'p': 3, 'weights': 'uniform'}",573,1.000000,...,0.975,1.000000,0.950,1.000000,1.000,1.000000,0.000953,0.002198,0.027026,0.000000
5,0.002218,0.007262,0.960880,1.000000,1,3,distance,"{'n_neighbors': 1, 'p': 3, 'weights': 'distance'}",573,1.000000,...,0.975,1.000000,0.950,1.000000,1.000,1.000000,0.000299,0.003125,0.027026,0.000000
6,0.003233,0.002357,0.951100,0.982886,2,1,uniform,"{'n_neighbors': 2, 'p': 1, 'weights': 'uniform'}",883,0.976190,...,0.950,0.983740,0.950,0.981030,1.000,0.978320,0.003801,0.001397,0.026732,0.004037
7,0.002309,0.001987,0.965770,1.000000,2,1,distance,"{'n_neighbors': 2, 'p': 1, 'weights': 'distance'}",342,1.000000,...,0.975,1.000000,0.950,1.000000,1.000,1.000000,0.000441,0.000690,0.022357,0.000000
8,0.002187,0.001619,0.938875,0.983971,2,2,uniform,"{'n_neighbors': 2, 'p': 2, 'weights': 'uniform'}",1092,1.000000,...,0.975,0.983740,0.950,0.986450,0.975,0.981030,0.000233,0.000166,0.040947,0.002840
9,0.002165,0.001513,0.965770,1.000000,2,2,distance,"{'n_neighbors': 2, 'p': 2, 'weights': 'distance'}",342,1.000000,...,0.975,1.000000,0.950,1.000000,1.000,1.000000,0.000488,0.000130,0.022357,0.000000


In [11]:
y_preds_svm = grids[0].predict(X_test)
y_preds_knn = grids[1].predict(X_test)

In [12]:
print (classification_report(y_test, y_preds_svm), confusion_matrix(y_test, y_preds_svm))

             precision    recall  f1-score   support

          0       0.95      0.98      0.96       167
          1       0.96      0.93      0.94       107

avg / total       0.96      0.96      0.96       274
 [[163   4]
 [  8  99]]


In [13]:
print (classification_report(y_test, y_preds_knn), confusion_matrix(y_test, y_preds_knn))

             precision    recall  f1-score   support

          0       0.96      0.98      0.97       167
          1       0.96      0.94      0.95       107

avg / total       0.96      0.96      0.96       274
 [[163   4]
 [  6 101]]


## Diferencia de performance entre Random Search y Gridsearch

Dado el siguiente conjunto de parámetros:

        param_dist = {
                    "C": [0.1, 0.5, 0.8, 1, 10, 100, 1000],
                    "kernel": ['linear','rbf'],
                    "gamma": [0.1, 0.01, 0.001, 0.0001, 0.00001]
                     }

Implementar una búsqueda del conjunto óptimo de hiperparámetros tanto con GridSearchCV como con RandomSearchCV.
Verificar la diferencia en cada caso de:
    
    1. El tiempo de ejecución (utilizando la magic function %%time)
    2. La combinación óptima de parámetros
    3. La performance del mejor modelo en cada caso sobre los datos del test set que separamos anteriormente en términos de accuracy


In [25]:
from sklearn.model_selection import RandomizedSearchCV

In [49]:
def busquedaGridsearch(params_):
    gs = GridSearchCV(estimator=svm.SVC(), param_grid=params_, scoring='accuracy', cv=10, n_jobs=4)
    fit = gs.fit(X_train, y_train)
    return gs    

In [50]:
def busquedaRandomSearch(params_,iter_):
    gs = RandomizedSearchCV(estimator=svm.SVC(), param_distributions=params_, scoring='accuracy', cv=10, n_jobs=4, n_iter = iter_ )
    fit = gs.fit(X_train, y_train)
    return gs    

In [51]:
param_dist = {
            "C": [0.1, 0.5, 0.8, 1, 10, 100, 1000],
            "kernel": ['linear','rbf'],
            "gamma": [0.1, 0.01, 0.001, 0.0001, 0.00001]
             }


In [52]:
%%time
gs_random_search = busquedaRandomSearch(param_dist,10)        

CPU times: user 140 ms, sys: 44 ms, total: 184 ms
Wall time: 14.6 s


In [53]:
%%time
gs_grid_search = busquedaGridsearch(param_dist)   

CPU times: user 764 ms, sys: 48 ms, total: 812 ms
Wall time: 45.5 s


In [54]:
gs_random_search.best_params_

{'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}

In [55]:
gs_grid_search.best_params_

{'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}

In [57]:
from sklearn.metrics import accuracy_score

def obtener_performance(estimator):
    y_pred = estimator.predict(X_test)
    return accuracy_score(y_pred,y_test, normalize = True)

In [58]:
obtener_performance(gs_grid_search.best_estimator_)

0.95620437956204385

In [59]:
obtener_performance(gs_random_search.best_estimator_)

0.96350364963503654