# LAB: Estimando hiperparámetros con `GridSearchCV` para SVC y KNN

## Introducción

El objetivo de esta práctica es que puedan comenzar a tunear hiperparámetros usando Cross Validation. Para eso, usaremos `GridSearchCV`.

Utilizaremos el dataset ya trabajado sobre cáncer de mama. Contiene información de estudios clínicos y celulares. El objetivo es predecir el carácter benigno ($class_t=0$) maligno ($class_t=1$) del cáncer en función de una serie de predictores a nivel celular.

    + class_t es la variable target
    + el resto son variables con valores normalizados de 1 a 10

[Aquí](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names) pueden encontrar más información sobre el dataset.

**Nota:** se eliminaron del dataset original 16 casos con valores perdidos en algunos campos.

## Tareas

Para esta práctica deberá 

1. Construir dos clasificadores: Support Vector Machines y K-Vecinos más Cercanos (KNN)
2. Estimar los hiperpáremtros del modelo

**2.1 SVM:** deberá tunear un modelo con kernel lineal y C's = 1, 10, 100, 1000; y otro con kernel rbf, C=1, 10, 100, 1000 y [gamma](http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html) = 0.1, 0.01, 0.001, 0.0001, 0.00001

**2.2 KNN:** deberá tunear tanto el parámetro k, como la medida del peso dado a los K vecinos (uniforme o distancia). También podría probar con el parámetro p que define el tipo de distancia con el que se calculan los vecinos más cercanos.
      
3. Estimar los modelos finales
4. Evaluar cuál de los dos performa mejor

**Importante:** recuerde que deberá diseñar cuidadosamente las diferentes estrategias de validación de las diferentes etapas de estimación del modelo.

Importamos los paquetes necesarios


In [2]:
from sklearn import svm, linear_model
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import datasets
from sklearn.preprocessing import normalize

Importamos el dataset

In [3]:
df = pd.read_csv('../Data/breast-cancer.csv', header = None)
df.columns = ['ID', 'clump_Thickness', 'unif_cell_size', 'unif_cell_shape', 'adhesion', 'epith_cell_Size', 'bare_nuclei',
              'bland_chromatin ','norm_nucleoli', 'mitoses', 'class_t']

Recodificamos las clases en "0" y "1"

In [4]:
df.class_t[df['class_t'] == 2] = 0
df.class_t[df['class_t'] == 4] = 1

Hacemos el split entre target y features

In [22]:
X = df[['clump_Thickness', 'unif_cell_size', 'unif_cell_shape', 'adhesion', 'epith_cell_Size', 'bare_nuclei',
              'bland_chromatin ','norm_nucleoli', 'mitoses']]
y = df['class_t']

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

**Pista 1:** Conviene realizar dos listas, una con los estimadores de los modelos y otra con la grid de parámetros a estimar en cada modelo.

**Pista 2:** Conviene iterar sobre esas listas para estimar los hiperparámetros de los modelos

In [28]:
svm_param_grid_list = [
    {
        'kernel': ['linear'],
        'C': [1, 10, 100, 1000]
    },
    {
        'kernel': ['rbf'],
        'C': [1, 10, 100, 1000],
        'gamma': [0.1, 0.01, 0.001, 0.0001, 0.00001]
    }
]

In [29]:
svm_c = svm.SVC()

In [30]:
smv_grid = GridSearchCV(svm_c, svm_param_grid_list, cv=10, scoring='accuracy')

In [31]:
smv_grid.fit(X_train, y_train)

GridSearchCV(cv=10, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}, {'kernel': ['rbf'], 'C': [1, 10, 100, 1000], 'gamma': [0.1, 0.01, 0.001, 0.0001, 1e-05]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

In [32]:
pd.DataFrame(smv_grid.cv_results_)

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_C,param_gamma,param_kernel,params,rank_test_score,split0_test_score,...,split7_test_score,split7_train_score,split8_test_score,split8_train_score,split9_test_score,split9_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.003851,0.000408,0.964989,0.974955,1,,linear,"{'C': 1, 'kernel': 'linear'}",12,0.978723,...,0.977778,0.978155,0.911111,0.985437,0.977778,0.973301,0.000914,3.7e-05,0.0224,0.004085
1,0.0153,0.000486,0.964989,0.974955,10,,linear,"{'C': 10, 'kernel': 'linear'}",12,0.978723,...,0.977778,0.978155,0.911111,0.985437,0.977778,0.973301,0.006938,5.8e-05,0.0224,0.004085
2,0.138362,0.00057,0.962801,0.974955,100,,linear,"{'C': 100, 'kernel': 'linear'}",15,0.978723,...,0.955556,0.978155,0.911111,0.985437,0.977778,0.973301,0.091966,1.9e-05,0.022127,0.004085
3,1.203065,0.000577,0.962801,0.974955,1000,,linear,"{'C': 1000, 'kernel': 'linear'}",15,0.978723,...,0.955556,0.978155,0.911111,0.985437,0.977778,0.973301,0.835272,2e-05,0.022127,0.004085
4,0.005425,0.000673,0.95186,0.994895,1,0.1,rbf,"{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}",18,0.93617,...,0.955556,0.995146,0.933333,0.995146,1.0,0.992718,0.000118,1e-05,0.025011,0.001305
5,0.002427,0.000449,0.973742,0.978603,1,0.01,rbf,"{'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}",1,1.0,...,0.977778,0.975728,0.933333,0.985437,0.977778,0.980583,6.7e-05,1.6e-05,0.019202,0.002842
6,0.002518,0.00046,0.969365,0.974713,1,0.001,rbf,"{'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}",5,1.0,...,0.977778,0.968447,0.933333,0.980583,0.977778,0.978155,6.7e-05,2.3e-05,0.022299,0.003956
7,0.004528,0.000661,0.969365,0.969365,1,0.0001,rbf,"{'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}",5,1.0,...,0.977778,0.968447,0.933333,0.973301,0.977778,0.968447,7.7e-05,2e-05,0.024401,0.002714
8,0.006889,0.00084,0.660832,0.660832,1,1e-05,rbf,"{'C': 1, 'gamma': 1e-05, 'kernel': 'rbf'}",24,0.659574,...,0.666667,0.660194,0.666667,0.660194,0.666667,0.660194,0.000288,3.1e-05,0.006303,0.000699
9,0.005426,0.000674,0.95186,1.0,10,0.1,rbf,"{'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}",18,0.93617,...,0.955556,1.0,0.933333,1.0,1.0,1.0,5.5e-05,2e-05,0.025011,0.0


In [33]:
smv_grid.best_estimator_, smv_grid.best_score_, smv_grid.best_params_

(SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 0.97374179431072205,
 {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'})

### KNN

In [3]:
knn = KNeighborsClassifier()

In [4]:
KNeighborsClassifier?

In [None]:
knn_param_grid = {
    
}

## Diferencia de performance entre Random Search y Gridsearch

Dado el siguiente conjunto de parámetros:

        param_dist = {
                    "C": [0.1, 0.5, 0.8, 1, 10, 100, 1000],
                    "kernel": ['linear','rbf'],
                    "gamma": [0.1, 0.01, 0.001, 0.0001, 0.00001]
                     }

Implementar una búsqueda del conjunto óptimo de hiperparámetros tanto con GridSearchCV como con RandomSearchCV con distinto número de iteraciones.
<br/>
Verificar la diferencia en cada caso de:
    
    1. El tiempo de ejecución (utilizando la magic function %%time)
    2. Los combinación óptima de parámetros
    3. La performance del mejor modelo en cada caso sobre los datos del test set que separamos anteriormente en términos de accuracy
