# Práctica 2 - Clasificación supervisada en scikit-learn

## Minería de Datos 2017/2018 - Francisco Martínez Esteso

---

# Indice de Contenido

* [Introducción](#Introducción)

---
# Introducción

---
## Paquetes y librerías

Cargamos previamente todos los paquetes y librarías necesarios:

In [None]:
import numpy as np
import pandas as pd
from scipy import stats, integrate
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import seaborn as sns
sns.set(color_codes=True)

# GridSearch
from sklearn.model_selection import GridSearchCV

# KNN
from sklearn import neighbors

# Metrics
import sklearn.metrics as metrics

In [None]:
%matplotlib inline
mpl.rcParams["figure.figsize"] = "8, 4"
import warnings
warnings.simplefilter("ignore")

---
## Carga de datos

Definimos una semilla inicial, la cual nos será útil para hacer **reproducibles** nuestros experimentos:

In [None]:
seed = 6342
np.random.seed(6342)

Ahora importamos los **datasets** disponibles en campusvirtual y los separamos en conjuntos de train y test:

In [None]:
df = pd.read_csv("./data/pima.csv", dtype={ "label": 'category'})
dfAttributes = df.drop('label', 1)
dfLabel = df['label']

In [None]:
from sklearn.model_selection import train_test_split
train_atts, test_atts, train_label, test_label = train_test_split( 
    dfAttributes,
    dfLabel,
    test_size=0.2,
    random_state=seed,
    stratify=dfLabel)

In [None]:
def definirDataset(datasetname):
    df = pd.read_csv("./data/"+datasetname+".csv", dtype={ "label": 'category'})
    dfAttributes = df.drop('label', 1)
    dfLabel = df['label']
    
    train_atts, test_atts, train_label, test_label = train_test_split( 
    dfAttributes,
    dfLabel,
    test_size=0.2,
    random_state=seed,
    stratify=dfLabel)
    
    return train_atts, test_atts, train_label, test_label

In [None]:
definirDataset("pima")


---
# GridSearch

El algoritmo de **GridSearch** recibe un grid o cuadro de parámetros y nos permite realizar una búsqueda exhaustiva evaluando mediante validación cruzada todas las posibles combinaciones existentes.
El esquema de parametrización de GridSearch es el siguiente:
```
GridSearchCV(
    estimator # El algoritmo de aprendizaje a optimizar 
    param_grid # Un diccionario con los nombres de los parámetros y los valores a considerar
    scoring  # La métrica a optimizar
    cv # Numero de folds en la validación cruzada
)
```

## Clasificador KNN

**KNN** es un método de clasificación sencillo de entender, ya que solo tiene un parámetro **k** que determina el número de vecinos con los que compararemos.
Podemos ver un esquema de su parametrización a continuación:
```
neighbors.KNeighborsClassifier(
    n_neighbors # numero de vecinos en la clasificación
)
```

Implementamos por tanto un proceso de validación cruzada de ... folds para optimizar el valor de k en el algoritmo KNN.

In [None]:
clf = GridSearchCV(
    estimator = neighbors.KNeighborsClassifier(),
    param_grid = { 'n_neighbors' : [1,2,3,4,5] },
    scoring = 'accuracy',
    cv = 10
)

fitted = clf.fit(train_atts, train_label)

In [None]:
fitted.best_params_

In [None]:
# Get the mean score for each cv
means = fitted.cv_results_['mean_test_score']
# Get the sd score for each cv
stds = fitted.cv_results_['std_test_score']
# Get each specific configuration
conf = fitted.cv_results_['params']

# Print the three things togheter
for mean, std, params in zip(means, stds, conf):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))

In [None]:
prediction = clf.predict(test_atts)

In [None]:
metrics.confusion_matrix(test_label, prediction)

In [None]:
metrics.accuracy_score(test_label, prediction)

In [None]:
def KNNgridSearchCV(datasetname):
    
    definirDataset(datasetname)
    
    clf = GridSearchCV(
        estimator = neighbors.KNeighborsClassifier(),
        param_grid = { 'n_neighbors' : [1,2,3,4,5] },
        scoring = 'accuracy',
        cv = 10
    )
    
    fitted = clf.fit(train_atts, train_label)
    
    print(fitted.best_params_['n_neighbors'])
    print("\n")
    
    # Get the mean score for each cv
    means = fitted.cv_results_['mean_test_score']
    # Get the sd score for each cv
    stds = fitted.cv_results_['std_test_score']
    # Get each specific configuration
    conf = fitted.cv_results_['params']

    # Print the three things togheter
    for mean, std, params in zip(means, stds, conf):
        print("%0.3f (+/-%0.03f) for %r"
            % (mean, std * 2, params))
    print("\n")
        
    prediction = clf.predict(test_atts)
    
    print(metrics.confusion_matrix(test_label, prediction))
    print("\n")
    print(metrics.accuracy_score(test_label, prediction))

In [None]:
def pintar():
    model = neighbors.KNeighborsClassifier(5)
    classifierPrintBoundaries(model, train_atts, train_label, test_atts, test_label)  

In [None]:
KNNgridSearchCV("iris")
pintar()

In [None]:
def classifierPrintBoundaries(model, train_atts, train_label, test_atts, test_label):

    attsPair = [ (x,y) for x in train_atts.columns for y in train_atts.columns if x != y]
    
    for (att1_name, att2_name) in attsPair:
        
        xx, yy = np.meshgrid(np.arange(min(train_atts[att1_name])-1, max(train_atts[att1_name])+1, 0.05),
                             np.arange(min(train_atts[att2_name])-1, max(train_atts[att2_name])+1, 0.05))

        mesh = pd.DataFrame({ 'x' : xx.ravel(), 'y' : yy.ravel() })
        
        cls = model.fit(train_atts[[att1_name, att2_name]], train_label)

        Z = cls.predict(mesh)
        mesh = mesh.assign( label = pd.Categorical(Z, categories=train_label.cat.categories) )

        colors = ["#4D73AB","#54A86F","#C44D54"]

        mesh = mesh.assign(colors = mesh.label.cat.codes.map(lambda x: colors[x]))
        colorBoundary = list(mesh.label.cat.codes.map(lambda x: colors[x]))
        colorObservations = list(test_label.cat.codes.map(lambda x: colors[x]))

        fig, ax = plt.subplots()
        # Plot using Seaborn
        sns.regplot(x='x', y='y', data=mesh,
                   fit_reg=False, 
                   scatter_kws={'color': colorBoundary})

        sns.regplot(x=att1_name, y=att2_name, data=test_atts,
                   fit_reg=False,
                   scatter_kws={'color': colorObservations,  'lw': 1, 'edgecolor':'#FFFFFF'})