# MÓDULO 4: LA CIENCIA DE DATOS Y LOS MODELOS DE ANALÍTICA PREDICTIVA EN LA INDUSTRIA 4.0

## 4 - Aprendizaje supervisado

## Ejercicio sobre imbalanced data

En la mayoría de los casos resulta muy complejo el hecho de aprender un modelo supervisado robusto y preciso si el dataset está muy desbalanceado; es decir, hay un gran volumen de datos correspondientes a una clase o valor de variable objetivo y unos pocos datos correspondientes a otra clase o valor.

En este notebook vamos a trabajar los siguientes conceptos:
- Técnicas de resampling: oversampling (RandomOverSampler, SMOTE y ADASYN) y undersampling (RandomUnderSampler, TomekLinks y EditedNearestNeighbours)
- Modelos de clasificación supervisada: árboles de decisión, kNN y redes neuronales

La idea es aplicar los diferentes modelos al conjunto de datos desbalanceado antes y después de utilizar las técnicas de resampling, comparando los resultados obtenidos. 

El ejercicio consistirá en aplicar el mismo pipeline definido para el ejemplo con otro conjunto de datos proporcionado y plantear un nuevo modelo clasificatorio.

Referencias: https://www.jeremyjordan.me/imbalanced-data/

In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter('ignore')
import pandas as pd
import numpy as np

Cargar los datos que vamos a trabajar sobre la calidad de un vino portugués. 
Están disponibles en https://archive.ics.uci.edu/ml/datasets/wine+quality

In [None]:
# Wine quality dataset contains 12 features
# Target class derived as target: <=4 (score between 1 and 10)
df = pd.read_csv('wine_quality.csv')
df.head()

Preparamos los datos, separando variables predictoras, en X, y variable objetivo a predecir, y

In [None]:
X, y = df.values[:,:-1], df.values[:,-1]
print(np.unique(y))
# pasamos los valores de y número entero {0,1}
y = (y==1).astype(int)

#### Visualizar el dataset

Así comprobamos visualmente la distribución de cada valor de clase y podemos apreciar claramente la naturaleza desbalanceada del dataset. Para ello, aplicamos una técnica de reducción de dimensionalidad, PCA, quedándonos con los 2 primeros componentes principales para una visualización óptima

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.decomposition import PCA
import seaborn as sns

# Reduce dataset to 2 feature dimensions in order to visualize the data
pca = PCA(n_components=2)
pca.fit(X)
X_reduced = pca.transform(X)

fig, ax = plt.subplots(1, 2, figsize= (15,5))

ax[0].scatter(X_reduced[y == 0, 0], X_reduced[y == 0, 1], label="low quality wine", alpha=0.2)
ax[0].scatter(X_reduced[y == 1, 0], X_reduced[y == 1, 1], label="high quality wine", alpha=0.2)
ax[0].set_title('PCA of original dataset')
ax[0].legend()

ax[1] = sns.countplot(y)
ax[1].set_title('Number of observations per class')

#### Train test split

Separamos el dataset en conjuntos de entrenamiento y test para validar el modelo predictor de la calidad del vino que queremos generar

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)

## Pipeline con diferentes métodos para gestionar datos desbalanceados

Utilizaremos el método 'model_resampling_pipeline(...)' para comparar los diferentes métodos de resampling. En concreto, usaremos los siguientes:
- oversampling: RandomOverSampler, SMOTE, ADASYN
- undersampling: RandomUnderSampler, TomekLinks, EditedNearestNeighbours

In [None]:
from sklearn import metrics 
from collections import Counter

from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler, TomekLinks, EditedNearestNeighbours


def model_resampling_pipeline(X_train, 
                              X_test, 
                              y_train, 
                              y_test, 
                              model):
    results = {'ordinary': {},
               'class_weight': {},
               'oversample': {},
               'undersample': {}}
    
    # ------ No balancing ------
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    accuracy = metrics.accuracy_score(y_test, predictions)
    precision, recall, fscore, support = metrics.precision_recall_fscore_support(y_test, predictions)
    tn, fp, fn, tp = metrics.confusion_matrix(y_test, predictions).ravel()
    fpr, tpr, thresholds = metrics.roc_curve(y_test, predictions, pos_label=1)
    auc = metrics.auc(fpr, tpr)
    
    results['ordinary'] = {'accuracy': accuracy, 'precision': precision, 'recall': recall, 
                          'fscore': fscore, 'n_occurences': support,
                          'predictions_count': Counter(predictions),
                          'tp': tp, 'tn': tn, 'fp': fp, 'fn': fn,
                          'auc': auc}
    
    
    # ------ Class weight ------
    if 'class_weight' in model.get_params().keys():
        model.set_params(class_weight='balanced')
        model.fit(X_train, y_train)
        predictions = model.predict(X_test)
        accuracy = metrics.accuracy_score(y_test, predictions)
        precision, recall, fscore, support = metrics.precision_recall_fscore_support(y_test, predictions)
        tn, fp, fn, tp = metrics.confusion_matrix(y_test, predictions).ravel()
        fpr, tpr, thresholds = metrics.roc_curve(y_test, predictions, pos_label=1)
        auc = metrics.auc(fpr, tpr)

        results['class_weight'] = {'accuracy': accuracy, 'precision': precision, 'recall': recall, 
                                  'fscore': fscore, 'n_occurences': support,
                                  'predictions_count': Counter(predictions),
                                  'tp': tp, 'tn': tn, 'fp': fp, 'fn': fn,
                                  'auc': auc}

    
    # ------------ OVERSAMPLING TECHNIQUES ------------
    print('------ Oversampling methods ------')
    techniques = [RandomOverSampler(),
                  SMOTE(),
                  ADASYN()]
    
    for sampler in techniques:
        technique = sampler.__class__.__name__
        print(f'Technique: {technique}')
        print(f'Before resampling: {sorted(Counter(y_train).items())}')
        X_resampled, y_resampled = sampler.fit_sample(X_train, y_train)
        print(f'After resampling: {sorted(Counter(y_resampled).items())}')

        model.fit(X_resampled, y_resampled)
        predictions = model.predict(X_test)
        accuracy = metrics.accuracy_score(y_test, predictions)
        precision, recall, fscore, support = metrics.precision_recall_fscore_support(y_test, predictions)
        tn, fp, fn, tp = metrics.confusion_matrix(y_test, predictions).ravel()
        fpr, tpr, thresholds = metrics.roc_curve(y_test, predictions, pos_label=1)
        auc = metrics.auc(fpr, tpr)

        results['oversample'][technique] = {'accuracy': accuracy, 
                                            'precision': precision, 
                                            'recall': recall,
                                            'fscore': fscore, 
                                            'n_occurences': support,
                                            'predictions_count': Counter(predictions),
                                            'tp': tp, 'tn': tn, 'fp': fp, 'fn': fn,
                                            'auc': auc}

    
    # ------------ UNDERSAMPLING TECHNIQUES ------------
    print('------ Undersampling methods ------')
    techniques = [RandomUnderSampler(),                  
                  TomekLinks(),
                  EditedNearestNeighbours()]
    
    for sampler in techniques:
        technique = sampler.__class__.__name__
        print(f'Technique: {technique}')
        print(f'Before resampling: {sorted(Counter(y_train).items())}')
        X_resampled, y_resampled = sampler.fit_sample(X_train, y_train)
        print(f'After resampling: {sorted(Counter(y_resampled).items())}')

        model.fit(X_resampled, y_resampled)
        predictions = model.predict(X_test)
        accuracy = metrics.accuracy_score(y_test, predictions)
        precision, recall, fscore, support = metrics.precision_recall_fscore_support(y_test, predictions)
        tn, fp, fn, tp = metrics.confusion_matrix(y_test, predictions).ravel()
        fpr, tpr, thresholds = metrics.roc_curve(y_test, predictions, pos_label=1)
        auc = metrics.auc(fpr, tpr)

        results['undersample'][technique] = {'accuracy': accuracy, 
                                            'precision': precision, 
                                            'recall': recall,
                                            'fscore': fscore, 
                                            'n_occurences': support,
                                            'predictions_count': Counter(predictions),
                                            'tp': tp, 'tn': tn, 'fp': fp, 'fn': fn,
                                            'auc': auc}
        

    return results

## Visualización de resultados

Con el objetivo de evaluar visualmente los resultados obtenidos por los modelos que vamos a aplicar, definimos la función 'evaluate_method(...)' 

In [None]:
def evaluate_method(results, 
                    method, 
                    metrics = ['precision', 'recall', 'fscore']):
    fig, ax = plt.subplots(1, 7, sharey=True, figsize=(16, 6))
    
    for i, metric in enumerate(metrics):
        ax[i*2].axhline(results['ordinary'][metric][0], label='No Resampling')
        ax[i*2+1].axhline(results['ordinary'][metric][1], label='No Resampling')
        
        if results['class_weight']:
            ax[i*2].bar(0, results['class_weight'][metric][0], label='Adjust Class Weight')
            ax[i*2+1].bar(0, results['class_weight'][metric][1], label='Adjust Class Weight')
            
        ax[0].legend(loc='upper center', bbox_to_anchor=(9, 1.01),
                     ncol=1, fancybox=True, shadow=True)
        
        for j, (technique, result) in enumerate(results[method].items()):
            ax[i*2].bar(j+1, result[metric][0], label=technique)
            
            ax[i*2+1].bar(j+1, result[metric][1], label=technique)
        
        
        ax[i*2].set_title(f'Low quality wine: \n{metric}')
        ax[i*2+1].set_title(f'High quality wine: \n{metric}')
    
    # AUC vis
    ax[6].set_title(f'Area under curve')
    ax[6].axhline(results['ordinary']['auc'], label='No Resampling')
    if results['class_weight']:
        ax[6].bar(0, results['class_weight']['auc'], label='Adjust Class Weight')
    for j, (technique, result) in enumerate(results[method].items()):
        ax[6].bar(j+1, result['auc'], label=technique)

# Modelos de aprendizaje

Aplicaremos unos modelos supervisados sencillos, los vistos en clase:
- Árboles de decisión: DecisionTreeClassifier
- kNN: KNeighborsClassifier
- Redes neuronales: MLPClassifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

## Árboles de decisión

Visualizar un árbol de decisión entrenado con datos desbalanceados

In [None]:
model = DecisionTreeClassifier(max_depth=4)
model.fit(X_train, y_train)

from IPython.display import Image  
from sklearn.externals.six import StringIO  
import pydot  
from sklearn import tree

dot_data = StringIO()  
tree.export_graphviz(model, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph[0].create_png())

In [None]:
model = DecisionTreeClassifier()
results = model_resampling_pipeline(X_train, X_test, y_train, y_test, model)

In [None]:
evaluate_method(results, 'oversample')

In [None]:
evaluate_method(results, 'undersample')

## kNN

In [None]:
model = KNeighborsClassifier()
results = model_resampling_pipeline(X_train, X_test, y_train, y_test, model)

In [None]:
evaluate_method(results, 'oversample')

In [None]:
evaluate_method(results, 'undersample')

## Redes neuronales

In [None]:
model = MLPClassifier(hidden_layer_sizes=(50, 50), activation='relu', solver='sgd')
results = model_resampling_pipeline(X_train, X_test, y_train, y_test, model)

In [None]:
evaluate_method(results, 'oversample')

In [None]:
evaluate_method(results, 'undersample')

# Ejercicio 1

### Hacer lo mismo utilizando otro dataset y comentar resultados

Cargar los datos que vamos a trabajar sobre el índice de crimen en US. 
Están disponibles en http://archive.ics.uci.edu/ml/datasets/communities+and+crime

In [None]:
# US crime dataset contains 100 features, descriptions found here: 
# Target class derived as target: >0.65
df = pd.read_csv('communities.csv')
df.head()

In [None]:
...

# Ejercicio 2

### Proponer otro método de aprendizaje supervisado y ver si mejora los resultados obtenidos por los 3 propuestos

In [None]:
...