# Modelado: XGBoost

In [None]:
! jupyter nbconvert --to html 5_Modelado_XGB.ipynb

[NbConvertApp] Converting notebook 5_Modelado_XGB.ipynb to html
[NbConvertApp] Writing 726086 bytes to 5_Modelado_XGB.html


In [None]:
import pandas as pd
import collections
from typing import List, Dict, Union, Tuple

import matplotlib.pyplot as plt
import numpy as np
from IPython.display import clear_output
from sklearn import metrics
from sklearn.base import clone
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import plot_confusion_matrix, classification_report, f1_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, ComplementNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier


# Creando un tipo de objeto para los clasificadores
ModelRegressor = Union[SVC, KNeighborsClassifier, DecisionTreeClassifier,
                       GaussianNB, MultinomialNB, ComplementNB,
                       LogisticRegression, XGBClassifier]



%matplotlib inline

pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None  # default='warn'
from IPython.display import display

import os
import sys

# Agregar mi librería personalizada de python
module_path = os.path.abspath(os.path.join(os.getcwd().replace('notebooks', 'src')))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import rain
import importlib

def reload():
    """" Used to reload the modules"""
    libs_list = [rain]
    for lib in libs_list:
        importlib.reload(lib)

    print("Reload complete")

## Objetivo del notebook

En este notebook, utilizaré la **'Importancia de las características'** (Feature Imporance) que proporciona autómaticamente el algoritmo XGBoost. 

<br>

XGBoost cuenta con diferentes Scores para medir la 'Importancia de las características':
- total_gain
- total_cover
- weight
- gain
- cover

Todos ellos ordenan las variables de los atributos de acuerdo a la importancia de cada uno.

Posteriormente, de manera recursiva, probaré todas las combinaciones provistas por la 'Importancia de las características', para crear diferentes conjuntos de datos en función de las variables más importantes. 

Así por ejemplo, si al obtener la lista de las variables de importancia [Variable_C, Variable_D, Variable_A], crearé un conjunto de datos con las variables obtenidas, después de evaluar el modelo con estas variables, eliminaré la variable menos importante para que queden como [Variable_C, Variable_D] y así hasta no tener variables que probar. 

Con esto buscaré probar diferentes algoritmos de clasificación y analizar si hay alguna mejora en el rendimiento del modelo, de lo contrario será mejor cambiar el enfoque en el análisis de los datos.

<br>


In [None]:
# Columnas a eliminar antes del modelado
drop_columns = ['MinTemp', 'Pressure3pm', 'Temp9am', 'Temp3pm', # Multicolineadlidad
                'WindGustDir', 'WindDir9am', 'WindDir3pm', # Alta cardinalidad
                'Location', 'Date', 'month', 'year', # Sin importancia para el modelo
                'month_cos', # tiene mejor desempeño month_sin,
                # 'Sunshine', 'Evaporation', 'Cloud3pm', 'Cloud9am' # >35% datos faltantes
                
               ]

In [None]:
# Como siempre, iniciando con la lectura de los datos y su procesamiento
# Recordemos que, al aún no he desarrollado un proceso para manejar la cardinalidad
# de algunas variables, las excluyo mediante el parámetro cardinality_threshold
X_train, y_train, X_test, y_test = rain.pipline_process_data(drop_columns)

### Obtención de las variables más importantes con Total gain

In [None]:
model = XGBClassifier()
model.fit(X_train, y_train)





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
importance_type = 'total_gain'
imp_scores_d = model.get_booster().get_score(
        importance_type=importance_type)

In [None]:
imp_scores_d

{'Humidity3pm': 45750.850255763435,
 'Sunshine': 8489.55127910801,
 'WindGustSpeed': 9039.062232533392,
 'Pressure9am': 9172.011989376006,
 'WindSpeed9am': 1709.6892495027,
 'MaxTemp': 5033.241020737997,
 'Rainfall': 4734.679244640002,
 'Humidity9am': 3253.201954278399,
 'Cloud3pm': 2670.357133191199,
 'WindSpeed3pm': 2039.0384879803507,
 'month_sin': 1743.6139687670006,
 'Evaporation': 2088.9761646969987,
 'Cloud9am': 1063.3254668859997}

De lo anterior obtenemos la importancia de los 14 atributos en el conjunto de datos, lo que resta es ordenarlos de menor a mayor:

In [None]:
sorted_imp = sorted(imp_scores_d.items(), key=lambda kv: kv[1])
sorted_dict = collections.OrderedDict(sorted_imp)
sorted_dict

OrderedDict([('Cloud9am', 1063.3254668859997),
             ('WindSpeed9am', 1709.6892495027),
             ('month_sin', 1743.6139687670006),
             ('WindSpeed3pm', 2039.0384879803507),
             ('Evaporation', 2088.9761646969987),
             ('Cloud3pm', 2670.357133191199),
             ('Humidity9am', 3253.201954278399),
             ('Rainfall', 4734.679244640002),
             ('MaxTemp', 5033.241020737997),
             ('Sunshine', 8489.55127910801),
             ('WindGustSpeed', 9039.062232533392),
             ('Pressure9am', 9172.011989376006),
             ('Humidity3pm', 45750.850255763435)])

In [None]:
X_train.shape

(106644, 14)

In [None]:
# Y obtener una lista con las variables
print([key for key in sorted_dict.keys()])

['Cloud9am', 'WindSpeed9am', 'month_sin', 'WindSpeed3pm', 'Evaporation', 'Cloud3pm', 'Humidity9am', 'Rainfall', 'MaxTemp', 'Sunshine', 'WindGustSpeed', 'Pressure9am', 'Humidity3pm']


De manera iterativa entrenaré un modelo con todos los atributos, en la siguiente iteración será N - 1, por lo que en la segunda iteración se entrenará con 14, luego 13, hasta llegar a 1, siempre eliminando la variable con menor importanca.

In [None]:
def get_features_by_xgb_importance(
        model: XGBClassifier, importance_type: str) -> List:
    """Returns a list with the features sorted by importance"""

    imp_scores_d = model.get_booster().get_score(
        importance_type=importance_type)
    sorted_imp = sorted(imp_scores_d.items(), key=lambda kv: kv[1])
    sorted_dict = collections.OrderedDict(sorted_imp)

    return [key for key in sorted_dict.keys()]

In [None]:
def estimate_score_metrics(y_test: pd.Series,
                           y_pred: np.ndarray,
                           y_prob: np.ndarray
                           ) -> Tuple[float, float, int, int, int, int]:
    """Returns the following evaluation metrics: ROC, ROC_AUC,
    \rF1-score, Recall, Accuracy, Brier"""
    roc = round(metrics.roc_auc_score(y_test, y_pred), 2)
    roc_auc = round(metrics.roc_auc_score(y_test, y_prob), 2)

    f1 = round(metrics.f1_score(y_test, y_pred) * 100)
    recall = round(metrics.recall_score(y_test, y_pred) * 100)
    accuracy = round(metrics.accuracy_score(y_test, y_pred) * 100)
    brier = round(metrics.brier_score_loss(y_test, y_pred) * 100)

    return roc, roc_auc, f1, recall, accuracy, brier

In [None]:
def get_total_iterations(model, importance_types_list: List) -> int:
    """Returns the total of iterations for the modeling by xgb
    feature importance"""
    no_elements = 1
    for imp_type in importance_types_list:
        features = get_features_by_xgb_importance(
            model=model, importance_type=imp_type)

        while len(features) > 0:
            _ = features.pop(0)
            no_elements += 1

    return no_elements

In [None]:
def modeling_by_subset(
        model: ModelRegressor,
        x_train: pd.DataFrame,
        y_train: Union[pd.Series, pd.DataFrame],
        features: List) -> np.array:
    """Returns an array with Machine Learning model, identifier of model,
    number of features, metrics scores, """
    return model.fit(x_train[features], y_train)


def predict_by_subset(
        predictor: ModelRegressor,
        x_test: pd.DataFrame,
        features: List) -> np.array:
    """Returns an array with Machine Learning model, identifier of model,
    number of features, metrics scores, """
    x_test_subset = x_test[features]
    y_pred = predictor.predict(x_test_subset)
    y_prob = np.around(predictor.predict_proba(x_test_subset)[:, 1], 2)

    return y_pred, y_prob


def modeling_by_xgb_importance(
        model_name: str,
        model: ModelRegressor,
        x_train: pd.DataFrame,
        y_train: pd.DataFrame,
        x_test: pd.DataFrame,
        y_test) -> pd.DataFrame:
    """Returns a dataframe of models scores using xgboost feature
    \r importance to select the best features
    """
    gral_model = XGBClassifier(n_jobs=-1)
    gral_model_fitted = gral_model.fit(x_train, y_train)
    imp_types_lst = ['total_gain', 'total_cover', 'weight', 'gain', 'cover']

    no_elements = get_total_iterations(gral_model_fitted, imp_types_lst)
    count = 1
    row_lst: List[np.array] = []
    row_array = np.array(row_lst)

    for importance_type in imp_types_lst:
        features_list = get_features_by_xgb_importance(
            model=gral_model_fitted, importance_type=importance_type)

        while len(features_list) > 0:

            predictor = modeling_by_subset(model=clone(model),
                                           x_train=x_train,
                                           y_train=y_train,
                                           features=features_list)

            y_pred, y_prob = predict_by_subset(predictor=predictor,
                                               x_test=x_test,
                                               features=features_list)

            score_metrics = estimate_score_metrics(
                y_test=y_test, y_pred=y_pred, y_prob=y_prob)

            row_lst.append(np.array([
                model_name,
                'model_' + str(count),
                len(features_list),
                *score_metrics,
                importance_type,
                ','.join(features_list)]))

            count += 1
            _ = features_list.pop(0)

            txt = 'of ' + ' Modeling with ' + str(model_name) + ' :'
            update_progress(count / no_elements, progress_text=txt)

        row_array = np.array(row_lst)

    # Names of columns of info value dataframe
    cols_dict = {
        'Models': str, 'Id': str, 'No_features': int, 'ROC': float,
        'ROC_AUC': float, 'F1': float, 'Recall': float, 'Accuracy': float,
        'Brier': float, 'Importance_type': str, 'Best_features': str,}

    cols_name = [names for names in cols_dict]
    df = pd.DataFrame(data=row_array, columns=cols_name)
    df = df.astype(cols_dict)

    df = df.sort_values(['F1', 'ROC_AUC', 'No_features'],
                        ascending=False).reset_index(drop=True)

    clear_output(wait=False)

    return df

In [None]:
def update_progress(progress, progress_text=''):
    """ Print the progress of a 'FOR' inside a function """

    bar_length = 40
    if isinstance(progress, int):
        progress = float(progress)
    if not isinstance(progress, float):
        progress = 0
    progress = max(progress, 0)
    progress = min(progress, 1)
    block = int(round(bar_length * progress))
    clear_output(wait=True)
    text = ' '.join(['Progress', progress_text, '[{0}] {1:.1f}%'])
    ouput_text = text.format("#" * block + "-" * (bar_length - block),
                             progress * 100)

    print(ouput_text)

### Prueba inicial con Naive Bayes

In [None]:
X_train, y_train, X_test, y_test = rain.pipline_process_data(drop_columns)

In [None]:
X_train.head()

Unnamed: 0,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Cloud9am,Cloud3pm,month_sin,RainToday_Yes
45,1.445476,-0.441505,-0.095563,0.173866,0.381396,0.772517,1.240342,-1.991449,-1.208374,-1.54607,0.154836,0.128999,0.692809,0
123922,-0.280233,-0.441505,-0.095563,0.173866,-0.564591,-0.581357,-0.702506,0.381944,-0.538072,-0.013757,0.154836,0.128999,1.212339,0
116523,-0.34132,-0.441505,-0.095563,0.444504,-0.048598,0.403279,0.463203,-0.522206,-0.899004,0.952267,-2.025007,0.128999,-0.016881,0
80142,1.873085,-0.441505,0.974961,1.17909,-0.220596,-0.827516,-0.961552,0.890528,-1.569306,-0.513424,-1.589039,-0.359097,1.212339,0
17559,0.69716,-0.441505,-0.095563,0.173866,-0.048598,-0.581357,0.074633,0.212416,0.029106,-0.013757,0.154836,0.128999,-0.016881,0


In [None]:
X_train.shape

(106644, 14)

In [None]:
X_test.head()

Unnamed: 0,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Cloud9am,Cloud3pm,month_sin,RainToday_Yes
14881,1.750911,-0.441505,-0.095563,0.173866,-0.220596,1.264835,-0.961552,-0.861262,-1.517744,0.469255,0.154836,0.128999,-0.016881,0
49181,-0.891103,-0.441505,-0.095563,0.173866,-0.736589,-0.089039,-1.479645,-0.465697,-0.744319,1.88498,0.154836,0.128999,0.692809,0
91010,0.483355,3.371542,-0.095563,0.173866,1.413382,-0.089039,0.074633,1.342602,1.833765,-1.679315,0.154836,0.128999,0.692809,1
90944,0.193192,1.963648,-0.095563,0.173866,0.725391,2.003312,1.628912,0.325434,1.111902,-0.26359,0.154836,0.128999,-0.726572,1
32468,-0.616211,-0.441505,-1.433717,-0.483395,-1.166583,-0.089039,-0.443459,1.399112,0.544723,1.88498,1.462742,-0.847194,-1.246102,0


In [None]:
X_test.shape

(35549, 14)

Aquí un ejemplo con el algoritmo de clasificación **Naive Bayes**

In [None]:
df2= modeling_by_xgb_importance(model_name='NAIVE BAYES', model=GaussianNB(),
                           x_train=X_train, y_train=y_train, x_test=X_test, y_test=y_test)

In [None]:
df2

Unnamed: 0,Models,Id,No_features,ROC,ROC_AUC,F1,Recall,Accuracy,Brier,Importance_type,Best_features
0,NAIVE BAYES,model_34,6,0.73,0.83,58.0,57.0,81.0,19.0,weight,"WindGustSpeed,Sunshine,Humidity9am,Humidity3pm..."
1,NAIVE BAYES,model_1,13,0.73,0.82,58.0,62.0,79.0,21.0,total_gain,"Cloud9am,WindSpeed9am,month_sin,WindSpeed3pm,E..."
2,NAIVE BAYES,model_14,13,0.73,0.82,58.0,62.0,79.0,21.0,total_cover,"month_sin,Cloud9am,Cloud3pm,WindSpeed9am,WindS..."
3,NAIVE BAYES,model_27,13,0.73,0.82,58.0,62.0,79.0,21.0,weight,"Cloud9am,Cloud3pm,month_sin,Rainfall,WindSpeed..."
4,NAIVE BAYES,model_40,13,0.73,0.82,58.0,62.0,79.0,21.0,gain,"Evaporation,WindSpeed9am,Cloud9am,WindSpeed3pm..."
...,...,...,...,...,...,...,...,...,...,...,...
60,NAIVE BAYES,model_26,1,0.67,0.79,49.0,38.0,82.0,18.0,total_cover,Humidity3pm
61,NAIVE BAYES,model_52,1,0.67,0.79,49.0,38.0,82.0,18.0,gain,Humidity3pm
62,NAIVE BAYES,model_65,1,0.59,0.69,34.0,25.0,78.0,22.0,cover,Sunshine
63,NAIVE BAYES,model_38,2,0.54,0.71,15.0,8.0,79.0,21.0,weight,"MaxTemp,Pressure9am"


En total se generarón 70 modelos, de los cuales, el mejor obtuvo un desempeño de 58.0% para F1 score. 

In [None]:
best_features = df2.Best_features.iloc[0].split(',')

Las características que contribuyeron al mejor modelo son:

In [None]:
best_features

['WindGustSpeed',
 'Sunshine',
 'Humidity9am',
 'Humidity3pm',
 'MaxTemp',
 'Pressure9am']

### Realizando la comprobación de los resultados

In [None]:
model = GaussianNB()

In [None]:
model.fit(X_train[best_features], y_train)

GaussianNB()

In [None]:
print(classification_report(y_train, model.predict(X_train[best_features])))

              precision    recall  f1-score   support

           0       0.88      0.89      0.88     82736
           1       0.59      0.57      0.58     23908

    accuracy                           0.81    106644
   macro avg       0.73      0.73      0.73    106644
weighted avg       0.81      0.81      0.81    106644



In [None]:
y_pred = model.predict(X_test[best_features])

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.88      0.88     27580
           1       0.59      0.57      0.58      7969

    accuracy                           0.81     35549
   macro avg       0.73      0.73      0.73     35549
weighted avg       0.81      0.81      0.81     35549



In [None]:
round(metrics.f1_score(y_test, y_pred) * 100)

58

### Ampliando los algoritmos de clasificación

En el siguiente diccionario almaceno los algoritmos que serán utilizados a través de la selección de las mejores características y analizar su desempeño.

In [None]:
X_train, y_train, X_test, y_test = rain.pipline_process_data(drop_columns)

In [None]:
models_dict = {
    'NAIVE BAYES': GaussianNB(),
    'LOGISTIC REGRESSION': LogisticRegression(),
    # 'SVC': SVC(probability=True, random_state=0),
    # 'SVC POLY': SVC(kernel='poly', probability=True, random_state=0),
    'XGB': XGBClassifier(n_jobs=-1),
    'XGB DART 4': XGBClassifier(booster='dart', max_depth=4, n_jobs=-1),
    'XGB DEEP 5': XGBClassifier(max_depth=5, n_jobs=-1),
    'KNN 3': KNeighborsClassifier(n_neighbors=3, n_jobs=-1),
    # 'KNN 5': KNeighborsClassifier(n_neighbors=5, n_jobs=-1),
    'KNN 7': KNeighborsClassifier(n_neighbors=7, n_jobs=-1),
             }

**Nota:** El rendimiento de los modelos, los guardaré en la siguiente ruta: ../results/models_results

In [None]:
for key_model in models_dict.keys():
    try:
        df_tmp = rain.modeling_by_xgb_importance(model_name=key_model, model=models_dict[key_model],
                           x_train=X_train, y_train=y_train, x_test=X_test, y_test=y_test)
        df_tmp.to_csv(r'../results/models_results/' + key_model.replace(' ', '_') + '.csv',
             index=False)
        
    except:
        print(key_model)

Progress of  Modeling with XGB DEEP 5 : [##--------------------------------------] 4.5%




### Conslusiones

Al realizar esta prueba, encontré que:
- El mejor modelo predictivo corresponde a XGBoost, con un F1-score de 61%
- Los mejores atributos corresponden a Cloud9am,WindSpeed9am,month_sin,WindSpeed3pm,Evaporation,Cloud3pm,Humidity9am,Rainfall,MaxTemp,Sunshine,WindGustSpeed,Pres
- Después de analizar diferentes algoritmos de clasificación, llegó a la conclusión, de que se requiere de otro enfoque.
- Para ello, me enfocaré en un modelo más local, correspondiente a una ciudad.