# **DESAFÍO ED MACHINA**
Alen Jiménez - Febrero 2024

**MACHINE LEARNING (ML)**

El objetivo de esta notebook es presentar modelos de aprendizaje automático que nos ayuden a predecir si un alumno va a aprobar un curso. 

# Tabla de Contenidos
* 0. [Set Up General](#set_up_general)
* 1. [Procesamiento preparatorio para aprendizaje automático](#procesamiento)
* 2. [Machine Learning: Preselección de modelos](#ml_preseleccion)
* 3. [Machine Learning: Procesamiento de Modelos](#ml_procesamiento_modelos)

In [257]:
# Importamos bibliotecas

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh
import altair as alt
import re
import scipy
import warnings
from collections import Counter

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,\
    recall_score, precision_score, f1_score, roc_auc_score,\
        classification_report, precision_recall_curve, auc

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.neighbors import NearestCentroid
from sklearn.naive_bayes import BernoulliNB

from xgboost import XGBClassifier # Clasificador de XGBoost
from bayes_opt import BayesianOptimization # Optimización Bayesiana

import lazypredict
from lazypredict.Supervised import LazyRegressor
from lazypredict.Supervised import LazyClassifier

pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')

In [237]:
# Directorio de trabajo

directorio_de_trabajo = 'C:/Users/alenj/Escritorio/proyectos/desafio_edmachina'

os.chdir(directorio_de_trabajo)

print(f'Directorio actual de trabajo: {os.getcwd()}')

Directorio actual de trabajo: C:\Users\alenj\Escritorio\proyectos\desafio_edmachina


In [238]:
# Importamos el csv

df = pd.read_csv('output/output_dw.csv', sep = ',', encoding = 'utf-8')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2638 entries, 0 to 2637
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   id                        2638 non-null   int64  
 1   user_uuid                 2638 non-null   object 
 2   periodo                   2638 non-null   int64  
 3   course_uuid               2638 non-null   object 
 4   dias_hasta_primer_examen  2638 non-null   float64
 5   semestre_2                2638 non-null   int64  
 6   aprobo                    2638 non-null   int64  
 7   nota_parcial              2638 non-null   float64
 8   score                     2479 non-null   float64
 9   tiempo_hasta_submision    2417 non-null   float64
dtypes: float64(4), int64(4), object(2)
memory usage: 206.2+ KB


# 1. Procesamiento preparatorio para aprendizaje automático <a class = 'anchor' id = 'procesamiento'></a>

Vemos que hay algunos valores faltantes en score y tiempo_hasta_submision. Los completamos con el promedio.

In [239]:
# Completamos valores faltantes

score_mean = df['score'].mean()
tiempo_mean = df['tiempo_hasta_submision'].mean()

df['score'].fillna(score_mean, inplace=True)
df['tiempo_hasta_submision'].fillna(tiempo_mean, inplace=True)

In [240]:
# Veamos el desbalanceo de clases
 
df['aprobo'].value_counts()

aprobo
1    2573
0      65
Name: count, dtype: int64

In [241]:
# Para fines analíticos, hacemos enroque entre clases

df['desaprobo'] = df['aprobo'].apply(lambda x: 1 if x == 0 else 0)
df['desaprobo'].value_counts()

desaprobo
0    2573
1      65
Name: count, dtype: int64

In [242]:
ratio = df['desaprobo'].value_counts()[0] / df['desaprobo'].value_counts()[1]
ratio

39.58461538461538

Hay un ratio de 40:1

In [243]:
# Dividimos el data frame entre target y features

X = df.drop(columns = ['desaprobo','aprobo','id','user_uuid','periodo','course_uuid','semestre_2']) #semestre_2 es constante
y = df['desaprobo']

In [244]:
# Dividimos entre train y test

X_train, X_test, y_train, y_test = train_test_split(X, y
                                                    , test_size = 0.3
                                                    , random_state = 41
                                                    , stratify = y)

In [245]:
# Hacemos oversampling de clase minoritaria usando SMOTE

print('Shape del dataset original %s' % Counter(y_train))

sm = SMOTE(sampling_strategy = 0.5
           , k_neighbors = 5
           , random_state = 0)

X_train_sm, y_train_sm = sm.fit_resample(X_train,y_train)

print('Shape del dataset resample %s' % Counter(y_train_sm))

Shape del dataset original Counter({0: 1801, 1: 45})
Shape del dataset resample Counter({0: 1801, 1: 900})


In [246]:
# Estandarizamos

scaler = StandardScaler()  
    
X_train_sc = scaler.fit_transform(X_train) # Estandarizamos los datos     
X_test_sc = scaler.transform(X_test)

X_train_sm_sc = scaler.fit_transform(X_train_sm) # Estandarizamos los datos     
X_test_sm_sc = scaler.transform(X_test)

# 2. Machine Learning: Preselección de modelos <a class = 'anchor' id = 'ml_preseleccion'></a>

In [247]:
# Clasificaciones sin resampling

clf0 = LazyClassifier(verbose = 0, ignore_warnings=True, predictions=True, custom_metric = recall_score)
models0, predictions0 = clf0.fit(X_train_sc, X_test_sc, y_train, y_test)
models0

100%|██████████| 29/29 [00:01<00:00, 23.62it/s]

[LightGBM] [Info] Number of positive: 45, number of negative: 1801
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000152 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 542
[LightGBM] [Info] Number of data points in the train set: 1846, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.024377 -> initscore=-3.689435
[LightGBM] [Info] Start training from score -3.689435





Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,recall_score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NearestCentroid,0.72,0.74,0.74,0.82,0.75,0.01
Perceptron,0.96,0.57,0.57,0.96,0.15,0.0
DecisionTreeClassifier,0.95,0.56,0.56,0.95,0.15,0.01
KNeighborsClassifier,0.97,0.52,0.52,0.96,0.05,0.03
ExtraTreeClassifier,0.95,0.51,0.51,0.95,0.05,0.01
AdaBoostClassifier,0.97,0.5,0.5,0.96,0.0,0.11
LogisticRegression,0.97,0.5,0.5,0.96,0.0,0.01
SVC,0.97,0.5,0.5,0.96,0.0,0.04
SGDClassifier,0.97,0.5,0.5,0.96,0.0,0.02
RidgeClassifierCV,0.97,0.5,0.5,0.96,0.0,0.01


En términos del Recall Score, NearestCentroid y QuadraticDiscriminantAnalysis son los que mejor performance tienen.

In [248]:
# Clasificaciones con resampling

clf1 = LazyClassifier(verbose = 0, ignore_warnings=True, predictions=True, custom_metric = recall_score)
models1, predictions1 = clf1.fit(X_train_sm_sc, X_test_sm_sc, y_train_sm, y_test)
models1

100%|██████████| 29/29 [00:02<00:00, 13.85it/s]

[LightGBM] [Info] Number of positive: 900, number of negative: 1801
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000067 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 930
[LightGBM] [Info] Number of data points in the train set: 2701, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.333210 -> initscore=-0.693703
[LightGBM] [Info] Start training from score -0.693703





Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,recall_score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BernoulliNB,0.7,0.77,0.77,0.8,0.85,0.01
PassiveAggressiveClassifier,0.8,0.75,0.75,0.87,0.7,0.01
Perceptron,0.75,0.75,0.75,0.84,0.75,0.01
NearestCentroid,0.76,0.73,0.73,0.84,0.7,0.01
QuadraticDiscriminantAnalysis,0.84,0.7,0.7,0.89,0.55,0.01
CalibratedClassifierCV,0.85,0.68,0.68,0.9,0.5,0.03
LogisticRegression,0.85,0.68,0.68,0.9,0.5,0.01
LinearSVC,0.85,0.68,0.68,0.9,0.5,0.07
RidgeClassifierCV,0.84,0.68,0.68,0.89,0.5,0.01
RidgeClassifier,0.84,0.68,0.68,0.89,0.5,0.01


- Tanto el Balanced Accuracy como el ROC AUC mejoran en casi todos los casos.
- El Recall mejora en todos los casos, significativamente.
- BernoulliNB pasa a ser el mejor en general, pero NearestCentroid mantiene una buena performance.

# 3. Machine Learning: Procesamiento de Modelos <a class = 'anchor' id = 'ml_procesamiento_modelos'></a>

In [249]:
# Probamos con BernoulliNB

def bernoulliNB_report(X_train,X_test,y_train,y_test):

    model=BernoulliNB()
    
    model.fit(X_train,y_train)
    y_pred=model.predict(X_test)
    y_proba=model.predict_proba(X_test)
    
    print(classification_report(y_test,y_pred))
    
    print('Area bajo la curva ROC:',np.round(roc_auc_score(y_test,y_proba[:,1]),4))
    
    precision, recall,threshold=precision_recall_curve(y_test,y_proba[:,1]);

    print('Area bajo la curva Precision-Recall:',np.round(auc(recall,precision),4))

    return

In [250]:
bernoulliNB_report(X_train_sm_sc, X_test_sm_sc, y_train_sm, y_test)

              precision    recall  f1-score   support

           0       0.99      0.69      0.82       772
           1       0.07      0.85      0.12        20

    accuracy                           0.70       792
   macro avg       0.53      0.77      0.47       792
weighted avg       0.97      0.70      0.80       792

Area bajo la curva ROC: 0.7503
Area bajo la curva Precision-Recall: 0.0498


In [251]:
# Probamos con XGBClassifier

# Limites inferiores y superiores de los hiperparametros que vamos a optimizar.
pbounds = {
    'learning_rate': (0.01, 1.0),
    'n_estimators': (100, 1000),
    'max_depth': (3,10),
    'subsample': (1.0, 1.0),  # Subsample lo dejamos en 1 que es default para XGBoost ya que no son muchos datos.
    'colsample': (1.0, 1.0),  # Como no hay muchas variables también lo dejamos en uno.
    'gamma': (0, 5)}

# Función de optimización de los hiperparametros
def xgboost_hyper_param(learning_rate,
                        n_estimators,
                        max_depth,
                        subsample,
                        colsample,
                        gamma):
    # Se transforman max_depth y n_estimators en int ya que XGBoost no acepta float.
    max_depth = int(max_depth)
    n_estimators = int(n_estimators)
    
    # Instanciación del XGBClassifier con objetivo multi clasificación
    clf = XGBClassifier(
        max_depth=max_depth,
        learning_rate=learning_rate,
        n_estimators=n_estimators,
        gamma=gamma)
    
    # Retornamos el valor de accuracy obtenido por el modelo
    return np.mean(cross_val_score(clf, X_train_sm_sc, y_train_sm, cv=3, scoring='accuracy'))

# Instanciacion de la optimización bayesiana.
optimizer = BayesianOptimization(
    f=xgboost_hyper_param,
    pbounds=pbounds,
    random_state=1,
)

optimizer.maximize(init_points=20, n_iter=4)

print('Mejor Resultado:', optimizer.max)

|   iter    |  target   | colsample |   gamma   | learni... | max_depth | n_esti... | subsample |
-------------------------------------------------------------------------------------------------


| [0m1        [0m | [0m0.8978   [0m | [0m1.0      [0m | [0m3.602    [0m | [0m0.01011  [0m | [0m5.116    [0m | [0m232.1    [0m | [0m1.0      [0m |
| [95m2        [0m | [95m0.9471   [0m | [95m1.0      [0m | [95m1.728    [0m | [95m0.4028   [0m | [95m6.772    [0m | [95m477.3    [0m | [95m1.0      [0m |
| [0m3        [0m | [0m0.9278   [0m | [0m1.0      [0m | [0m4.391    [0m | [0m0.03711  [0m | [0m7.693    [0m | [0m475.6    [0m | [0m1.0      [0m |
| [95m4        [0m | [95m0.9478   [0m | [95m1.0      [0m | [95m0.9905   [0m | [95m0.8027   [0m | [95m9.778    [0m | [95m382.1    [0m | [95m1.0      [0m |
| [0m5        [0m | [0m0.9004   [0m | [0m1.0      [0m | [0m4.473    [0m | [0m0.09419  [0m | [0m3.273    [0m | [0m252.8    [0m | [0m1.0      [0m |
| [0m6        [0m | [0m0.9426   [0m | [0m1.0      [0m | [0m2.106    [0m | [0m0.9583   [0m | [0m6.732    [0m | [0m722.7    [0m | [0m1.0      [0m |
| [0m7     

In [252]:
modelo = XGBClassifier(
        max_depth = 5,
        learning_rate=0.2481,
        n_estimators=630,
        gamma=0.003)

modelo.fit(X_train_sm_sc, y_train_sm)

# Predecimos los valores del conjunto de testeo y lo almacenamos en una variable para ver su accuracy
y_pred = modelo.predict(X_test_sm_sc)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.98      0.97      0.97       772
           1       0.08      0.10      0.09        20

    accuracy                           0.95       792
   macro avg       0.53      0.53      0.53       792
weighted avg       0.95      0.95      0.95       792



In [254]:
# Hacemos una red neuronal muy sencilla

from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(5, kernel_initializer = 'uniform', activation = 'relu', input_dim = 4))
model.add(Dense(5, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dense(1, kernel_initializer = 'uniform', activation = 'sigmoid'))
model.compile(optimizer="adam", loss='binary_crossentropy', metrics=['accuracy'])

model.fit(X_train_sm_sc, y_train_sm, batch_size=32, epochs=50)
y_pred = np.rint(model.predict(X_test_sm_sc).flatten())



Epoch 1/50


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [255]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.98      0.85      0.91       772
           1       0.08      0.50      0.13        20

    accuracy                           0.84       792
   macro avg       0.53      0.67      0.52       792
weighted avg       0.96      0.84      0.89       792



BernoulliNB parece ser el algoritmo que mejor performa, aunque hay mucho espacio para buscar mejoras.

Usamos Regresión Logistica para ver la importancia de los features

In [258]:
model=LogisticRegressionCV(scoring='f1')
    
model.fit(X_train_sm_sc, y_train_sm)
y_pred=model.predict(X_test_sm_sc)
y_proba=model.predict_proba(X_test_sm_sc)

print(classification_report(y_test,y_pred))

print('Area bajo la curva ROC:',np.round(roc_auc_score(y_test,y_proba[:,1]),4))

precision, recall,threshold=precision_recall_curve(y_test,y_proba[:,1]);

print('Area bajo la curva Precision-Recall:',np.round(auc(recall,precision),4))

              precision    recall  f1-score   support

           0       0.99      0.87      0.92       772
           1       0.09      0.50      0.15        20

    accuracy                           0.86       792
   macro avg       0.54      0.68      0.54       792
weighted avg       0.96      0.86      0.90       792

Area bajo la curva ROC: 0.8206
Area bajo la curva Precision-Recall: 0.1207


In [261]:
model.n_features_in_

4

In [263]:
X.columns

Index(['dias_hasta_primer_examen', 'nota_parcial', 'score',
       'tiempo_hasta_submision'],
      dtype='object')

In [259]:
model.coef_

array([[-0.02385422, -1.56879313, -0.10673112,  0.13278198]])

El orden de importancia de las features es:
- nota_parcial
- tiempo_hasta_submision
- score
- dias_hasta_primer_examen

---