# MODELADO XGBOOST (SIN OPTIMIZACI√ìN)

Objetivo: Entrenar modelo XGBoost con hiperpar√°metros por defecto para comparar con el baseline (Regresi√≥n Log√≠stica).

Fases temporales:
- T0 (Matr√≠cula)          : Variables disponibles al momento de inscripci√≥n
- T1 (Fin 1er Semestre)   : T0 + variables acad√©micas del 1er semestre
- T2 (Fin 2do Semestre)   : T1 + variables acad√©micas del 2do semestre

Preprocesamiento espec√≠fico para XGBoost:
- No requiere escalado
- Label Encoding para categ√≥ricas
- Target Encoding para 'course'

Pipeline:
1. Carga de datos preprocesados
2. Definici√≥n de variables por fase temporal
3. Split estratificado (80/20)
4. Preprocesamiento espec√≠fico para XGBoost
5. Entrenamiento con Cross-Validation 5-fold
6. Evaluaci√≥n en test set
7. Comparaci√≥n de resultados por fase

## 0. Librerias y configuraci√≥n

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Preprocesamiento
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.preprocessing import LabelEncoder

# Modelo
from xgboost import XGBClassifier

# M√©tricas
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report,
    roc_curve, precision_recall_curve, average_precision_score
)

# Target Encoding
from category_encoders import TargetEncoder

# Configuraci√≥n de visualizaci√≥n
plt.style.use('seaborn-v0_8-whitegrid')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.4f}'.format)

# Seed para reproducibilidad
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Directorio de salida
OUTPUT_DIR = "../outputs/figures/modelado/XGBoost/"
OUTPUT_DIR_REPORTES = "../outputs/models/XGBoost/"
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR_REPORTES, exist_ok=True)

# mlflow
import mlflow
import mlflow.sklearn

## 1. Carga de datos preprocesados

In [31]:
# Cargar dataset preprocesado
df = pd.read_csv('../data/processed/preprocessed_data.csv')

print(f"Dataset cargado: {df.shape[0]} filas x {df.shape[1]} columnas")
print(df['target_binario'].value_counts())
print(f"\nRatio de desbalance: {df['target_binario'].value_counts()[0] / df['target_binario'].value_counts()[1]:.2f}:1")

df.head()

Dataset cargado: 4424 filas x 38 columnas
target_binario
0    3003
1    1421
Name: count, dtype: int64

Ratio de desbalance: 2.11:1


Unnamed: 0,application_order,course,daytimeevening_attendance,previous_qualification_grade,admission_grade,displaced,educational_special_needs,debtor,tuition_fees_up_to_date,gender,scholarship_holder,age_at_enrollment,international,curricular_units_1st_sem_credited,curricular_units_1st_sem_enrolled,curricular_units_1st_sem_evaluations,curricular_units_1st_sem_approved,curricular_units_1st_sem_grade,curricular_units_1st_sem_without_evaluations,curricular_units_2nd_sem_credited,curricular_units_2nd_sem_enrolled,curricular_units_2nd_sem_evaluations,curricular_units_2nd_sem_approved,curricular_units_2nd_sem_grade,curricular_units_2nd_sem_without_evaluations,unemployment_rate,inflation_rate,gdp,is_single,application_mode_risk,is_over_23_entry,previous_qualification_risk,mothers_qualification_level,fathers_qualification_level,mothers_occupation_level,fathers_occupation_level,has_unknown_parent_info,target_binario
0,5,171,1,122.0,127.3,1,0,0,1,1,0,20,0,0,0,0,0,0.0,0,0,0,0,0,0.0,0,10.8,1.4,1.74,1,Bajo_Riesgo,0,Bajo_Riesgo,Basica_Media,Secundaria,Otro_Trabajo,Otro_Trabajo,0,1
1,1,9254,1,160.0,142.5,1,0,0,0,1,0,19,0,0,6,6,6,14.0,0,0,6,6,6,13.6667,0,13.9,-0.3,0.79,1,Bajo_Riesgo,0,Bajo_Riesgo,Secundaria,Superior,Profesional,Profesional,0,0
2,5,9070,1,122.0,124.8,1,0,0,0,1,0,19,0,0,6,0,0,0.0,0,0,6,0,0,0.0,0,10.8,1.4,1.74,1,Bajo_Riesgo,0,Bajo_Riesgo,Basica_Baja,Basica_Baja,Otro_Trabajo,Otro_Trabajo,0,1
3,2,9773,1,122.0,119.6,1,0,0,1,0,0,20,0,0,6,8,6,13.4286,0,0,6,10,5,12.4,0,9.4,-0.8,-3.12,1,Bajo_Riesgo,0,Bajo_Riesgo,Basica_Media,Basica_Baja,Otro_Trabajo,Profesional,0,0
4,1,8014,0,100.0,141.5,0,0,0,1,0,0,45,0,0,6,9,5,12.3333,0,0,6,6,6,13.0,0,13.9,-0.3,0.79,0,Alto_Riesgo,1,Bajo_Riesgo,Basica_Baja,Basica_Media,Otro_Trabajo,Otro_Trabajo,0,0


## 2. Definici√≥n de variables por fase temporal (T0, T1, T2)

In [32]:
# TARGET
TARGET = 'target_binario'

# -----------------------------------------------------------------------------
# VARIABLES BINARIAS (no requieren encoding, ya son 0/1)
# -----------------------------------------------------------------------------
VARS_BINARIAS_T0 = [
    'daytimeevening_attendance',
    'displaced',
    'educational_special_needs',
    'gender',
    'scholarship_holder',
    'international',
    'is_single'
]

VARS_BINARIAS_T1 = [
    'debtor',
    'tuition_fees_up_to_date'
]


# -----------------------------------------------------------------------------
# VARIABLES NUM√âRICAS (NO requieren escalado para XGBoost)
# -----------------------------------------------------------------------------
VARS_NUMERICAS_T0 = [
    'age_at_enrollment',
    'admission_grade',
    'previous_qualification_grade'
]

VARS_NUMERICAS_T1 = [
    'curricular_units_1st_sem_credited',
    'curricular_units_1st_sem_enrolled',
    'curricular_units_1st_sem_evaluations',
    'curricular_units_1st_sem_approved',
    'curricular_units_1st_sem_grade',
    'curricular_units_1st_sem_without_evaluations',
    'unemployment_rate',
    'inflation_rate',
    'gdp'
]

VARS_NUMERICAS_T2 = [
    'curricular_units_2nd_sem_credited',
    'curricular_units_2nd_sem_enrolled',
    'curricular_units_2nd_sem_evaluations',
    'curricular_units_2nd_sem_approved',
    'curricular_units_2nd_sem_grade',
    'curricular_units_2nd_sem_without_evaluations'
]

# -----------------------------------------------------------------------------
# VARIABLES CATEG√ìRICAS AGRUPADAS (requieren Label Encoding para XGBoost)
# -----------------------------------------------------------------------------
VARS_CATEGORICAS_AGRUPADAS_T0 = [
    'application_mode_risk',
    'previous_qualification_risk',
    'mothers_qualification_level',
    'fathers_qualification_level',
    'mothers_occupation_level',
    'fathers_occupation_level'
]

# -----------------------------------------------------------------------------
# VARIABLES CATEG√ìRICAS PARA TARGET ENCODING
# -----------------------------------------------------------------------------
VARS_TARGET_ENCODING_T0 = ['course']

# -----------------------------------------------------------------------------
# VARIABLE ORDINAL (se trata como num√©rica)
# -----------------------------------------------------------------------------
VARS_ORDINALES_T0 = ['application_order']

# =============================================================================
# COMPOSICI√ìN DE VARIABLES POR FASE TEMPORAL
# =============================================================================

# T0: Variables disponibles al momento de matr√≠cula
VARS_T0 = (
    VARS_BINARIAS_T0 +
    VARS_NUMERICAS_T0 +
    VARS_CATEGORICAS_AGRUPADAS_T0 +
    VARS_TARGET_ENCODING_T0 +
    VARS_ORDINALES_T0
)

# T1: T0 + variables del 1er semestre
VARS_T1 = (
    VARS_T0 +
    VARS_BINARIAS_T1 +
    VARS_NUMERICAS_T1
)

# T2: T1 + variables del 2do semestre
VARS_T2 = (
    VARS_T1 +
    VARS_NUMERICAS_T2
)

print("================================================================================")
print("  VARIABLES POR FASE TEMPORAL")
print("================================================================================")
print(f"\n T0 (Matr√≠cula): {len(VARS_T0)} variables")
print(f" T1 (Fin 1er Sem): {len(VARS_T1)} variables (+{len(VARS_T1) - len(VARS_T0)})")
print(f" T2 (Fin 2do Sem): {len(VARS_T2)} variables (+{len(VARS_T2) - len(VARS_T1)})")

  VARIABLES POR FASE TEMPORAL

 T0 (Matr√≠cula): 18 variables
 T1 (Fin 1er Sem): 29 variables (+11)
 T2 (Fin 2do Sem): 35 variables (+6)


## 3. Split TRAIN/TEST 

In [33]:
# Split se hace en totalidad del dataste, posteriormente se seleccionan las variables seg√∫n la fase temporal para entrenemiento y evaluaci√≥nl

X = df[VARS_T2].copy()
y = df[TARGET].copy()

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=RANDOM_STATE
)

print("================================================================================")
print("  SPLIT TRAIN/TEST")
print("================================================================================")
print(f"\nTrain: {X_train.shape[0]} ({X_train.shape[0]/len(df)*100:.1f}%)")
print(f"Test:  {X_test.shape[0]} ({X_test.shape[0]/len(df)*100:.1f}%)")

print(f"\nDistribuci√≥n del target en Train:")
print(y_train.value_counts())
print(f"Ratio de desbalance:: {y_train.value_counts()[0] / y_train.value_counts()[1]:.2f}:1")

print(f"\nDistribuci√≥n del target en Test:")
print(y_test.value_counts())
print(f"Ratio de desbalance:: {y_test.value_counts()[0] / y_test.value_counts()[1]:.2f}:1")

  SPLIT TRAIN/TEST

Train: 3539 (80.0%)
Test:  885 (20.0%)

Distribuci√≥n del target en Train:
target_binario
0    2402
1    1137
Name: count, dtype: int64
Ratio de desbalance:: 2.11:1

Distribuci√≥n del target en Test:
target_binario
0    601
1    284
Name: count, dtype: int64
Ratio de desbalance:: 2.12:1


## 4. Funciones de preprocesamiento para XGBOOST

In [34]:
def obtiene_variables_por_fase(fase):
    # Retorna las listas de variables seg√∫n la fase temporal, retorna diccionario con variables de la fase
    if fase == 'T0':
        return {
            'binarias': VARS_BINARIAS_T0,
            'numericas': VARS_NUMERICAS_T0 + VARS_ORDINALES_T0,
            'categoricas_le': VARS_CATEGORICAS_AGRUPADAS_T0,
            'categoricas_te': VARS_TARGET_ENCODING_T0,
            'all': VARS_T0
        }
    elif fase == 'T1':
        return {
            'binarias': VARS_BINARIAS_T0 + VARS_BINARIAS_T1,
            'numericas': VARS_NUMERICAS_T0 + VARS_ORDINALES_T0 + VARS_NUMERICAS_T1,
            'categoricas_le': VARS_CATEGORICAS_AGRUPADAS_T0,
            'categoricas_te': VARS_TARGET_ENCODING_T0,
            'all': VARS_T1
        }
    elif fase == 'T2':
        return {
            'binarias': VARS_BINARIAS_T0 + VARS_BINARIAS_T1,
            'numericas': VARS_NUMERICAS_T0 + VARS_ORDINALES_T0 + VARS_NUMERICAS_T1 + VARS_NUMERICAS_T2,
            'categoricas_le': VARS_CATEGORICAS_AGRUPADAS_T0,
            'categoricas_te': VARS_TARGET_ENCODING_T0,
            'all': VARS_T2
        }
    else:
        raise ValueError(f"Fase no v√°lida: {fase}. Usar 'T0', 'T1', o 'T2'")


def preprocesamiento_xgboost(X_train, X_test, y_train, fase):
    # Preprocesa los datos para XGBoost    
    variables_fase = obtiene_variables_por_fase(fase)
    
    # Seleccionar solo las variables de la fase
    X_train_fase = X_train[variables_fase['all']].copy()
    X_test_fase = X_test[variables_fase['all']].copy()
    
    # Diccionario para guardar encoders
    label_encoders = {}
    
    # -------------------------------------------------------------------------
    # 1. TARGET ENCODING para 'course' 
    # -------------------------------------------------------------------------
    te = TargetEncoder(cols=variables_fase['categoricas_te'], smoothing=0.3)
    
    for col in variables_fase['categoricas_te']:
        X_train_fase[col + '_encoded'] = te.fit_transform(X_train_fase[[col]], y_train)[col]
        X_test_fase[col + '_encoded'] = te.transform(X_test_fase[[col]])[col]
        # Eliminar columna original
        X_train_fase = X_train_fase.drop(columns=[col])
        X_test_fase = X_test_fase.drop(columns=[col])
    
    # -------------------------------------------------------------------------
    # 2. LABEL ENCODING para categ√≥ricas agrupadas
    # -------------------------------------------------------------------------
    for col in variables_fase['categoricas_le']:
        le = LabelEncoder()
        X_train_fase[col] = le.fit_transform(X_train_fase[col].astype(str))
        X_test_fase[col] = le.transform(X_test_fase[col].astype(str))
        label_encoders[col] = le
    
    # -------------------------------------------------------------------------
    # Guardar informaci√≥n
    # -------------------------------------------------------------------------
    variables = X_train_fase.columns.tolist()
    preprocessors = {
        'target_encoder': te,
        'label_encoders': label_encoders,
        'feature_names': variables
    }
    
    return X_train_fase, X_test_fase, variables, preprocessors

## 5. Funciones entrenamiento

In [35]:
def entrena_xgboost(X_train, y_train, fase, cv_folds=5):
    """Entrena y eval√∫a XGBoost con Cross-Validation."""

    mlflow.end_run()
         
    print("================================================================================")
    print(f"  ENTRENAMIENTO XGBOOST - FASE {fase}")
    print("================================================================================")
    print(f"\nVariables: {X_train.shape[1]}")
    print(f"Registros: {X_train.shape[0]}")
    
    # -------------------------------------------------------------------------
    # Calcular scale_pos_weight para desbalance
    # -------------------------------------------------------------------------
    n_neg = (y_train == 0).sum()
    n_pos = (y_train == 1).sum()
    scale_pos_weight = n_neg / n_pos
    print(f"\nscale_pos_weight: {scale_pos_weight:.2f}")

    print(f"\nHiperpar√°metros (por defecto):")
    print(f"   ‚Ä¢ n_estimators: 100")
    print(f"   ‚Ä¢ max_depth: 6")
    print(f"   ‚Ä¢ learning_rate: 0.3")
    print(f"   ‚Ä¢ subsample: 1.0")
    print(f"   ‚Ä¢ colsample_bytree: 1.0")
    
    # -------------------------------------------------------------------------
    # Cross-Validation con loop manual
    # -------------------------------------------------------------------------
    print(f"\nCross-Validation ({cv_folds}-fold):")
    
    cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=RANDOM_STATE)
    
    # Almacenar resultados por fold
    cv_results = {
        'train_accuracy': [], 'test_accuracy': [],
        'train_precision': [], 'test_precision': [],
        'train_recall': [], 'test_recall': [],
        'train_f1': [], 'test_f1': [],
        'train_roc_auc': [], 'test_roc_auc': []
    }
    
    for fold, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
        X_fold_train = X_train.iloc[train_idx]
        X_fold_val = X_train.iloc[val_idx]
        y_fold_train = y_train.iloc[train_idx]
        y_fold_val = y_train.iloc[val_idx]
        
        # Crear y entrenar modelo
        model = XGBClassifier(
            n_estimators=100,
            max_depth=6,
            learning_rate=0.3,
            subsample=1.0,
            colsample_bytree=1.0,
            min_child_weight=1,
            gamma=0,
            reg_alpha=0,
            reg_lambda=1,
            scale_pos_weight=scale_pos_weight,
            objective='binary:logistic',
            eval_metric='logloss',
            use_label_encoder=False,
            random_state=RANDOM_STATE,
            n_jobs=-1
        )
        model.fit(X_fold_train, y_fold_train)
        
        # Predicciones
        y_train_pred = model.predict(X_fold_train)
        y_train_proba = model.predict_proba(X_fold_train)[:, 1]
        y_val_pred = model.predict(X_fold_val)
        y_val_proba = model.predict_proba(X_fold_val)[:, 1]
        
        # M√©tricas Train
        cv_results['train_accuracy'].append(accuracy_score(y_fold_train, y_train_pred))
        cv_results['train_precision'].append(precision_score(y_fold_train, y_train_pred))
        cv_results['train_recall'].append(recall_score(y_fold_train, y_train_pred))
        cv_results['train_f1'].append(f1_score(y_fold_train, y_train_pred))
        cv_results['train_roc_auc'].append(roc_auc_score(y_fold_train, y_train_proba))
        
        # M√©tricas Validation
        cv_results['test_accuracy'].append(accuracy_score(y_fold_val, y_val_pred))
        cv_results['test_precision'].append(precision_score(y_fold_val, y_val_pred))
        cv_results['test_recall'].append(recall_score(y_fold_val, y_val_pred))
        cv_results['test_f1'].append(f1_score(y_fold_val, y_val_pred))
        cv_results['test_roc_auc'].append(roc_auc_score(y_fold_val, y_val_proba))
    
    # Convertir a numpy arrays
    for key in cv_results:
        cv_results[key] = np.array(cv_results[key])
    
    # -------------------------------------------------------------------------
    # Resultados por fold
    # -------------------------------------------------------------------------
    print("\n Resultados por fold:")
    for i in range(cv_folds):
        print(f"\n  Fold {i+1}:")
        for metric in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']:
            train_score = cv_results[f'train_{metric}'][i]
            val_score = cv_results[f'test_{metric}'][i]
            print(f"    {metric:<10} | Train: {train_score:.4f} | Val: {val_score:.4f}")


    mlflow.set_experiment("TFM_Dropout_Prediction")
    with mlflow.start_run(run_name=f"XGBoost_CV5_{fase}"):
        mlflow.set_tag("modelo", 'Params por default')
        mlflow.set_tag("tipo", 'Validacion cruzada')
        mlflow.log_params(model.get_params())

        # -------------------------------------------------------------------------
        # Resumen CV (media ¬± std)
        # -------------------------------------------------------------------------
        print(f"\n Resumen Cross-Validation:")
        print(f"\n   {'M√©trica':<12} {'Train Mean':>12} {'Train Std':>12} {'Val Mean':>12} {'Val Std':>12}")
        print(f"   {'-'*60}")
        for metric in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']:
            train_mean = cv_results[f'train_{metric}'].mean()
            train_std = cv_results[f'train_{metric}'].std()
            val_mean = cv_results[f'test_{metric}'].mean()
            val_std = cv_results[f'test_{metric}'].std()
            # mlflow
            mlflow.log_metric(f'test_{metric}_mean', val_mean.round(4))
            mlflow.log_metric(f'test_{metric}_std', val_std.round(4))
            print(f"   {metric:<12} {train_mean:>12.4f} {train_std:>12.4f} {val_mean:>12.4f} {val_std:>12.4f}")
    
    
    # -------------------------------------------------------------------------
    # Retornar resultados
    # -------------------------------------------------------------------------
    results = {
        'phase': fase,
        'model': model,
        'n_features': X_train.shape[1],
        'cv_results': cv_results,
    }
    
    return results



def resumen_cv(cv_results, fase, modelo):
    metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']

    summary = {
        'modelo': modelo,
        'fase': fase
    }


    # -------------------------
    # M√©tricas de VALIDACI√ìN
    # -------------------------
    for metric in metrics:
        summary[f'{metric}_val_mean'] = cv_results[f'test_{metric}'].mean()
        summary[f'{metric}_val_std']  = cv_results[f'test_{metric}'].std()
        

    # -------------------------
    # M√©tricas de TRAIN
    # -------------------------
    for metric in metrics:
        summary[f'{metric}_train_mean'] = cv_results[f'train_{metric}'].mean()
        summary[f'{metric}_train_std']  = cv_results[f'train_{metric}'].std()


    return pd.DataFrame([summary])

## 6. Modelado FASE T0 (MATR√çCULA)

In [36]:
# Preprocesamiento para T0
X_train_T0, X_test_T0, features_T0, prep_T0 = preprocesamiento_xgboost(
    X_train, X_test, y_train, fase='T0'
)

print(f"\nT0 - Dimensiones despu√©s del preprocesamiento:")
print(f"   Train: {X_train_T0.shape}")
print(f"   Test:  {X_test_T0.shape}")
print(f"   Variables: {len(features_T0)}")
print(f"   \nLas variables son:")
X_train_T0.columns



T0 - Dimensiones despu√©s del preprocesamiento:
   Train: (3539, 18)
   Test:  (885, 18)
   Variables: 18
   
Las variables son:


Index(['daytimeevening_attendance', 'displaced', 'educational_special_needs',
       'gender', 'scholarship_holder', 'international', 'is_single',
       'age_at_enrollment', 'admission_grade', 'previous_qualification_grade',
       'application_mode_risk', 'previous_qualification_risk',
       'mothers_qualification_level', 'fathers_qualification_level',
       'mothers_occupation_level', 'fathers_occupation_level',
       'application_order', 'course_encoded'],
      dtype='object')

In [37]:
# Entrenar y evaluar T0
results_T0 = entrena_xgboost(X_train_T0, y_train, fase='T0')

df_resumen_xgboost = resumen_cv(
    cv_results=results_T0['cv_results'],
    fase='T0',
    modelo='XGBoost'
)
# Guardar tabla de comparaci√≥n
df_resumen_xgboost.to_csv(f"{OUTPUT_DIR_REPORTES}cv_summary_XGBoost.csv", index=False)

print(f"Resultados guardados en: {OUTPUT_DIR_REPORTES}cv_summary_XGBoost.csv")

  ENTRENAMIENTO XGBOOST - FASE T0

Variables: 18
Registros: 3539

scale_pos_weight: 2.11

Hiperpar√°metros (por defecto):
   ‚Ä¢ n_estimators: 100
   ‚Ä¢ max_depth: 6
   ‚Ä¢ learning_rate: 0.3
   ‚Ä¢ subsample: 1.0
   ‚Ä¢ colsample_bytree: 1.0

Cross-Validation (5-fold):

 Resultados por fold:

  Fold 1:
    accuracy   | Train: 0.9700 | Val: 0.7020
    precision  | Train: 0.9383 | Val: 0.5399
    recall     | Train: 0.9703 | Val: 0.5044
    f1         | Train: 0.9540 | Val: 0.5215
    roc_auc    | Train: 0.9967 | Val: 0.7090

  Fold 2:
    accuracy   | Train: 0.9820 | Val: 0.7020
    precision  | Train: 0.9574 | Val: 0.5374
    recall     | Train: 0.9879 | Val: 0.5351
    f1         | Train: 0.9724 | Val: 0.5363
    roc_auc    | Train: 0.9990 | Val: 0.7137

  Fold 3:
    accuracy   | Train: 0.9749 | Val: 0.6907
    precision  | Train: 0.9439 | Val: 0.5161
    recall     | Train: 0.9802 | Val: 0.5639
    f1         | Train: 0.9617 | Val: 0.5389
    roc_auc    | Train: 0.9980 | Val: 0.71

### Comentarios FASE 0

1. Se observa que todas las m√©tricas en entrenamiento alcanzan casi el valor m√°ximo (1.0) lo que indica sobreentrenamiento fuerte, aunque ligeramente menor en comparaci√≥n con RF en Fase 0. El recall en validaci√≥n es 0.5445, inferior RL.
2. Las m√©tricas de validaci√≥n caen de forma relevente, en proemdio 0.2276, en consecuencia, no generaliza adecuadamente y desviaci√≥n standard presentan mayor variaci√≥n (promedio 0.0201)
3. XGBoost no aporta ventaja en fase T0 y presenta sobreajuste.

## 7. Modelado FASE T1 (FIN 1ER SEMESTRE)

In [38]:
# Preprocesamiento para T1
X_train_T1, X_test_T1, features_T1, prep_T1 = preprocesamiento_xgboost(
    X_train, X_test, y_train, fase='T1'
)

print(f"\nT1 - Dimensiones despu√©s del preprocesamiento:")
print(f"   Train: {X_train_T1.shape}")
print(f"   Test:  {X_test_T1.shape}")
print(f"   Features: {len(features_T1)}")
print(f"   \nLas variables son:")
X_train_T1.columns


T1 - Dimensiones despu√©s del preprocesamiento:
   Train: (3539, 29)
   Test:  (885, 29)
   Features: 29
   
Las variables son:


Index(['daytimeevening_attendance', 'displaced', 'educational_special_needs',
       'gender', 'scholarship_holder', 'international', 'is_single',
       'age_at_enrollment', 'admission_grade', 'previous_qualification_grade',
       'application_mode_risk', 'previous_qualification_risk',
       'mothers_qualification_level', 'fathers_qualification_level',
       'mothers_occupation_level', 'fathers_occupation_level',
       'application_order', 'debtor', 'tuition_fees_up_to_date',
       'curricular_units_1st_sem_credited',
       'curricular_units_1st_sem_enrolled',
       'curricular_units_1st_sem_evaluations',
       'curricular_units_1st_sem_approved', 'curricular_units_1st_sem_grade',
       'curricular_units_1st_sem_without_evaluations', 'unemployment_rate',
       'inflation_rate', 'gdp', 'course_encoded'],
      dtype='object')

In [39]:
# Entrenar y evaluar T1
results_T1 = entrena_xgboost(X_train_T1, y_train, fase='T1')

df_resumen_XGBoost_T1 = resumen_cv(
    cv_results=results_T1['cv_results'],
    fase='T1',
    modelo='XGBoost'
)

xg_path = "../outputs/models/XGBoost/cv_summary_XGBoost.csv"
df_xg = pd.read_csv(xg_path)
df_final = pd.concat([df_xg, df_resumen_XGBoost_T1], ignore_index=True)

# Guardar tabla de comparaci√≥n
df_final.to_csv(xg_path, index=False)
print(f"\nResultados guardados en: {OUTPUT_DIR_REPORTES}cv_summary_XGBoost.csv")


  ENTRENAMIENTO XGBOOST - FASE T1

Variables: 29
Registros: 3539

scale_pos_weight: 2.11

Hiperpar√°metros (por defecto):
   ‚Ä¢ n_estimators: 100
   ‚Ä¢ max_depth: 6
   ‚Ä¢ learning_rate: 0.3
   ‚Ä¢ subsample: 1.0
   ‚Ä¢ colsample_bytree: 1.0

Cross-Validation (5-fold):

 Resultados por fold:

  Fold 1:
    accuracy   | Train: 1.0000 | Val: 0.8404
    precision  | Train: 1.0000 | Val: 0.7602
    recall     | Train: 1.0000 | Val: 0.7368
    f1         | Train: 1.0000 | Val: 0.7483
    roc_auc    | Train: 1.0000 | Val: 0.8785

  Fold 2:
    accuracy   | Train: 0.9996 | Val: 0.8333
    precision  | Train: 0.9989 | Val: 0.7644
    recall     | Train: 1.0000 | Val: 0.6974
    f1         | Train: 0.9995 | Val: 0.7294
    roc_auc    | Train: 1.0000 | Val: 0.8783

  Fold 3:
    accuracy   | Train: 0.9996 | Val: 0.8164
    precision  | Train: 0.9989 | Val: 0.7175
    recall     | Train: 1.0000 | Val: 0.7048
    f1         | Train: 0.9995 | Val: 0.7111
    roc_auc    | Train: 1.0000 | Val: 0.88

### Comentarios FASE 1

1. Se observa que casi la totalidad de las m√©tricas en entrenamiento alcanzan el valor m√°ximo (1.0) lo que indica sobreentrenamiento severo. El recall en validaci√≥n es 0.7071, similar a RL y RF.
2. Las m√©tricas de validaci√≥n caen de forma relevente, en promedio 0.12, en consecuencia, no generaliza adecuadamente. Desviaci√≥n standard presentan menor variaci√≥n (promedio 0.0138) que la Fase 0 presentando una estabilidad relativa.

## 8. Modelado FASE T2 (FIN 2DO SEMESTRE)

In [40]:
# Preprocesamiento para T2
X_train_T2, X_test_T2, features_T2, prep_T2 = preprocesamiento_xgboost(
    X_train, X_test, y_train, fase='T2'
)

print(f"\nT2 - Dimensiones despu√©s del preprocesamiento:")
print(f"   Train: {X_train_T2.shape}")
print(f"   Test:  {X_test_T2.shape}")
print(f"   Features: {len(features_T2)}")
print(f"   \nLas variables son:")
X_train_T2.columns



T2 - Dimensiones despu√©s del preprocesamiento:
   Train: (3539, 35)
   Test:  (885, 35)
   Features: 35
   
Las variables son:


Index(['daytimeevening_attendance', 'displaced', 'educational_special_needs',
       'gender', 'scholarship_holder', 'international', 'is_single',
       'age_at_enrollment', 'admission_grade', 'previous_qualification_grade',
       'application_mode_risk', 'previous_qualification_risk',
       'mothers_qualification_level', 'fathers_qualification_level',
       'mothers_occupation_level', 'fathers_occupation_level',
       'application_order', 'debtor', 'tuition_fees_up_to_date',
       'curricular_units_1st_sem_credited',
       'curricular_units_1st_sem_enrolled',
       'curricular_units_1st_sem_evaluations',
       'curricular_units_1st_sem_approved', 'curricular_units_1st_sem_grade',
       'curricular_units_1st_sem_without_evaluations', 'unemployment_rate',
       'inflation_rate', 'gdp', 'curricular_units_2nd_sem_credited',
       'curricular_units_2nd_sem_enrolled',
       'curricular_units_2nd_sem_evaluations',
       'curricular_units_2nd_sem_approved', 'curricular_units_2nd

In [41]:
# Entrenar y evaluar T2
results_T2 = entrena_xgboost(X_train_T2, y_train, fase='T2')

df_resumen_XGBoost_T2 = resumen_cv(
    cv_results=results_T2['cv_results'],
    fase='T2',
    modelo='XGBoost'
)

xg_path = "../outputs/models/XGBoost/cv_summary_XGBoost.csv"
df_xg = pd.read_csv(xg_path)
df_final = pd.concat([df_xg, df_resumen_XGBoost_T2], ignore_index=True)

# Guardar tabla de comparaci√≥n
df_final.to_csv(xg_path, index=False)
print(f"\nResultados guardados en: {OUTPUT_DIR_REPORTES}cv_summary_XGBoost.csv")

  ENTRENAMIENTO XGBOOST - FASE T2

Variables: 35
Registros: 3539

scale_pos_weight: 2.11

Hiperpar√°metros (por defecto):
   ‚Ä¢ n_estimators: 100
   ‚Ä¢ max_depth: 6
   ‚Ä¢ learning_rate: 0.3
   ‚Ä¢ subsample: 1.0
   ‚Ä¢ colsample_bytree: 1.0

Cross-Validation (5-fold):

 Resultados por fold:

  Fold 1:
    accuracy   | Train: 0.9996 | Val: 0.8573
    precision  | Train: 0.9989 | Val: 0.7822
    recall     | Train: 1.0000 | Val: 0.7719
    f1         | Train: 0.9995 | Val: 0.7770
    roc_auc    | Train: 1.0000 | Val: 0.8944

  Fold 2:
    accuracy   | Train: 1.0000 | Val: 0.8503
    precision  | Train: 1.0000 | Val: 0.7824
    recall     | Train: 1.0000 | Val: 0.7412
    f1         | Train: 1.0000 | Val: 0.7613
    roc_auc    | Train: 1.0000 | Val: 0.9039

  Fold 3:
    accuracy   | Train: 1.0000 | Val: 0.8559
    precision  | Train: 1.0000 | Val: 0.7934
    recall     | Train: 1.0000 | Val: 0.7445
    f1         | Train: 1.0000 | Val: 0.7682
    roc_auc    | Train: 1.0000 | Val: 0.90

### Comentarios FASE 2

1. Continua presentando las m√©tricas en entrenamiento con valor m√°ximo (1.0) lo que indica sobreentrenamiento severo. El recall en validaci√≥n es 0.7485 y un AUC 0.9080, reflejando mayor capacidad predictiva.
2. Las m√©tricas de validaci√≥n caen de forma relevente, en promedio 0.2544, en consecuencia, no generaliza adecuadamente.  Desviaci√≥n standard presentan mayor variaci√≥n (promedio 0.0162) mejor que la fase 2.

## 9. Resumen Final XGBoost

In [42]:
# Mostrar resumen final
df_final = pd.read_csv(f"{OUTPUT_DIR_REPORTES}cv_summary_XGBoost.csv")

print("================================================================================")
print("  RESUMEN XGBoost - CROSS VALIDATION")
print("================================================================================")
print(df_final.to_string(index=False))

  RESUMEN XGBoost - CROSS VALIDATION
 modelo fase  accuracy_val_mean  accuracy_val_std  precision_val_mean  precision_val_std  recall_val_mean  recall_val_std  f1_val_mean  f1_val_std  roc_auc_val_mean  roc_auc_val_std  accuracy_train_mean  accuracy_train_std  precision_train_mean  precision_train_std  recall_train_mean  recall_train_std  f1_train_mean  f1_train_std  roc_auc_train_mean  roc_auc_train_std
XGBoost   T0             0.6883            0.0174              0.5157             0.0265           0.5339          0.0206       0.5241      0.0173            0.7130           0.0091               0.9745              0.0041                0.9435               0.0074             0.9793            0.0056         0.9611        0.0063              0.9976             0.0008
XGBoost   T1             0.8358            0.0114              0.7572             0.0214           0.7203          0.0180       0.7382      0.0175            0.8849           0.0066               0.9997              0.000

## 10. Resumen completo de entrenamiento (Todos los algoritmos)

In [43]:
resumen_path = "../outputs/models/cv_summary_entrenamiento.csv"
df_resumen = pd.read_csv(resumen_path)

xg_path = "../outputs/models/XGBoost/cv_summary_XGBoost.csv"
df_xg = pd.read_csv(xg_path)

df_resumen = pd.concat([df_resumen, df_xg], ignore_index=True)

# Guardar tabla de comparaci√≥n
df_resumen.to_csv(f"../outputs/models/cv_summary_entrenamiento.csv", index=False)

print(f"Resultados guardados en: ../outputs/models/cv_summary_entrenamiento.csv")

Resultados guardados en: ../outputs/models/cv_summary_entrenamiento.csv


## Conclusi√≥n

Aunque XGBoost presenta desviaciones standard moderadas mas altas que RL y similares a RF, estan se disminuyen en la medida que se agrega m√°s informaci√≥n. En comparaci√≥n con RF presenta mejor recall, AUC y menor varianza, sin emabargo, al comparar con RL, este algoritmo presenta menor capacidad de detectar a los estudiantes desertores y menor capacidad de discriminaci√≥n. Si bien presenta una mejora relevante al incoporar informaci√≥n acad√©mica en las fases 1 y 2, presenta un sobreaprendizaje en el conjunto de entrenamiento, lo cual sugiere la necesidad de una optmizaci√≥n adecuada de los los hiperpar√°metros.