# MODELADO CATBOOST (SIN OPTIMIZACI√ìN)

Objetivo: Entrenar modelo CatBoost con hiperpar√°metros por defecto para comparar con el baseline (Regresi√≥n Log√≠stica), XGBoost y LightGBM.

Fases temporales:
- T0 (Matr√≠cula)          : Variables disponibles al momento de inscripci√≥n
- T1 (Fin 1er Semestre)   : T0 + variables acad√©micas del 1er semestre
- T2 (Fin 2do Semestre)   : T1 + variables acad√©micas del 2do semestre

Preprocesamiento espec√≠fico para CatBoost:
- No requiere escalado
- Manejo nativo de categ√≥ricas (sin Label Encoding)
- Target Encoding para 'course' (alta cardinalidad)

Pipeline:
1. Carga de datos preprocesados
2. Definici√≥n de variables por fase temporal
3. Preprocesamiento espec√≠fico para CatBoost
4. Entrenamiento con Cross-Validation 5-fold
5. Comparaci√≥n de resultados por fase

## 0. Librerias y configuraci√≥n

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Preprocesamiento
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder

# Modelo
from catboost import CatBoostClassifier
import catboost

# M√©tricas
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score
)

# Target Encoding
from category_encoders import TargetEncoder

# Configuraci√≥n de visualizaci√≥n
plt.style.use('seaborn-v0_8-whitegrid')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.4f}'.format)

# Seed para reproducibilidad
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Directorio de salida
OUTPUT_DIR = "../outputs/figures/modelado/CatBoost/"
OUTPUT_DIR_REPORTES = "../outputs/models/CatBoost/"
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR_REPORTES, exist_ok=True)

# mlflow
import mlflow
import mlflow.sklearn

## 1. Carga de datos preprocesados

In [21]:
# Cargar dataset preprocesado
df = pd.read_csv('../data/processed/preprocessed_data.csv')

print(f"Dataset cargado: {df.shape[0]} filas x {df.shape[1]} columnas")
print(df['target_binario'].value_counts())
print(f"\nRatio de desbalance: {df['target_binario'].value_counts()[0] / df['target_binario'].value_counts()[1]:.2f}:1")

df.head()

Dataset cargado: 4424 filas x 38 columnas
target_binario
0    3003
1    1421
Name: count, dtype: int64

Ratio de desbalance: 2.11:1


Unnamed: 0,application_order,course,daytimeevening_attendance,previous_qualification_grade,admission_grade,displaced,educational_special_needs,debtor,tuition_fees_up_to_date,gender,scholarship_holder,age_at_enrollment,international,curricular_units_1st_sem_credited,curricular_units_1st_sem_enrolled,curricular_units_1st_sem_evaluations,curricular_units_1st_sem_approved,curricular_units_1st_sem_grade,curricular_units_1st_sem_without_evaluations,curricular_units_2nd_sem_credited,curricular_units_2nd_sem_enrolled,curricular_units_2nd_sem_evaluations,curricular_units_2nd_sem_approved,curricular_units_2nd_sem_grade,curricular_units_2nd_sem_without_evaluations,unemployment_rate,inflation_rate,gdp,is_single,application_mode_risk,is_over_23_entry,previous_qualification_risk,mothers_qualification_level,fathers_qualification_level,mothers_occupation_level,fathers_occupation_level,has_unknown_parent_info,target_binario
0,5,171,1,122.0,127.3,1,0,0,1,1,0,20,0,0,0,0,0,0.0,0,0,0,0,0,0.0,0,10.8,1.4,1.74,1,Bajo_Riesgo,0,Bajo_Riesgo,Basica_Media,Secundaria,Otro_Trabajo,Otro_Trabajo,0,1
1,1,9254,1,160.0,142.5,1,0,0,0,1,0,19,0,0,6,6,6,14.0,0,0,6,6,6,13.6667,0,13.9,-0.3,0.79,1,Bajo_Riesgo,0,Bajo_Riesgo,Secundaria,Superior,Profesional,Profesional,0,0
2,5,9070,1,122.0,124.8,1,0,0,0,1,0,19,0,0,6,0,0,0.0,0,0,6,0,0,0.0,0,10.8,1.4,1.74,1,Bajo_Riesgo,0,Bajo_Riesgo,Basica_Baja,Basica_Baja,Otro_Trabajo,Otro_Trabajo,0,1
3,2,9773,1,122.0,119.6,1,0,0,1,0,0,20,0,0,6,8,6,13.4286,0,0,6,10,5,12.4,0,9.4,-0.8,-3.12,1,Bajo_Riesgo,0,Bajo_Riesgo,Basica_Media,Basica_Baja,Otro_Trabajo,Profesional,0,0
4,1,8014,0,100.0,141.5,0,0,0,1,0,0,45,0,0,6,9,5,12.3333,0,0,6,6,6,13.0,0,13.9,-0.3,0.79,0,Alto_Riesgo,1,Bajo_Riesgo,Basica_Baja,Basica_Media,Otro_Trabajo,Otro_Trabajo,0,0


## 2. Definici√≥n de variables por fase temporal (T0, T1, T2)

In [22]:
# TARGET
TARGET = 'target_binario'

# -----------------------------------------------------------------------------
# VARIABLES BINARIAS (no requieren encoding, ya son 0/1)
# -----------------------------------------------------------------------------
VARS_BINARIAS_T0 = [
    'daytimeevening_attendance',
    'displaced',
    'educational_special_needs',
    'gender',
    'scholarship_holder',
    'international',
    'is_single'
]

VARS_BINARIAS_T1 = [
    'debtor',
    'tuition_fees_up_to_date'
]

# -----------------------------------------------------------------------------
# VARIABLES NUM√âRICAS (NO requieren escalado para CatBoost)
# -----------------------------------------------------------------------------
VARS_NUMERICAS_T0 = [
    'age_at_enrollment',
    'admission_grade',
    'previous_qualification_grade'
]

VARS_NUMERICAS_T1 = [
    'curricular_units_1st_sem_credited',
    'curricular_units_1st_sem_enrolled',
    'curricular_units_1st_sem_evaluations',
    'curricular_units_1st_sem_approved',
    'curricular_units_1st_sem_grade',
    'curricular_units_1st_sem_without_evaluations',
    'unemployment_rate',
    'inflation_rate',
    'gdp'
]

VARS_NUMERICAS_T2 = [
    'curricular_units_2nd_sem_credited',
    'curricular_units_2nd_sem_enrolled',
    'curricular_units_2nd_sem_evaluations',
    'curricular_units_2nd_sem_approved',
    'curricular_units_2nd_sem_grade',
    'curricular_units_2nd_sem_without_evaluations'
]

# -----------------------------------------------------------------------------
# VARIABLES CATEG√ìRICAS AGRUPADAS (CatBoost las maneja nativamente)
# -----------------------------------------------------------------------------
VARS_CATEGORICAS_AGRUPADAS_T0 = [
    'application_mode_risk',
    'previous_qualification_risk',
    'mothers_qualification_level',
    'fathers_qualification_level',
    'mothers_occupation_level',
    'fathers_occupation_level'
]

# -----------------------------------------------------------------------------
# VARIABLES CATEG√ìRICAS PARA TARGET ENCODING (alta cardinalidad)
# -----------------------------------------------------------------------------
VARS_TARGET_ENCODING_T0 = ['course']

# -----------------------------------------------------------------------------
# VARIABLE ORDINAL (se trata como num√©rica)
# -----------------------------------------------------------------------------
VARS_ORDINALES_T0 = ['application_order']

# =============================================================================
# COMPOSICI√ìN DE VARIABLES POR FASE TEMPORAL
# =============================================================================

# T0: Variables disponibles al momento de matr√≠cula
VARS_T0 = (
    VARS_BINARIAS_T0 +
    VARS_NUMERICAS_T0 +
    VARS_CATEGORICAS_AGRUPADAS_T0 +
    VARS_TARGET_ENCODING_T0 +
    VARS_ORDINALES_T0
)

# T1: T0 + variables del 1er semestre
VARS_T1 = (
    VARS_T0 +
    VARS_BINARIAS_T1 +
    VARS_NUMERICAS_T1
)

# T2: T1 + variables del 2do semestre
VARS_T2 = (
    VARS_T1 +
    VARS_NUMERICAS_T2
)

print("================================================================================")
print("  VARIABLES POR FASE TEMPORAL")
print("================================================================================")
print(f"\n T0 (Matr√≠cula): {len(VARS_T0)} variables")
print(f" T1 (Fin 1er Sem): {len(VARS_T1)} variables (+{len(VARS_T1) - len(VARS_T0)})")
print(f" T2 (Fin 2do Sem): {len(VARS_T2)} variables (+{len(VARS_T2) - len(VARS_T1)})")

  VARIABLES POR FASE TEMPORAL

 T0 (Matr√≠cula): 18 variables
 T1 (Fin 1er Sem): 29 variables (+11)
 T2 (Fin 2do Sem): 35 variables (+6)


## 3. Preparaci√≥n de datos

In [23]:
# Preparar X e y
X = df[VARS_T2].copy()
y = df[TARGET].copy()

print("================================================================================")
print("  DATOS PARA ENTRENAMIENTO")
print("================================================================================")
print(f"\nTotal registros: {X.shape[0]}")
print(f"Total variables: {X.shape[1]}")

print(f"\nDistribuci√≥n del target:")
print(y.value_counts())
print(f"Ratio de desbalance: {y.value_counts()[0] / y.value_counts()[1]:.2f}:1")

  DATOS PARA ENTRENAMIENTO

Total registros: 4424
Total variables: 35

Distribuci√≥n del target:
target_binario
0    3003
1    1421
Name: count, dtype: int64
Ratio de desbalance: 2.11:1


## 4. Funciones de preprocesamiento para CATBOOST

In [24]:
def obtiene_variables_por_fase(fase):
    # Retorna las listas de variables seg√∫n la fase temporal
    if fase == 'T0':
        return {
            'binarias': VARS_BINARIAS_T0,
            'numericas': VARS_NUMERICAS_T0 + VARS_ORDINALES_T0,
            'categoricas_nativas': VARS_CATEGORICAS_AGRUPADAS_T0,
            'categoricas_te': VARS_TARGET_ENCODING_T0,
            'all': VARS_T0
        }
    elif fase == 'T1':
        return {
            'binarias': VARS_BINARIAS_T0 + VARS_BINARIAS_T1,
            'numericas': VARS_NUMERICAS_T0 + VARS_ORDINALES_T0 + VARS_NUMERICAS_T1,
            'categoricas_nativas': VARS_CATEGORICAS_AGRUPADAS_T0,
            'categoricas_te': VARS_TARGET_ENCODING_T0,
            'all': VARS_T1
        }
    elif fase == 'T2':
        return {
            'binarias': VARS_BINARIAS_T0 + VARS_BINARIAS_T1,
            'numericas': VARS_NUMERICAS_T0 + VARS_ORDINALES_T0 + VARS_NUMERICAS_T1 + VARS_NUMERICAS_T2,
            'categoricas_nativas': VARS_CATEGORICAS_AGRUPADAS_T0,
            'categoricas_te': VARS_TARGET_ENCODING_T0,
            'all': VARS_T2
        }
    else:
        raise ValueError(f"Fase no v√°lida: {fase}. Usar 'T0', 'T1', o 'T2'")


def preprocesamiento_catboost(X, y, fase):
    """
    Preprocesa los datos para CatBoost:
    - NO requiere Label Encoding (manejo nativo de categ√≥ricas)
    - Solo Target Encoding para 'course' (alta cardinalidad)
    - Convierte categ√≥ricas a string para CatBoost
    """
    variables_fase = obtiene_variables_por_fase(fase)
    
    # Seleccionar solo las variables de la fase
    X_fase = X[variables_fase['all']].copy()
    
    # -------------------------------------------------------------------------
    # 1. TARGET ENCODING para 'course' (alta cardinalidad)
    # -------------------------------------------------------------------------
    te = TargetEncoder(cols=variables_fase['categoricas_te'], smoothing=0.3)
    
    for col in variables_fase['categoricas_te']:
        X_fase[col + '_encoded'] = te.fit_transform(X_fase[[col]], y)[col]
        X_fase = X_fase.drop(columns=[col])
    
    # -------------------------------------------------------------------------
    # 2. Convertir categ√≥ricas a string para luego procesar en catBoost
    # -------------------------------------------------------------------------
    for col in variables_fase['categoricas_nativas']:
        X_fase[col] = X_fase[col].astype(str)
    
    # -------------------------------------------------------------------------
    # 3. Obtener √≠ndices de columnas categ√≥ricas para CatBoost
    # -------------------------------------------------------------------------
    cat_features_idx = [X_fase.columns.get_loc(c) for c in variables_fase['categoricas_nativas']]
    
    # -------------------------------------------------------------------------
    # Guardar informaci√≥n
    # -------------------------------------------------------------------------
    variables = X_fase.columns.tolist()
    preprocessors = {
        'target_encoder': te,
        'feature_names': variables,
        'cat_features_idx': cat_features_idx
    }
    
    return X_fase, variables, preprocessors

## 5. Funciones entrenamiento

In [25]:
def entrena_catboost_cv(X, y, fase, cat_features_idx, cv_folds=5):
    """
    Entrena CatBoost con Cross-Validation.
    """

    mlflow.end_run()

    print("================================================================================")
    print(f"  ENTRENAMIENTO CATBOOST - FASE {fase}")
    print("================================================================================")
    print(f"\nVariables: {X.shape[1]}")
    print(f"Registros: {X.shape[0]}")
    print(f"Variables categ√≥ricas (nativas): {len(cat_features_idx)}")
    
    # -------------------------------------------------------------------------
    # Calcular class_weights para desbalance
    # -------------------------------------------------------------------------
    n_neg = (y == 0).sum()
    n_pos = (y == 1).sum()
    scale_pos_weight = n_neg / n_pos
    class_weights = {0: 1.0, 1: scale_pos_weight}
    print(f"\nclass_weights: {{0: 1.0, 1: {scale_pos_weight:.2f}}}")
    
    print(f"\nHiperpar√°metros (por defecto):")
    print(f"   ‚Ä¢ iterations: 100")
    print(f"   ‚Ä¢ depth: 6")
    print(f"   ‚Ä¢ learning_rate: 0.1")
    print(f"   ‚Ä¢ l2_leaf_reg: 3")
    
    # -------------------------------------------------------------------------
    # Cross-Validation
    # -------------------------------------------------------------------------
    print(f"\nCross-Validation ({cv_folds}-fold):")
    
    cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=RANDOM_STATE)
    
    cv_results = {
        'train_accuracy': [], 'test_accuracy': [],
        'train_precision': [], 'test_precision': [],
        'train_recall': [], 'test_recall': [],
        'train_f1': [], 'test_f1': [],
        'train_roc_auc': [], 'test_roc_auc': []
    }
    
    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
        X_fold_train = X.iloc[train_idx]
        X_fold_val = X.iloc[val_idx]
        y_fold_train = y.iloc[train_idx]
        y_fold_val = y.iloc[val_idx]
        
        # Crear modelo
        model = CatBoostClassifier(
            iterations=100, depth=6, learning_rate=0.1, l2_leaf_reg=3,
            border_count=254, class_weights=class_weights,
            loss_function='Logloss', eval_metric='AUC',
            cat_features=cat_features_idx, random_seed=RANDOM_STATE, verbose=False
        )
        model.fit(X_fold_train, y_fold_train)
        
        # Predicciones
        y_train_pred = model.predict(X_fold_train)
        y_train_proba = model.predict_proba(X_fold_train)[:, 1]
        y_val_pred = model.predict(X_fold_val)
        y_val_proba = model.predict_proba(X_fold_val)[:, 1]
        
        # M√©tricas Train
        cv_results['train_accuracy'].append(accuracy_score(y_fold_train, y_train_pred))
        cv_results['train_precision'].append(precision_score(y_fold_train, y_train_pred))
        cv_results['train_recall'].append(recall_score(y_fold_train, y_train_pred))
        cv_results['train_f1'].append(f1_score(y_fold_train, y_train_pred))
        cv_results['train_roc_auc'].append(roc_auc_score(y_fold_train, y_train_proba))
        
        # M√©tricas Validation
        cv_results['test_accuracy'].append(accuracy_score(y_fold_val, y_val_pred))
        cv_results['test_precision'].append(precision_score(y_fold_val, y_val_pred))
        cv_results['test_recall'].append(recall_score(y_fold_val, y_val_pred))
        cv_results['test_f1'].append(f1_score(y_fold_val, y_val_pred))
        cv_results['test_roc_auc'].append(roc_auc_score(y_fold_val, y_val_proba))
    
    # Convertir a numpy arrays
    for key in cv_results:
        cv_results[key] = np.array(cv_results[key])
    
    # -------------------------------------------------------------------------
    # Resultados por fold
    # -------------------------------------------------------------------------
    print("\n Resultados por fold:")
    for i in range(cv_folds):
        print(f"\n  Fold {i+1}:")
        for metric in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']:
            train_score = cv_results[f'train_{metric}'][i]
            val_score = cv_results[f'test_{metric}'][i]
            print(f"    {metric:<10} | Train: {train_score:.4f} | Val: {val_score:.4f}")


    mlflow.set_experiment("TFM_Dropout_Prediction")
    with mlflow.start_run(run_name=f"CatBoost_CV5_{fase}"):
        mlflow.set_tag("modelo", 'Params por default')
        mlflow.set_tag("tipo", 'Validacion cruzada')
        mlflow.log_params(model.get_params())

        # -------------------------------------------------------------------------
        # Resumen CV
        # -------------------------------------------------------------------------
        print(f"\n   {'M√©trica':<12} {'Train Mean':>12} {'Train Std':>12} {'Val Mean':>12} {'Val Std':>12}")
        print(f"   {'-'*60}")
        for metric in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']:
            train_mean = cv_results[f'train_{metric}'].mean()
            train_std = cv_results[f'train_{metric}'].std()
            val_mean = cv_results[f'test_{metric}'].mean()
            val_std = cv_results[f'test_{metric}'].std()
            # mlflow
            mlflow.log_metric(f'test_{metric}_mean', val_mean.round(4))
            mlflow.log_metric(f'test_{metric}_std', val_std.round(4))
            print(f"   {metric:<12} {train_mean:>12.4f} {train_std:>12.4f} {val_mean:>12.4f} {val_std:>12.4f}")
        
        return {
            'phase': fase,
            'n_features': X.shape[1],
            'cv_results': cv_results
        }


def resumen_cv(cv_results, fase, modelo):
    metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
    summary = {'modelo': modelo, 'fase': fase}
    
    # -------------------------
    # M√©tricas de VALIDACI√ìN
    # -------------------------
    for metric in metrics:
        summary[f'{metric}_val_mean'] = cv_results[f'test_{metric}'].mean()
        summary[f'{metric}_val_std']  = cv_results[f'test_{metric}'].std()
        

    # -------------------------
    # M√©tricas de TRAIN
    # -------------------------
    for metric in metrics:
        summary[f'{metric}_train_mean'] = cv_results[f'train_{metric}'].mean()
        summary[f'{metric}_train_std']  = cv_results[f'train_{metric}'].std()

    
    return pd.DataFrame([summary])

## 6. Modelado FASE T0 (MATR√çCULA)

In [26]:
# Preprocesamiento para T0
X_T0, features_T0, prep_T0 = preprocesamiento_catboost(X, y, fase='T0')

print(f"\nT0 - Dimensiones: {X_T0.shape}")
print(f"Variables: {len(features_T0)}")
print(f"Categ√≥ricas (√≠ndices): {prep_T0['cat_features_idx']}")


T0 - Dimensiones: (4424, 18)
Variables: 18
Categ√≥ricas (√≠ndices): [10, 11, 12, 13, 14, 15]


In [27]:
# Entrenar T0
results_T0 = entrena_catboost_cv(X_T0, y, fase='T0', cat_features_idx=prep_T0['cat_features_idx'])

df_resumen_catboost = resumen_cv(results_T0['cv_results'], fase='T0', modelo='CatBoost')
df_resumen_catboost.to_csv(f"{OUTPUT_DIR_REPORTES}cv_summary_CatBoost.csv", index=False)
print(f"\nResultados guardados en: {OUTPUT_DIR_REPORTES}cv_summary_CatBoost.csv")

  ENTRENAMIENTO CATBOOST - FASE T0

Variables: 18
Registros: 4424
Variables categ√≥ricas (nativas): 6

class_weights: {0: 1.0, 1: 2.11}

Hiperpar√°metros (por defecto):
   ‚Ä¢ iterations: 100
   ‚Ä¢ depth: 6
   ‚Ä¢ learning_rate: 0.1
   ‚Ä¢ l2_leaf_reg: 3

Cross-Validation (5-fold):

 Resultados por fold:

  Fold 1:
    accuracy   | Train: 0.7253 | Val: 0.6893
    precision  | Train: 0.5544 | Val: 0.5128
    recall     | Train: 0.7359 | Val: 0.7053
    f1         | Train: 0.6324 | Val: 0.5938
    roc_auc    | Train: 0.8111 | Val: 0.7582

  Fold 2:
    accuracy   | Train: 0.7313 | Val: 0.7006
    precision  | Train: 0.5630 | Val: 0.5265
    recall     | Train: 0.7309 | Val: 0.6655
    f1         | Train: 0.6361 | Val: 0.5879
    roc_auc    | Train: 0.8143 | Val: 0.7635

  Fold 3:
    accuracy   | Train: 0.7197 | Val: 0.6904
    precision  | Train: 0.5472 | Val: 0.5130
    recall     | Train: 0.7388 | Val: 0.6937
    f1         | Train: 0.6287 | Val: 0.5898
    roc_auc    | Train: 0.8103

## 7. Modelado FASE T1 (FIN 1ER SEMESTRE)

In [28]:
# Preprocesamiento para T1
X_T1, features_T1, prep_T1 = preprocesamiento_catboost(X, y, fase='T1')

print(f"\nT1 - Dimensiones: {X_T1.shape}")
print(f"Variables: {len(features_T1)}")


T1 - Dimensiones: (4424, 29)
Variables: 29


In [29]:
# Entrenar T1
results_T1 = entrena_catboost_cv(X_T1, y, fase='T1', cat_features_idx=prep_T1['cat_features_idx'])

df_resumen_T1 = resumen_cv(results_T1['cv_results'], fase='T1', modelo='CatBoost')

cb_path = f"{OUTPUT_DIR_REPORTES}cv_summary_CatBoost.csv"
df_cb = pd.read_csv(cb_path)
df_final = pd.concat([df_cb, df_resumen_T1], ignore_index=True)
df_final.to_csv(cb_path, index=False)
print(f"\nResultados guardados en: {cb_path}")

  ENTRENAMIENTO CATBOOST - FASE T1

Variables: 29
Registros: 4424
Variables categ√≥ricas (nativas): 6

class_weights: {0: 1.0, 1: 2.11}

Hiperpar√°metros (por defecto):
   ‚Ä¢ iterations: 100
   ‚Ä¢ depth: 6
   ‚Ä¢ learning_rate: 0.1
   ‚Ä¢ l2_leaf_reg: 3

Cross-Validation (5-fold):

 Resultados por fold:

  Fold 1:
    accuracy   | Train: 0.8906 | Val: 0.8452
    precision  | Train: 0.8134 | Val: 0.7517
    recall     | Train: 0.8556 | Val: 0.7754
    f1         | Train: 0.8340 | Val: 0.7634
    roc_auc    | Train: 0.9523 | Val: 0.8897

  Fold 2:
    accuracy   | Train: 0.8966 | Val: 0.8350
    precision  | Train: 0.8292 | Val: 0.7255
    recall     | Train: 0.8540 | Val: 0.7817
    f1         | Train: 0.8414 | Val: 0.7525
    roc_auc    | Train: 0.9562 | Val: 0.8987

  Fold 3:
    accuracy   | Train: 0.8986 | Val: 0.8452
    precision  | Train: 0.8285 | Val: 0.7543
    recall     | Train: 0.8628 | Val: 0.7676
    f1         | Train: 0.8453 | Val: 0.7609
    roc_auc    | Train: 0.9555

## 8. Modelado FASE T2 (FIN 2DO SEMESTRE)

In [30]:
# Preprocesamiento para T2
X_T2, features_T2, prep_T2 = preprocesamiento_catboost(X, y, fase='T2')

print(f"\nT2 - Dimensiones: {X_T2.shape}")
print(f"Variables: {len(features_T2)}")


T2 - Dimensiones: (4424, 35)
Variables: 35


In [31]:
# Entrenar T2
results_T2 = entrena_catboost_cv(X_T2, y, fase='T2', cat_features_idx=prep_T2['cat_features_idx'])

df_resumen_T2 = resumen_cv(results_T2['cv_results'], fase='T2', modelo='CatBoost')

cb_path = f"{OUTPUT_DIR_REPORTES}cv_summary_CatBoost.csv"
df_cb = pd.read_csv(cb_path)
df_final = pd.concat([df_cb, df_resumen_T2], ignore_index=True)
df_final.to_csv(cb_path, index=False)
print(f"\nResultados guardados en: {cb_path}")

  ENTRENAMIENTO CATBOOST - FASE T2

Variables: 35
Registros: 4424
Variables categ√≥ricas (nativas): 6

class_weights: {0: 1.0, 1: 2.11}

Hiperpar√°metros (por defecto):
   ‚Ä¢ iterations: 100
   ‚Ä¢ depth: 6
   ‚Ä¢ learning_rate: 0.1
   ‚Ä¢ l2_leaf_reg: 3

Cross-Validation (5-fold):

 Resultados por fold:

  Fold 1:
    accuracy   | Train: 0.9053 | Val: 0.8599
    precision  | Train: 0.8380 | Val: 0.7674
    recall     | Train: 0.8741 | Val: 0.8105
    f1         | Train: 0.8557 | Val: 0.7884
    roc_auc    | Train: 0.9645 | Val: 0.9046

  Fold 2:
    accuracy   | Train: 0.9101 | Val: 0.8655
    precision  | Train: 0.8527 | Val: 0.7835
    recall     | Train: 0.8707 | Val: 0.8028
    f1         | Train: 0.8616 | Val: 0.7930
    roc_auc    | Train: 0.9645 | Val: 0.9188

  Fold 3:
    accuracy   | Train: 0.9014 | Val: 0.8633
    precision  | Train: 0.8373 | Val: 0.7880
    recall     | Train: 0.8602 | Val: 0.7852
    f1         | Train: 0.8486 | Val: 0.7866
    roc_auc    | Train: 0.9608

## 9. Resumen Final

In [32]:
# Mostrar resumen final
df_final = pd.read_csv(f"{OUTPUT_DIR_REPORTES}cv_summary_CatBoost.csv")

print("================================================================================")
print("  RESUMEN CATBOOST - CROSS VALIDATION")
print("================================================================================")
print(df_final.to_string(index=False))

  RESUMEN CATBOOST - CROSS VALIDATION
  modelo fase  accuracy_val_mean  accuracy_val_std  precision_val_mean  precision_val_std  recall_val_mean  recall_val_std  f1_val_mean  f1_val_std  roc_auc_val_mean  roc_auc_val_std  accuracy_train_mean  accuracy_train_std  precision_train_mean  precision_train_std  recall_train_mean  recall_train_std  f1_train_mean  f1_train_std  roc_auc_train_mean  roc_auc_train_std
CatBoost   T0             0.6944            0.0155              0.5188             0.0189           0.6875          0.0157       0.5912      0.0152            0.7702           0.0137               0.7277              0.0046                0.5576               0.0061             0.7373            0.0122         0.6349        0.0055              0.8109             0.0029
CatBoost   T1             0.8449            0.0068              0.7508             0.0153           0.7748          0.0063       0.7625      0.0083            0.9032           0.0090               0.8925              0

## 10. Resumen completo de entrenamiento (Todos los algoritmos)

In [33]:
resumen_path = "../outputs/models/cv_summary_entrenamiento.csv"
df_resumen = pd.read_csv(resumen_path)

cb_path = "../outputs/models/CatBoost/cv_summary_CatBoost.csv"
df_cb = pd.read_csv(cb_path)

df_resumen = pd.concat([df_resumen, df_cb], ignore_index=True)

# Guardar tabla de comparaci√≥n
df_resumen.to_csv(f"../outputs/models/cv_summary_entrenamiento.csv", index=False)

print(f"Resultados guardados en: ../outputs/models/cv_summary_entrenamiento.csv")

Resultados guardados en: ../outputs/models/cv_summary_entrenamiento.csv
