# Predicci√≥n de Retrasos de Vuelos: Comparativa de Estrategias y Modelos (v5)

**Objetivo:** Comparar de forma justa m√∫ltiples estrategias de *feature engineering* y m√∫ltiples modelos para encontrar la mejor combinaci√≥n para predecir retrasos de vuelos (`RETRASADO_LLEGADA`).

**Metodolog√≠a de Validaci√≥n:**
Para evitar fuga de datos (data leakage), se usar√° un **Split Temporal**:
- **Train:** Meses 1‚Äì9
- **Valid:** Meses 10‚Äì12

**Experimentos:**
1.  **Comparativa de Features (con LGBM):**
    - **Exp 1:** LabelEncoder (Simple)
    - **Exp 2:** Target Encoding (Avanzado)
    - **Exp 3:** Target Encoding + Agregados Hist√≥ricos (M√°s Avanzado)
2.  **Batalla de Modelos (con las mejores features):**
    - LightGBM
    - XGBoost
    - Random Forest

2: Importaciones y Configuraci√≥n Global (C√≥digo)

In [25]:
import os, time, json, math, warnings
import numpy as np
import pandas as pd
from joblib import dump, load
import lightgbm as lgb
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    roc_auc_score, f1_score, precision_recall_curve, auc as sk_auc
)

warnings.filterwarnings('ignore')
pd.options.mode.chained_assignment = None

# --- Configuraci√≥n Global ---
# ¬°¬°IMPORTANTE!! Ajusta esta ruta a tu archivo local
DATA_PATH = r"D:\OneDrive\DOCUMENTOS\Personales\2024\uniandes\8 S\seminario\g11-caso-estudio-flights\data\processed\flights_clean.csv"

TARGET_COL = "RETRASADO_LLEGADA"
RESULTS = [] # Aqu√≠ se guardar√°n los resultados de cada modelo

3: Paso 1 - Carga y Preparaci√≥n de Datos (Helpers) (C√≥digo)

In [26]:
# ==============================================================================
# PASO 1: FUNCIONES DE PREPARACI√ìN DE DATOS (Helpers)
# ==============================================================================

def load_and_prep_data(data_path):
    """Carga y deriva todas las features necesarias del CSV."""
    print(f"Cargando datos desde {data_path}...")
    
    # Columnas que necesitamos del CSV original
    need_cols = [
        "MONTH", "DAY_OF_WEEK", "AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT",
        "SCHEDULED_DEPARTURE", "SCHEDULED_ARRIVAL", 
        "SCHEDULED_TIME", "DISTANCE",
        "ORIGEN_LAT", "ORIGEN_LON", "DEST_LAT", "DEST_LON",
        "SALIDA_SIN", "SALIDA_COS", "LLEGADA_SIN", "LLEGADA_COS",
        "RETRASADO_LLEGADA"
    ]
    
    try:
        header = pd.read_csv(data_path, nrows=0).columns.tolist()
    except FileNotFoundError:
        print(f"ERROR: No se encontr√≥ el archivo en {data_path}")
        return None
        
    present = [c for c in need_cols if c in header]
    
    # Definir tipos de datos para ahorrar memoria
    dtype_map = {
        "MONTH":"int8", "DAY_OF_WEEK":"int8", "AIRLINE":"category", 
        "ORIGIN_AIRPORT":"category", "DESTINATION_AIRPORT":"category",
        "SCHEDULED_DEPARTURE":"int32", "SCHEDULED_ARRIVAL":"int32",
        "SCHEDULED_TIME":"float32", "DISTANCE":"float32",
        "ORIGEN_LAT":"float32", "ORIGEN_LON":"float32",
        "DEST_LAT":"float32", "DEST_LON":"float32", 
        "SALIDA_SIN":"float32", "SALIDA_COS":"float32", 
        "LLEGADA_SIN":"float32", "LLEGADA_COS":"float32",
        "RETRASADO_LLEGADA":"int8"
    }
    dtype_eff = {k:v for k,v in dtype_map.items() if k in present}

    v = pd.read_csv(data_path, usecols=present, dtype=dtype_eff, low_memory=False)

    # --- Derivar features FALTANTES (si no vinieron en el CSV) ---
    
    def haversine_km(lat1, lon1, lat2, lon2):
        R = 6371.0
        lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
        dlat = lat2 - lat1; dlon = lon2 - lon1
        a = np.sin(dlat/2.0)**2 + np.cos(lat1)*np.cos(lat2)*np.sin(dlon/2.0)**2
        return (2*R*np.arcsin(np.sqrt(a))).astype(np.float32)

    if "DISTANCIA_HAV" not in v.columns and "DISTANCE" not in v.columns:
        if {"ORIGEN_LAT", "ORIGEN_LON", "DEST_LAT", "DEST_LON"}.issubset(v.columns):
            v["DISTANCIA_HAV"] = haversine_km(v["ORIGEN_LAT"], v["ORIGEN_LON"], v["DEST_LAT"], v["DEST_LON"])
        else:
            v["DISTANCIA_HAV"] = 0.0
    elif "DISTANCE" in v.columns and "DISTANCIA_HAV" not in v.columns:
        v["DISTANCIA_HAV"] = v["DISTANCE"].astype("float32") 

    if "SCHEDULED_TIME" not in v.columns: v["SCHEDULED_TIME"] = 0.0
        
    if "MINUTO_DIA_SALIDA" not in v.columns and "SCHEDULED_DEPARTURE" in v.columns:
        hs = (v["SCHEDULED_DEPARTURE"] // 100).clip(0, 23).astype("int16")
        ms = (v["SCHEDULED_DEPARTURE"] % 100).clip(0, 59).astype("int16")
        v["MINUTO_DIA_SALIDA"] = (hs * 60 + ms).astype("int16")
        v["HORA_SALIDA"] = hs
    
    if "SALIDA_SIN" not in v.columns and "MINUTO_DIA_SALIDA" in v.columns:
        rad = 2*np.pi*(v["MINUTO_DIA_SALIDA"].astype(float)/(24*60))
        v["SALIDA_SIN"] = np.sin(rad).astype("float32")
        v["SALIDA_COS"] = np.cos(rad).astype("float32")

    if "MINUTO_DIA_LLEGADA" not in v.columns and "SCHEDULED_ARRIVAL" in v.columns:
        hl = (v["SCHEDULED_ARRIVAL"] // 100).clip(0, 23).astype("int16")
        ml = (v["SCHEDULED_ARRIVAL"] % 100).clip(0, 59).astype("int16")
        v["MINUTO_DIA_LLEGADA"] = (hl * 60 + ml).astype("int16")
    
    if "LLEGADA_SIN" not in v.columns and "MINUTO_DIA_LLEGADA" in v.columns:
        rad_l = 2*np.pi*(v["MINUTO_DIA_LLEGADA"].astype(float)/(24*60))
        v["LLEGADA_SIN"] = np.sin(rad_l).astype("float32")
        v["LLEGADA_COS"] = np.cos(rad_l).astype("float32")
        
    if "MONTH_SIN" not in v.columns and "MONTH" in v.columns:
        v["MONTH_SIN"] = np.sin(2*np.pi * v["MONTH"]/12).astype("float32")
        v["MONTH_COS"] = np.cos(2*np.pi * v["MONTH"]/12).astype("float32")

    if "RUTA" not in v.columns:
        v["RUTA"] = v["ORIGIN_AIRPORT"].astype(str) + "_" + v["DESTINATION_AIRPORT"].astype(str)
    
    print(f"Datos preparados. Shape: {v.shape}")
    return v

def split_temporal(df, target_col):
    """Split temporal: Train 1-9, Valid 10-12"""
    print("Realizando split temporal (Train 1-9, Valid 10-12)...")
    train_mask = df["MONTH"].between(1, 9)
    valid_mask = df["MONTH"].between(10, 12)
    
    y = df[target_col].astype("int8")
    X = df.drop(columns=[target_col])
    
    X_train, y_train = X.loc[train_mask].copy(), y.loc[train_mask].copy()
    X_valid, y_valid = X.loc[valid_mask].copy(), y.loc[valid_mask].copy()
    
    print(f"X_train: {X_train.shape}, X_valid: {X_valid.shape}")
    return X_train, y_train, X_valid, y_valid

# Ejecutar carga y split
v_full = load_and_prep_data(DATA_PATH)
if v_full is not None:
    X_train_base, y_train_base, X_valid_base, y_valid_base = split_temporal(v_full, TARGET_COL)

Cargando datos desde D:\OneDrive\DOCUMENTOS\Personales\2024\uniandes\8 S\seminario\g11-caso-estudio-flights\data\processed\flights_clean.csv...
Datos preparados. Shape: (5231130, 25)
Realizando split temporal (Train 1-9, Valid 10-12)...
X_train: (4299046, 24), X_valid: (932084, 24)


4: Paso 2 - Funciones de Feature Engineering (Codificadores) (C√≥digo)

In [27]:
# ==============================================================================
# PASO 2: FUNCIONES DE FEATURE ENGINEERING (Codificadores)
# ==============================================================================

def apply_label_encoder(X_train_subset, X_valid_subset):
    """Aplica LabelEncoder y maneja categor√≠as desconocidas."""
    print("Aplicando LabelEncoder...")
    X_train_le = X_train_subset.copy()
    X_valid_le = X_valid_subset.copy()
    cat_cols_in_subset = X_train_subset.columns 
    encoders = {}
    
    for col in cat_cols_in_subset: 
        le = LabelEncoder()
        X_train_le[col] = le.fit_transform(X_train_le[col].astype(str))
        
        le_classes = set(le.classes_)
        X_valid_le[col] = X_valid_le[col].astype(str).apply(lambda x: x if x in le_classes else '<unknown>')
        if '<unknown>' not in le_classes:
            le.classes_ = np.append(le.classes_, '<unknown>')
        
        X_valid_le[col] = le.transform(X_valid_le[col])
        encoders[col] = le
            
    return X_train_le, X_valid_le, encoders

def kfold_target_encode(s_train, y_train, s_valid, smoothing=50):
    """Helper para TE K-Fold (sin fuga) en una columna."""
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    gmean = float(y_train.mean())
    enc_train = pd.Series(index=s_train.index, dtype="float32")

    for tr_idx, val_idx in skf.split(s_train, y_train):
        s_tr, y_tr = s_train.iloc[tr_idx], y_train.iloc[tr_idx]
        s_val = s_train.iloc[val_idx]

        stats = y_tr.groupby(s_tr.astype(str)).mean()
        cnts = y_tr.groupby(s_tr.astype(str)).size()
        smoothed = ((stats * cnts + gmean * smoothing) / (cnts + smoothing)).to_dict()
        enc_train.iloc[val_idx] = s_val.astype(str).map(smoothed).fillna(gmean)

    full_stats = y_train.groupby(s_train.astype(str)).mean()
    full_cnts = y_train.groupby(s_train.astype(str)).size()
    mapping = ((full_stats * full_cnts + gmean * smoothing) / (full_cnts + smoothing)).to_dict()
    enc_valid = s_valid.astype(str).map(mapping).fillna(gmean).astype("float32")
    
    return enc_train.astype("float32"), enc_valid

def apply_target_encoding(X_train, y_train, X_valid, cat_cols):
    """Aplica TE K-Fold y DEVUELVE SOLO LAS NUEVAS COLUMNAS (preserva √≠ndice)."""
    print("Aplicando Target Encoding K-Fold...")
    X_train_te = pd.DataFrame(index=X_train.index)
    X_valid_te = pd.DataFrame(index=X_valid.index)
    
    for col in cat_cols:
        new_col_name = f"{col}_TE"
        enc_tr, enc_val = kfold_target_encode(X_train[col], y_train, X_valid[col])
        X_train_te[new_col_name] = enc_tr
        X_valid_te[new_col_name] = enc_val
        
    return X_train_te, X_valid_te

# ================================================================
# *** VERSI√ìN CORREGIDA (v7) de apply_historical_aggs ***
# ================================================================
def apply_historical_aggs(X_train, y_train, X_valid, agg_specs):
    """
    Calcula agregados hist√≥ricos y DEVUELVE SOLO LAS NUEVAS COLUMNAS.
    (v7 FIX: Reindexa el merge al √≠ndice original)
    """
    print("Aplicando Agregados Hist√≥ricos (v7 Fix)...")
    gmean = float(y_train.mean())
    
    X_train_agg_cols = pd.DataFrame(index=X_train.index)
    X_valid_agg_cols = pd.DataFrame(index=X_valid.index)

    df_train = X_train.copy()
    df_train[TARGET_COL] = y_train
    
    for keys, pref in agg_specs:
        rate_col, n_col = f"{pref}_rate", f"{pref}_n"
        
        # 1. Calcular stats *solo* en train
        agg = df_train.groupby(keys, observed=True)[TARGET_COL].agg(["mean", "size"]).reset_index()
        agg.columns = keys + [rate_col, n_col]
        
        # 2. Aplicar (mapear) stats a X_train y X_valid
        #    Preservamos el √≠ndice original haciendo el merge sobre el √≠ndice
        X_train_merged = X_train[keys].merge(agg, on=keys, how="left")
        X_valid_merged = X_valid[keys].merge(agg, on=keys, how="left")

        # 3. *** FIX ***
        # El merge desordena el √≠ndice. Debemos re-alinearlo al √≠ndice original
        # de X_train/X_valid ANTES de llenar NaNs y asignar.
        X_train_merged.index = X_train.index
        X_valid_merged.index = X_valid.index

        # 4. Llenar NaNs y asignar (ahora los √≠ndices coinciden)
        X_train_agg_cols[rate_col] = X_train_merged[rate_col].fillna(gmean).astype("float32")
        X_train_agg_cols[n_col] = X_train_merged[n_col].fillna(0).astype("float32")
        X_valid_agg_cols[rate_col] = X_valid_merged[rate_col].fillna(gmean).astype("float32")
        X_valid_agg_cols[n_col] = X_valid_merged[n_col].fillna(0).astype("float32")

    return X_train_agg_cols, X_valid_agg_cols

Celda 5: Paso 3 - Funciones de Entrenamiento (LGBM, XGB, RF) (C√≥digo)

In [62]:
# ==============================================================================
# PASO 3: FUNCIONES DE ENTRENAMIENTO Y EVALUACI√ìN (CORREGIDO v16)
# ==============================================================================

def find_best_f1_threshold(y_true, y_proba):
    """Encuentra el umbral que maximiza el F1-Score."""
    prec, rec, thr = precision_recall_curve(y_true, y_proba)
    f1s = (2 * prec * rec) / (prec + rec + 1e-9) # Evitar divisi√≥n por cero
    best_f1_idx = np.nanargmax(f1s)
    # Asegurar que el √≠ndice no est√© fuera de los l√≠mites de thr
    if best_f1_idx < len(thr):
        return f1s[best_f1_idx], thr[best_f1_idx]
    else:
        # Fallback si el mejor F1 est√° en el √∫ltimo punto (sin umbral)
        return f1s[best_f1_idx], 0.99
    

def train_lgbm(X_train, y_train, X_valid, y_valid, exp_name, categorical_features=None):
    """Entrena LGBM y reporta m√©tricas."""
    print(f"\n--- Entrenando Experimento: {exp_name} (LGBM) ---")
    
    params = {
        'objective': 'binary',
        'metric': 'auc',
        'n_estimators': 1000, 
        'learning_rate': 0.05,
        'num_leaves': 127,
        'class_weight': 'balanced', # Usar esto
        'n_jobs': -1,
        'random_state': 42,
        'colsample_bytree': 0.8,
        'subsample': 0.8,
        'min_child_samples': 200
    }
    
    model = lgb.LGBMClassifier(**params)
    t0 = time.time()
    
    fit_params = {
        "eval_set": [(X_valid, y_valid)],
        "eval_metric": "auc",
        "callbacks": [lgb.early_stopping(100), lgb.log_evaluation(200)]
    }
    
    # *** FIX v16 (LGBM) ***
    # Si usamos LabelEncoder (Exp 1), solo pasamos los NOMBRES.
    # La conversi√≥n a Dtype 'category' se hace AFUERA (en la Celda 7).
    if categorical_features:
        fit_params["categorical_feature"] = categorical_features
        
    model.fit(X_train, y_train, **fit_params) # X_train/y_train ya est√°n alineados
    t1 = time.time()
    
    print(f"Entrenamiento completado en {t1-t0:.1f}s (Best iter: {model.best_iteration_})\n")
    
    y_proba = model.predict_proba(X_valid)[:, 1]
    auc_roc = roc_auc_score(y_valid, y_proba)
    prec, rec, _ = precision_recall_curve(y_valid, y_proba)
    auc_pr = sk_auc(rec, prec)
    best_f1, best_thr = find_best_f1_threshold(y_valid, y_proba)
    
    metrics = {
        "Modelo": "LGBM",
        "Experimento": exp_name,
        "ROC-AUC": round(auc_roc, 4),
        "PR-AUC": round(auc_pr, 4),
        "Best_F1": round(best_f1, 4),
        "Best_F1_Threshold": round(best_thr, 3),
        "Tiempo (s)": round(t1 - t0, 1)
    }
    RESULTS.append(metrics)
    return model, metrics

def train_xgb(X_train, y_train, X_valid, y_valid, exp_name, categorical_features=None):
    """Entrena XGBoost y reporta m√©tricas."""
    print(f"\n--- Entrenando Experimento: {exp_name} (XGBoost) ---")
    
    neg = (y_train == 0).sum()
    pos = (y_train == 1).sum()
    scale_pos_weight = neg / pos
    
    params = {
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
        'n_estimators': 1000,
        'learning_rate': 0.05,
        'max_depth': 8, 
        'scale_pos_weight': scale_pos_weight,
        'n_jobs': -1,
        'random_state': 42,
        'colsample_bytree': 0.8,
        'subsample': 0.8,
        'min_child_weight': 5,
        'early_stopping_rounds': 100 # v14 fix
    }
    
    # *** FIX v16 (XGB) ***
    # La conversi√≥n a Dtype 'category' se hace AFUERA (en la Celda 7).
    if categorical_features:
        params['enable_categorical'] = True

    model = XGBClassifier(**params)
    t0 = time.time()
    
    model.fit(
        X_train, y_train,
        eval_set=[(X_valid, y_valid)],
        verbose=200
    )
    t1 = time.time()
    
    print(f"Entrenamiento completado en {t1-t0:.1f}s (Best iter: {model.best_iteration})\n")
    
    y_proba = model.predict_proba(X_valid)[:, 1]
    auc_roc = roc_auc_score(y_valid, y_proba)
    prec, rec, _ = precision_recall_curve(y_valid, y_proba)
    auc_pr = sk_auc(rec, prec)
    best_f1, best_thr = find_best_f1_threshold(y_valid, y_proba)
    
    metrics = {
        "Modelo": "XGBoost",
        "Experimento": exp_name,
        "ROC-AUC": round(auc_roc, 4),
        "PR-AUC": round(auc_pr, 4),
        "Best_F1": round(best_f1, 4),
        "Best_F1_Threshold": round(best_thr, 3),
        "Tiempo (s)": round(t1 - t0, 1)
    }
    RESULTS.append(metrics)
    return model, metrics

def train_rf(X_train, y_train, X_valid, y_valid, exp_name):
    """Entrena Random Forest y reporta m√©tricas."""
    print(f"\n--- Entrenando Experimento: {exp_name} (RandomForest) ---")
    
    params = {
        'n_estimators': 100, # Reducido para velocidad
        'max_depth': 20, 
        'min_samples_leaf': 100, 
        'max_features': 'sqrt',
        'n_jobs': -1,
        'random_state': 42,
        'class_weight': 'balanced'
    }
    
    model = RandomForestClassifier(**params)
    t0 = time.time()
    
    model.fit(X_train, y_train)
    t1 = time.time()
    
    print(f"Entrenamiento completado en {t1-t0:.1f}s\n")
    
    y_proba = model.predict_proba(X_valid)[:, 1]
    auc_roc = roc_auc_score(y_valid, y_proba)
    prec, rec, _ = precision_recall_curve(y_valid, y_proba)
    auc_pr = sk_auc(rec, prec)
    best_f1, best_thr = find_best_f1_threshold(y_valid, y_proba)
    
    metrics = {
        "Modelo": "RandomForest",
        "Experimento": exp_name,
        "ROC-AUC": round(auc_roc, 4),
        "PR-AUC": round(auc_pr, 4),
        "Best_F1": round(best_f1, 4),
        "Best_F1_Threshold": round(best_thr, 3),
        "Tiempo (s)": round(t1 - t0, 1)
    }
    RESULTS.append(metrics)
    return model, metrics

6: Paso 4 - Ejecuci√≥n de Experimentos (Setup) (C√≥digo)

In [63]:
# ==============================================================================
# PASO 4: PREPARACI√ìN DE VARIABLES BASE
# ==============================================================================

# Columnas para ingenier√≠a de features
cat_cols = ["AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT", "RUTA"]
num_cols = [
    "MONTH", "DAY_OF_WEEK", 
    "SALIDA_SIN", "SALIDA_COS", 
    "LLEGADA_SIN", "LLEGADA_COS",
    "MONTH_SIN", "MONTH_COS",
    "SCHEDULED_TIME", "DISTANCIA_HAV" 
]
# Filtrar por las que realmente existen en v_full
num_cols = [c for c in num_cols if c in v_full.columns]
cat_cols = [c for c in cat_cols if c in v_full.columns]

agg_specs = [
    (["RUTA", "HORA_SALIDA"], "RUTA_HORA"),
    (["AIRLINE"], "AIR"),
    (["ORIGIN_AIRPORT"], "ORI"),
    (["DESTINATION_AIRPORT"], "DES")
]

# --- Targets reseteados (se usan en todos los experimentos) ---
y_train_r = y_train_base.reset_index(drop=True)
y_valid_r = y_valid_base.reset_index(drop=True)

# Guardar las matrices num√©ricas base (con √≠ndice reseteado)
X_train_num_r = X_train_base[num_cols].reset_index(drop=True)
X_valid_num_r = X_valid_base[num_cols].reset_index(drop=True)

print(f"Usando {len(num_cols)} features num√©ricas y {len(cat_cols)} categ√≥ricas.")

Usando 10 features num√©ricas y 4 categ√≥ricas.


7: Experimento 1 (LabelEncoder) (C√≥digo)

In [64]:
# ==============================================================================
# --- Exp 1: LabelEncoder --- 
# ==============================================================================
print("\n=== INICIANDO EXPERIMENTO 1: LabelEncoder ===")

# 1. Aplicar LE
X_train_le, X_valid_le, le_encoders = apply_label_encoder(X_train_base[cat_cols], X_valid_base[cat_cols])

# 2. Unir num√©ricas (reseteadas) + LE (reseteadas)
X_train_1 = pd.concat([X_train_num_r, X_train_le.reset_index(drop=True)], axis=1)
X_valid_1 = pd.concat([X_valid_num_r, X_valid_le.reset_index(drop=True)], axis=1)

# 3. Nombres de las columnas categ√≥ricas
cat_features_names = list(X_train_le.columns)

# 4. *** FIX v16 (LGBM/XGB) ***
# Crear copias para convertir a Dtype 'category' UNIFICADO
# Esto es necesario ANTES de llamar a .fit()
X_train_1_cat = X_train_1.copy()
X_valid_1_cat = X_valid_1.copy()

for col in cat_features_names:
    # 1. Crear tipo unificado (de train Y valid)
    all_cats = pd.concat([X_train_1_cat[col], X_valid_1_cat[col]]).unique()
    cat_type = pd.CategoricalDtype(categories=all_cats, ordered=False)
    # 2. Aplicar a ambos
    X_train_1_cat[col] = X_train_1_cat[col].astype(cat_type)
    X_valid_1_cat[col] = X_valid_1_cat[col].astype(cat_type)
    
# 5. Entrenar Modelos
# (Pasamos X/y reseteados (0..N) y las versiones _cat para LGBM/XGB)
train_lgbm(X_train_1_cat, y_train_r, X_valid_1_cat, y_valid_r, "Exp 1: LabelEncoder", categorical_features=cat_features_names)
train_xgb(X_train_1_cat, y_train_r, X_valid_1_cat, y_valid_r, "Exp 1: LabelEncoder", categorical_features=cat_features_names)

# RF prefiere los ints simples (X_train_1)
train_rf(X_train_1, y_train_r, X_valid_1, y_valid_r, "Exp 1: LabelEncoder")


=== INICIANDO EXPERIMENTO 1: LabelEncoder ===
Aplicando LabelEncoder...

--- Entrenando Experimento: Exp 1: LabelEncoder (LGBM) ---
[LightGBM] [Info] Number of positive: 805372, number of negative: 3493674
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.121568 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5871
[LightGBM] [Info] Number of data points in the train set: 4299046, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[71]	valid_0's auc: 0.615217
Entrenamiento completado en 89.0s (Best iter: 71)


--- Entrenando Experimento: Exp 1: LabelEncoder (XGBoost) ---
[0]	validation_0-auc:0.57678
[166]	validation_0-auc:0.61389
Entr

(RandomForestClassifier(class_weight='balanced', max_depth=20,
                        min_samples_leaf=100, n_jobs=-1, random_state=42),
 {'Modelo': 'RandomForest',
  'Experimento': 'Exp 1: LabelEncoder',
  'ROC-AUC': 0.6155,
  'PR-AUC': 0.2344,
  'Best_F1': np.float64(0.3297),
  'Best_F1_Threshold': np.float64(0.425),
  'Tiempo (s)': 248.5})

8: Experimento 2 (Target Encoding) (C√≥digo)

In [65]:
# ==============================================================================
# --- Exp 2: Target Encoding --- 
# ==============================================================================
print("\n=== INICIANDO EXPERIMENTO 2: Target Encoding ===")

# 1. Aplicar TE
# (y_train e y_valid tienen el √≠ndice original, alineado con X_train_base)
# X_train_te_cols y X_valid_te_cols tienen el √≠ndice original (0, 1, 5, 8...)
X_train_te_cols, X_valid_te_cols = apply_target_encoding(X_train_base[cat_cols], y_train_base, X_valid_base[cat_cols], cat_cols)

# 2. Unir num√©ricas (reseteadas 0..N) + TE (reseteadas 0..N)
X_train_2 = pd.concat([X_train_num_r, X_train_te_cols.reset_index(drop=True)], axis=1)
X_valid_2 = pd.concat([X_valid_num_r, X_valid_te_cols.reset_index(drop=True)], axis=1)

# 3. Entrenar Modelos (sin categorical_features, ya que TE es num√©rico)
train_lgbm(X_train_2, y_train_r, X_valid_2, y_valid_r, "Exp 2: TargetEncoding")
train_xgb(X_train_2, y_train_r, X_valid_2, y_valid_r, "Exp 2: TargetEncoding")
train_rf(X_train_2, y_train_r, X_valid_2, y_valid_r, "Exp 2: TargetEncoding")


=== INICIANDO EXPERIMENTO 2: Target Encoding ===
Aplicando Target Encoding K-Fold...

--- Entrenando Experimento: Exp 2: TargetEncoding (LGBM) ---
[LightGBM] [Info] Number of positive: 805372, number of negative: 3493674
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.065566 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2398
[LightGBM] [Info] Number of data points in the train set: 4299046, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
Training until validation scores don't improve for 100 rounds
[200]	valid_0's auc: 0.611184
[400]	valid_0's auc: 0.612138
[600]	valid_0's auc: 0.61272
[800]	valid_0's auc: 0.612675
Early stopping, best iteration is:
[704]	valid_0's auc: 0.612817
Entrenamiento completado en 229.1s (Best it

(RandomForestClassifier(class_weight='balanced', max_depth=20,
                        min_samples_leaf=100, n_jobs=-1, random_state=42),
 {'Modelo': 'RandomForest',
  'Experimento': 'Exp 2: TargetEncoding',
  'ROC-AUC': 0.6204,
  'PR-AUC': 0.2415,
  'Best_F1': np.float64(0.3316),
  'Best_F1_Threshold': np.float64(0.423),
  'Tiempo (s)': 331.7})

9: Experimento 3 (Target Encoding + Agregados) (C√≥digo)

In [66]:
# ==============================================================================
# --- Exp 3: TE + Agregados --- 
# ==============================================================================
print("\n=== INICIANDO EXPERIMENTO 3: TE + Agregados Hist√≥ricos ===")

# 1. Calcular Agregados (preservan √≠ndice original 0, 1, 5, 8...)
X_train_agg_cols, X_valid_agg_cols = apply_historical_aggs(X_train_base, y_train_base, X_valid_base, agg_specs)

# 2. X_train_te_cols ya existe del paso anterior (con √≠ndice original 0, 1, 5, 8...)

# 3. Unir num√©ricas (reseteadas) + TE (reseteadas) + Agregados (reseteados)
X_train_3 = pd.concat([
    X_train_num_r, 
    X_train_te_cols.reset_index(drop=True), # Reseteado (0..N)
    X_train_agg_cols.reset_index(drop=True) # Reseteado (0..N)
], axis=1)

X_valid_3 = pd.concat([
    X_valid_num_r, 
    X_valid_te_cols.reset_index(drop=True), # Reseteado (0..N)
    X_valid_agg_cols.reset_index(drop=True) # Reseteado (0..N)
], axis=1)

# 4. Entrenar Modelos
train_lgbm(X_train_3, y_train_r, X_valid_3, y_valid_r, "Exp 3: TE + Agregados")
train_xgb(X_train_3, y_train_r, X_valid_3, y_valid_r, "Exp 3: TE + Agregados")
train_rf(X_train_3, y_train_r, X_valid_3, y_valid_r, "Exp 3: TE + Agregados")


=== INICIANDO EXPERIMENTO 3: TE + Agregados Hist√≥ricos ===
Aplicando Agregados Hist√≥ricos (v7 Fix)...

--- Entrenando Experimento: Exp 3: TE + Agregados (LGBM) ---
[LightGBM] [Info] Number of positive: 805372, number of negative: 3493674
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.118530 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3686
[LightGBM] [Info] Number of data points in the train set: 4299046, number of used features: 22
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[5]	valid_0's auc: 0.608403
Entrenamiento completado en 58.6s (Best iter: 5)


--- Entrenando Experimento: Exp 3: TE + Agregados (XGBoost) ---
[0]	validation_0-auc:0.59705
[

(RandomForestClassifier(class_weight='balanced', max_depth=20,
                        min_samples_leaf=100, n_jobs=-1, random_state=42),
 {'Modelo': 'RandomForest',
  'Experimento': 'Exp 3: TE + Agregados',
  'ROC-AUC': 0.6079,
  'PR-AUC': 0.2356,
  'Best_F1': np.float64(0.324),
  'Best_F1_Threshold': np.float64(0.391),
  'Tiempo (s)': 402.9})

10: Paso 5 - Reporte Final (C√≥digo)

In [68]:
# ==============================================================================
# PASO 6: REPORTE FINAL
# ==============================================================================

print("\n\n--- Comparaci√≥n Final de Alternativas (Validadas en Meses 10-12) ---")
df_results = pd.DataFrame(RESULTS).set_index(["Experimento", "Modelo"])

# Ordenar por la mejor m√©trica (ROC-AUC y luego PR-AUC)
df_sorted = df_results.sort_values(by=["ROC-AUC", "PR-AUC"], ascending=False)

# Imprimir como markdown para f√°cil lectura
print(df_sorted.to_markdown(floatfmt=".4f"))

# Determinar el ganador
if not df_sorted.empty:
    winner_exp = df_sorted.index[0][0]
    winner_model = df_sorted.index[0][1]
    winner_auc = df_sorted.iloc[0]["ROC-AUC"]
    print(f"\nüèÜ Ganador (por ROC-AUC): {winner_exp} con {winner_model} (AUC: {winner_auc:.4f})")
else:
    print("\nNo se completaron experimentos para determinar un ganador.")
print("---")
print("Nota: Un ROC-AUC m√°s alto es mejor para distinguir clases (ranking general).")
print("Un PR-AUC m√°s alto es mejor para encontrar retrasos (desbalanceado).")



--- Comparaci√≥n Final de Alternativas (Validadas en Meses 10-12) ---
|                                           |   ROC-AUC |   PR-AUC |   Best_F1 |   Best_F1_Threshold |   Tiempo (s) |
|:------------------------------------------|----------:|---------:|----------:|--------------------:|-------------:|
| ('Exp 2: TargetEncoding', 'RandomForest') |    0.6204 |   0.2415 |    0.3316 |              0.4230 |     331.7000 |
| ('Exp 1: LabelEncoder', 'XGBoost')        |    0.6161 |   0.2360 |    0.3297 |              0.3900 |     112.9000 |
| ('Exp 1: LabelEncoder', 'RandomForest')   |    0.6155 |   0.2344 |    0.3297 |              0.4250 |     248.5000 |
| ('Exp 1: LabelEncoder', 'LGBM')           |    0.6152 |   0.2353 |    0.3299 |              0.3710 |      89.0000 |
| ('Exp 2: TargetEncoding', 'XGBoost')      |    0.6135 |   0.2371 |    0.3276 |              0.3450 |     397.0000 |
| ('Exp 1: LabelEncoder', 'LGBM')           |    0.6135 |   0.2334 |    0.3296 |              0.3890 |