![Image in a markdown cell](https://cursos.utnba.centrodeelearning.com/pluginfile.php/1/theme_space/customlogo/1738330016/Logo%20UTN%20Horizontal.png)



# **Diplomado de Ciencia de Datos y An√°lisis Avanzado**
# **Unidad 5: Modelado Predictivo I**: Regresi√≥n y Clasificaci√≥n

---

# **Proyecto de Competencia Kaggle: Predicci√≥n de Abandono de Clientes**

## **Curso:** Diplomado en Ciencia de Datos

# **Nombres de los Miembros del Equipo:**
### *   [Nombre Completo del Miembro 1]
### *   [Nombre Completo del Miembro 2]
### *   [Nombre Completo del Miembro 3]

# **Objetivo:**
## El objetivo de este proyecto es construir y evaluar varios modelos de clasificaci√≥n para predecir si un cliente de una compa√±√≠a de telecomunicaciones abandonar√° o no el servicio (churn). El rendimiento final del mejor modelo se medir√° en la competencia de Kaggle a trav√©s de la **m√©trica ROC AUC**.


---

# **Enlace para unirse a la competencia**
### **USE EL ENLACE PARA UNIRSE POR EQUIPO, NO DE MANERA INDIVIDUAL**

https://www.kaggle.com/t/57b70c381e4d451b8ae38e164b91a2aa


### **Por favor siga las indicaciones que se suministran en la plataforma**


# 0. **Configuraci√≥n Inicial e Importaci√≥n de Librer√≠as**

## En esta secci√≥n, importaremos todas las librer√≠as necesarias para el proyecto. Es una buena pr√°ctica tener todas las importaciones en la primera celda.


In [16]:
    from print_utils import USE_EMOJIS, safe_print, supports_emojis

    safe_print("\nüì¶ *** Importando librer√≠as...")

    # Importaciones bsicas
    import importlib
    import pathlib
    from datetime import datetime
    from typing import Any, Dict, List, Tuple

    import matplotlib.gridspec as gridspec
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    import seaborn as sns
    from sklearn.compose import ColumnTransformer
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import (
        accuracy_score,
        f1_score,
        precision_score,
        recall_score,
        roc_auc_score,
    )
    from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
    from sklearn.naive_bayes import GaussianNB
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler


üì¶ *** Importando librer√≠as...


In [17]:
# üîß Importar m√≥dulos del proyecto (si existen)
try:
    import data_loader
    import dataset_splitter
    import eda
    import metrics
    import models
    import models_stable  # M√≥dulo estable sin emojis
    import submission

    print("‚úÖ  M√≥dulos del proyecto importados correctamente")
    print("INFO: M√≥dulo models_stable disponible")
except ImportError as e:
    print(f"‚ö†Ô∏è  Algunos m√≥dulos del proyecto no est√°n disponibles: {e}")
    print("üí°  Continuando con funcionalidad b√°sica...")


‚úÖ  M√≥dulos del proyecto importados correctamente
INFO: M√≥dulo models_stable disponible


In [19]:
# ========================================================================
    # üìä CONFIGURACI√ìN DE MLFLOW
    # ========================================================================
    print("\n‚öôÔ∏è  Configurando MLflow...")

    if MLFLOW_AVAILABLE:
        # Configurar el directorio de tracking (Windows compatible)
        mlflow_tracking_dir = os.path.join(os.getcwd(), "mlruns")
        os.makedirs(mlflow_tracking_dir, exist_ok=True)

        # Usar URI con scheme file:// para compatibilidad completa
        tracking_uri = pathlib.Path(mlflow_tracking_dir).as_uri()

        print(f"[FOLDER] Configurando tracking URI: {tracking_uri}")
        mlflow.set_tracking_uri(tracking_uri)

        # Configurar variables de entorno para evitar warnings del Model Registry
        os.environ["MLFLOW_DISABLE_ENV_MANAGER_CONDA_WARNING"] = "TRUE"

        experiment_name = "Churn_Prediction_TP5"

        # Crear o obtener el experimento con configuracin optimizada
        try:
            # Verificar si el experimento ya existe primero
            existing_experiment = mlflow.get_experiment_by_name(experiment_name)
            if existing_experiment is not None:
                experiment_id = existing_experiment.experiment_id
                print(
                    f"‚úÖ Experimento '{experiment_name}' encontrado con ID: {experiment_id}"
                )
            else:
                experiment_id = mlflow.create_experiment(experiment_name)
                print(
                    f"‚úÖ Experimento '{experiment_name}' creado con ID: {experiment_id}"
                )
        except Exception as e:
            print(f"‚ö†Ô∏è Configuracin de experimento con limitaciones: {str(e)[:100]}...")
            print(" Continuando con experimento por defecto")
            experiment_id = "0"  # Usar experimento por defecto

        # Configurar experimento activo
        try:
            mlflow.set_experiment(experiment_name)
            print(f"üéØ Experimento activo: {experiment_name}")
            MLFLOW_TRACKING_ENABLED = True
        except Exception as e:
            print(f" Usando experimento por defecto - tracking bsico disponible")
            MLFLOW_TRACKING_ENABLED = True  # Tracking sigue funcionando

        print("‚úÖ MLflow configurado exitosamente")

        # Iniciar run principal para toda la ejecucion
        main_run_name = (
            f"churn_prediction_run_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        )
        if MLFLOW_AVAILABLE and MLFLOW_TRACKING_ENABLED:
            try:
                mlflow.start_run(run_name=main_run_name)
                print("‚úÖ Run principal iniciado: {main_run_name}")
                active_run = mlflow.active_run()
                if active_run is not None:
                    main_run_id = active_run.info.run_id
                    print(f" Run ID: {main_run_id}")
                else:
                    print("‚ö†Ô∏è No se pudo obtener run activo")
                    main_run_id = "fallback"
            except Exception as e:
                print(f"‚ö†Ô∏è Error iniciando run principal: {e}")
                main_run_id = "fallback"

    else:
        print("‚ö†Ô∏è MLflow no disponible - usando configuracin fallback")
        MLFLOW_TRACKING_ENABLED = False
        main_run_id = f"fallback_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        experiment_id = "fallback"


IndentationError: unexpected indent (4029976911.py, line 4)

#  **1. Carga de Datos**
Cargaremos los datasets proporcionados para la competencia: `train.csv`, `test.csv` y `sample_submission.csv`.

In [None]:
# üìÅ Carga de datos desde archivos CSV
print("üìÅ 1. Cargando datos...")

try:
    # Cargar los datasets
    X_train = pd.read_csv('train.csv')
    X_test = pd.read_csv('test.csv')
    sample_submission_df = pd.read_csv('sample_submission.csv')
    
    print("‚úÖ Datos cargados exitosamente:")
    print(f"üìä   - Dataset de entrenamiento: {X_train.shape}")
    print(f"üß™   - Dataset de prueba: {X_test.shape}")
    print(f"üìù   - Sample submission: {sample_submission_df.shape}")
    
except FileNotFoundError as e:
    print(f"‚ùå Error: {e}")
    print("üí° Aseg√∫rate de que los archivos .csv est√©n en el directorio correcto")
    print("üìÇ Archivos esperados: train.csv, test.csv, sample_submission.csv")
    
    # Listar archivos CSV disponibles
    import os
    csv_files = [f for f in os.listdir('.') if f.endswith('.csv')]
    print(f"üìã Archivos CSV encontrados: {csv_files}")
except Exception as e:
    print(f"‚ùå Error inesperado: {e}")

# Mostrar informaci√≥n b√°sica de los datos
if 'X_train' in locals():
    print("\nüìä Informaci√≥n b√°sica del dataset de entrenamiento:")
    print(f"   - Forma: {X_train.shape}")
    print(f"   - Columnas: {list(X_train.columns)}")
    
    print("\nüìã Primeras 5 filas del dataset de entrenamiento:")
    print(X_train.head())
    
    print("\nüìã Primeras 5 filas del dataset de prueba:")
    print(X_test.head())
    
    print("\nüìã Sample submission structure:")
    print(sample_submission_df.head())
else:
    print("‚ö†Ô∏è No se pudieron cargar los datos")

# 2**. An√°lisis Exploratorio de Datos (EDA)**
En esta fase, exploraremos el dataset de entrenamiento para entender mejor nuestros datos, encontrar patrones, identificar valores faltantes y visualizar relaciones entre las caracter√≠sticas y la variable objetivo (`Churn`).

## Objetivo: conocer distribuci√≥n de datos, target, tipos de columnas.

Variables como Contract, InternetService, PaymentMethod requieren OneHotEncoding o LabelEncoding. #TODO: Verificar.

Target Churn: dataset m√°s desbalanceado (~20% churn). #Verificar el desbalanceo.

## Descripci√≥n de par√°metros




In [18]:
# üìä 2. An√°lisis Exploratorio de Datos (EDA)
print("üìä 2. Realizando an√°lisis exploratorio...")

# Verificar que los datos est√©n cargados
if 'X_train' not in locals() or X_train is None:
    print("‚ùå Error: X_train no est√° definido. Ejecuta primero la celda de carga de datos.")
else:
    try:
        # Reload the eda module to pick up any changes
        import importlib
        import sys
        if 'eda' in sys.modules:
            import eda
            importlib.reload(eda)
        else:
            import eda
        
        print("‚ÑπÔ∏è INFORMACI√ìN GENERAL DEL DATASET")
        print("=" * 70)
        eda.basic_info(X_train)
        
    except ImportError:
        print("‚ö†Ô∏è M√≥dulo 'eda' no disponible. Realizando an√°lisis b√°sico...")
        
        # An√°lisis b√°sico sin el m√≥dulo eda
        print("‚ÑπÔ∏è INFORMACI√ìN GENERAL DEL DATASET")
        print("=" * 70)
        print(f"üìê Dimensiones: {X_train.shape}")
        print(f"üìã Columnas: {list(X_train.columns)}")
        
        # Verificar valores faltantes
        missing_values = X_train.isnull().sum()
        if missing_values.sum() == 0:
            print("üîç VALORES FALTANTES:")
            print("‚úÖ No hay valores faltantes")
        else:
            print("üîç VALORES FALTANTES:")
            for col, count in missing_values[missing_values > 0].items():
                print(f"   {col}: {count} ({count/len(X_train)*100:.1f}%)")
    except Exception as e:
        print(f"‚ùå Error en el m√≥dulo eda: {e}")
        print("Realizando an√°lisis b√°sico...")
        
        # An√°lisis b√°sico como fallback
        print("‚ÑπÔ∏è INFORMACI√ìN GENERAL DEL DATASET")
        print("=" * 70)
        print(f"üìê Dimensiones: {X_train.shape}")
        print(f"üìã Columnas: {list(X_train.columns)}")
        
        # Verificar valores faltantes
        missing_values = X_train.isnull().sum()
        if missing_values.sum() == 0:
            print("üîç VALORES FALTANTES:")
            print("‚úÖ No hay valores faltantes")
        else:
            print("üîç VALORES FALTANTES:")
            for col, count in missing_values[missing_values > 0].items():
                print(f"   {col}: {count} ({count/len(X_train)*100:.1f}%)")

    # Distribuci√≥n de la variable objetivo
    if 'Churn' in X_train.columns:
        print("\nüéØ üéØ DISTRIBUCI√ìN DE LA VARIABLE OBJETIVO (Churn):")
        y_train = X_train['Churn']
        churn_counts = y_train.value_counts()
        churn_pct = y_train.value_counts(normalize=True) * 100
        
        for value, count in churn_counts.items():
            pct = churn_pct[value]
            label = "No Churn" if value == "No" else "Churn"
            print(f"{label}: {count} ({pct:.1f}%)")

        # Visualizaci√≥n de la distribuci√≥n del target
        plt.figure(figsize=(12, 4))

        plt.subplot(1, 2, 1)
        y_train.value_counts().plot(kind='bar', color=['lightblue', 'salmon'])
        plt.title('Distribuci√≥n de Churn')
        plt.xlabel('Churn')
        plt.ylabel('Frecuencia')
        plt.xticks(rotation=0)

        plt.subplot(1, 2, 2)
        y_train.value_counts(normalize=True).plot(kind='pie', autopct='%1.1f%%', colors=['lightblue', 'salmon'])
        plt.title('Proporci√≥n de Churn')
        plt.ylabel('')

        plt.tight_layout()
        plt.savefig('churn_distribution.png', dpi=150, bbox_inches='tight')
        plt.show()
        print("‚úÖ Gr√°fico de distribuci√≥n guardado: churn_distribution.png")

        # Mostrar relaci√≥n de features num√©ricas con el target
        print("\nüîç An√°lisis de caracter√≠sticas vs Churn:")
        
        # Seleccionar caracter√≠sticas para an√°lisis
        numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
        categorical_features = ['InternetService', 'Dependents', 'OnlineSecurity']
        
        # Verificar qu√© caracter√≠sticas est√°n disponibles
        available_numeric = [col for col in numeric_features if col in X_train.columns]
        available_categorical = [col for col in categorical_features if col in X_train.columns]
        
        if available_numeric or available_categorical:
            features_to_plot = available_numeric + available_categorical
            
            if len(features_to_plot) > 0:
                fig, axes = plt.subplots(2, 3, figsize=(15, 10))
                axes = axes.flatten()
                
                for i, col in enumerate(features_to_plot[:6]):  # M√°ximo 6 plots
                    if i < len(axes):
                        try:
                            if col in available_numeric:
                                sns.boxplot(x='Churn', y=col, data=X_train, ax=axes[i])
                            else:
                                # Para categ√≥ricas, usar countplot
                                sns.countplot(x=col, hue='Churn', data=X_train, ax=axes[i])
                                axes[i].tick_params(axis='x', rotation=45)
                            
                            axes[i].set_title(f'{col} vs Churn')
                        except Exception as e:
                            axes[i].text(0.5, 0.5, f"Error: {col}", 
                                       transform=axes[i].transAxes, ha='center')
                
                # Ocultar axes no utilizados
                for i in range(len(features_to_plot), len(axes)):
                    axes[i].axis('off')
                
                plt.tight_layout()
                plt.show()
        else:
            print("‚ö†Ô∏è No se encontraron caracter√≠sticas num√©ricas conocidas para an√°lisis")
    else:
        print("‚ö†Ô∏è Columna 'Churn' no encontrada en el dataset")

üìä 2. Realizando an√°lisis exploratorio...
‚ùå Error: X_train no est√° definido. Ejecuta primero la celda de carga de datos.


# **3. Preprocesamiento de Datos**

Prepararemos los datos para que puedan ser utilizados por los modelos de Machine Learning.

In [None]:
# 3.1 - Preprocesamiento de datos usando los modelos creados
from models import ChurnPredictor

# Inicializar el predictor
predictor = ChurnPredictor(random_state=42)

# Crear el preprocesador
predictor.create_preprocessor(X_train)

print("‚úÖ Preprocesador configurado exitosamente")
X_features = X_train.shape[1]
print(f"üìä Caracter√≠sticas a procesar: {X_features}")

#Mostrar estado de columnas luego del preprocesamiento.
predictor.inspect_transformed_columns(
    X_original=X_train,
    columns=['Partner', 'Dependents', 'Contract', 'PaymentMethod']
)

# Mostrar informaci√≥n del preprocesador
print("\nüîß Configuraci√≥n del preprocesador:")
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns                      
                                         
print(f"   - Caracter√≠sticas num√©ricas: {len(numeric_features)} : {numeric_features}")
print(f"   - Caracter√≠sticas categ√≥ricas: {len(categorical_features)}: {categorical_features}")

# **4. Modelado y Evaluaci√≥n**

Ahora entrenaremos y evaluaremos los tres modelos requeridos:
Regresi√≥n Log√≠stica , k-NN y Naive Bayes. Agrego RandomForest

In [None]:
# Para importar siempre la versi√≥n m√°s reciente de la clase ChurnPredictor

import sys
import importlib

# Ruta a tu m√≥dulo (ajust√° si es necesario)
module_name = "models"

# Eliminar del cach√© de m√≥dulos si ya estaba cargado
if module_name in sys.modules:
    del sys.modules[module_name]

# Importar y recargar
import models
importlib.reload(models)

# Instanciar la clase
predictor = models.ChurnPredictor()

# 4.1 Modelado 
print("ü§ñ Iniciando entrenamiento de modelos...")

# Preparar datos para la divisi√≥n train/validation
print("üîß Preparando datos para divisi√≥n train/validation...")

# Cargar datos originales si es necesario
try:
    # Separar features y target
    y = X_train["Churn"]
    print(f"üìä Variable objetivo extra√≠da: {y.shape}")
    
    # Extraer caracter√≠sticas (X) - remover Churn y customerID
    columns_to_drop = ['Churn']
    if 'customerID' in X_train.columns:
        columns_to_drop.append('customerID')
  
    X = X_train.drop(columns_to_drop, axis=1)
    print(f"üìä Caracter√≠sticas extra√≠das: {X.shape}")
    print(f"üìã Columnas removidas: {columns_to_drop}")
    
except Exception as e:
    print(f"‚ùå Error preparando datos: {e}")
    print("üí° Aseg√∫rate de que el dataset est√© cargado correctamente")

# Dividir datos en entrenamiento y validaci√≥n interna
print("\nüîÑ Dividiendo datos en train/validation interno...")
X_train_split, X_val, y_train_split, y_val = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

In [None]:
# 4.2 Crear el preprocesador para datos de validaci√≥n.
try:
    # Crear los modelos primero (incluyen preprocesador en pipeline)
    models = predictor.create_models()
    
    print("‚úÖ Modelos con preprocesador configurados exitosamente")
    print(f"üìä Modelos creados: {list(models.keys())}")
    
    # Verificar estructura de uno de los modelos
    sample_model = list(models.values())[0]
    print(f"üîß Estructura del pipeline: {list(sample_model.named_steps.keys())}")
    
    X_val_features = X_val.shape[1]
    print(f"üìä Caracter√≠sticas a procesar: {X_val_features}")
    
    # Mostrar informaci√≥n del preprocesador
    print("\nüîß Informaci√≥n de datos:")
    numeric_features = X_val.select_dtypes(include=['int64', 'float64']).columns
    categorical_features = X_val.select_dtypes(include=['object']).columns                      
                                             
    print(f"   - Caracter√≠sticas num√©ricas: {len(numeric_features)} : {list(numeric_features)}")
    print(f"   - Caracter√≠sticas categ√≥ricas: {len(categorical_features)}: {list(categorical_features)}")
    
    # Verificar que no hay columnas categ√≥ricas problem√°ticas
    print("\nüîç Verificando datos categ√≥ricos:")
    for col in categorical_features:
        unique_vals = X_val[col].unique()
        print(f"   - {col}: {unique_vals[:5]}{'...' if len(unique_vals) > 5 else ''}")
        
except Exception as e:
    print(f"‚ùå Error configurando modelos: {e}")
    print("üí° Aseg√∫rate de que el dataset est√© cargado correctamente") 



In [None]:
# 4.3 Crear y entrenar los modelos
print("üìä X_val : ")
display(X_val.head())

print("üìä y_val original")
print(y_val.head())

# ‚úÖ FIXED: Map the correct split data, not the full training data
print("üìä Mapeando datos para entrenamiento y evaluaci√≥n...")
label_mapping = {"No": 0, "Yes": 1}

# Map the SPLIT training data (not the full y_train)
print(f"üîç y_train_split valores √∫nicos antes del mapeo: {y_train_split.unique()}")
y_train_split_mapped = y_train_split.map(label_mapping)
print(f"‚úÖ y_train_split_mapped valores √∫nicos despu√©s del mapeo: {y_train_split_mapped.unique()}")

print(f"üîç y_val valores √∫nicos antes del mapeo: {y_val.unique()}")
y_val_mapped = y_val.map(label_mapping)
print(f"‚úÖ y_val_mapped valores √∫nicos despu√©s del mapeo: {y_val_mapped.unique()}")

print(f"\nüìä Verificando shapes:")
print(f"   X_train_split: {X_train_split.shape}")
print(f"   y_train_split_mapped: {y_train_split_mapped.shape}")
print(f"   X_val: {X_val.shape}")
print(f"   y_val_mapped: {y_val_mapped.shape}")

# ‚úÖ FIXED WORKFLOW: Create preprocessor BEFORE creating models
print(f"\nüîß Setting up preprocessor first...")
if predictor.preprocessor is None:
    preprocessor = predictor.create_preprocessor(X_train_split)
    print(f"‚úÖ Preprocessor created: {type(preprocessor)}")

# Create models with proper preprocessor
print(f"\nüîß Creating models with preprocessor...")
models = predictor.create_models()
print(f"‚úÖ Models created with proper preprocessing pipeline")

# ‚úÖ FIXED: Use matching datasets - X_train_split with y_train_split_mapped
print(f"\nüéØ Entrenando modelos con datos mapeados y shapes consistentes...")
predictor.train_models(X_train_split, y_train_split_mapped)

print("\nüéâ Entrenamiento completado para todos los modelos:")
for name in predictor.models.keys():
    print(f"   ‚úÖ {name}")

In [None]:
# üîß FIXING THE PREPROCESSING ISSUE
# The problem is that create_models() is called before the preprocessor is set up
# Let's check if we need to create the preprocessor first

print("üîç Checking current predictor state:")
print(f"   - Preprocessor: {type(predictor.preprocessor)}")
print(f"   - Models created: {len(predictor.models)} models")

# If preprocessor is None, we need to create it first
if predictor.preprocessor is None:
    print("\n‚ö†Ô∏è  Preprocessor is None - creating it first...")
    preprocessor = predictor.create_preprocessor(X_train_split)
    print(f"‚úÖ Preprocessor created: {type(preprocessor)}")

# Now create models with the proper preprocessor
if len(predictor.models) == 0:
    print("\nüîß Creating models with preprocessor...")
    models = predictor.create_models()
    print(f"‚úÖ Models created: {list(models.keys())}")
else:
    print("\n‚úÖ Models already exist")

# Final check that models have proper preprocessors
sample_model = list(predictor.models.values())[0]
print(f"\nüîç Sample model pipeline steps:")
for step_name, step_obj in sample_model.steps:
    print(f"   - {step_name}: {type(step_obj)}")

In [None]:
# üîß RECREATE MODELS WITH PROPER PREPROCESSOR
# Since the existing models were created with preprocessor=None, we need to recreate them

print("üîß Recreating models with proper preprocessor...")

# Recreate the models now that preprocessor is set up
models = predictor.create_models()
print(f"‚úÖ Models recreated: {list(models.keys())}")

# Verify the models now have proper preprocessors
sample_model = list(predictor.models.values())[0]
print(f"\nüîç Sample model pipeline steps (after recreation):")
for step_name, step_obj in sample_model.steps:
    print(f"   - {step_name}: {type(step_obj)}")

print(f"\nüéØ Ready to train models!")

In [None]:
# üß™ TEST MODEL TRAINING WITH FIXED PREPROCESSOR
print("üß™ Testing model training with fixed preprocessing pipeline...")

# Test with a small sample first to verify everything works
print(f"üìä Training data shape: {X_train_split.shape}")
print(f"üìä Target data shape: {y_train_split.shape}")

# Try training the models
try:
    predictor.train_models(X_train_split, y_train_split)
    print("\nüéâ SUCCESS! Model training completed without errors.")
except Exception as e:
    print(f"\n‚ùå Error during training: {e}")
    import traceback
    traceback.print_exc()

In [None]:
# üîç DEBUG DATA SHAPE INCONSISTENCY 
print("üîç Debugging data shape inconsistency...")
print(f"X_train shape: {X_train.shape}")
print(f"X_train_split shape: {X_train_split.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_train_split shape: {y_train_split.shape}")
print(f"y_train_mapped shape: {y_train_split.shape}")

print(f"\nüîç The issue: X_train_split has {X_train_split.shape[0]} samples")
print(f"              but y_train_mapped has {y_train_split.shape[0]} samples")
print(f"              We need to use y_train_split instead!")

# Map the correct y_train_split data
print(f"\nüîß Mapping y_train_split instead of y_train...")
label_mapping = {"No": 0, "Yes": 1}
y_train_split_mapped = y_train_split.map(label_mapping)
print(f"‚úÖ y_train_split_mapped shape: {y_train_split_mapped.shape}")

print(f"\nNow the shapes match:")
print(f"   X_train_split: {X_train_split.shape}")
print(f"   y_train_split_mapped: {y_train_split_mapped.shape}")

In [None]:
# üîç DIAGN√ìSTICO CONCISO DEL PROBLEMA
print("üîç DIAGN√ìSTICO R√ÅPIDO")
print("="*40)

# Verificar datos categ√≥ricos
categorical_cols = X_train_split.select_dtypes(include=['object']).columns
print(f"üìä Columnas categ√≥ricas: {len(categorical_cols)}")
if len(categorical_cols) > 0:
    print(f"   Columnas: {list(categorical_cols)}")
    for col in categorical_cols[:3]:  # Solo las primeras 3
        sample_vals = X_train_split[col].head(2).tolist()
        print(f"   {col}: {sample_vals}")

# Verificar estructura del modelo
sample_model = list(models.values())[0]
print(f"\nü§ñ Modelo tipo: {type(sample_model)}")
if hasattr(sample_model, 'named_steps'):
    steps = list(sample_model.named_steps.keys())
    print(f"   Pipeline steps: {steps}")
else:
    print("   ‚ö†Ô∏è No es un pipeline!")

# Test r√°pido del preprocesador
print(f"\nüß™ Test del pipeline completo:")
try:
    X_sample = X_train_split.head(1)
    y_sample = y_train_split.head(1) if hasattr(y_train_split, 'head') else [y_train_split[0]]
    
    # Intentar fit con una muestra peque√±a
    sample_model.fit(X_sample, y_sample)
    print("   ‚úÖ Pipeline funciona correctamente")
except Exception as e:
    print(f"   ‚ùå Error: {str(e)[:100]}...")
    
    # Verificar si el problema est√° en el preprocesador
    if hasattr(sample_model, 'named_steps') and 'preprocessor' in sample_model.named_steps:
        try:
            preprocessor = sample_model.named_steps['preprocessor']
            X_transformed = preprocessor.fit_transform(X_sample)
            print(f"   üìä Preprocesador OK: {X_transformed.shape}")
            print(f"   ‚ùå Error en el clasificador: {str(e)[:50]}...")
        except Exception as prep_error:
            print(f"   ‚ùå Error en preprocesador: {str(prep_error)[:50]}...")

print("="*40)

In [None]:
# 4.4 Evaluaci√≥n de modelos
try:
    from metrics import MetricsCalculator
    
    # ‚úÖ FIXED: Use y_val_mapped instead of y_val for consistency
    print("üìä Evaluando modelos con datos mapeados...")
    results = predictor.evaluate_models(X_val, y_val_mapped)
    best_model_name, best_model = predictor.get_best_model('ROC_AUC', results)
    
    # ‚úÖ FIXED: Use y_val_mapped for model report generation too
    print("üìà Generando reporte de modelos...")
    predictor.generate_model_report(X_val, y_val_mapped)
    
    # Calculate detailed metrics
    calc = MetricsCalculator()
    y_pred = best_model.predict(X_val)
    y_pred_proba = best_model.predict_proba(X_val)[:, 1]
    detailed_report = calc.generate_detailed_report(
        y_val_mapped, y_pred, y_pred_proba, 
        class_names=['No Churn', 'Churn'], 
        model_name=best_model_name
    )
    print(f"\nüèÜ Mejor modelo seleccionado: {best_model_name}")
    
except ImportError:
    print("‚ö†Ô∏è M√≥dulo metrics no disponible, evaluando modelos b√°sicamente...")
    
    # ‚úÖ FIXED: Use y_val_mapped for basic evaluation too
    results = predictor.evaluate_models(X_val, y_val_mapped)
    best_model_name, best_model = predictor.get_best_model('ROC_AUC', results)
    print(f"\nüèÜ Mejor modelo seleccionado: {best_model_name}")
    
    # Show basic results
    print(f"\nüìä Resultados de evaluaci√≥n:")
    for model_name, metrics in results.items():
        print(f"   {model_name}:")
        for metric_name, value in metrics.items():
            print(f"      - {metric_name}: {value:.4f}")
        print()

In [None]:
# üîç DIAGN√ìSTICO: Feature Mismatch Issue
print("üîç Diagnosticando el problema de feature mismatch...")

# Check data dimensions
print(f"\nüìä Dimensiones de datos:")
print(f"   X_train_split (usado para entrenamiento): {X_train_split.shape}")
print(f"   X_val (usado para evaluaci√≥n): {X_val.shape}")

# Check if they have the same columns
print(f"\nüìã Comparaci√≥n de columnas:")
if set(X_train_split.columns) == set(X_val.columns):
    print("   ‚úÖ X_train_split y X_val tienen las mismas columnas")
else:
    print("   ‚ö†Ô∏è PROBLEMA: X_train_split y X_val tienen columnas diferentes!")
    
    train_cols = set(X_train_split.columns)
    val_cols = set(X_val.columns)
    
    missing_in_val = train_cols - val_cols
    extra_in_val = val_cols - train_cols
    
    if missing_in_val:
        print(f"   Columnas en train pero NO en val: {missing_in_val}")
    if extra_in_val:
        print(f"   Columnas en val pero NO en train: {extra_in_val}")

# Check what the model expects vs what we're giving it
sample_model = list(predictor.models.values())[0]
print(f"\nü§ñ Informaci√≥n del modelo:")
print(f"   Tipo: {type(sample_model)}")

# Check if model has been fitted and what it expects
if hasattr(sample_model, 'steps'):
    classifier = sample_model.steps[-1][1]
    if hasattr(classifier, 'n_features_in_'):
        print(f"   Modelo espera: {classifier.n_features_in_} features")
        print(f"   X_val tiene: {X_val.shape[1]} features")
        print(f"   Diferencia: {classifier.n_features_in_ - X_val.shape[1]} features faltantes")

# Check preprocessor output
print(f"\nüîß Verificando preprocesador:")
if predictor.preprocessor is not None:
    print(f"   Preprocesador configurado: {type(predictor.preprocessor)}")
    
    # Test preprocessor with a small sample
    try:
        X_val_sample = X_val.head(1)
        X_transformed_sample = predictor.preprocessor.transform(X_val_sample)
        print(f"   X_val original: {X_val_sample.shape}")
        print(f"   Despu√©s del preprocesador: {X_transformed_sample.shape}")
        
        if X_transformed_sample.shape[1] != classifier.n_features_in_:
            print(f"   ‚ö†Ô∏è PROBLEMA: Preprocesador produce {X_transformed_sample.shape[1]} features")
            print(f"               pero modelo espera {classifier.n_features_in_} features")
        else:
            print(f"   ‚úÖ Preprocesador produce el n√∫mero correcto de features")
            
    except Exception as e:
        print(f"   ‚ùå Error en preprocesador: {e}")
else:
    print("   ‚ùå No hay preprocesador configurado!")

print("\n" + "="*60)

In [None]:
# üîß SOLUCI√ìN: Fix Preprocessor Mismatch
print("üîß Solucionando el problema de preprocessor mismatch...")

# The issue is that the model pipeline has its own preprocessor
# that was trained during the fit process, but we're trying to use
# a separate preprocessor instance

print("\nüîç Analizando pipeline del modelo entrenado...")
sample_model = list(predictor.models.values())[0]

# Extract the fitted preprocessor from the model pipeline
if hasattr(sample_model, 'steps') and len(sample_model.steps) >= 2:
    pipeline_preprocessor = sample_model.steps[0][1]  # First step should be preprocessor
    pipeline_classifier = sample_model.steps[1][1]   # Second step should be classifier
    
    print(f"   Pipeline step 0 (preprocessor): {type(pipeline_preprocessor)}")
    print(f"   Pipeline step 1 (classifier): {type(pipeline_classifier)}")
    
    # Test the pipeline preprocessor
    try:
        X_val_sample = X_val.head(1)
        X_transformed_pipeline = pipeline_preprocessor.transform(X_val_sample)
        print(f"   ‚úÖ Pipeline preprocessor produce: {X_transformed_pipeline.shape[1]} features")
        
        if X_transformed_pipeline.shape[1] == pipeline_classifier.n_features_in_:
            print(f"   ‚úÖ CORRECTO: Pipeline preprocessor match con classifier")
            print(f"   üìä El problema era usar el preprocessor externo en lugar del pipeline interno")
        else:
            print(f"   ‚ö†Ô∏è A√∫n hay mismatch en el pipeline interno")
            
    except Exception as e:
        print(f"   ‚ùå Error con pipeline preprocessor: {e}")

# The solution is simple: the models already have the correct preprocessor built-in
# We just need to call model.predict(X_val) directly without separate preprocessing
print(f"\nüí° SOLUCI√ìN:")
print(f"   - Los modelos de Pipeline ya incluyen el preprocessor correcto")
print(f"   - NO debemos usar predictor.preprocessor por separado")
print(f"   - Llamar directamente model.predict(X_val) y model.predict_proba(X_val)")

# Test the solution
print(f"\nüß™ Probando la soluci√≥n:")
try:
    test_model = list(predictor.models.values())[0]
    y_pred_test = test_model.predict(X_val.head(1))
    y_pred_proba_test = test_model.predict_proba(X_val.head(1))
    
    print(f"   ‚úÖ SUCCESS: model.predict() funciona correctamente")
    print(f"   Predicci√≥n: {y_pred_test}")
    print(f"   Probabilidades: {y_pred_proba_test}")
    
except Exception as e:
    print(f"   ‚ùå ERROR: {e}")

print("\nüéØ Ahora podemos proceder con la evaluaci√≥n correcta")
print("="*60)

In [None]:
# üîß SOLUCI√ìN DEFINITIVA: Retrain models with consistent preprocessing
print("üîß El problema requiere reentrenamiento con preprocessing consistente...")

# The issue is that there was an inconsistency during training
# Let's retrain the models properly with a fresh start

print("\nüîÑ Reinicializando el pipeline completo...")

# 1. Import the models module properly and create a fresh predictor instance
import importlib
import models as models_module
importlib.reload(models_module)

predictor_fixed = models_module.ChurnPredictor(random_state=42)

# 2. Create preprocessor with training data
print("üìä Creando preprocessor con datos de entrenamiento...")
preprocessor_fixed = predictor_fixed.create_preprocessor(X_train_split)

# 3. Fit the preprocessor with training data
print("üîß Ajustando preprocessor con datos de entrenamiento...")
preprocessor_fixed.fit(X_train_split)

# 4. Verify preprocessor output shapes
print(f"\nüîç Verificando preprocessor arreglado:")
X_train_sample = X_train_split.head(1)
X_val_sample = X_val.head(1)

X_train_transformed = preprocessor_fixed.transform(X_train_sample)
X_val_transformed = preprocessor_fixed.transform(X_val_sample)

print(f"   X_train_split original: {X_train_sample.shape}")
print(f"   X_train_split transformado: {X_train_transformed.shape}")
print(f"   X_val original: {X_val_sample.shape}")
print(f"   X_val transformado: {X_val_transformed.shape}")

if X_train_transformed.shape[1] == X_val_transformed.shape[1]:
    print(f"   ‚úÖ Shapes consistentes: {X_train_transformed.shape[1]} features")
    
    # 5. Create models with the fixed preprocessor
    print(f"\nü§ñ Creando modelos con preprocessor arreglado...")
    models_fixed = predictor_fixed.create_models()
    
    # 6. Retrain with the same data that was used before
    print(f"\nüéØ Reentrenando modelos con datos consistentes...")
    predictor_fixed.train_models(X_train_split, y_train_split_mapped)
    
    # 7. Test the fixed models
    print(f"\nüß™ Probando modelos arreglados...")
    test_model_fixed = list(predictor_fixed.models.values())[0]
    
    y_pred_test_fixed = test_model_fixed.predict(X_val.head(1))
    y_pred_proba_test_fixed = test_model_fixed.predict_proba(X_val.head(1))
    
    print(f"   ‚úÖ SUCCESS: Modelos reentrenados funcionan correctamente")
    print(f"   Predicci√≥n: {y_pred_test_fixed}")
    print(f"   Probabilidades shape: {y_pred_proba_test_fixed.shape}")
    
    # 8. Replace the original predictor with the fixed one
    print(f"\nüîÑ Reemplazando predictor original con versi√≥n arreglada...")
    predictor = predictor_fixed
    models = predictor.models  # Update models dict as well
    
    print(f"   ‚úÖ Predictor actualizado exitosamente")
    print(f"   üìä Modelos disponibles: {list(models.keys())}")
    print(f"\nüéâ PROBLEMA RESUELTO: Ahora podemos proceder con la evaluaci√≥n")
    
else:
    print(f"   ‚ùå A√∫n hay inconsistencia en shapes")
    print(f"   Train: {X_train_transformed.shape[1]}, Val: {X_val_transformed.shape[1]}")

print("="*60)

#  **5. Selecci√≥n de Modelo y Generaci√≥n de Submission para Kaggle**

## Basado en tus resultados de validaci√≥n, elige el mejor modelo . Luego, re-entr√©nalo usando **todos los datos de `train.csv`** y √∫salo para hacer predicciones sobre `test.csv`.

In [None]:
# üîß CONFIGURACI√ìN Y ENTRENAMIENTO DEL CHURNPREDICTOR

from models import ChurnPredictor

print("üöÄ Inicializando ChurnPredictor...")

# Usar los datos ORIGINALES (antes del preprocesamiento) para crear el preprocesador
predictor = ChurnPredictor(random_state=42)

# Verificar y limpiar datos antes del preprocesamiento
print("üîç Verificando datos antes del preprocesamiento...")

# Asumir que X_train son los datos originales (con customerID y sin procesar)
# Necesitamos remover customerID de X_train si existe
if 'customerID' in X_train.columns:
    X_train_clean = X_train.drop(['customerID'], axis=1)
else:
    X_train_clean = X_train.copy()

#customerIDs lo guardo para usarn en la generaci√≥n del archivo de submit.
customer_ids = X_test['customerID']

# Necesitamos remover customerID de X_test si existe
if 'customerID' in X_test.columns:
    X_test_clean = X_test.drop(['customerID'], axis=1)
else:
    X_test_clean = X_test.copy()


# SINCRONIZACI√ìN FINAL: Asegurar que X e y tengan el mismo n√∫mero de muestras
if X_train_clean.shape[0] != y_train.shape[0]:
    print(f"‚ö†Ô∏è Sincronizando datos finales:")
    print(f"   - X_train_clean: {X_train_clean.shape[0]} ‚Üí ", end="")
    print(f"   - y_train: {y_train.shape[0]} ‚Üí ", end="")
    
    min_samples = min(X_train_clean.shape[0], y_train.shape[0])
    X_train_clean = X_train_clean.iloc[:min_samples]
    y_train_sync = y_train.iloc[:min_samples] if hasattr(y_train, 'iloc') else y_train[:min_samples]
    
    print(f"Sincronizados a {min_samples} muestras")
else:
    y_train_sync = y_train
    print("‚úÖ Datos ya est√°n sincronizados")

# IMPORTANTE: Mapear y_train_sync para consistencia de tipos
print(f"\nüîß Mapeando y_train_sync a formato num√©rico...")
y_train_sync = predictor.map_target(y_train_sync)

# Crear el preprocesador con los datos originales
preprocessor = predictor.create_preprocessor(X_train_clean)

print("‚úÖ Preprocesador configurado exitosamente")
print(f"üìä Caracter√≠sticas procesadas: {X_train_clean.shape[1]}")
print(f"üìä Muestras para entrenamiento: {X_train_clean.shape[0]}")

# Crear los modelos (esto autom√°ticamente usa el preprocesador)
models = predictor.create_models()

# ENTRENAR con los datos sincronizados
print("\nüéØ Iniciando entrenamiento con datos sincronizados...")
predictor.train_models(X_train_clean, y_train_sync)

print("\nüéâ Entrenamiento completado para todos los modelos:")
for model_name in models.keys():
    print(f"   ‚úÖ {model_name}")

print(f"\nüìä Resumen del entrenamiento:")
print(f"   - Datos de entrenamiento: {X_train_clean.shape}")
print(f"   - Etiquetas: {y_train_sync.shape if hasattr(y_train_sync, 'shape') else len(y_train_sync)}")
print(f"   - Modelos entrenados: {len(models)}")

# **Funci√≥n para generar el archivo de submission**

In [None]:
# Predicciones finales y creaci√≥n del archivo de submission
from submission import create_submission_file

print("üìÑ Generando predicciones finales...")


print(f"üìä Datos para entrenamiento final: {len(X_train_clean):,} muestras")

# Crear archivo de submission
submission_df = create_submission_file(
    final_model=best_model,
    X_train_full=X_train_clean,  # Solo caracter√≠sticas
    y_train_full=y_train_sync,   # Variable objetivo sincronizada.
    X_test_full=X_test_clean, # Solo caracter√≠sticas de test
    customer_ids=customer_ids,
    filename="submission_grupoM_inicial.csv"
)

# Mostrar primeras predicciones
print(f"\nüìã Primeras 10 predicciones:")
print(submission_df.head(10))

# Estad√≠sticas de las predicciones
predictions = submission_df.iloc[:, 1].values
print(f"\nüìä Estad√≠sticas de predicciones:")
print(f"   - Predicciones de churn (>0.5): {np.sum(predictions > 0.5):,} ({np.mean(predictions > 0.5)*100:.1f}%)")
print(f"   - Predicciones de no churn (‚â§0.5): {np.sum(predictions <= 0.5):,} ({np.mean(predictions <= 0.5)*100:.1f}%)")
print(f"   - Rango: [{predictions.min():.4f}, {predictions.max():.4f}]")

print(f"\n‚úÖ Archivo de submission 'submission_grupoM.csv' creado exitosamente")
print(f"üéØ Listo para subir a Kaggle!")

In [None]:
# üîß Optimizaci√≥n de hiperpar√°metros para el mejor modelo
from models import hyperparameter_tuning

print(f"üîß Optimizando hiperpar√°metros para {best_model_name}...")

# Definir grillas de par√°metros seg√∫n el modelo
if 'Logistic' in best_model_name:
    param_grid = {
        'classifier__C': [0.1, 1.0, 10.0],
        'classifier__solver': ['liblinear', 'lbfgs']
    }
elif 'KNN' in best_model_name:
    param_grid = {
        'classifier__n_neighbors': [3, 5, 7, 9],
        'classifier__weights': ['uniform', 'distance']
    }
elif 'Random' in best_model_name:
    param_grid = {
        'classifier__n_estimators': [50, 100, 200],
        'classifier__max_depth': [None, 10, 20],
        'classifier__min_samples_split': [2, 5]
    }
else:
    # Para Naive Bayes u otros
    param_grid = {
        'classifier__var_smoothing': [1e-9, 1e-8, 1e-7]
    }

# Realizar b√∫squeda de hiperpar√°metros
grid_search = hyperparameter_tuning(
    best_model, param_grid, X_train_clean, y_train_sync,
    cv=5, scoring='roc_auc'
)

print(f"\nüéØ Hiperpar√°metros optimizados:")
print(f"   - Mejor score CV: {grid_search.best_score_:.4f}")
print(f"   - Mejores par√°metros: {grid_search.best_params_}")

# Actualizar el mejor modelo con los par√°metros optimizados
optimized_model = grid_search.best_estimator_

print("\nü§ñ Generando predicciones del modelo optimizado...")

# Obtener predicciones del modelo optimizado
y_pred_opt = optimized_model.predict(X_val)
y_pred_proba_opt = optimized_model.predict_proba(X_val)[:, 1]

print("‚úÖ Predicciones generadas correctamente")
print("üìã Las m√©tricas se calcular√°n en las siguientes celdas despu√©s de la correcci√≥n de tipos")

In [None]:
# üîß CORRECCI√ìN: Resolver inconsistencia de tipos entre y_val y y_pred_opt

print("üîç Diagnosticando tipos de datos...")
print(f"Tipo de y_val: {type(y_val)}")
print(f"Valores √∫nicos en y_val: {np.unique(y_val)}")
print(f"Tipo de y_pred_opt: {type(y_pred_opt)}")
print(f"Valores √∫nicos en y_pred_opt: {np.unique(y_pred_opt)}")

# Verificar si las predicciones son de tipo string
if y_pred_opt.dtype == 'object' or isinstance(y_pred_opt[0], str):
    print("‚ö†Ô∏è Detectadas predicciones en formato texto, convirtiendo a num√©rico...")
    
    # Mapear predicciones de texto a n√∫meros
    y_pred_opt_numeric = np.where(y_pred_opt == 'Yes', 1, 0)
    
    print(f"‚úÖ Predicciones convertidas:")
    print(f"   - Antes: {np.unique(y_pred_opt)}")
    print(f"   - Despu√©s: {np.unique(y_pred_opt_numeric)}")
    
    # Reemplazar las predicciones
    y_pred_opt = y_pred_opt_numeric
else:
    print("‚úÖ Predicciones ya est√°n en formato num√©rico")

# Verificar consistencia de y_val 
if y_val.dtype == 'object' or isinstance(y_val.iloc[0] if hasattr(y_val, 'iloc') else y_val[0], str):
    print("‚ö†Ô∏è y_val tambi√©n est√° en formato texto, convirtiendo...")
    
    # Convertir y_val si es necesario
    if hasattr(y_val, 'map'):  # Si es pandas Series
        y_val_numeric = y_val.map({'Yes': 1, 'No': 0})
    else:  # Si es numpy array
        y_val_numeric = np.where(y_val == 'Yes', 1, 0)
    
    y_val = y_val_numeric
    print("‚úÖ y_val convertido a formato num√©rico")

print(f"\nüìä Estado final:")
print(f"   - y_val: tipo {type(y_val)}, valores √∫nicos {np.unique(y_val)}")
print(f"   - y_pred_opt: tipo {type(y_pred_opt)}, valores √∫nicos {np.unique(y_pred_opt)}")
print("‚úÖ Tipos de datos sincronizados correctamente")

In [None]:
# üìà EVALUACI√ìN FINAL: Calcular m√©tricas del modelo optimizado con MLflow

print("üìä Calculando m√©tricas del modelo optimizado con tipos correctos...")

# Importar m√©tricas
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score

try:
    # Calcular m√©tricas del modelo optimizado
    opt_auc = roc_auc_score(y_val, y_pred_proba_opt)
    opt_acc = accuracy_score(y_val, y_pred_opt)
    opt_precision = precision_score(y_val, y_pred_opt)
    opt_recall = recall_score(y_val, y_pred_opt)
    opt_f1 = f1_score(y_val, y_pred_opt)

    print(f"\nüéØ M√©tricas del modelo optimizado:")
    print(f"   - ROC AUC: {opt_auc:.4f}")
    print(f"   - Accuracy: {opt_acc:.4f}")
    print(f"   - Precision: {opt_precision:.4f}")
    print(f"   - Recall: {opt_recall:.4f}")
    print(f"   - F1-Score: {opt_f1:.4f}")

    # üéØ MLflow: Registrar modelo optimizado
    print("\nüîÑ Registrando modelo optimizado en MLflow...")
    
    # Obtener hiperpar√°metros del modelo optimizado
    if hasattr(optimized_model, 'named_steps'):
        classifier = optimized_model.named_steps['classifier']
        optimized_hyperparams = classifier.get_params()
    else:
        optimized_hyperparams = optimized_model.get_params()
    
    # Agregar informaci√≥n de optimizaci√≥n
    optimized_hyperparams['optimization_method'] = 'GridSearchCV'
    optimized_hyperparams['cv_folds'] = 5
    optimized_hyperparams['cv_best_score'] = grid_search.best_score_
    
    # Registrar en MLflow
    optimized_run_id = log_model_metrics(
        model_name=f"{best_model_name}_OPTIMIZED",
        y_true=y_val,
        y_pred=y_pred_opt,
        y_pred_proba=y_pred_proba_opt,
        model=optimized_model,
        X_test=X_val,
        hyperparams=optimized_hyperparams
    )
    
    print(f"   ‚úÖ Modelo optimizado registrado - Run ID: {optimized_run_id[:8]}...")

    # üìä Comparar con el modelo original si est√° disponible
    if 'results' in locals() or 'results' in globals():
        original_auc = results[best_model_name]['ROC_AUC']
        improvement = opt_auc - original_auc
        
        print(f"\nüìà Comparaci√≥n con modelo original:")
        print(f"   - ROC AUC original: {original_auc:.4f}")
        print(f"   - ROC AUC optimizado: {opt_auc:.4f}")
        print(f"   - Mejora: {improvement:.4f} ({improvement*100:.2f}%)")
        
        # üéØ MLflow: Registrar comparaci√≥n
        with mlflow.start_run(run_name=f"COMPARISON_{best_model_name}"):
            mlflow.log_metric("original_roc_auc", original_auc)
            mlflow.log_metric("optimized_roc_auc", opt_auc)
            mlflow.log_metric("improvement", improvement)
            mlflow.log_metric("improvement_percentage", improvement * 100)
            
            mlflow.set_tag("comparison", "original_vs_optimized")
            mlflow.set_tag("best_model", best_model_name)
            
            # Log de la decisi√≥n
            if improvement > 0.001:  # Mejora significativa
                mlflow.set_tag("decision", "use_optimized")
                print("üéâ ¬°Optimizaci√≥n exitosa! El modelo mejor√≥ significativamente")
                # Actualizar best_model si est√° disponible
                if 'optimized_model' in locals() or 'optimized_model' in globals():
                    best_model = optimized_model
                    print("‚úÖ Mejor modelo actualizado con la versi√≥n optimizada")
            else:
                mlflow.set_tag("decision", "use_original")
                print("üìù La optimizaci√≥n no produjo mejoras significativas")
                print("‚úÖ Manteniendo el modelo original")
                
        print(f"   ‚úÖ Comparaci√≥n registrada en MLflow")
    else:
        print("üìù No hay resultados originales para comparar")

    print("\n‚úÖ C√°lculo de m√©tricas completado exitosamente")

except Exception as e:
    print(f"‚ùå Error calculando m√©tricas: {e}")
    print("üí° Verificando variables disponibles...")
    
    available_vars = []
    for var in ['y_val', 'y_pred_opt', 'y_pred_proba_opt', 'results', 'best_model_name']:
        if var in locals() or var in globals():
            available_vars.append(var)
            
    print(f"Variables disponibles: {available_vars}")
    
    # Log del error en MLflow
    with mlflow.start_run(run_name="ERROR_OPTIMIZATION_EVALUATION"):
        mlflow.set_tag("status", "error")
        mlflow.set_tag("error_message", str(e))
        mlflow.log_param("available_variables", str(available_vars))

In [None]:

filename = f"submissions\submission_grupoM_optimized_{optimized_run_id[:5]}.csv"
submission_df_optimized = create_submission_file(
    final_model=optimized_model,
    X_train_full=X_train_clean,
    y_train_full=y_train_sync,
    X_test_full=X_test_clean,
    customer_ids=customer_ids,
    filename=filename
)

print(f"\nüìã Primeras 10 predicciones del modelo optimizado:")
print(submission_df_optimized.head(10))

#  **6. Conclusiones (Opcional pero Recomendado)**

## Escribe un breve resumen de tus hallazgos.
* ## ¬øQu√© modelo funcion√≥ mejor y por qu√© crees que fue as√≠?
  El modelo que funcion√≥ mejor fue el de regresi√≥n logistica. Fue la que mejor se ajusto a los datos existentes.
* ## ¬øCu√°les fueron las caracter√≠sticas m√°s importantes o los descubrimientos m√°s interesantes del EDA?
  Pendiente
* ## ¬øQu√© desaf√≠os encontraron y c√≥mo los resolvieron?
  Los desafio fueron varios, los podriamos clasificar en 
  1. Desafios t√©cnicos de implemetaci√≥n dela ambiente.
  2. Desafios en el desarrollod el pipeline de ML, principalmente en el preprocesamiento de datos, validaci√≥n interna usando test_split, conversi√≥n de parametros, validaci√≥n de Nan, problemas de casteo por diferencia de tipo de datos etc.


In [None]:
# üéØ MLflow: Dashboard de Experimentos y Comparaci√≥n de Modelos

print("üìä Generando dashboard de experimentos MLflow...")

# Obtener todos los runs del experimento actual
experiment = mlflow.get_experiment_by_name("Churn_Prediction_TP5")
runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id])

print(f"\nüß™ Resumen del Experimento: {experiment.name}")
print(f"   - Total de runs: {len(runs)}")
print(f"   - Tracking URI: {mlflow.get_tracking_uri()}")

# Mostrar top 5 modelos por ROC AUC
if len(runs) > 0 and 'metrics.roc_auc' in runs.columns:
    print("\nüèÜ Top 5 Modelos por ROC AUC:")
    top_models = runs.nlargest(5, 'metrics.roc_auc')[
        ['tags.model_type', 'metrics.roc_auc', 'metrics.accuracy', 'metrics.f1_score']
    ]
    
    for i, (idx, row) in enumerate(top_models.iterrows(), 1):
        model_name = row.get('tags.model_type', 'Unknown')
        roc_auc = row.get('metrics.roc_auc', 0)
        accuracy = row.get('metrics.accuracy', 0)
        f1_score = row.get('metrics.f1_score', 0)
        
        print(f"   {i}. {model_name}")
        print(f"      - ROC AUC: {roc_auc:.4f}")
        print(f"      - Accuracy: {accuracy:.4f}")
        print(f"      - F1-Score: {f1_score:.4f}")

    # Visualizaci√≥n comparativa
    plt.figure(figsize=(15, 5))
    
    # Subplot 1: ROC AUC Comparison
    plt.subplot(1, 3, 1)
    model_names = [name if len(name) <= 15 else name[:12] + '...' for name in top_models['tags.model_type']]
    plt.bar(model_names, top_models['metrics.roc_auc'], color='skyblue')
    plt.title('Comparaci√≥n ROC AUC')
    plt.ylabel('ROC AUC Score')
    plt.xticks(rotation=45)
    plt.grid(axis='y', alpha=0.3)
    
    # Subplot 2: Accuracy Comparison
    plt.subplot(1, 3, 2)
    plt.bar(model_names, top_models['metrics.accuracy'], color='lightgreen')
    plt.title('Comparaci√≥n Accuracy')
    plt.ylabel('Accuracy Score')
    plt.xticks(rotation=45)
    plt.grid(axis='y', alpha=0.3)
    
    # Subplot 3: F1-Score Comparison
    plt.subplot(1, 3, 3)
    plt.bar(model_names, top_models['metrics.f1_score'], color='salmon')
    plt.title('Comparaci√≥n F1-Score')
    plt.ylabel('F1-Score')
    plt.xticks(rotation=45)
    plt.grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Tabla comparativa detallada
    print("\nüìã Tabla Comparativa Detallada:")
    comparison_df = runs[[
        'tags.model_type', 'metrics.roc_auc', 'metrics.accuracy', 
        'metrics.precision', 'metrics.recall', 'metrics.f1_score'
    ]].round(4)
    comparison_df.columns = ['Model', 'ROC_AUC', 'Accuracy', 'Precision', 'Recall', 'F1_Score']
    display(comparison_df.sort_values('ROC_AUC', ascending=False))
    
else:
    print("‚ö†Ô∏è No se encontraron m√©tricas de ROC AUC en los runs")
    if len(runs) > 0:
        print("üìã Columnas disponibles:")
        print(runs.columns.tolist())

# Informaci√≥n para acceder a MLflow UI
print(f"\nüåê Para acceder al dashboard completo de MLflow:")
print(f"   1. Abre una terminal en la carpeta del proyecto")
print(f"   2. Ejecuta: mlflow ui --backend-store-uri {mlflow.get_tracking_uri()}")
print(f"   3. Abre tu navegador en: http://localhost:5000")
print(f"   4. Busca el experimento: '{experiment.name}'")

print("\n‚úÖ Dashboard de experimentos generado correctamente")
print("üéØ Ahora puedes comparar f√°cilmente todos tus modelos y experimentos!")

# üöÄ **Iniciar MLflow UI para Exploraci√≥n Interactiva**

## **Pasos para acceder al dashboard web de MLflow:**

### **1. Abrir Terminal**
- En VS Code: presiona `Ctrl+Shift+` ` (backtick) o ve a Terminal > New Terminal

### **2. Navegar al directorio del proyecto**
```bash
cd "notebooks\UTN-elearning-analisis-datos-avanzado\Unidades\Unidad5\TP5"
```

### **3. Iniciar el servidor MLflow**
```bash
mlflow ui
```

### **4. Acceder al dashboard**
- Abre tu navegador web
- Ve a: **http://localhost:5000**
- Busca el experimento: **"Churn_Prediction_TP5"**

## **Funcionalidades del MLflow UI:**

### **üìä Comparaci√≥n de Experimentos**
- **Tabla comparativa** de todos los modelos entrenados
- **Filtros y ordenamiento** por cualquier m√©trica
- **Gr√°ficos comparativos** autom√°ticos

### **üìà M√©tricas Detalladas**
- **ROC AUC, Accuracy, Precision, Recall, F1-Score**
- **Hiperpar√°metros** de cada modelo
- **Artifacts** (modelos guardados, gr√°ficos, etc.)

### **üîç An√°lisis de Runs**
- **Timeline** de experimentos
- **Comparaci√≥n lado a lado** de modelos
- **Download de modelos** entrenados

### **üèÜ Mejores Pr√°cticas**
- **Reproducibilidad** completa de experimentos
- **Versionado** de modelos
- **Colaboraci√≥n** en equipo

---
### üí° **Tip:** El servidor MLflow se ejecutar√° en background, puedes dejarlo corriendo mientras trabajas en el notebook.