[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/husseinlopez/diplomadoIA/blob/main/temp.ipynb)

# M√≥dulo 1: Introducci√≥n a la Miner√≠a de Datos
## Ejercicios Pr√°cticos de Limpieza y Preparaci√≥n de Datos

**Diplomado en Inteligencia Artificial**  
**Dr. Irvin Hussein L√≥pez Nava**  
**CICESE - UABC**

---

## Objetivos de esta sesi√≥n

1. **Identificar y corregir problemas de calidad** en conjuntos de datos reales
2. **Manejar valores faltantes** con diferentes estrategias de imputaci√≥n
3. **Detectar y tratar valores at√≠picos** sin perder informaci√≥n relevante
4. **Aplicar t√©cnicas de reducci√≥n de dimensionalidad** (PCA, t-SNE)
5. **Seleccionar atributos relevantes** mediante m√©todos Filter, Wrapper y Embedded
6. **Balancear clases desbalanceadas** con t√©cnicas de over/undersampling
7. **Integrar todas las t√©cnicas** en un pipeline de preprocesamiento robusto

## Estructura del notebook

### Parte 1: Limpieza de Datos
* Inspecci√≥n inicial y detecci√≥n de problemas
* Manejo de valores faltantes
* Identificaci√≥n y tratamiento de outliers
* Transformaciones y escalamiento

### Parte 2: Reducci√≥n de Dimensionalidad
* An√°lisis de Componentes Principales (PCA)
* t-SNE para visualizaci√≥n no lineal
* Comparaci√≥n de m√©todos

### Parte 3: Selecci√≥n de Atributos
* M√©todos basados en filtros
* M√©todos Wrapper
* M√©todos Embedded
* Consenso entre m√©todos

### Parte 4: Balanceo de Clases
* T√©cnicas de oversampling (SMOTE, ADASYN)
* T√©cnicas de undersampling
* Evaluaci√≥n del impacto en m√©tricas

### Parte 5: Pipeline Completo
* Integraci√≥n de todas las t√©cnicas
* Documentaci√≥n de decisiones
* Validaci√≥n final

---
## 0. Configuraci√≥n del Entorno

Importaremos todas las bibliotecas necesarias para el an√°lisis completo.

In [None]:
# Manejo de datos
import numpy as np
import pandas as pd
from scipy import stats

# Visualizaci√≥n
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Configuraci√≥n de visualizaci√≥n
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Configuraci√≥n de pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Reproducibilidad
np.random.seed(42)

# Ignorar warnings
import warnings
warnings.filterwarnings('ignore')

print("‚úì Bibliotecas b√°sicas importadas correctamente")

In [None]:
# Preprocesamiento
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler,
    LabelEncoder, OneHotEncoder, PowerTransformer
)
from sklearn.impute import SimpleImputer, KNNImputer

# Reducci√≥n de dimensionalidad
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Selecci√≥n de atributos
from sklearn.feature_selection import (
    SelectKBest, chi2, f_classif, mutual_info_classif,
    RFE, SelectFromModel
)

# Modelos para selecci√≥n embedded
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.tree import DecisionTreeClassifier

# Balanceo de clases
from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek

# Evaluaci√≥n y validaci√≥n
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import (
    classification_report, confusion_matrix,
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve
)

# Datasets
from sklearn.datasets import (
    load_breast_cancer, load_wine, load_iris,
    make_classification, make_blobs
)

print("‚úì Bibliotecas de ML y preprocesamiento importadas correctamente")

---
# Parte 1: Limpieza de Datos

En esta secci√≥n trabajaremos con un dataset que presenta problemas comunes:
- Valores faltantes
- Valores at√≠picos
- Escalas incompatibles
- Tipos de datos incorrectos

## 1.1 Creaci√≥n de un Dataset con Problemas Reales

Crearemos un dataset sint√©tico que simula datos m√©dicos con problemas t√≠picos.

In [None]:
def create_messy_health_dataset(n_samples=500):
    """
    Crea un dataset sint√©tico de datos de salud con problemas reales:
    - Valores faltantes (MCAR, MAR, MNAR)
    - Outliers
    - Escalas inconsistentes
    - Errores de registro
    """
    np.random.seed(42)
    
    # Variables base
    data = {
        'edad': np.random.normal(45, 15, n_samples).clip(18, 90),
        'peso': np.random.normal(70, 15, n_samples).clip(40, 150),
        'estatura': np.random.normal(165, 10, n_samples).clip(140, 200),
        'presion_sistolica': np.random.normal(120, 15, n_samples).clip(80, 200),
        'presion_diastolica': np.random.normal(80, 10, n_samples).clip(60, 120),
        'glucosa': np.random.normal(100, 20, n_samples).clip(70, 300),
        'colesterol': np.random.normal(200, 40, n_samples).clip(120, 350),
        'trigliceridos': np.random.normal(150, 50, n_samples).clip(50, 500),
        'frecuencia_cardiaca': np.random.normal(75, 10, n_samples).clip(50, 120),
    }
    
    df = pd.DataFrame(data)
    
    # Calcular IMC
    df['imc'] = df['peso'] / ((df['estatura']/100) ** 2)
    
    # Variables categ√≥ricas
    df['genero'] = np.random.choice(['M', 'F'], n_samples)
    df['fumador'] = np.random.choice(['Si', 'No', 'Exfumador'], n_samples, p=[0.2, 0.6, 0.2])
    df['diabetes'] = (df['glucosa'] > 126).astype(int)
    df['hipertension'] = (df['presion_sistolica'] > 140).astype(int)
    
    # Introducir valores faltantes de diferentes tipos
    
    # MCAR (Missing Completely At Random) - 5% en edad
    mcar_mask = np.random.random(n_samples) < 0.05
    df.loc[mcar_mask, 'edad'] = np.nan
    
    # MAR (Missing At Random) - Personas con diabetes tienen m√°s faltantes en colesterol
    mar_mask = (df['diabetes'] == 1) & (np.random.random(n_samples) < 0.15)
    df.loc[mar_mask, 'colesterol'] = np.nan
    
    # MNAR (Missing Not At Random) - Valores altos de glucosa tienden a faltar m√°s
    high_glucose = df['glucosa'] > df['glucosa'].quantile(0.75)
    mnar_mask = high_glucose & (np.random.random(n_samples) < 0.10)
    df.loc[mnar_mask, 'glucosa'] = np.nan
    
    # Valores faltantes adicionales
    df.loc[np.random.random(n_samples) < 0.08, 'trigliceridos'] = np.nan
    df.loc[np.random.random(n_samples) < 0.03, 'frecuencia_cardiaca'] = np.nan
    
    # Introducir outliers
    
    # Outliers extremos (errores de medici√≥n)
    outlier_indices = np.random.choice(n_samples, size=10, replace=False)
    df.loc[outlier_indices[:3], 'peso'] = np.random.uniform(200, 250, 3)
    df.loc[outlier_indices[3:6], 'presion_sistolica'] = np.random.uniform(220, 280, 3)
    df.loc[outlier_indices[6:], 'glucosa'] = np.random.uniform(400, 600, 4)
    
    # Outliers moderados (valores reales pero inusuales)
    moderate_outliers = np.random.choice(n_samples, size=20, replace=False)
    df.loc[moderate_outliers, 'colesterol'] = np.random.uniform(300, 400, 20)
    
    # Introducir inconsistencias
    
    # Algunas estatura en cm, otras (pocas) en metros
    error_indices = np.random.choice(n_samples, size=5, replace=False)
    df.loc[error_indices, 'estatura'] = df.loc[error_indices, 'estatura'] / 100
    
    # Calcular variable objetivo (riesgo cardiovascular)
    risk_score = (
        (df['edad'] > 55).astype(int) * 2 +
        (df['imc'] > 30).astype(int) * 2 +
        df['diabetes'] * 3 +
        df['hipertension'] * 3 +
        (df['fumador'] == 'Si').astype(int) * 2 +
        (df['colesterol'] > 240).fillna(0).astype(int) * 2
    )
    
    # Binarizar riesgo con algo de ruido
    noise = np.random.random(n_samples) < 0.1
    df['riesgo_alto'] = ((risk_score >= 6) != noise).astype(int)
    
    return df

# Crear dataset
df_health = create_messy_health_dataset(500)

print(f"Dataset creado con {len(df_health)} observaciones y {len(df_health.columns)} variables")
print(f"\nPrimeras filas:")
df_health.head(10)

## 1.2 Inspecci√≥n Inicial

Primer vistazo a la estructura y calidad de los datos.

In [None]:
def inspect_dataset(df):
    """
    Realiza una inspecci√≥n completa del dataset
    """
    print("="*80)
    print("INSPECCI√ìN GENERAL DEL DATASET")
    print("="*80)
    
    print(f"\nüìä Dimensiones: {df.shape[0]} filas √ó {df.shape[1]} columnas")
    print(f"üíæ Memoria utilizada: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    print("\n" + "="*80)
    print("TIPOS DE DATOS")
    print("="*80)
    print(df.dtypes)
    
    print("\n" + "="*80)
    print("VALORES FALTANTES")
    print("="*80)
    
    missing = df.isnull().sum()
    missing_pct = 100 * missing / len(df)
    missing_table = pd.DataFrame({
        'Columna': missing.index,
        'Faltantes': missing.values,
        'Porcentaje': missing_pct.values
    })
    missing_table = missing_table[missing_table['Faltantes'] > 0].sort_values('Porcentaje', ascending=False)
    
    if len(missing_table) > 0:
        print(missing_table.to_string(index=False))
        print(f"\n‚ö†Ô∏è  Total de valores faltantes: {missing.sum()} ({100*missing.sum()/(df.shape[0]*df.shape[1]):.2f}% del dataset)")
    else:
        print("‚úì No hay valores faltantes")
    
    print("\n" + "="*80)
    print("ESTAD√çSTICAS DESCRIPTIVAS (VARIABLES NUM√âRICAS)")
    print("="*80)
    print(df.describe().T)
    
    print("\n" + "="*80)
    print("DISTRIBUCI√ìN DE VARIABLES CATEG√ìRICAS")
    print("="*80)
    
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns
    for col in categorical_cols:
        print(f"\n{col}:")
        print(df[col].value_counts())
        print(f"Valores √∫nicos: {df[col].nunique()}")

inspect_dataset(df_health)

## 1.3 Visualizaci√≥n de Valores Faltantes

Entender el patr√≥n de datos faltantes es crucial para decidir c√≥mo manejarlos.

In [None]:
def visualize_missing_data(df):
    """
    Crea visualizaciones comprehensivas de valores faltantes
    """
    fig = plt.figure(figsize=(16, 12))
    gs = fig.add_gridspec(3, 2, hspace=0.3, wspace=0.3)
    
    # 1. Matriz de valores faltantes
    ax1 = fig.add_subplot(gs[0, :])
    missing_matrix = df.isnull().astype(int)
    sns.heatmap(missing_matrix.T, cmap='YlOrRd', cbar=True, ax=ax1,
                yticklabels=df.columns, xticklabels=False)
    ax1.set_title('Matriz de Valores Faltantes\n(Amarillo = Presente, Rojo = Faltante)', 
                  fontsize=14, fontweight='bold')
    ax1.set_xlabel('Observaciones')
    
    # 2. Porcentaje de valores faltantes por columna
    ax2 = fig.add_subplot(gs[1, 0])
    missing_pct = 100 * df.isnull().sum() / len(df)
    missing_pct = missing_pct[missing_pct > 0].sort_values(ascending=True)
    
    if len(missing_pct) > 0:
        colors = ['#d62728' if x > 10 else '#ff7f0e' if x > 5 else '#2ca02c' for x in missing_pct]
        missing_pct.plot(kind='barh', ax=ax2, color=colors)
        ax2.set_xlabel('Porcentaje de valores faltantes (%)')
        ax2.set_title('Valores Faltantes por Variable', fontweight='bold')
        ax2.axvline(x=5, color='orange', linestyle='--', alpha=0.5, label='5%')
        ax2.axvline(x=10, color='red', linestyle='--', alpha=0.5, label='10%')
        ax2.legend()
        ax2.grid(axis='x', alpha=0.3)
    
    # 3. N√∫mero de valores faltantes por fila
    ax3 = fig.add_subplot(gs[1, 1])
    missing_per_row = df.isnull().sum(axis=1)
    missing_counts = missing_per_row.value_counts().sort_index()
    
    ax3.bar(missing_counts.index, missing_counts.values, color='steelblue', alpha=0.7)
    ax3.set_xlabel('N√∫mero de valores faltantes')
    ax3.set_ylabel('N√∫mero de observaciones')
    ax3.set_title('Distribuci√≥n de Valores Faltantes por Fila', fontweight='bold')
    ax3.grid(axis='y', alpha=0.3)
    
    # A√±adir texto con estad√≠sticas
    total_rows_with_missing = (missing_per_row > 0).sum()
    ax3.text(0.95, 0.95, 
             f'Filas con faltantes: {total_rows_with_missing}\n'
             f'Filas completas: {len(df) - total_rows_with_missing}',
             transform=ax3.transAxes, fontsize=10,
             verticalalignment='top', horizontalalignment='right',
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    # 4. Correlaci√≥n entre valores faltantes
    ax4 = fig.add_subplot(gs[2, :])
    missing_corr = df.isnull().corr()
    mask = np.triu(np.ones_like(missing_corr), k=1)
    
    sns.heatmap(missing_corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm',
                center=0, ax=ax4, cbar_kws={'label': 'Correlaci√≥n'})
    ax4.set_title('Correlaci√≥n entre Patrones de Valores Faltantes\n'
                  '(Valores altos sugieren faltantes no aleatorios)', fontweight='bold')
    
    plt.suptitle('An√°lisis Comprehensivo de Valores Faltantes', 
                 fontsize=16, fontweight='bold', y=0.995)
    
    return fig

fig = visualize_missing_data(df_health)
plt.show()

## 1.4 An√°lisis de Patrones de Valores Faltantes

Determinar si los valores faltantes son MCAR, MAR o MNAR.

In [None]:
# An√°lisis detallado de patrones de valores faltantes
def analyze_missing_patterns(df):
    """
    Analiza si los valores faltantes son MCAR, MAR o MNAR
    """
    print("="*80)
    print("AN√ÅLISIS DE PATRONES DE VALORES FALTANTES")
    print("="*80)
    
    # Crear indicadores de faltantes
    cols_with_missing = df.columns[df.isnull().any()].tolist()
    
    for col in cols_with_missing:
        print(f"\n{'='*80}")
        print(f"Variable: {col}")
        print(f"{'='*80}")
        
        missing_mask = df[col].isnull()
        
        # Comparar caracter√≠sticas entre observaciones con y sin faltantes
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        numeric_cols = [c for c in numeric_cols if c != col]
        
        print("\nComparaci√≥n de medias (con faltantes vs sin faltantes):")
        print("-" * 60)
        
        for other_col in numeric_cols[:5]:  # Limitamos a 5 para no saturar
            if df[other_col].notna().sum() > 0:
                mean_missing = df.loc[missing_mask, other_col].mean()
                mean_present = df.loc[~missing_mask, other_col].mean()
                
                if pd.notna(mean_missing) and pd.notna(mean_present):
                    diff_pct = 100 * (mean_missing - mean_present) / mean_present
                    
                    # Test t para diferencia de medias
                    try:
                        t_stat, p_value = stats.ttest_ind(
                            df.loc[missing_mask, other_col].dropna(),
                            df.loc[~missing_mask, other_col].dropna()
                        )
                        significance = "***" if p_value < 0.001 else "**" if p_value < 0.01 else "*" if p_value < 0.05 else ""
                    except:
                        p_value = np.nan
                        significance = ""
                    
                    print(f"{other_col:30s}: {mean_present:7.2f} ‚Üí {mean_missing:7.2f} "
                          f"({diff_pct:+6.1f}%) p={p_value:.3f} {significance}")
    
    print("\n" + "="*80)
    print("INTERPRETACI√ìN:")
    print("="*80)
    print("* = p < 0.05  (diferencia estad√≠sticamente significativa)")
    print("** = p < 0.01 (alta significancia)")
    print("*** = p < 0.001 (muy alta significancia)")
    print("\nDiferencias significativas sugieren valores faltantes MAR o MNAR")
    print("No diferencias sugiere MCAR (Missing Completely At Random)")

analyze_missing_patterns(df_health)

## 1.4 Manejo de Valores Faltantes

Compararemos diferentes estrategias de imputaci√≥n.

In [None]:
def compare_imputation_methods(df, column):
    """
    Compara diferentes m√©todos de imputaci√≥n en una columna espec√≠fica
    """
    df_test = df.copy()
    missing_mask = df_test[column].isnull()
    original_values = df_test.loc[~missing_mask, column].copy()
    
    methods = {}
    
    # 1. Eliminaci√≥n
    methods['Eliminaci√≥n'] = df_test[column].dropna()
    
    # 2. Media
    imputer_mean = SimpleImputer(strategy='mean')
    methods['Media'] = pd.Series(
        imputer_mean.fit_transform(df_test[[column]]).ravel(),
        index=df_test.index
    )
    
    # 3. Mediana
    imputer_median = SimpleImputer(strategy='median')
    methods['Mediana'] = pd.Series(
        imputer_median.fit_transform(df_test[[column]]).ravel(),
        index=df_test.index
    )
    
    # 4. KNN Imputer
    numeric_cols = df_test.select_dtypes(include=[np.number]).columns.tolist()
    if len(numeric_cols) > 1:
        imputer_knn = KNNImputer(n_neighbors=5)
        df_knn = df_test[numeric_cols].copy()
        imputed_knn = imputer_knn.fit_transform(df_knn)
        col_idx = numeric_cols.index(column)
        methods['KNN (k=5)'] = pd.Series(
            imputed_knn[:, col_idx],
            index=df_test.index
        )
    
    # Visualizaci√≥n comparativa
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    axes = axes.ravel()
    
    # Plot original
    ax = axes[0]
    ax.hist(original_values, bins=30, alpha=0.7, color='gray', edgecolor='black')
    ax.axvline(original_values.mean(), color='red', linestyle='--', 
               linewidth=2, label=f'Media: {original_values.mean():.2f}')
    ax.axvline(original_values.median(), color='blue', linestyle='--', 
               linewidth=2, label=f'Mediana: {original_values.median():.2f}')
    ax.set_title('Distribuci√≥n Original\\n(sin valores faltantes)', fontweight='bold')
    ax.set_xlabel(column)
    ax.set_ylabel('Frecuencia')
    ax.legend()
    ax.grid(alpha=0.3)
    
    # Plot cada m√©todo
    for idx, (method_name, imputed_data) in enumerate(methods.items(), 1):
        if idx >= len(axes):
            break
        ax = axes[idx]
        ax.hist(original_values, bins=30, alpha=0.4, color='gray', label='Original', edgecolor='black')
        ax.hist(imputed_data.dropna(), bins=30, alpha=0.6, color='steelblue', label=method_name, edgecolor='black')
        mean_diff = imputed_data.mean() - original_values.mean()
        std_diff = imputed_data.std() - original_values.std()
        ax.set_title(f'{method_name}\\nŒîmedia: {mean_diff:+.2f}, Œîstd: {std_diff:+.2f}', fontweight='bold')
        ax.set_xlabel(column)
        ax.set_ylabel('Frecuencia')
        ax.legend()
        ax.grid(alpha=0.3)
    
    for idx in range(len(methods) + 1, len(axes)):
        axes[idx].axis('off')
    
    plt.suptitle(f'Comparaci√≥n de M√©todos de Imputaci√≥n: {column}', fontsize=16, fontweight='bold')
    plt.tight_layout()
    
    # Estad√≠sticas
    print("="*80)
    print(f"COMPARACI√ìN DE M√âTODOS DE IMPUTACI√ìN: {column}")
    print("="*80)
    print(f"\\nOriginal: N={len(original_values)}, Media={original_values.mean():.2f}, Std={original_values.std():.2f}")
    for method_name, imputed_data in methods.items():
        print(f"{method_name}: N={len(imputed_data.dropna())}, Media={imputed_data.mean():.2f}, Std={imputed_data.std():.2f}")
    
    return fig, methods

# Comparar m√©todos para glucosa
fig, methods = compare_imputation_methods(df_health, 'glucosa')
plt.show()

In [None]:
# Aplicar imputaci√≥n con KNN
def apply_imputation(df, strategy='knn'):
    """
    Aplica estrategia de imputaci√≥n al dataset completo
    """
    df_imputed = df.copy()
    
    numeric_cols = df_imputed.select_dtypes(include=[np.number]).columns.tolist()
    categorical_cols = df_imputed.select_dtypes(include=['object', 'category']).columns.tolist()
    
    if strategy == 'knn':
        imputer_num = KNNImputer(n_neighbors=5)
        df_imputed[numeric_cols] = imputer_num.fit_transform(df_imputed[numeric_cols])
        
        for col in categorical_cols:
            if df_imputed[col].isnull().any():
                mode_value = df_imputed[col].mode()[0]
                df_imputed[col].fillna(mode_value, inplace=True)
    
    print(f"Imputaci√≥n aplicada con estrategia: {strategy}")
    print(f"Filas antes: {len(df)} ‚Üí Filas despu√©s: {len(df_imputed)}")
    print(f"Valores faltantes restantes: {df_imputed.isnull().sum().sum()}")
    
    return df_imputed

df_health_imputed = apply_imputation(df_health, strategy='knn')

## 1.5 Detecci√≥n y Tratamiento de Outliers

Identificaremos valores at√≠picos usando m√∫ltiples m√©todos.

In [None]:
def detect_outliers_multiple_methods(df, column):
    """
    Detecta outliers usando diferentes m√©todos:
    1. IQR (Interquartile Range)
    2. Z-score
    3. Isolation Forest
    """
    from sklearn.ensemble import IsolationForest
    
    data = df[column].dropna().values.reshape(-1, 1)
    outliers = {}
    
    # 1. M√©todo IQR
    Q1 = np.percentile(data, 25)
    Q3 = np.percentile(data, 75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers['IQR'] = (data < lower_bound) | (data > upper_bound)
    
    # 2. Z-score
    z_scores = np.abs(stats.zscore(data))
    outliers['Z-score'] = z_scores > 3
    
    # 3. Isolation Forest
    iso_forest = IsolationForest(contamination=0.1, random_state=42)
    outliers['Isolation Forest'] = iso_forest.fit_predict(data) == -1
    
    # Visualizaci√≥n
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # Box plot
    ax = axes[0, 0]
    bp = ax.boxplot([data.ravel()], vert=True, patch_artist=True,
                     boxprops=dict(facecolor='lightblue', alpha=0.7),
                     medianprops=dict(color='red', linewidth=2))
    ax.axhline(lower_bound, color='orange', linestyle='--', label=f'IQR lower: {lower_bound:.2f}')
    ax.axhline(upper_bound, color='orange', linestyle='--', label=f'IQR upper: {upper_bound:.2f}')
    ax.set_ylabel(column)
    ax.set_title('Box Plot con L√≠mites IQR', fontweight='bold')
    ax.legend()
    ax.grid(alpha=0.3)
    
    # Distribuci√≥n con outliers
    ax = axes[0, 1]
    ax.hist(data, bins=50, alpha=0.6, color='steelblue', edgecolor='black')
    for method_name, is_outlier in outliers.items():
        outlier_values = data[is_outlier.ravel()]
        if len(outlier_values) > 0:
            ax.scatter(outlier_values, [0] * len(outlier_values), s=100, alpha=0.6, label=method_name)
    ax.set_xlabel(column)
    ax.set_ylabel('Frecuencia')
    ax.set_title('Distribuci√≥n con Outliers Detectados', fontweight='bold')
    ax.legend()
    ax.grid(alpha=0.3)
    
    # Z-scores
    ax = axes[1, 0]
    sorted_idx = np.argsort(data.ravel())
    ax.scatter(range(len(data)), z_scores[sorted_idx], alpha=0.5, s=20)
    ax.axhline(3, color='red', linestyle='--', label='Umbral Z=3')
    ax.set_xlabel('Observaciones (ordenadas)')
    ax.set_ylabel('|Z-score|')
    ax.set_title('Z-scores', fontweight='bold')
    ax.legend()
    ax.grid(alpha=0.3)
    
    # Comparaci√≥n
    ax = axes[1, 1]
    method_names = list(outliers.keys())
    counts = [outliers[m].sum() for m in method_names]
    bars = ax.barh(method_names, counts, color=['#ff7f0e', '#2ca02c', '#d62728'])
    ax.set_xlabel('N√∫mero de outliers detectados')
    ax.set_title('Comparaci√≥n de M√©todos', fontweight='bold')
    ax.grid(axis='x', alpha=0.3)
    
    for bar, count in zip(bars, counts):
        width = bar.get_width()
        ax.text(width, bar.get_y() + bar.get_height()/2,
               f'{int(count)} ({100*count/len(data):.1f}%)',
               ha='left', va='center', fontweight='bold')
    
    plt.suptitle(f'Detecci√≥n de Outliers: {column}', fontsize=16, fontweight='bold')
    plt.tight_layout()
    
    # Consenso
    consensus_outliers = sum(outliers.values()) >= 2
    
    print("="*80)
    print(f"DETECCI√ìN DE OUTLIERS: {column}")
    print("="*80)
    print(f"\\nTotal observaciones: {len(data)}")
    for method, is_outlier in outliers.items():
        n_outliers = is_outlier.sum()
        print(f"{method:20s}: {n_outliers:4d} ({100*n_outliers/len(data):5.2f}%)")
    print(f"\\nConsenso (‚â•2 m√©todos): {consensus_outliers.sum()} ({100*consensus_outliers.sum()/len(data):.2f}%)")
    
    return fig, outliers, consensus_outliers

# Detectar outliers
fig_out, outliers_peso, consensus_peso = detect_outliers_multiple_methods(df_health_imputed, 'peso')
plt.show()

In [None]:
# Tratamiento de outliers
def treat_outliers(df, column, method='cap', outlier_mask=None):
    """
    Trata outliers usando diferentes estrategias
    """
    df_treated = df.copy()
    original = df_treated[column].copy()
    
    if outlier_mask is None:
        Q1 = df_treated[column].quantile(0.25)
        Q3 = df_treated[column].quantile(0.75)
        IQR = Q3 - Q1
        outlier_mask = (df_treated[column] < Q1 - 1.5*IQR) | (df_treated[column] > Q3 + 1.5*IQR)
    
    if method == 'remove':
        df_treated = df_treated[~outlier_mask]
    elif method == 'cap':
        lower = df_treated[column].quantile(0.05)
        upper = df_treated[column].quantile(0.95)
        df_treated[column] = df_treated[column].clip(lower, upper)
    
    # Visualizaci√≥n
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    axes[0].hist(original, bins=50, alpha=0.7, color='red', edgecolor='black')
    axes[0].set_title('Antes del Tratamiento', fontweight='bold')
    axes[0].set_xlabel(column)
    
    axes[1].hist(df_treated[column], bins=50, alpha=0.7, color='green', edgecolor='black')
    axes[1].set_title(f'Despu√©s ({method})', fontweight='bold')
    axes[1].set_xlabel(column)
    
    axes[2].boxplot([original.dropna(), df_treated[column].dropna()],
                    labels=['Antes', 'Despu√©s'], patch_artist=True)
    axes[2].set_title('Comparaci√≥n', fontweight='bold')
    
    plt.tight_layout()
    return df_treated, fig

df_peso_treated, fig_treat = treat_outliers(df_health_imputed, 'peso', method='cap', outlier_mask=consensus_peso.ravel())
plt.show()

## 1.6 Escalamiento y Transformaciones

Comparaci√≥n de diferentes m√©todos de escalamiento.

In [None]:
def compare_scaling_methods(df, columns=None):
    """
    Compara diferentes m√©todos de escalamiento
    """
    if columns is None:
        columns = df.select_dtypes(include=[np.number]).columns[:4]
    
    df_subset = df[columns].copy()
    
    scalers = {
        'Original': None,
        'StandardScaler': StandardScaler(),
        'MinMaxScaler': MinMaxScaler(),
        'RobustScaler': RobustScaler(),
        'PowerTransformer': PowerTransformer(method='yeo-johnson')
    }
    
    scaled_data = {}
    for name, scaler in scalers.items():
        if scaler is None:
            scaled_data[name] = df_subset.values
        else:
            scaled_data[name] = scaler.fit_transform(df_subset)
    
    # Visualizaci√≥n
    fig, axes = plt.subplots(len(scalers), len(columns), figsize=(5*len(columns), 4*len(scalers)))
    if len(columns) == 1:
        axes = axes.reshape(-1, 1)
    
    for i, (method_name, data) in enumerate(scaled_data.items()):
        for j, col in enumerate(columns):
            ax = axes[i, j]
            ax.hist(data[:, j], bins=50, alpha=0.7, color='steelblue', edgecolor='black')
            mean = np.mean(data[:, j])
            std = np.std(data[:, j])
            if i == 0:
                ax.set_title(f'{col}\\n{method_name}\\nŒº={mean:.2f}, œÉ={std:.2f}', fontweight='bold')
            else:
                ax.set_title(f'{method_name}\\nŒº={mean:.2f}, œÉ={std:.2f}', fontweight='bold')
            ax.axvline(mean, color='red', linestyle='--', linewidth=2, alpha=0.7)
            ax.grid(alpha=0.3)
    
    plt.suptitle('Comparaci√≥n de M√©todos de Escalamiento', fontsize=16, fontweight='bold')
    plt.tight_layout()
    return fig, scaled_data

cols_to_scale = ['edad', 'peso', 'presion_sistolica', 'glucosa']
fig_scale, scaled_results = compare_scaling_methods(df_health_imputed, cols_to_scale)
plt.show()

---
# Parte 2: Reducci√≥n de Dimensionalidad

Exploraremos t√©cnicas para reducir el n√∫mero de variables preservando la mayor cantidad de informaci√≥n.

## 2.1 Preparaci√≥n: Dataset de C√°ncer de Mama

Usaremos el dataset cl√°sico de Wisconsin Breast Cancer con 30 caracter√≠sticas.

In [None]:
# Cargar dataset
cancer = load_breast_cancer()
X_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y_cancer = cancer.target

print("="*80)
print("DATASET: Wisconsin Breast Cancer")
print("="*80)
print(f"\nDimensiones: {X_cancer.shape}")
print(f"Clases: {np.unique(y_cancer, return_counts=True)}")
print(f"\nPrimeras caracter√≠sticas:")
print(X_cancer.columns.tolist()[:10])
print("...")

# Escalamiento previo (necesario para PCA y t-SNE)
scaler = StandardScaler()
X_cancer_scaled = scaler.fit_transform(X_cancer)
X_cancer_scaled_df = pd.DataFrame(X_cancer_scaled, columns=X_cancer.columns)

print(f"\n‚úì Datos escalados con StandardScaler")

## 2.2 An√°lisis de Componentes Principales (PCA)

PCA encuentra direcciones ortogonales de m√°xima varianza.

In [None]:
def perform_pca_analysis(X, y=None, feature_names=None):
    """
    Realiza an√°lisis completo de PCA con m√∫ltiples visualizaciones
    """
    # PCA completo
    pca_full = PCA()
    X_pca_full = pca_full.fit_transform(X)
    
    explained_variance = pca_full.explained_variance_ratio_
    cumulative_variance = np.cumsum(explained_variance)
    
    # Encontrar componentes para 90%, 95%, 99%
    n_90 = np.argmax(cumulative_variance >= 0.90) + 1
    n_95 = np.argmax(cumulative_variance >= 0.95) + 1
    n_99 = np.argmax(cumulative_variance >= 0.99) + 1
    
    print("="*80)
    print("AN√ÅLISIS PCA")
    print("="*80)
    print(f"\nDimensiones originales: {X.shape[1]}")
    print(f"\nComponentes necesarios para:")
    print(f"  - 90% varianza: {n_90} componentes")
    print(f"  - 95% varianza: {n_95} componentes")
    print(f"  - 99% varianza: {n_99} componentes")
    print(f"\nPrimeros 5 componentes explican: {cumulative_variance[4]:.1%}")
    print(f"Primeros 10 componentes explican: {cumulative_variance[9]:.1%}")
    
    # Visualizaciones
    fig = plt.figure(figsize=(20, 12))
    gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)
    
    # 1. Varianza por componente (Scree Plot)
    ax1 = fig.add_subplot(gs[0, 0])
    components = np.arange(1, min(21, len(explained_variance)+1))
    ax1.bar(components, explained_variance[:20], alpha=0.7, color='steelblue', edgecolor='black')
    ax1.set_xlabel('Componente Principal', fontsize=12)
    ax1.set_ylabel('Varianza Explicada', fontsize=12)
    ax1.set_title('Scree Plot\n(Primeras 20 componentes)', fontweight='bold', fontsize=13)
    ax1.grid(alpha=0.3, axis='y')
    ax1.set_xticks(components[::2])
    
    # 2. Varianza acumulada
    ax2 = fig.add_subplot(gs[0, 1])
    ax2.plot(range(1, len(cumulative_variance)+1), cumulative_variance, 
             marker='o', linewidth=2, markersize=4, color='steelblue')
    ax2.axhline(y=0.90, color='green', linestyle='--', linewidth=2, label='90%', alpha=0.7)
    ax2.axhline(y=0.95, color='orange', linestyle='--', linewidth=2, label='95%', alpha=0.7)
    ax2.axhline(y=0.99, color='red', linestyle='--', linewidth=2, label='99%', alpha=0.7)
    ax2.axvline(x=n_95, color='orange', linestyle=':', alpha=0.5)
    ax2.set_xlabel('N√∫mero de Componentes', fontsize=12)
    ax2.set_ylabel('Varianza Acumulada', fontsize=12)
    ax2.set_title('Varianza Explicada Acumulada', fontweight='bold', fontsize=13)
    ax2.legend(fontsize=10)
    ax2.grid(alpha=0.3)
    ax2.set_xlim(0, min(30, len(cumulative_variance)))
    
    # 3. Raz√≥n de varianza (Kaiser criterion)
    ax3 = fig.add_subplot(gs[0, 2])
    eigenvalues = pca_full.explained_variance_[:20]
    ax3.plot(range(1, len(eigenvalues)+1), eigenvalues, marker='s', 
             linewidth=2, markersize=6, color='darkred')
    ax3.axhline(y=1, color='black', linestyle='--', linewidth=2, label='Kaiser criterion (Œª=1)', alpha=0.7)
    ax3.set_xlabel('Componente Principal', fontsize=12)
    ax3.set_ylabel('Eigenvalue (Œª)', fontsize=12)
    ax3.set_title('Eigenvalues\n(Kaiser: retener Œª > 1)', fontweight='bold', fontsize=13)
    ax3.legend(fontsize=10)
    ax3.grid(alpha=0.3)
    n_kaiser = np.sum(pca_full.explained_variance_ > 1)
    ax3.text(0.98, 0.98, f'n={n_kaiser}', transform=ax3.transAxes,
             ha='right', va='top', fontsize=11, fontweight='bold',
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
    
    # 4. Proyecci√≥n 2D (PC1 vs PC2)
    ax4 = fig.add_subplot(gs[1, :2])
    if y is not None:
        scatter = ax4.scatter(X_pca_full[:, 0], X_pca_full[:, 1], 
                            c=y, cmap='RdYlGn', alpha=0.6, s=50, edgecolors='black', linewidth=0.5)
        cbar = plt.colorbar(scatter, ax=ax4)
        cbar.set_label('Clase', fontsize=11)
    else:
        ax4.scatter(X_pca_full[:, 0], X_pca_full[:, 1], 
                   alpha=0.6, s=50, color='steelblue', edgecolors='black', linewidth=0.5)
    
    ax4.set_xlabel(f'PC1 ({explained_variance[0]:.1%} varianza)', fontsize=12)
    ax4.set_ylabel(f'PC2 ({explained_variance[1]:.1%} varianza)', fontsize=12)
    ax4.set_title(f'Proyecci√≥n en Primeras 2 Componentes\n(Total: {explained_variance[0]+explained_variance[1]:.1%} varianza)', 
                 fontweight='bold', fontsize=13)
    ax4.grid(alpha=0.3)
    ax4.axhline(0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
    ax4.axvline(0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
    
    # 5. Loadings PC1
    ax5 = fig.add_subplot(gs[1, 2])
    if feature_names is not None:
        loadings_pc1 = pd.Series(pca_full.components_[0], index=feature_names)
        top_loadings = pd.concat([loadings_pc1.nlargest(5), loadings_pc1.nsmallest(5)])
        colors = ['red' if x < 0 else 'green' for x in top_loadings.values]
        top_loadings.plot(kind='barh', ax=ax5, color=colors, alpha=0.7, edgecolor='black')
        ax5.set_xlabel('Loading', fontsize=11)
        ax5.set_title('Top Loadings PC1', fontweight='bold', fontsize=13)
        ax5.axvline(0, color='black', linewidth=1)
        ax5.grid(alpha=0.3, axis='x')
    
    # 6. Loadings PC2
    ax6 = fig.add_subplot(gs[2, 0])
    if feature_names is not None:
        loadings_pc2 = pd.Series(pca_full.components_[1], index=feature_names)
        top_loadings = pd.concat([loadings_pc2.nlargest(5), loadings_pc2.nsmallest(5)])
        colors = ['red' if x < 0 else 'green' for x in top_loadings.values]
        top_loadings.plot(kind='barh', ax=ax6, color=colors, alpha=0.7, edgecolor='black')
        ax6.set_xlabel('Loading', fontsize=11)
        ax6.set_title('Top Loadings PC2', fontweight='bold', fontsize=13)
        ax6.axvline(0, color='black', linewidth=1)
        ax6.grid(alpha=0.3, axis='x')
    
    # 7. Biplot (PC1 vs PC2 con vectores)
    ax7 = fig.add_subplot(gs[2, 1:])
    if y is not None:
        scatter = ax7.scatter(X_pca_full[:, 0], X_pca_full[:, 1], 
                            c=y, cmap='RdYlGn', alpha=0.3, s=30)
    else:
        ax7.scatter(X_pca_full[:, 0], X_pca_full[:, 1], alpha=0.3, s=30, color='gray')
    
    if feature_names is not None:
        # Dibujar vectores de variables (solo las m√°s importantes)
        scale = 4
        top_features = np.argsort(np.abs(pca_full.components_[0]))[-8:]
        for i in top_features:
            ax7.arrow(0, 0, 
                     pca_full.components_[0, i]*scale, 
                     pca_full.components_[1, i]*scale,
                     head_width=0.1, head_length=0.1, fc='red', ec='red', alpha=0.6, linewidth=2)
            ax7.text(pca_full.components_[0, i]*scale*1.15, 
                    pca_full.components_[1, i]*scale*1.15,
                    feature_names[i], fontsize=9, ha='center', 
                    bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.7))
    
    ax7.set_xlabel(f'PC1 ({explained_variance[0]:.1%})', fontsize=12)
    ax7.set_ylabel(f'PC2 ({explained_variance[1]:.1%})', fontsize=12)
    ax7.set_title('Biplot (Observaciones + Variables)', fontweight='bold', fontsize=13)
    ax7.axhline(0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
    ax7.axvline(0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
    ax7.grid(alpha=0.3)
    
    plt.suptitle('An√°lisis Completo de Componentes Principales (PCA)', 
                fontsize=18, fontweight='bold', y=0.998)
    
    return pca_full, X_pca_full, fig

# Aplicar PCA
pca_model, X_cancer_pca, fig_pca = perform_pca_analysis(
    X_cancer_scaled, 
    y_cancer, 
    X_cancer.columns
)
plt.show()

## 2.3 Visualizaci√≥n 3D con PCA

Exploremos las primeras 3 componentes en un gr√°fico interactivo.

In [None]:
def plot_pca_3d_interactive(X_pca, y=None, explained_variance=None):
    """
    Crea visualizaci√≥n 3D interactiva de PCA
    """
    fig = go.Figure()
    
    if y is not None:
        # Colores por clase
        colors = ['red' if label == 0 else 'green' for label in y]
        labels = ['Maligno' if label == 0 else 'Benigno' for label in y]
        
        for class_label in np.unique(y):
            mask = y == class_label
            class_name = 'Maligno' if class_label == 0 else 'Benigno'
            color = 'red' if class_label == 0 else 'green'
            
            fig.add_trace(go.Scatter3d(
                x=X_pca[mask, 0],
                y=X_pca[mask, 1],
                z=X_pca[mask, 2],
                mode='markers',
                name=class_name,
                marker=dict(
                    size=5,
                    color=color,
                    opacity=0.6,
                    line=dict(color='black', width=0.5)
                ),
                text=[class_name] * mask.sum(),
                hovertemplate='<b>%{text}</b><br>PC1: %{x:.2f}<br>PC2: %{y:.2f}<br>PC3: %{z:.2f}<extra></extra>'
            ))
    else:
        fig.add_trace(go.Scatter3d(
            x=X_pca[:, 0],
            y=X_pca[:, 1],
            z=X_pca[:, 2],
            mode='markers',
            marker=dict(size=5, color='steelblue', opacity=0.6),
        ))
    
    # Etiquetas de ejes
    if explained_variance is not None:
        xlabel = f'PC1 ({explained_variance[0]:.1%})'
        ylabel = f'PC2 ({explained_variance[1]:.1%})'
        zlabel = f'PC3 ({explained_variance[2]:.1%})'
        total_var = explained_variance[0] + explained_variance[1] + explained_variance[2]
        title = f'PCA 3D - Varianza Total: {total_var:.1%}'
    else:
        xlabel, ylabel, zlabel = 'PC1', 'PC2', 'PC3'
        title = 'PCA 3D'
    
    fig.update_layout(
        title=dict(text=title, font=dict(size=20, color='black'), x=0.5, xanchor='center'),
        scene=dict(
            xaxis=dict(title=xlabel, backgroundcolor='rgb(230, 230,230)'),
            yaxis=dict(title=ylabel, backgroundcolor='rgb(230, 230,230)'),
            zaxis=dict(title=zlabel, backgroundcolor='rgb(230, 230,230)'),
        ),
        width=900,
        height=700,
        showlegend=True
    )
    
    return fig

# Crear visualizaci√≥n 3D
fig_3d = plot_pca_3d_interactive(
    X_cancer_pca, 
    y_cancer, 
    pca_model.explained_variance_ratio_
)
fig_3d.show()

## 2.4 t-SNE para Visualizaci√≥n No Lineal

t-SNE (t-Distributed Stochastic Neighbor Embedding) preserva la estructura local de los datos.

In [None]:
def compare_perplexity_tsne(X, y=None, perplexities=[10, 30, 50, 90]):
    """
    Compara t-SNE con diferentes valores de perplexity
    """
    results = {}
    
    print("Ejecutando t-SNE con diferentes perplexities...")
    print("(Esto puede tomar varios minutos)")
    print("="*80)
    
    for perp in perplexities:
        print(f"  Perplexity = {perp}...", end=' ')
        tsne = TSNE(n_components=2, perplexity=perp, random_state=42, 
                   n_iter=1000, verbose=0)
        X_tsne = tsne.fit_transform(X)
        results[perp] = X_tsne
        print("‚úì")
    
    # Visualizaci√≥n
    fig, axes = plt.subplots(2, 2, figsize=(16, 14))
    axes = axes.ravel()
    
    for idx, perp in enumerate(perplexities):
        ax = axes[idx]
        X_tsne = results[perp]
        
        if y is not None:
            scatter = ax.scatter(X_tsne[:, 0], X_tsne[:, 1], 
                               c=y, cmap='RdYlGn', alpha=0.6, s=50, 
                               edgecolors='black', linewidth=0.5)
            if idx == 0:
                cbar = plt.colorbar(scatter, ax=ax)
                cbar.set_label('Clase (0=Maligno, 1=Benigno)', fontsize=10)
        else:
            ax.scatter(X_tsne[:, 0], X_tsne[:, 1], 
                      alpha=0.6, s=50, color='steelblue',
                      edgecolors='black', linewidth=0.5)
        
        ax.set_xlabel('t-SNE 1', fontsize=12)
        ax.set_ylabel('t-SNE 2', fontsize=12)
        ax.set_title(f'Perplexity = {perp}', fontweight='bold', fontsize=14)
        ax.grid(alpha=0.3)
    
    plt.suptitle('Comparaci√≥n de t-SNE con Diferentes Perplexities', 
                fontsize=16, fontweight='bold')
    plt.tight_layout()
    
    print("\n" + "="*80)
    print("INTERPRETACI√ìN DE PERPLEXITY:")
    print("="*80)
    print("  ‚Ä¢ Valores bajos (5-15): Enfatizan estructura LOCAL")
    print("  ‚Ä¢ Valores medios (30-50): Balance entre local y global")
    print("  ‚Ä¢ Valores altos (>50): Enfatizan estructura GLOBAL")
    print("  ‚Ä¢ Valor por defecto: 30 (buen punto de partida)")
    
    return results, fig

# Ejecutar t-SNE
# NOTA: Usar subset para velocidad (t-SNE es costoso)
tsne_results, fig_tsne = compare_perplexity_tsne(
    X_cancer_scaled[:500], 
    y_cancer[:500],
    perplexities=[10, 30, 50, 90]
)
plt.show()

## 2.5 Comparaci√≥n: PCA vs t-SNE

Ventajas y desventajas de cada m√©todo.

In [None]:
def compare_pca_tsne(X, y=None, n_components_pca=2):
    """
    Compara lado a lado PCA y t-SNE
    """
    # PCA
    pca = PCA(n_components=n_components_pca)
    X_pca = pca.fit_transform(X)
    
    # t-SNE
    tsne = TSNE(n_components=2, random_state=42, perplexity=30)
    X_tsne = tsne.fit_transform(X)
    
    # Visualizaci√≥n
    fig, axes = plt.subplots(1, 2, figsize=(18, 7))
    
    # PCA
    ax = axes[0]
    if y is not None:
        scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], 
                           c=y, cmap='RdYlGn', alpha=0.6, s=60,
                           edgecolors='black', linewidth=0.5)
        plt.colorbar(scatter, ax=ax, label='Clase')
    else:
        ax.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.6, s=60, color='steelblue')
    
    var_exp = pca.explained_variance_ratio_
    ax.set_xlabel(f'PC1 ({var_exp[0]:.1%})', fontsize=13)
    ax.set_ylabel(f'PC2 ({var_exp[1]:.1%})', fontsize=13)
    ax.set_title(f'PCA\nVarianza total: {var_exp.sum():.1%}', 
                fontweight='bold', fontsize=15)
    ax.grid(alpha=0.3)
    ax.axhline(0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
    ax.axvline(0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
    
    # A√±adir caracter√≠sticas de PCA
    ax.text(0.02, 0.98, 
           '‚úì Lineal\n‚úì R√°pido\n‚úì Interpretable\n‚úì Determin√≠stico\n‚úó Asume linealidad',
           transform=ax.transAxes, fontsize=11, verticalalignment='top',
           bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
    
    # t-SNE
    ax = axes[1]
    if y is not None:
        scatter = ax.scatter(X_tsne[:, 0], X_tsne[:, 1], 
                           c=y, cmap='RdYlGn', alpha=0.6, s=60,
                           edgecolors='black', linewidth=0.5)
        plt.colorbar(scatter, ax=ax, label='Clase')
    else:
        ax.scatter(X_tsne[:, 0], X_tsne[:, 1], alpha=0.6, s=60, color='steelblue')
    
    ax.set_xlabel('t-SNE 1', fontsize=13)
    ax.set_ylabel('t-SNE 2', fontsize=13)
    ax.set_title('t-SNE\n(perplexity=30)', fontweight='bold', fontsize=15)
    ax.grid(alpha=0.3)
    
    # A√±adir caracter√≠sticas de t-SNE
    ax.text(0.02, 0.98,
           '‚úì No lineal\n‚úì Preserva clusters\n‚úì Bueno para visualizaci√≥n\n‚úó Lento\n‚úó No determin√≠stico',
           transform=ax.transAxes, fontsize=11, verticalalignment='top',
           bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))
    
    plt.suptitle('Comparaci√≥n: PCA vs t-SNE', fontsize=18, fontweight='bold')
    plt.tight_layout()
    
    return X_pca, X_tsne, fig

# Comparar
X_pca_2d, X_tsne_2d, fig_compare = compare_pca_tsne(X_cancer_scaled, y_cancer)
plt.show()

print("\n" + "="*80)
print("CU√ÅNDO USAR CADA M√âTODO")
print("="*80)
print("\nPCA:")
print("  ‚Ä¢ Reducci√≥n de dimensionalidad para modelado")
print("  ‚Ä¢ Interpretabilidad importante")
print("  ‚Ä¢ Datasets grandes")
print("  ‚Ä¢ Necesitas reproducibilidad")
print("\nt-SNE:")
print("  ‚Ä¢ Visualizaci√≥n exploratoria")
print("  ‚Ä¢ Detectar clusters no lineales")
print("  ‚Ä¢ Datasets peque√±os-medianos (<10k observaciones)")
print("  ‚Ä¢ No necesitas interpretabilidad de ejes")

---
# Parte 3: Selecci√≥n de Atributos

Identificaremos las caracter√≠sticas m√°s relevantes usando tres familias de m√©todos.

## 3.1 M√©todos Filter

Eval√∫an la relevancia de cada atributo independientemente del modelo.

In [None]:
def apply_filter_methods(X, y, k=15):
    """
    Aplica m√∫ltiples m√©todos filter para selecci√≥n de atributos
    """
    feature_names = X.columns if hasattr(X, 'columns') else [f'F{i}' for i in range(X.shape[1])]
    results = {}
    
    print("="*80)
    print("M√âTODOS FILTER - SELECCI√ìN DE ATRIBUTOS")
    print("="*80)
    
    # 1. ANOVA F-test (para clasificaci√≥n)
    print("\n1. Ejecutando ANOVA F-test...", end=' ')
    f_selector = SelectKBest(f_classif, k='all')
    f_selector.fit(X, y)
    results['F-test'] = pd.DataFrame({
        'Feature': feature_names,
        'Score': f_selector.scores_,
        'p-value': f_selector.pvalues_
    }).sort_values('Score', ascending=False)
    print("‚úì")
    
    # 2. Mutual Information
    print("2. Ejecutando Mutual Information...", end=' ')
    mi_selector = SelectKBest(mutual_info_classif, k='all')
    mi_selector.fit(X, y)
    results['Mutual Info'] = pd.DataFrame({
        'Feature': feature_names,
        'Score': mi_selector.scores_
    }).sort_values('Score', ascending=False)
    print("‚úì")
    
    # 3. Chi-squared (requiere valores no negativos)
    print("3. Ejecutando Chi-squared...", end=' ')
    # Normalizar a [0, 1] para chi2
    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()
    X_normalized = scaler.fit_transform(X)
    chi2_selector = SelectKBest(chi2, k='all')
    chi2_selector.fit(X_normalized, y)
    results['Chi-squared'] = pd.DataFrame({
        'Feature': feature_names,
        'Score': chi2_selector.scores_
    }).sort_values('Score', ascending=False)
    print("‚úì")
    
    # Visualizaci√≥n
    fig = plt.figure(figsize=(20, 6))
    
    for idx, (method_name, scores_df) in enumerate(results.items(), 1):
        ax = plt.subplot(1, 3, idx)
        top_features = scores_df.head(k)
        
        # Colores basados en score normalizado
        scores_norm = (top_features['Score'] - top_features['Score'].min()) / (top_features['Score'].max() - top_features['Score'].min())
        colors = plt.cm.RdYlGn(scores_norm)
        
        bars = ax.barh(range(len(top_features)), top_features['Score'].values, color=colors, edgecolor='black')
        ax.set_yticks(range(len(top_features)))
        ax.set_yticklabels(top_features['Feature'].values, fontsize=10)
        ax.invert_yaxis()
        ax.set_xlabel('Score', fontsize=12)
        ax.set_title(f'{method_name}\nTop {k} Features', fontweight='bold', fontsize=14)
        ax.grid(axis='x', alpha=0.3)
        
        # A√±adir valores
        for i, (bar, score) in enumerate(zip(bars, top_features['Score'].values)):
            width = bar.get_width()
            ax.text(width, bar.get_y() + bar.get_height()/2,
                   f' {score:.2f}', ha='left', va='center', fontsize=9, fontweight='bold')
    
    plt.suptitle('M√©todos Filter: Ranking de Features', fontsize=18, fontweight='bold')
    plt.tight_layout()
    
    # Imprimir rankings
    print("\n" + "="*80)
    print("TOP 10 FEATURES POR M√âTODO")
    print("="*80)
    for method_name, scores_df in results.items():
        print(f"\n{method_name}:")
        print(scores_df.head(10)[['Feature', 'Score']].to_string(index=False))
    
    return results, fig

# Aplicar m√©todos filter
filter_results, fig_filter = apply_filter_methods(
    pd.DataFrame(X_cancer_scaled, columns=X_cancer.columns), 
    y_cancer, 
    k=15
)
plt.show()

## 3.2 An√°lisis de Correlaciones

Identificar features altamente correlacionados (redundantes).

In [None]:
def plot_feature_correlations(X, top_features=None, threshold=0.8):
    """
    Visualiza correlaciones entre features y detecta redundancia
    """
    if top_features is not None:
        X_subset = X[top_features]
    else:
        X_subset = X
    
    # Calcular correlaciones
    corr_matrix = X_subset.corr()
    
    # Encontrar pares altamente correlacionados
    high_corr_pairs = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                high_corr_pairs.append({
                    'Feature1': corr_matrix.columns[i],
                    'Feature2': corr_matrix.columns[j],
                    'Correlation': corr_matrix.iloc[i, j]
                })
    
    # Visualizaci√≥n
    fig = plt.figure(figsize=(18, 14))
    
    # Heatmap completo
    ax1 = plt.subplot(2, 1, 1)
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
    sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', 
                cmap='coolwarm', center=0, square=True,
                linewidths=0.5, cbar_kws={'label': 'Correlaci√≥n'},
                ax=ax1, vmin=-1, vmax=1)
    ax1.set_title('Matriz de Correlaci√≥n entre Features', fontweight='bold', fontsize=16)
    
    # Distribuci√≥n de correlaciones
    ax2 = plt.subplot(2, 2, 3)
    corr_values = corr_matrix.values[np.triu_indices_from(corr_matrix.values, k=1)]
    ax2.hist(corr_values, bins=50, alpha=0.7, color='steelblue', edgecolor='black')
    ax2.axvline(threshold, color='red', linestyle='--', linewidth=2, label=f'Umbral: ¬±{threshold}')
    ax2.axvline(-threshold, color='red', linestyle='--', linewidth=2)
    ax2.set_xlabel('Correlaci√≥n', fontsize=12)
    ax2.set_ylabel('Frecuencia', fontsize=12)
    ax2.set_title('Distribuci√≥n de Correlaciones', fontweight='bold', fontsize=14)
    ax2.legend()
    ax2.grid(alpha=0.3)
    
    # Tabla de features altamente correlacionados
    ax3 = plt.subplot(2, 2, 4)
    ax3.axis('tight')
    ax3.axis('off')
    
    if high_corr_pairs:
        df_high_corr = pd.DataFrame(high_corr_pairs)
        df_high_corr = df_high_corr.sort_values('Correlation', ascending=False, key=abs)
        
        table_data = []
        for _, row in df_high_corr.head(15).iterrows():
            table_data.append([
                row['Feature1'][:20],
                row['Feature2'][:20],
                f"{row['Correlation']:.3f}"
            ])
        
        table = ax3.table(cellText=table_data,
                         colLabels=['Feature 1', 'Feature 2', 'Corr'],
                         cellLoc='left',
                         loc='center',
                         colWidths=[0.4, 0.4, 0.2])
        table.auto_set_font_size(False)
        table.set_fontsize(9)
        table.scale(1, 2)
        
        # Colorear header
        for i in range(3):
            table[(0, i)].set_facecolor('#40466e')
            table[(0, i)].set_text_props(weight='bold', color='white')
        
        ax3.set_title(f'Features Altamente Correlacionados (|r| > {threshold})\n{len(high_corr_pairs)} pares encontrados',
                     fontweight='bold', fontsize=14, pad=20)
    else:
        ax3.text(0.5, 0.5, f'No hay features con |r| > {threshold}',
                ha='center', va='center', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    
    print("="*80)
    print(f"AN√ÅLISIS DE CORRELACIONES (umbral = {threshold})")
    print("="*80)
    print(f"\nTotal de pares altamente correlacionados: {len(high_corr_pairs)}")
    if high_corr_pairs:
        print("\nTop 10 pares m√°s correlacionados:")
        df_high_corr = pd.DataFrame(high_corr_pairs).sort_values('Correlation', ascending=False, key=abs)
        print(df_high_corr.head(10).to_string(index=False))
    
    return fig, high_corr_pairs

# Analizar correlaciones en top features de F-test
top_15_features = filter_results['F-test'].head(15)['Feature'].tolist()
fig_corr, high_corr = plot_feature_correlations(
    pd.DataFrame(X_cancer_scaled, columns=X_cancer.columns),
    top_features=top_15_features,
    threshold=0.8
)
plt.show()

## 3.3 M√©todos Wrapper

Eval√∫an subconjuntos de features entrenando modelos.

In [None]:
def apply_wrapper_methods(X, y, n_features_to_select=10):
    """
    Aplica RFE (Recursive Feature Elimination) con diferentes modelos
    """
    feature_names = X.columns if hasattr(X, 'columns') else [f'F{i}' for i in range(X.shape[1])]
    
    # Definir modelos
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
    }
    
    results = {}
    
    print("="*80)
    print("M√âTODOS WRAPPER - RFE (Recursive Feature Elimination)")
    print("="*80)
    print(f"\nSeleccionando top {n_features_to_select} features con cada modelo...")
    
    for model_name, model in models.items():
        print(f"\n{model_name}...", end=' ')
        
        # RFE
        rfe = RFE(estimator=model, n_features_to_select=n_features_to_select, step=1)
        rfe.fit(X, y)
        
        # Guardar resultados
        results[model_name] = {
            'selected': feature_names[rfe.support_].tolist(),
            'ranking': rfe.ranking_
        }
        
        print("‚úì")
        print(f"  Features seleccionados: {results[model_name]['selected'][:5]}...")
    
    # Visualizaci√≥n
    fig = plt.figure(figsize=(20, 12))
    
    # 1. Ranking por modelo
    for idx, (model_name, result) in enumerate(results.items(), 1):
        ax = plt.subplot(2, 3, idx)
        
        ranking_df = pd.DataFrame({
            'Feature': feature_names,
            'Ranking': result['ranking']
        }).sort_values('Ranking')
        
        top_features = ranking_df.head(15)
        colors = ['green' if r == 1 else 'orange' if r <= 3 else 'red' 
                 for r in top_features['Ranking']]
        
        bars = ax.barh(range(len(top_features)), top_features['Ranking'].values, 
                      color=colors, alpha=0.7, edgecolor='black')
        ax.set_yticks(range(len(top_features)))
        ax.set_yticklabels(top_features['Feature'].values, fontsize=9)
        ax.invert_yaxis()
        ax.set_xlabel('Ranking (1 = mejor)', fontsize=11)
        ax.set_title(f'{model_name}\nTop 15 Features', fontweight='bold', fontsize=13)
        ax.grid(axis='x', alpha=0.3)
        
        # A√±adir l√≠nea en ranking = n_features_to_select
        ax.axvline(n_features_to_select, color='blue', linestyle='--', 
                  linewidth=2, alpha=0.5, label=f'Top {n_features_to_select}')
        ax.legend()
    
    # 2. Diagrama de Venn (consenso)
    ax4 = plt.subplot(2, 3, 4)
    ax4.axis('off')
    
    selected_sets = {name: set(result['selected']) for name, result in results.items()}
    
    # Intersecciones
    all_three = selected_sets['Logistic Regression'] & selected_sets['Random Forest'] & selected_sets['Gradient Boosting']
    lr_rf = (selected_sets['Logistic Regression'] & selected_sets['Random Forest']) - all_three
    lr_gb = (selected_sets['Logistic Regression'] & selected_sets['Gradient Boosting']) - all_three
    rf_gb = (selected_sets['Random Forest'] & selected_sets['Gradient Boosting']) - all_three
    
    only_lr = selected_sets['Logistic Regression'] - selected_sets['Random Forest'] - selected_sets['Gradient Boosting']
    only_rf = selected_sets['Random Forest'] - selected_sets['Logistic Regression'] - selected_sets['Gradient Boosting']
    only_gb = selected_sets['Gradient Boosting'] - selected_sets['Logistic Regression'] - selected_sets['Random Forest']
    
    # Texto
    y_pos = 0.9
    ax4.text(0.5, y_pos, 'CONSENSO ENTRE MODELOS', ha='center', fontsize=16, fontweight='bold')
    y_pos -= 0.1
    
    ax4.text(0.1, y_pos, f'üü¢ Los 3 modelos ({len(all_three)}):', fontsize=12, fontweight='bold')
    y_pos -= 0.05
    for feat in sorted(all_three):
        ax4.text(0.15, y_pos, f'‚Ä¢ {feat}', fontsize=10)
        y_pos -= 0.04
    
    y_pos -= 0.03
    ax4.text(0.1, y_pos, f'üü° 2 modelos:', fontsize=12, fontweight='bold')
    y_pos -= 0.05
    for feat in sorted(lr_rf | lr_gb | rf_gb):
        ax4.text(0.15, y_pos, f'‚Ä¢ {feat}', fontsize=10)
        y_pos -= 0.04
        if y_pos < 0.1:
            break
    
    ax4.set_xlim(0, 1)
    ax4.set_ylim(0, 1)
    
    # 3. Heatmap de selecci√≥n
    ax5 = plt.subplot(2, 3, 5)
    selection_matrix = []
    model_names_list = list(results.keys())
    
    for model_name in model_names_list:
        row = [1 if feat in results[model_name]['selected'] else 0 
               for feat in feature_names]
        selection_matrix.append(row)
    
    selection_df = pd.DataFrame(selection_matrix, 
                               index=model_names_list,
                               columns=feature_names)
    
    # Ordenar por n√∫mero de selecciones
    feature_counts = selection_df.sum(axis=0)
    selection_df = selection_df[feature_counts.sort_values(ascending=False).index]
    
    sns.heatmap(selection_df.iloc[:, :20], annot=True, fmt='d', cmap='RdYlGn',
                cbar_kws={'label': 'Seleccionado'}, ax=ax5,
                linewidths=0.5, vmin=0, vmax=1)
    ax5.set_title('Features Seleccionados por Modelo\n(Top 20 m√°s frecuentes)', 
                 fontweight='bold', fontsize=13)
    ax5.set_xlabel('')
    ax5.set_ylabel('')
    
    # 4. Frecuencia de selecci√≥n
    ax6 = plt.subplot(2, 3, 6)
    feature_counts_sorted = feature_counts.sort_values(ascending=False).head(15)
    colors_freq = ['green' if c == 3 else 'orange' if c == 2 else 'red' 
                   for c in feature_counts_sorted.values]
    
    bars = ax6.barh(range(len(feature_counts_sorted)), feature_counts_sorted.values,
                   color=colors_freq, alpha=0.7, edgecolor='black')
    ax6.set_yticks(range(len(feature_counts_sorted)))
    ax6.set_yticklabels(feature_counts_sorted.index, fontsize=10)
    ax6.invert_yaxis()
    ax6.set_xlabel('N√∫mero de modelos que lo seleccionaron', fontsize=11)
    ax6.set_title('Frecuencia de Selecci√≥n\nTop 15 Features', fontweight='bold', fontsize=13)
    ax6.set_xticks([0, 1, 2, 3])
    ax6.grid(axis='x', alpha=0.3)
    
    plt.suptitle('M√©todos Wrapper: RFE con M√∫ltiples Modelos', 
                fontsize=18, fontweight='bold')
    plt.tight_layout()
    
    return results, all_three, fig

# Aplicar RFE
wrapper_results, consensus_features, fig_wrapper = apply_wrapper_methods(
    pd.DataFrame(X_cancer_scaled, columns=X_cancer.columns),
    y_cancer,
    n_features_to_select=10
)
plt.show()

print("\n" + "="*80)
print("FEATURES CON CONSENSO (seleccionados por los 3 modelos):")
print("="*80)
for feat in sorted(consensus_features):
    print(f"  ‚úì {feat}")

---
# Parte 4: Balanceo de Clases

Manejaremos el desbalance de clases usando t√©cnicas de over/undersampling.

## 4.1 Creaci√≥n de Dataset Desbalanceado

Simularemos un escenario realista de desbalance severo.

In [None]:
# Crear dataset desbalanceado
def create_imbalanced_dataset(n_samples=1000, imbalance_ratio=0.1):
    """
    Crea dataset de clasificaci√≥n con desbalance de clases
    """
    # Clase mayoritaria
    n_majority = int(n_samples * (1 - imbalance_ratio))
    n_minority = n_samples - n_majority
    
    X_imb, y_imb = make_classification(
        n_samples=n_samples,
        n_features=20,
        n_informative=15,
        n_redundant=5,
        n_classes=2,
        weights=[1-imbalance_ratio, imbalance_ratio],
        flip_y=0.01,
        random_state=42
    )
    
    # Convertir a DataFrame
    feature_names = [f'feature_{i}' for i in range(X_imb.shape[1])]
    X_imb_df = pd.DataFrame(X_imb, columns=feature_names)
    
    print("="*80)
    print("DATASET DESBALANCEADO CREADO")
    print("="*80)
    print(f"\nTotal de observaciones: {len(y_imb)}")
    unique, counts = np.unique(y_imb, return_counts=True)
    for cls, count in zip(unique, counts):
        pct = 100 * count / len(y_imb)
        print(f"  Clase {cls}: {count:4d} ({pct:5.2f}%)")
    
    ratio = counts[1] / counts[0]
    print(f"\nRatio Minor√≠a/Mayor√≠a: {ratio:.3f} ({ratio:.1%})")
    
    return X_imb_df, y_imb

# Crear dataset
X_imb, y_imb = create_imbalanced_dataset(n_samples=1000, imbalance_ratio=0.10)

# Visualizar distribuci√≥n
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Distribuci√≥n de clases
ax = axes[0]
unique, counts = np.unique(y_imb, return_counts=True)
bars = ax.bar(unique, counts, color=['red', 'green'], alpha=0.7, edgecolor='black', width=0.6)
ax.set_xlabel('Clase', fontsize=13)
ax.set_ylabel('N√∫mero de observaciones', fontsize=13)
ax.set_title('Distribuci√≥n Original de Clases', fontweight='bold', fontsize=15)
ax.set_xticks([0, 1])
ax.set_xticklabels(['Clase 0\n(Mayor√≠a)', 'Clase 1\n(Minor√≠a)'])
ax.grid(axis='y', alpha=0.3)

for bar, count in zip(bars, counts):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
           f'{count}\n({100*count/len(y_imb):.1f}%)',
           ha='center', va='bottom', fontsize=12, fontweight='bold')

# PCA del dataset desbalanceado
ax = axes[1]
pca_imb = PCA(n_components=2)
X_imb_pca = pca_imb.fit_transform(X_imb)

scatter = ax.scatter(X_imb_pca[:, 0], X_imb_pca[:, 1], 
                    c=y_imb, cmap='RdYlGn', alpha=0.6, s=50,
                    edgecolors='black', linewidth=0.5)
ax.set_xlabel(f'PC1 ({pca_imb.explained_variance_ratio_[0]:.1%})', fontsize=12)
ax.set_ylabel(f'PC2 ({pca_imb.explained_variance_ratio_[1]:.1%})', fontsize=12)
ax.set_title('Visualizaci√≥n PCA del Dataset Desbalanceado', fontweight='bold', fontsize=15)
ax.grid(alpha=0.3)
plt.colorbar(scatter, ax=ax, label='Clase', ticks=[0, 1])

plt.tight_layout()
plt.show()

## 4.2 Comparaci√≥n de M√©todos de Balanceo

Compararemos diferentes t√©cnicas de over/undersampling.

In [None]:
def compare_balancing_methods(X, y):
    """
    Compara m√∫ltiples m√©todos de balanceo de clases
    """
    methods = {}
    
    print("="*80)
    print("APLICANDO M√âTODOS DE BALANCEO")
    print("="*80)
    
    # 1. Original (no balancing)
    methods['1. Original'] = (X.copy(), y.copy())
    print("‚úì 1. Original (sin balanceo)")
    
    # 2. Random Oversampling
    ros = RandomOverSampler(random_state=42)
    X_ros, y_ros = ros.fit_resample(X, y)
    methods['2. Random\nOversampling'] = (X_ros, y_ros)
    print("‚úì 2. Random Oversampling")
    
    # 3. SMOTE
    smote = SMOTE(random_state=42, k_neighbors=5)
    X_smote, y_smote = smote.fit_resample(X, y)
    methods['3. SMOTE'] = (X_smote, y_smote)
    print("‚úì 3. SMOTE (Synthetic Minority Over-sampling)")
    
    # 4. ADASYN
    try:
        adasyn = ADASYN(random_state=42, n_neighbors=5)
        X_adasyn, y_adasyn = adasyn.fit_resample(X, y)
        methods['4. ADASYN'] = (X_adasyn, y_adasyn)
        print("‚úì 4. ADASYN (Adaptive Synthetic)")
    except:
        print("‚ö† 4. ADASYN - No aplicable (muy pocos ejemplos minoritarios)")
    
    # 5. BorderlineSMOTE
    try:
        bsmote = BorderlineSMOTE(random_state=42, k_neighbors=5)
        X_bsmote, y_bsmote = bsmote.fit_resample(X, y)
        methods['5. Borderline\nSMOTE'] = (X_bsmote, y_bsmote)
        print("‚úì 5. BorderlineSMOTE")
    except:
        print("‚ö† 5. BorderlineSMOTE - No aplicable")
    
    # 6. Random Undersampling
    rus = RandomUnderSampler(random_state=42)
    X_rus, y_rus = rus.fit_resample(X, y)
    methods['6. Random\nUndersampling'] = (X_rus, y_rus)
    print("‚úì 6. Random Undersampling")
    
    # 7. SMOTE + Tomek Links
    try:
        smote_tomek = SMOTETomek(random_state=42)
        X_st, y_st = smote_tomek.fit_resample(X, y)
        methods['7. SMOTE +\nTomek'] = (X_st, y_st)
        print("‚úì 7. SMOTE + Tomek Links (Hybrid)")
    except:
        print("‚ö† 7. SMOTE + Tomek - No aplicable")
    
    # Visualizaci√≥n
    n_methods = len(methods)
    n_cols = 4
    n_rows = (n_methods + n_cols - 1) // n_cols
    
    fig = plt.figure(figsize=(20, 5*n_rows))
    
    for idx, (method_name, (X_bal, y_bal)) in enumerate(methods.items(), 1):
        # Distribuci√≥n de clases
        ax = plt.subplot(n_rows, n_cols, idx)
        
        unique, counts = np.unique(y_bal, return_counts=True)
        bars = ax.bar(unique, counts, alpha=0.7, 
                     color=['red', 'green'], edgecolor='black', width=0.6)
        
        ax.set_xlabel('Clase', fontsize=11)
        ax.set_ylabel('N√∫mero de muestras', fontsize=11)
        
        ratio = counts[1] / counts[0] if len(counts) > 1 else 0
        total_samples = len(y_bal)
        
        title = f'{method_name}\nN={total_samples} | Ratio={ratio:.2f}'
        ax.set_title(title, fontweight='bold', fontsize=12)
        ax.grid(axis='y', alpha=0.3)
        ax.set_xticks([0, 1])
        
        # A√±adir etiquetas en barras
        for bar, count in zip(bars, counts):
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2., height,
                   f'{count}\n({100*count/total_samples:.1f}%)',
                   ha='center', va='bottom', fontsize=10, fontweight='bold')
    
    plt.suptitle('Comparaci√≥n de M√©todos de Balanceo de Clases', 
                fontsize=18, fontweight='bold')
    plt.tight_layout()
    
    return methods, fig

# Aplicar m√©todos
balancing_results, fig_balance = compare_balancing_methods(X_imb, y_imb)
plt.show()

## 4.3 Visualizaci√≥n del Impacto del Balanceo

Ver c√≥mo cada m√©todo afecta el espacio de features.

In [None]:
def visualize_balancing_impact(X_original, y_original, balancing_results):
    """
    Visualiza el impacto de cada m√©todo de balanceo en el espacio PCA
    """
    # Ajustar PCA en datos originales
    pca = PCA(n_components=2)
    pca.fit(X_original)
    
    n_methods = len(balancing_results)
    n_cols = 3
    n_rows = (n_methods + n_cols - 1) // n_cols
    
    fig = plt.figure(figsize=(18, 6*n_rows))
    
    for idx, (method_name, (X_bal, y_bal)) in enumerate(balancing_results.items(), 1):
        ax = plt.subplot(n_rows, n_cols, idx)
        
        # Proyectar datos balanceados con PCA original
        X_bal_pca = pca.transform(X_bal)
        
        # Separar clases para visualizaci√≥n
        mask_0 = y_bal == 0
        mask_1 = y_bal == 1
        
        ax.scatter(X_bal_pca[mask_0, 0], X_bal_pca[mask_0, 1],
                  c='red', alpha=0.4, s=30, label='Clase 0', edgecolors='none')
        ax.scatter(X_bal_pca[mask_1, 0], X_bal_pca[mask_1, 1],
                  c='green', alpha=0.6, s=30, label='Clase 1', edgecolors='none')
        
        ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})', fontsize=11)
        ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})', fontsize=11)
        
        n_samples = len(y_bal)
        n_class_1 = (y_bal == 1).sum()
        ax.set_title(f'{method_name}\nN={n_samples} | Clase 1: {n_class_1}',
                    fontweight='bold', fontsize=13)
        ax.grid(alpha=0.3)
        ax.legend(loc='upper right', fontsize=9)
    
    plt.suptitle('Impacto del Balanceo en el Espacio de Features (PCA)', 
                fontsize=18, fontweight='bold')
    plt.tight_layout()
    
    return fig

fig_impact = visualize_balancing_impact(X_imb, y_imb, balancing_results)
plt.show()

## 4.4 Evaluaci√≥n del Impacto en el Desempe√±o

Comparar m√©tricas de clasificaci√≥n con cada m√©todo de balanceo.

In [None]:
def evaluate_balancing_methods(X_original, y_original, balancing_results):
    """
    Eval√∫a el desempe√±o de clasificaci√≥n con cada m√©todo de balanceo
    """
    # Split original data
    X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(
        X_original, y_original, test_size=0.3, random_state=42, stratify=y_original
    )
    
    results_metrics = []
    
    print("="*80)
    print("EVALUACI√ìN DE M√âTODOS DE BALANCEO")
    print("="*80)
    print("\nEntrenando modelos con cada m√©todo de balanceo...")
    
    for method_name, (X_bal, y_bal) in balancing_results.items():
        print(f"  {method_name}...", end=' ')
        
        # Para el m√©todo original, usar train/test split normal
        if 'Original' in method_name:
            X_train, y_train = X_train_orig, y_train_orig
        else:
            # Para m√©todos de balanceo, aplicar solo a training set
            if len(X_bal) > len(X_train_orig):
                # Es oversampling, tomar muestra del tama√±o apropiado
                indices = np.random.choice(len(X_bal), size=len(X_train_orig)*2, replace=False)
                X_train = X_bal.iloc[indices] if hasattr(X_bal, 'iloc') else X_bal[indices]
                y_train = y_bal[indices]
            else:
                X_train, y_train = X_bal, y_bal
        
        # Entrenar modelo
        model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
        model.fit(X_train, y_train)
        
        # Predecir en test set (siempre el mismo)
        y_pred = model.predict(X_test_orig)
        y_pred_proba = model.predict_proba(X_test_orig)[:, 1]
        
        # Calcular m√©tricas
        metrics = {
            'Method': method_name.replace('\n', ' '),
            'Accuracy': accuracy_score(y_test_orig, y_pred),
            'Precision': precision_score(y_test_orig, y_pred, zero_division=0),
            'Recall': recall_score(y_test_orig, y_pred, zero_division=0),
            'F1-Score': f1_score(y_test_orig, y_pred, zero_division=0),
            'ROC-AUC': roc_auc_score(y_test_orig, y_pred_proba)
        }
        
        results_metrics.append(metrics)
        print("‚úì")
    
    # Convertir a DataFrame
    df_metrics = pd.DataFrame(results_metrics)
    
    # Visualizaci√≥n
    fig = plt.figure(figsize=(18, 10))
    
    # 1. Todas las m√©tricas
    ax1 = plt.subplot(2, 2, 1)
    df_plot = df_metrics.set_index('Method')[['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']]
    df_plot.plot(kind='bar', ax=ax1, width=0.8, edgecolor='black')
    ax1.set_ylabel('Score', fontsize=12)
    ax1.set_title('Comparaci√≥n de Todas las M√©tricas', fontweight='bold', fontsize=14)
    ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right')
    ax1.legend(loc='lower right', fontsize=10)
    ax1.grid(axis='y', alpha=0.3)
    ax1.set_ylim(0, 1.05)
    ax1.axhline(y=0.5, color='red', linestyle='--', alpha=0.5, linewidth=1)
    
    # 2. Precision vs Recall
    ax2 = plt.subplot(2, 2, 2)
    colors = plt.cm.tab10(np.linspace(0, 1, len(df_metrics)))
    for idx, row in df_metrics.iterrows():
        ax2.scatter(row['Recall'], row['Precision'], s=200, alpha=0.7,
                   color=colors[idx], edgecolors='black', linewidth=2)
        ax2.annotate(row['Method'], 
                    (row['Recall'], row['Precision']),
                    fontsize=9, ha='center')
    
    ax2.plot([0, 1], [0, 1], 'k--', alpha=0.3)
    ax2.set_xlabel('Recall', fontsize=12)
    ax2.set_ylabel('Precision', fontsize=12)
    ax2.set_title('Precision vs Recall Trade-off', fontweight='bold', fontsize=14)
    ax2.grid(alpha=0.3)
    ax2.set_xlim(-0.05, 1.05)
    ax2.set_ylim(-0.05, 1.05)
    
    # 3. F1-Score ranking
    ax3 = plt.subplot(2, 2, 3)
    df_sorted = df_metrics.sort_values('F1-Score')
    colors_f1 = plt.cm.RdYlGn(df_sorted['F1-Score'].values)
    bars = ax3.barh(range(len(df_sorted)), df_sorted['F1-Score'].values,
                    color=colors_f1, edgecolor='black')
    ax3.set_yticks(range(len(df_sorted)))
    ax3.set_yticklabels(df_sorted['Method'].values, fontsize=10)
    ax3.set_xlabel('F1-Score', fontsize=12)
    ax3.set_title('Ranking por F1-Score', fontweight='bold', fontsize=14)
    ax3.grid(axis='x', alpha=0.3)
    ax3.set_xlim(0, 1)
    
    for bar, score in zip(bars, df_sorted['F1-Score'].values):
        width = bar.get_width()
        ax3.text(width + 0.02, bar.get_y() + bar.get_height()/2,
                f'{score:.3f}', ha='left', va='center', fontsize=10, fontweight='bold')
    
    # 4. Tabla de resultados
    ax4 = plt.subplot(2, 2, 4)
    ax4.axis('tight')
    ax4.axis('off')
    
    table_data = []
    for _, row in df_metrics.iterrows():
        table_data.append([
            row['Method'][:20],
            f"{row['Accuracy']:.3f}",
            f"{row['Precision']:.3f}",
            f"{row['Recall']:.3f}",
            f"{row['F1-Score']:.3f}",
            f"{row['ROC-AUC']:.3f}"
        ])
    
    table = ax4.table(cellText=table_data,
                     colLabels=['Method', 'Acc', 'Prec', 'Rec', 'F1', 'AUC'],
                     cellLoc='center',
                     loc='center',
                     colWidths=[0.35, 0.13, 0.13, 0.13, 0.13, 0.13])
    table.auto_set_font_size(False)
    table.set_fontsize(9)
    table.scale(1, 2)
    
    # Colorear header
    for i in range(6):
        table[(0, i)].set_facecolor('#40466e')
        table[(0, i)].set_text_props(weight='bold', color='white')
    
    # Colorear mejor de cada columna
    for col_idx in range(1, 6):
        col_values = [float(table_data[i][col_idx]) for i in range(len(table_data))]
        best_idx = np.argmax(col_values)
        table[(best_idx + 1, col_idx)].set_facecolor('#90EE90')
        table[(best_idx + 1, col_idx)].set_text_props(weight='bold')
    
    ax4.set_title('Tabla de Resultados\n(Verde = Mejor en cada m√©trica)', 
                 fontweight='bold', fontsize=14, pad=20)
    
    plt.suptitle('Evaluaci√≥n del Impacto del Balanceo en el Desempe√±o', 
                fontsize=18, fontweight='bold')
    plt.tight_layout()
    
    print("\n" + "="*80)
    print("RESULTADOS:")
    print("="*80)
    print(df_metrics.to_string(index=False))
    
    # Identificar mejor m√©todo
    best_f1 = df_metrics.loc[df_metrics['F1-Score'].idxmax()]
    print(f"\nüèÜ Mejor m√©todo por F1-Score: {best_f1['Method']} ({best_f1['F1-Score']:.3f})")
    
    return df_metrics, fig

# Evaluar m√©todos
metrics_df, fig_eval = evaluate_balancing_methods(X_imb, y_imb, balancing_results)
plt.show()

---
# Parte 5: Pipeline Completo de Preprocesamiento

Integraremos todas las t√©cnicas en un pipeline reproducible y robusto.

## 5.1 Construcci√≥n del Pipeline

Crearemos un pipeline modular que integre todas las etapas.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

class DataCleaningTransformer(BaseEstimator, TransformerMixin):
    """
    Transformer personalizado para limpieza de datos
    """
    def __init__(self, imputation_strategy='knn', outlier_method='cap'):
        self.imputation_strategy = imputation_strategy
        self.outlier_method = outlier_method
        self.imputer = None
        self.outlier_bounds = {}
    
    def fit(self, X, y=None):
        # Ajustar imputer
        if self.imputation_strategy == 'knn':
            self.imputer = KNNImputer(n_neighbors=5)
        elif self.imputation_strategy == 'median':
            self.imputer = SimpleImputer(strategy='median')
        else:
            self.imputer = SimpleImputer(strategy='mean')
        
        self.imputer.fit(X)
        
        # Calcular l√≠mites de outliers
        if self.outlier_method == 'cap':
            X_clean = self.imputer.transform(X)
            for i in range(X_clean.shape[1]):
                q05 = np.percentile(X_clean[:, i], 5)
                q95 = np.percentile(X_clean[:, i], 95)
                self.outlier_bounds[i] = (q05, q95)
        
        return self
    
    def transform(self, X):
        # Imputar
        X_clean = self.imputer.transform(X)
        
        # Tratar outliers
        if self.outlier_method == 'cap':
            for i, (lower, upper) in self.outlier_bounds.items():
                X_clean[:, i] = np.clip(X_clean[:, i], lower, upper)
        
        return X_clean

def create_preprocessing_pipeline(
    use_cleaning=True,
    use_scaling=True,
    use_pca=False,
    use_feature_selection=False,
    use_balancing=False,
    n_components_pca=0.95,
    n_features=10,
    balancing_method='smote'
):
    """
    Crea pipeline de preprocesamiento configurable
    """
    steps = []
    
    # 1. Limpieza (opcional)
    if use_cleaning:
        steps.append(('cleaning', DataCleaningTransformer(
            imputation_strategy='knn',
            outlier_method='cap'
        )))
    
    # 2. Escalamiento (opcional)
    if use_scaling:
        steps.append(('scaler', StandardScaler()))
    
    # 3. Reducci√≥n de dimensionalidad (opcional)
    if use_pca:
        steps.append(('pca', PCA(n_components=n_components_pca)))
    
    # 4. Selecci√≥n de atributos (opcional)
    if use_feature_selection:
        steps.append(('feature_selection', SelectKBest(
            f_classif, k=n_features
        )))
    
    # 5. Balanceo (opcional)
    if use_balancing:
        if balancing_method == 'smote':
            steps.append(('balancing', SMOTE(random_state=42)))
        elif balancing_method == 'adasyn':
            steps.append(('balancing', ADASYN(random_state=42)))
        elif balancing_method == 'borderline':
            steps.append(('balancing', BorderlineSMOTE(random_state=42)))
    
    # 6. Clasificador
    steps.append(('classifier', RandomForestClassifier(
        n_estimators=100,
        random_state=42,
        n_jobs=-1
    )))
    
    pipeline = Pipeline(steps)
    
    return pipeline

print("="*80)
print("EJEMPLO DE PIPELINE COMPLETO")
print("="*80)

# Crear pipeline
pipeline = create_preprocessing_pipeline(
    use_cleaning=True,
    use_scaling=True,
    use_pca=True,
    use_feature_selection=False,
    use_balancing=True,
    n_components_pca=0.95,
    balancing_method='smote'
)

print("\nPasos del pipeline:")
for name, step in pipeline.steps:
    print(f"  {name:20s}: {step.__class__.__name__}")

## 5.2 Evaluaci√≥n del Pipeline

Comparemos diferentes configuraciones del pipeline.

In [None]:
def evaluate_pipeline_configurations(X, y):
    """
    Eval√∫a m√∫ltiples configuraciones del pipeline
    """
    # Configuraciones a probar
    configs = [
        {
            'name': 'Baseline (solo clasificador)',
            'use_cleaning': False,
            'use_scaling': False,
            'use_pca': False,
            'use_feature_selection': False,
            'use_balancing': False
        },
        {
            'name': 'Escalamiento solo',
            'use_cleaning': False,
            'use_scaling': True,
            'use_pca': False,
            'use_feature_selection': False,
            'use_balancing': False
        },
        {
            'name': 'Escalamiento + PCA',
            'use_cleaning': False,
            'use_scaling': True,
            'use_pca': True,
            'use_feature_selection': False,
            'use_balancing': False
        },
        {
            'name': 'Escalamiento + Feature Selection',
            'use_cleaning': False,
            'use_scaling': True,
            'use_pca': False,
            'use_feature_selection': True,
            'use_balancing': False
        },
        {
            'name': 'Escalamiento + SMOTE',
            'use_cleaning': False,
            'use_scaling': True,
            'use_pca': False,
            'use_feature_selection': False,
            'use_balancing': True
        },
        {
            'name': 'Pipeline Completo',
            'use_cleaning': True,
            'use_scaling': True,
            'use_pca': True,
            'use_feature_selection': False,
            'use_balancing': True
        }
    ]
    
    results = []
    
    print("="*80)
    print("EVALUANDO CONFIGURACIONES DE PIPELINE")
    print("="*80)
    
    # Split datos
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y
    )
    
    for config in configs:
        print(f"\n{config['name']}...", end=' ')
        
        # Crear pipeline
        name = config.pop('name')
        pipeline = create_preprocessing_pipeline(**config)
        
        try:
            # Entrenar
            pipeline.fit(X_train, y_train)
            
            # Predecir
            y_pred = pipeline.predict(X_test)
            y_pred_proba = pipeline.predict_proba(X_test)[:, 1]
            
            # M√©tricas
            results.append({
                'Configuration': name,
                'Accuracy': accuracy_score(y_test, y_pred),
                'Precision': precision_score(y_test, y_pred, zero_division=0),
                'Recall': recall_score(y_test, y_pred, zero_division=0),
                'F1-Score': f1_score(y_test, y_pred, zero_division=0),
                'ROC-AUC': roc_auc_score(y_test, y_pred_proba)
            })
            
            print("‚úì")
            
        except Exception as e:
            print(f"‚úó Error: {str(e)[:50]}")
            results.append({
                'Configuration': name,
                'Accuracy': 0,
                'Precision': 0,
                'Recall': 0,
                'F1-Score': 0,
                'ROC-AUC': 0
            })
    
    # Convertir a DataFrame
    df_results = pd.DataFrame(results)
    
    # Visualizaci√≥n
    fig = plt.figure(figsize=(18, 10))
    
    # 1. Comparaci√≥n de todas las m√©tricas
    ax1 = plt.subplot(2, 2, 1)
    df_plot = df_results.set_index('Configuration')[['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']]
    df_plot.plot(kind='bar', ax=ax1, width=0.8, edgecolor='black')
    ax1.set_ylabel('Score', fontsize=12)
    ax1.set_title('Comparaci√≥n de M√©tricas por Configuraci√≥n', fontweight='bold', fontsize=14)
    ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right', fontsize=10)
    ax1.legend(fontsize=10)
    ax1.grid(axis='y', alpha=0.3)
    ax1.set_ylim(0, 1.05)
    
    # 2. Heatmap de m√©tricas
    ax2 = plt.subplot(2, 2, 2)
    metrics_matrix = df_results.set_index('Configuration')[['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']]
    sns.heatmap(metrics_matrix.T, annot=True, fmt='.3f', cmap='RdYlGn',
                center=0.5, vmin=0, vmax=1, ax=ax2,
                linewidths=0.5, cbar_kws={'label': 'Score'})
    ax2.set_title('Heatmap de M√©tricas', fontweight='bold', fontsize=14)
    ax2.set_xlabel('')
    ax2.set_ylabel('', fontsize=12)
    
    # 3. Ranking por F1-Score
    ax3 = plt.subplot(2, 2, 3)
    df_sorted = df_results.sort_values('F1-Score')
    colors = plt.cm.RdYlGn(df_sorted['F1-Score'].values)
    bars = ax3.barh(range(len(df_sorted)), df_sorted['F1-Score'].values,
                    color=colors, edgecolor='black')
    ax3.set_yticks(range(len(df_sorted)))
    ax3.set_yticklabels(df_sorted['Configuration'].values, fontsize=10)
    ax3.set_xlabel('F1-Score', fontsize=12)
    ax3.set_title('Ranking por F1-Score', fontweight='bold', fontsize=14)
    ax3.grid(axis='x', alpha=0.3)
    ax3.set_xlim(0, 1)
    
    for bar, score in zip(bars, df_sorted['F1-Score'].values):
        width = bar.get_width()
        ax3.text(width + 0.02, bar.get_y() + bar.get_height()/2,
                f'{score:.3f}', ha='left', va='center', fontsize=10, fontweight='bold')
    
    # 4. Mejora relativa respecto al baseline
    ax4 = plt.subplot(2, 2, 4)
    baseline_f1 = df_results[df_results['Configuration'].str.contains('Baseline')]['F1-Score'].values[0]
    df_results['Improvement'] = ((df_results['F1-Score'] - baseline_f1) / baseline_f1) * 100
    
    df_improvement = df_results[~df_results['Configuration'].str.contains('Baseline')].sort_values('Improvement')
    colors_imp = ['red' if x < 0 else 'green' for x in df_improvement['Improvement'].values]
    
    bars = ax4.barh(range(len(df_improvement)), df_improvement['Improvement'].values,
                    color=colors_imp, alpha=0.7, edgecolor='black')
    ax4.set_yticks(range(len(df_improvement)))
    ax4.set_yticklabels(df_improvement['Configuration'].values, fontsize=10)
    ax4.set_xlabel('Mejora en F1-Score (%)', fontsize=12)
    ax4.set_title(f'Mejora Relativa vs Baseline\n(Baseline F1={baseline_f1:.3f})', 
                 fontweight='bold', fontsize=14)
    ax4.axvline(0, color='black', linewidth=1)
    ax4.grid(axis='x', alpha=0.3)
    
    for bar, improvement in zip(bars, df_improvement['Improvement'].values):
        width = bar.get_width()
        label_pos = width + (5 if width > 0 else -5)
        ha = 'left' if width > 0 else 'right'
        ax4.text(label_pos, bar.get_y() + bar.get_height()/2,
                f'{improvement:+.1f}%', ha=ha, va='center', fontsize=10, fontweight='bold')
    
    plt.suptitle('Evaluaci√≥n de Configuraciones del Pipeline', 
                fontsize=18, fontweight='bold')
    plt.tight_layout()
    
    print("\n" + "="*80)
    print("RESULTADOS:")
    print("="*80)
    print(df_results[['Configuration', 'Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']].to_string(index=False))
    
    best_config = df_results.loc[df_results['F1-Score'].idxmax()]
    print(f"\nüèÜ Mejor configuraci√≥n: {best_config['Configuration']}")
    print(f"   F1-Score: {best_config['F1-Score']:.3f}")
    print(f"   Mejora vs Baseline: {best_config['Improvement']:.1f}%")
    
    return df_results, fig

# Evaluar usando el dataset desbalanceado
results_pipeline, fig_pipeline = evaluate_pipeline_configurations(X_imb, y_imb)
plt.show()

## 5.3 Validaci√≥n Cruzada del Mejor Pipeline

Validaremos la robustez del mejor pipeline.

In [None]:
def cross_validate_pipeline(X, y, pipeline=None, cv=5):
    """
    Realiza validaci√≥n cruzada estratificada del pipeline
    """
    if pipeline is None:
        pipeline = create_preprocessing_pipeline(
            use_cleaning=True,
            use_scaling=True,
            use_pca=True,
            use_balancing=True
        )
    
    print("="*80)
    print("VALIDACI√ìN CRUZADA DEL PIPELINE")
    print("="*80)
    print(f"\nRealizando validaci√≥n cruzada con {cv} folds...")
    
    # Validaci√≥n cruzada estratificada
    skf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
    
    scores = {
        'accuracy': [],
        'precision': [],
        'recall': [],
        'f1': [],
        'roc_auc': []
    }
    
    fold_results = []
    
    for fold, (train_idx, test_idx) in enumerate(skf.split(X, y), 1):
        print(f"  Fold {fold}/{cv}...", end=' ')
        
        X_train = X.iloc[train_idx] if hasattr(X, 'iloc') else X[train_idx]
        X_test = X.iloc[test_idx] if hasattr(X, 'iloc') else X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        
        # Entrenar
        pipeline.fit(X_train, y_train)
        
        # Predecir
        y_pred = pipeline.predict(X_test)
        y_pred_proba = pipeline.predict_proba(X_test)[:, 1]
        
        # Calcular m√©tricas
        fold_scores = {
            'accuracy': accuracy_score(y_test, y_pred),
            'precision': precision_score(y_test, y_pred, zero_division=0),
            'recall': recall_score(y_test, y_pred, zero_division=0),
            'f1': f1_score(y_test, y_pred, zero_division=0),
            'roc_auc': roc_auc_score(y_test, y_pred_proba)
        }
        
        for metric, score in fold_scores.items():
            scores[metric].append(score)
        
        fold_results.append(fold_scores)
        print("‚úì")
    
    # Calcular estad√≠sticas
    stats = {}
    for metric, values in scores.items():
        stats[metric] = {
            'mean': np.mean(values),
            'std': np.std(values),
            'min': np.min(values),
            'max': np.max(values)
        }
    
    # Visualizaci√≥n
    fig = plt.figure(figsize=(18, 10))
    
    # 1. Box plots de m√©tricas
    ax1 = plt.subplot(2, 2, 1)
    metrics_df = pd.DataFrame(scores)
    bp = ax1.boxplot([metrics_df[col].values for col in metrics_df.columns],
                     labels=['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC'],
                     patch_artist=True, showmeans=True,
                     meanprops=dict(marker='D', markerfacecolor='red', markersize=8))
    
    for patch in bp['boxes']:
        patch.set_facecolor('lightblue')
    
    ax1.set_ylabel('Score', fontsize=12)
    ax1.set_title(f'Distribuci√≥n de M√©tricas\n({cv}-Fold Cross-Validation)', 
                 fontweight='bold', fontsize=14)
    ax1.grid(axis='y', alpha=0.3)
    ax1.set_ylim(0, 1.05)
    
    # 2. M√©tricas por fold
    ax2 = plt.subplot(2, 2, 2)
    folds = list(range(1, cv+1))
    for metric in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']:
        ax2.plot(folds, scores[metric], marker='o', label=metric.replace('_', '-').title(), linewidth=2)
    
    ax2.set_xlabel('Fold', fontsize=12)
    ax2.set_ylabel('Score', fontsize=12)
    ax2.set_title('M√©tricas por Fold', fontweight='bold', fontsize=14)
    ax2.legend(fontsize=10)
    ax2.grid(alpha=0.3)
    ax2.set_xticks(folds)
    ax2.set_ylim(0, 1.05)
    
    # 3. Media y desviaci√≥n est√°ndar
    ax3 = plt.subplot(2, 2, 3)
    metrics_names = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
    means = [stats[m]['mean'] for m in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']]
    stds = [stats[m]['std'] for m in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']]
    
    x_pos = np.arange(len(metrics_names))
    bars = ax3.bar(x_pos, means, yerr=stds, alpha=0.7, capsize=5,
                  color='steelblue', edgecolor='black', error_kw={'linewidth': 2})
    ax3.set_xticks(x_pos)
    ax3.set_xticklabels(metrics_names, rotation=45, ha='right')
    ax3.set_ylabel('Score', fontsize=12)
    ax3.set_title('Media ¬± Desviaci√≥n Est√°ndar', fontweight='bold', fontsize=14)
    ax3.grid(axis='y', alpha=0.3)
    ax3.set_ylim(0, 1.05)
    
    for bar, mean, std in zip(bars, means, stds):
        height = bar.get_height()
        ax3.text(bar.get_x() + bar.get_width()/2., height + std + 0.02,
                f'{mean:.3f}¬±{std:.3f}', ha='center', va='bottom', fontsize=9, fontweight='bold')
    
    # 4. Tabla de estad√≠sticas
    ax4 = plt.subplot(2, 2, 4)
    ax4.axis('tight')
    ax4.axis('off')
    
    table_data = []
    for metric_name, metric_key in zip(metrics_names, ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']):
        row = [
            metric_name,
            f"{stats[metric_key]['mean']:.4f}",
            f"{stats[metric_key]['std']:.4f}",
            f"{stats[metric_key]['min']:.4f}",
            f"{stats[metric_key]['max']:.4f}"
        ]
        table_data.append(row)
    
    table = ax4.table(cellText=table_data,
                     colLabels=['Metric', 'Mean', 'Std', 'Min', 'Max'],
                     cellLoc='center',
                     loc='center')
    table.auto_set_font_size(False)
    table.set_fontsize(11)
    table.scale(1, 2.5)
    
    for i in range(5):
        table[(0, i)].set_facecolor('#40466e')
        table[(0, i)].set_text_props(weight='bold', color='white')
    
    ax4.set_title(f'Estad√≠sticas de {cv}-Fold Cross-Validation', 
                 fontweight='bold', fontsize=14, pad=20)
    
    plt.suptitle('Validaci√≥n Cruzada del Pipeline', fontsize=18, fontweight='bold')
    plt.tight_layout()
    
    print("\n" + "="*80)
    print("RESULTADOS DE VALIDACI√ìN CRUZADA:")
    print("="*80)
    for metric_name, metric_key in zip(metrics_names, ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']):
        print(f"{metric_name:15s}: {stats[metric_key]['mean']:.4f} ¬± {stats[metric_key]['std']:.4f} "
              f"(min={stats[metric_key]['min']:.4f}, max={stats[metric_key]['max']:.4f})")
    
    return stats, fig

# Validar pipeline completo
cv_stats, fig_cv = cross_validate_pipeline(X_imb, y_imb, cv=5)
plt.show()

---
# Resumen y Conclusiones

## ‚úÖ Lo que hemos aprendido

### 1. Limpieza de Datos
* Los valores faltantes requieren an√°lisis cuidadoso (MCAR, MAR, MNAR)
* KNN Imputation generalmente supera a m√©todos simples
* Los outliers deben investigarse antes de eliminarlos
* El escalamiento es crucial para muchos algoritmos

### 2. Reducci√≥n de Dimensionalidad
* **PCA**: R√°pido, interpretable, lineal
  * √ötil para reducci√≥n real de dimensionalidad
  * Preserva varianza global
  
* **t-SNE**: Lento, no interpretable, no lineal
  * Excelente para visualizaci√≥n
  * Preserva estructura local (clusters)

### 3. Selecci√≥n de Atributos
* **M√©todos Filter**: R√°pidos pero independientes del modelo
* **M√©todos Wrapper**: M√°s lentos pero espec√≠ficos del modelo
* **Consenso**: Combinar m√∫ltiples m√©todos aumenta robustez

### 4. Balanceo de Clases
* El desbalance severo sesga modelos hacia la mayor√≠a
* **SMOTE** es generalmente superior a random oversampling
* **ADASYN** adapta la s√≠ntesis a la densidad local
* El balanceo debe aplicarse SOLO en training set

### 5. Pipelines
* Automatizan y estandarizan el preprocesamiento
* Previenen data leakage
* Facilitan reproducibilidad
* Permiten comparaci√≥n justa de configuraciones

## üéØ Mejores Pr√°cticas

1. **Siempre dividir datos ANTES** de preprocesar
2. **Documentar decisiones** de preprocesamiento
3. **Validar el impacto** de cada transformaci√≥n
4. **Usar validaci√≥n cruzada** para evaluar robustez
5. **No eliminar datos sin investigar** primero
6. **Balancear clases con cuidado** (solo en training)
7. **Escalar antes de PCA** o m√©todos basados en distancia
8. **Preferir pipelines** a c√≥digo ad-hoc

## üìä Resultados Clave de Este Notebook

De nuestros experimentos:
* El escalamiento mejor√≥ m√©tricas en todos los casos
* PCA redujo dimensiones sin perder desempe√±o significativo
* SMOTE mejor√≥ recall de la clase minoritaria significativamente
* El pipeline completo logr√≥ el mejor balance precision/recall

## üöÄ Pr√≥ximos Pasos

1. Aplicar estas t√©cnicas a sus propios datasets
2. Experimentar con diferentes configuraciones
3. Documentar el proceso de toma de decisiones
4. Comparar m√∫ltiples estrategias sistem√°ticamente

## üìö Referencias y Recursos

* Scikit-learn Documentation: https://scikit-learn.org
* Imbalanced-learn: https://imbalanced-learn.org
* "Feature Engineering and Selection" - Kuhn & Johnson
* "Hands-On Machine Learning" - Aur√©lien G√©ron

---
## üí° Ejercicios Adicionales (Opcional)

Pon a prueba tu comprensi√≥n:

### Ejercicio 1: Dataset Personalizado
Aplica el pipeline completo a uno de estos datasets:
* Wine Quality
* Iris
* Digits
* Tus propios datos

### Ejercicio 2: Optimizaci√≥n de Hiperpar√°metros
Usa GridSearchCV para optimizar:
* N√∫mero de vecinos en KNN Imputer
* N√∫mero de componentes en PCA
* K en SMOTE

### Ejercicio 3: An√°lisis de Sensibilidad
Investiga c√≥mo var√≠a el desempe√±o al:
* Cambiar porcentajes de valores faltantes
* Variar grado de desbalance
* Modificar cantidad de outliers

### Ejercicio 4: Pipeline Avanzado
Extiende el pipeline para:
* Manejar variables categ√≥ricas
* Incluir ingenier√≠a de features
* Probar m√∫ltiples clasificadores

---

**¬°Excelente trabajo completando este m√≥dulo!** üéâ

Has dominado las t√©cnicas fundamentales de limpieza y preparaci√≥n de datos que son esenciales para cualquier proyecto de ciencia de datos o machine learning.