# üöÄ Clasificaci√≥n Multivariable con Transformer + Features Num√©ricas/Categ√≥ricas

---

## üìã Descripci√≥n
Script robusto para clasificaci√≥n combinando:
- **Texto** procesado con Transformer (BETO o XLM-RoBERTa)
- **Features num√©ricas** (edad)
- **Features categ√≥ricas** (nivel educativo, desempe√±o)

### üéØ Variables de Entrada:
1. **texto_final** (texto): Procesado con Transformer
2. **p208a** (edad): Variable num√©rica
3. **p301a** (nivel_educativo): Variable categ√≥rica (ya num√©rica)
4. **p507** (desempe√±o): Variable categ√≥rica (ya num√©rica)

### üéØ Variable Objetivo:
- **p505r4**: Clasificaci√≥n de ocupaci√≥n

### ‚ú® Caracter√≠sticas:
- ‚úÖ Arquitectura h√≠brida: Transformer + Dense Layers
- ‚úÖ Normalizaci√≥n de features num√©ricas
- ‚úÖ Embeddings para features categ√≥ricas
- ‚úÖ M√©tricas completas (macro, micro, weighted)
- ‚úÖ Compatible con BETO y XLM-RoBERTa

---


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# ============================================================================
# INSTALACI√ìN DE DEPENDENCIAS
# ============================================================================

# Descomenta si necesitas instalar
# !pip install transformers==4.36.0 datasets==2.15.0 scikit-learn==1.3.2
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# !pip install accelerate sentencepiece

print("‚úÖ Si las librer√≠as ya est√°n instaladas, contin√∫a")


‚úÖ Si las librer√≠as ya est√°n instaladas, contin√∫a


In [3]:
# ============================================================================
# IMPORTACIONES Y VERIFICACI√ìN DEL ENTORNO
# ============================================================================

import sys
import os
import warnings
import logging
from datetime import datetime
from pathlib import Path

# Data & ML
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support,
    classification_report,
    confusion_matrix
)

# Transformers
from transformers import (
    AutoTokenizer,
    AutoModel,
    AutoConfig,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)

# Utilities
from tqdm.auto import tqdm
import pickle
import json

warnings.filterwarnings('ignore')

# ============================================================================
# VERIFICACI√ìN DE GPU
# ============================================================================

print("\n" + "="*80)
print("üîç VERIFICACI√ìN DEL ENTORNO")
print("="*80)

print(f"\nüì¶ Versiones:")
print(f"   Python: {sys.version.split()[0]}")
print(f"   PyTorch: {torch.__version__}")
print(f"   CUDA disponible: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"\nüéÆ GPU Detectada:")
    print(f"   Dispositivo: {torch.cuda.get_device_name(0)}")
    print(f"   Memoria total: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("\n‚ö†Ô∏è  GPU no detectada - El entrenamiento ser√° lento")

print("\n" + "="*80)
print("‚úÖ Importaciones completadas correctamente")
print("="*80 + "\n")



üîç VERIFICACI√ìN DEL ENTORNO

üì¶ Versiones:
   Python: 3.12.12
   PyTorch: 2.8.0+cu126
   CUDA disponible: True

üéÆ GPU Detectada:
   Dispositivo: Tesla T4
   Memoria total: 15.83 GB

‚úÖ Importaciones completadas correctamente



In [4]:
# ============================================================================
# ‚öôÔ∏è  CONFIGURACI√ìN PRINCIPAL - MODIFICA AQU√ç
# ============================================================================

class MultimodalConfig:
    """
    Configuraci√≥n para clasificaci√≥n multivariable:
    Transformer (texto) + Features num√©ricas + Features categ√≥ricas
    """

    # ========================================================================
    # üéØ SELECCI√ìN DEL MODELO - CAMBIA SOLO ESTA L√çNEA
    # ========================================================================

    # MODEL_NAME = "FacebookAI/xlm-roberta-base"  # Opci√≥n 1: XLM-RoBERTa
    # MODEL_NAME = "dccuchile/bert-base-spanish-wwm-cased"  # Opci√≥n 2: BETO
    MODEL_NAME = "bertin-project/bertin-roberta-base-spanish"

    # ========================================================================
    # üìÇ RUTAS DE DATOS
    # ========================================================================

    DATA_PATH = "/content/drive/MyDrive/classification_coding_open_ended_occupational_responses_ENAHO/CLEAN_DATA/BASE_LIMPIA_VF.parquet"  # Cambia esta ruta
    BASE_OUTPUT_DIR = "/content/drive/MyDrive/classification_coding_open_ended_occupational_responses_ENAHO/05_BERTIN"

    # ========================================================================
    # üìä COLUMNAS DEL DATASET
    # ========================================================================

    # Variable objetivo
    TARGET_COLUMN = "p505r4"

    # Variable de texto
    TEXT_COLUMN = "texto_final"

    # Variables num√©ricas
    NUMERIC_FEATURES = ["p208a"]  # edad

    # Variables categ√≥ricas (ya en formato num√©rico)
    CATEGORICAL_FEATURES = [
        "p301a",  # nivel_educativo
        "p507"    # desempe√±o
    ]

    # N√∫mero de categor√≠as √∫nicas
    # Se calcular√° autom√°ticamente si se deja en None
    CATEGORICAL_CARDINALITIES = {
        "p301a": None,  # Se calcular√° autom√°ticamente
        "p507": None    # Se calcular√° autom√°ticamente
    }

    # ========================================================================
    # üéõÔ∏è  HIPERPAR√ÅMETROS DEL MODELO
    # ========================================================================

    # Tokenizaci√≥n
    MAX_LENGTH = 128

    # Dimensiones de embeddings para categ√≥ricas
    CATEGORICAL_EMBEDDING_DIM = 16  # Dimensi√≥n de los embeddings categ√≥ricos

    # Arquitectura de la capa de fusi√≥n
    FUSION_HIDDEN_DIM = 256  # Dimensi√≥n oculta para fusionar features
    DROPOUT_RATE = 0.3       # Dropout para regularizaci√≥n

    # Entrenamiento
    BATCH_SIZE = 16
    LEARNING_RATE = 2e-5
    NUM_EPOCHS = 3
    WARMUP_STEPS = 500
    WEIGHT_DECAY = 0.01

    # Divisi√≥n de datos
    TEST_SIZE = 0.15
    VAL_SIZE = 0.15
    RANDOM_STATE = 2025

    # Filtrado
    MIN_SAMPLES_PER_CLASS = 10

    # Early stopping
    EARLY_STOPPING_PATIENCE = 3

    # ========================================================================
    # üîß CONFIGURACI√ìN AUTOM√ÅTICA
    # ========================================================================

    def __init__(self):
        """Inicializa configuraci√≥n"""
        # Detectar tipo de modelo
        if "roberta" in self.MODEL_NAME.lower():
            self.model_type = "xlm-roberta"
        elif "bert" in self.MODEL_NAME.lower():
            self.model_type = "bert"
        else:
            self.model_type = "transformer"

        # Crear nombre del experimento
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        model_short_name = self.MODEL_NAME.split('/')[-1]
        self.experiment_name = f"{model_short_name}_multimodal_{timestamp}"

        # Configurar directorios
        self.OUTPUT_DIR = os.path.join(self.BASE_OUTPUT_DIR, self.experiment_name)
        self.MODEL_SAVE_DIR = os.path.join(self.OUTPUT_DIR, "final_model")
        self.CHECKPOINT_DIR = os.path.join(self.OUTPUT_DIR, "checkpoints")

        # Crear directorios
        for dir_path in [self.OUTPUT_DIR, self.MODEL_SAVE_DIR, self.CHECKPOINT_DIR]:
            os.makedirs(dir_path, exist_ok=True)

        # Device
        self.DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

        # Configurar logging
        self.logger = self._setup_logging()
        self.logger.info(f"Experimento iniciado: {self.experiment_name}")
        self.logger.info(f"Modelo seleccionado: {self.MODEL_NAME}")
        self.logger.info(f"Dispositivo: {self.DEVICE}")

    def _setup_logging(self):
        """Configura logging"""
        log_file = os.path.join(
            self.OUTPUT_DIR,
            f'training_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
        )

        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s',
            handlers=[
                logging.FileHandler(log_file, encoding='utf-8'),
                logging.StreamHandler(sys.stdout)
            ]
        )
        return logging.getLogger(__name__)

    def display_config(self):
        """Muestra la configuraci√≥n"""
        print("\n" + "="*80)
        print("‚öôÔ∏è  CONFIGURACI√ìN DEL MODELO MULTIMODAL")
        print("="*80)
        print(f"\nü§ñ Modelo base: {self.MODEL_NAME}")
        print(f"   Experimento: {self.experiment_name}")
        print(f"\nüìä Variables de entrada:")
        print(f"   Texto: {self.TEXT_COLUMN}")
        print(f"   Num√©ricas: {', '.join(self.NUMERIC_FEATURES)}")
        print(f"   Categ√≥ricas: {', '.join(self.CATEGORICAL_FEATURES)}")
        print(f"\nüéØ Variable objetivo: {self.TARGET_COLUMN}")
        print(f"\nüèóÔ∏è  Arquitectura:")
        print(f"   Embedding categ√≥rico: {self.CATEGORICAL_EMBEDDING_DIM}D")
        print(f"   Capa de fusi√≥n: {self.FUSION_HIDDEN_DIM}D")
        print(f"   Dropout: {self.DROPOUT_RATE}")
        print(f"\nüéõÔ∏è  Entrenamiento:")
        print(f"   Batch size: {self.BATCH_SIZE}")
        print(f"   Learning rate: {self.LEARNING_RATE}")
        print(f"   Epochs: {self.NUM_EPOCHS}")
        print("\n" + "="*80 + "\n")

    def save_config(self):
        """Guarda la configuraci√≥n"""
        config_dict = {
            'model_name': self.MODEL_NAME,
            'model_type': self.model_type,
            'experiment_name': self.experiment_name,
            'text_column': self.TEXT_COLUMN,
            'numeric_features': self.NUMERIC_FEATURES,
            'categorical_features': self.CATEGORICAL_FEATURES,
            'target_column': self.TARGET_COLUMN,
            'max_length': self.MAX_LENGTH,
            'categorical_embedding_dim': self.CATEGORICAL_EMBEDDING_DIM,
            'fusion_hidden_dim': self.FUSION_HIDDEN_DIM,
            'dropout_rate': self.DROPOUT_RATE,
            'batch_size': self.BATCH_SIZE,
            'learning_rate': self.LEARNING_RATE,
            'num_epochs': self.NUM_EPOCHS,
            'device': self.DEVICE,
            'timestamp': datetime.now().isoformat()
        }

        config_path = os.path.join(self.OUTPUT_DIR, 'config.json')
        with open(config_path, 'w', encoding='utf-8') as f:
            json.dump(config_dict, f, indent=2, ensure_ascii=False)

        self.logger.info(f"Configuraci√≥n guardada en: {config_path}")
        return config_path


# Inicializar configuraci√≥n
config = MultimodalConfig()
config.display_config()
config.save_config()



‚öôÔ∏è  CONFIGURACI√ìN DEL MODELO MULTIMODAL

ü§ñ Modelo base: bertin-project/bertin-roberta-base-spanish
   Experimento: bertin-roberta-base-spanish_multimodal_20251120_043339

üìä Variables de entrada:
   Texto: texto_final
   Num√©ricas: p208a
   Categ√≥ricas: p301a, p507

üéØ Variable objetivo: p505r4

üèóÔ∏è  Arquitectura:
   Embedding categ√≥rico: 16D
   Capa de fusi√≥n: 256D
   Dropout: 0.3

üéõÔ∏è  Entrenamiento:
   Batch size: 16
   Learning rate: 2e-05
   Epochs: 3




'/content/drive/MyDrive/classification_coding_open_ended_occupational_responses_ENAHO/05_BERTIN/bertin-roberta-base-spanish_multimodal_20251120_043339/config.json'

In [5]:
# ============================================================================
# üìÇ CARGA Y PREPARACI√ìN DE DATOS MULTIVARIABLES
# ============================================================================

class MultimodalDataLoader:
    """Cargador de datos para entrada multivariable"""

    def __init__(self, config):
        self.config = config
        self.logger = config.logger
        self.scaler = StandardScaler()

    def load_data(self):
        """Carga datos desde archivo"""
        try:
            self.logger.info(f"Cargando datos desde: {self.config.DATA_PATH}")

            if not os.path.exists(self.config.DATA_PATH):
                raise FileNotFoundError(
                    f"‚ùå El archivo no existe: {self.config.DATA_PATH}"
                )

            # Cargar seg√∫n extensi√≥n
            file_ext = os.path.splitext(self.config.DATA_PATH)[1].lower()

            if file_ext == '.parquet':
                df = pd.read_parquet(self.config.DATA_PATH)
            elif file_ext == '.csv':
                df = pd.read_csv(self.config.DATA_PATH)
            elif file_ext in ['.xlsx', '.xls']:
                df = pd.read_excel(self.config.DATA_PATH)
            else:
                raise ValueError(f"‚ùå Formato no soportado: {file_ext}")

            self.logger.info(f"‚úÖ Datos cargados: {df.shape[0]:,} filas x {df.shape[1]} columnas")
            return df

        except Exception as e:
            self.logger.error(f"‚ùå Error al cargar datos: {str(e)}")
            raise

    def validate_data(self, df):
        """Valida que los datos tengan todas las columnas necesarias"""
        self.logger.info("Validando estructura de datos...")

        # Verificar columnas requeridas
        required_cols = (
            [self.config.TEXT_COLUMN, self.config.TARGET_COLUMN] +
            self.config.NUMERIC_FEATURES +
            self.config.CATEGORICAL_FEATURES
        )

        missing_cols = [col for col in required_cols if col not in df.columns]

        if missing_cols:
            available_cols = list(df.columns)
            raise ValueError(
                f"‚ùå Columnas faltantes: {missing_cols}\n"
                f"   Columnas disponibles: {available_cols}\n"
                f"   Verifica la configuraci√≥n en MultimodalConfig"
            )

        # Validar datos nulos
        for col in required_cols:
            null_count = df[col].isna().sum()
            if null_count > 0:
                self.logger.warning(f"   ‚ö†Ô∏è  {col}: {null_count:,} valores nulos")

        self.logger.info("‚úÖ Validaci√≥n completada")

    def filter_valid_records(self, df):
        """Filtra registros v√°lidos"""
        self.logger.info("Filtrando registros v√°lidos...")

        initial_count = len(df)

        # Crear m√°scara de validez
        valid_mask = (
            df[self.config.TEXT_COLUMN].notna() &
            df[self.config.TARGET_COLUMN].notna() &
            (df[self.config.TEXT_COLUMN].str.strip() != '')
        )

        # Validar num√©ricas
        for col in self.config.NUMERIC_FEATURES:
            valid_mask &= df[col].notna()

        # Validar categ√≥ricas
        for col in self.config.CATEGORICAL_FEATURES:
            valid_mask &= df[col].notna()

        df_clean = df[valid_mask].copy()

        final_count = len(df_clean)
        removed = initial_count - final_count

        self.logger.info(
            f"   Registros iniciales: {initial_count:,}\n"
            f"   Registros v√°lidos: {final_count:,}\n"
            f"   Removidos: {removed:,} ({removed/initial_count*100:.2f}%)"
        )

        if final_count == 0:
            raise ValueError("‚ùå No quedan registros v√°lidos")

        return df_clean

    def calculate_categorical_cardinalities(self, df):
        """Calcula la cardinalidad de variables categ√≥ricas"""
        self.logger.info("Calculando cardinalidad de variables categ√≥ricas...")

        for col in self.config.CATEGORICAL_FEATURES:
            unique_values = df[col].nunique()
            self.config.CATEGORICAL_CARDINALITIES[col] = unique_values
            self.logger.info(f"   {col}: {unique_values} categor√≠as √∫nicas")

    def filter_rare_classes(self, df):
        """Filtra clases con pocas muestras"""
        self.logger.info(
            f"Filtrando clases con < {self.config.MIN_SAMPLES_PER_CLASS} muestras..."
        )

        class_counts = df[self.config.TARGET_COLUMN].value_counts()
        valid_classes = class_counts[class_counts >= self.config.MIN_SAMPLES_PER_CLASS].index
        df_filtered = df[df[self.config.TARGET_COLUMN].isin(valid_classes)].copy()

        self.logger.info(
            f"   Clases originales: {len(class_counts):,}\n"
            f"   Clases mantenidas: {len(valid_classes):,}\n"
            f"   Registros despu√©s: {len(df_filtered):,}"
        )

        return df_filtered

    def create_label_mapping(self, df):
        """Crea mapeo de etiquetas"""
        self.logger.info("Creando mapeo de etiquetas...")

        unique_labels = sorted(df[self.config.TARGET_COLUMN].unique())
        label2id = {label: idx for idx, label in enumerate(unique_labels)}
        id2label = {idx: label for label, idx in label2id.items()}

        df['label_id'] = df[self.config.TARGET_COLUMN].map(label2id)

        self.logger.info(
            f"‚úÖ Mapeo creado: {len(label2id)} clases (√≠ndices 0-{len(label2id)-1})"
        )

        return df, label2id, id2label

    def normalize_numeric_features(self, train_df, val_df, test_df):
        """
        Normaliza features num√©ricas usando StandardScaler
        Fit en train, transform en val y test
        """
        self.logger.info("Normalizando features num√©ricas...")

        if not self.config.NUMERIC_FEATURES:
            return train_df, val_df, test_df

        # Fit en train
        self.scaler.fit(train_df[self.config.NUMERIC_FEATURES])

        # Transform en todos
        for df_split, name in [(train_df, 'train'), (val_df, 'val'), (test_df, 'test')]:
            normalized = self.scaler.transform(df_split[self.config.NUMERIC_FEATURES])

            for i, col in enumerate(self.config.NUMERIC_FEATURES):
                df_split[f"{col}_normalized"] = normalized[:, i]

        self.logger.info("‚úÖ Features num√©ricas normalizadas")

        return train_df, val_df, test_df

    def split_data(self, df):
        """Divide datos en train, val, test"""
        self.logger.info("Dividiendo datos...")

        try:
            # Primero separar test
            train_val, test = train_test_split(
                df,
                test_size=self.config.TEST_SIZE,
                random_state=self.config.RANDOM_STATE,
                stratify=df['label_id']
            )

            # Luego separar train y val
            val_size_adjusted = self.config.VAL_SIZE / (1 - self.config.TEST_SIZE)
            train, val = train_test_split(
                train_val,
                test_size=val_size_adjusted,
                random_state=self.config.RANDOM_STATE,
                stratify=train_val['label_id']
            )

            self.logger.info(
                f"‚úÖ Divisi√≥n completada:\n"
                f"   Train: {len(train):,}\n"
                f"   Validation: {len(val):,}\n"
                f"   Test: {len(test):,}"
            )

            return train, val, test

        except ValueError as e:
            self.logger.error(f"‚ùå Error al dividir datos: {str(e)}")
            raise


# ============================================================================
# EJECUTAR CARGA DE DATOS
# ============================================================================

print("\n" + "="*80)
print("üìÇ CARGANDO Y PREPARANDO DATOS")
print("="*80 + "\n")

try:
    data_loader = MultimodalDataLoader(config)

    # Cargar
    df_raw = data_loader.load_data()

    # Validar
    data_loader.validate_data(df_raw)

    # Filtrar v√°lidos
    df_valid = data_loader.filter_valid_records(df_raw)

    # Calcular cardinalidades
    data_loader.calculate_categorical_cardinalities(df_valid)

    # Filtrar clases raras
    df_filtered = data_loader.filter_rare_classes(df_valid)

    # Crear mapeo
    df_final, label2id, id2label = data_loader.create_label_mapping(df_filtered)

    # Dividir
    train_df, val_df, test_df = data_loader.split_data(df_final)

    # Normalizar features num√©ricas
    train_df, val_df, test_df = data_loader.normalize_numeric_features(
        train_df, val_df, test_df
    )

    print("\n" + "="*80)
    print("‚úÖ DATOS PREPARADOS EXITOSAMENTE")
    print("="*80 + "\n")

except Exception as e:
    print("\n" + "="*80)
    print("‚ùå ERROR EN LA CARGA DE DATOS")
    print("="*80)
    print(f"\n{str(e)}\n")
    raise



üìÇ CARGANDO Y PREPARANDO DATOS






‚úÖ DATOS PREPARADOS EXITOSAMENTE



In [6]:
# ============================================================================
# üî§ DATASET MULTIMODAL
# ============================================================================

class MultimodalDataset(Dataset):
    """
    Dataset que combina:
    - Texto tokenizado
    - Features num√©ricas normalizadas
    - Features categ√≥ricas
    """

    def __init__(self, dataframe, tokenizer, config, is_train=False):
        """
        Args:
            dataframe: DataFrame con todos los datos
            tokenizer: Tokenizer de HuggingFace
            config: Configuraci√≥n del modelo
            is_train: Si es True, puede aplicar augmentation
        """
        self.df = dataframe.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.config = config
        self.is_train = is_train

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]

        # 1. TEXTO - Tokenizar
        text = str(row[self.config.TEXT_COLUMN])
        encoding = self.tokenizer(
            text,
            max_length=self.config.MAX_LENGTH,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        # 2. FEATURES NUM√âRICAS - Normalizadas
        numeric_features = []
        for col in self.config.NUMERIC_FEATURES:
            normalized_col = f"{col}_normalized"
            value = row[normalized_col] if normalized_col in row else row[col]
            numeric_features.append(float(value))

        # 3. FEATURES CATEG√ìRICAS - Como √≠ndices
        categorical_features = []
        for col in self.config.CATEGORICAL_FEATURES:
            value = int(row[col])
            categorical_features.append(value)

        # 4. LABEL
        label = int(row['label_id'])

        return {
            # Texto
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),

            # Features num√©ricas
            'numeric_features': torch.tensor(numeric_features, dtype=torch.float),

            # Features categ√≥ricas
            'categorical_features': torch.tensor(categorical_features, dtype=torch.long),

            # Label
            'labels': torch.tensor(label, dtype=torch.long)
        }


print("‚úÖ Clase MultimodalDataset definida")


‚úÖ Clase MultimodalDataset definida


In [7]:
# ============================================================================
# üèóÔ∏è  MODELO MULTIMODAL: TRANSFORMER + NUMERIC + CATEGORICAL
# ============================================================================

class MultimodalTransformerClassifier(nn.Module):
    """
    Arquitectura h√≠brida que combina:
    1. Transformer pre-entrenado para texto
    2. Embeddings para features categ√≥ricas
    3. Features num√©ricas directas
    4. Capa de fusi√≥n para combinar todo
    """

    def __init__(self, config, model_name, num_labels, categorical_cardinalities):
        super().__init__()

        self.config = config
        self.num_labels = num_labels

        # 1. TRANSFORMER BASE (solo encoder)
        self.transformer = AutoModel.from_pretrained(model_name)
        self.transformer_dim = self.transformer.config.hidden_size

        # 2. EMBEDDINGS PARA CATEG√ìRICAS
        self.categorical_embeddings = nn.ModuleList([
            nn.Embedding(
                num_embeddings=categorical_cardinalities[cat_name] + 1,  # +1 por seguridad
                embedding_dim=config.CATEGORICAL_EMBEDDING_DIM
            )
            for cat_name in config.CATEGORICAL_FEATURES
        ])

        # 3. DIMENSIONES
        self.num_numeric = len(config.NUMERIC_FEATURES)
        self.num_categorical = len(config.CATEGORICAL_FEATURES)
        self.total_categorical_dim = self.num_categorical * config.CATEGORICAL_EMBEDDING_DIM

        # Dimensi√≥n total de entrada a la capa de fusi√≥n
        self.fusion_input_dim = (
            self.transformer_dim +      # Del transformer
            self.num_numeric +          # Features num√©ricas
            self.total_categorical_dim  # Embeddings categ√≥ricos
        )

        # 4. CAPAS DE FUSI√ìN
        self.fusion_layers = nn.Sequential(
            nn.Linear(self.fusion_input_dim, config.FUSION_HIDDEN_DIM),
            nn.LayerNorm(config.FUSION_HIDDEN_DIM),
            nn.ReLU(),
            nn.Dropout(config.DROPOUT_RATE),

            nn.Linear(config.FUSION_HIDDEN_DIM, config.FUSION_HIDDEN_DIM // 2),
            nn.LayerNorm(config.FUSION_HIDDEN_DIM // 2),
            nn.ReLU(),
            nn.Dropout(config.DROPOUT_RATE),
        )

        # 5. CLASIFICADOR
        self.classifier = nn.Linear(config.FUSION_HIDDEN_DIM // 2, num_labels)

        # 6. DROPOUT
        self.dropout = nn.Dropout(config.DROPOUT_RATE)

    def forward(self, input_ids, attention_mask, numeric_features,
                categorical_features, labels=None):
        """
        Forward pass

        Args:
            input_ids: [batch_size, seq_len]
            attention_mask: [batch_size, seq_len]
            numeric_features: [batch_size, num_numeric]
            categorical_features: [batch_size, num_categorical]
            labels: [batch_size] (opcional)
        """
        # 1. PROCESAR TEXTO CON TRANSFORMER
        transformer_outputs = self.transformer(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        # Usar [CLS] token (primer token) como representaci√≥n del texto
        text_representation = transformer_outputs.last_hidden_state[:, 0, :]  # [batch_size, transformer_dim]
        text_representation = self.dropout(text_representation)

        # 2. PROCESAR FEATURES CATEG√ìRICAS
        categorical_embeddings = []
        for i, embedding_layer in enumerate(self.categorical_embeddings):
            cat_values = categorical_features[:, i]  # [batch_size]
            embedded = embedding_layer(cat_values)    # [batch_size, embedding_dim]
            categorical_embeddings.append(embedded)

        # Concatenar todos los embeddings categ√≥ricos
        categorical_representation = torch.cat(categorical_embeddings, dim=1)  # [batch_size, total_cat_dim]

        # 3. FUSIONAR TODO
        # Concatenar: texto + num√©ricas + categ√≥ricas
        fused = torch.cat([
            text_representation,
            numeric_features,
            categorical_representation
        ], dim=1)  # [batch_size, fusion_input_dim]

        # 4. PASAR POR CAPAS DE FUSI√ìN
        fused = self.fusion_layers(fused)  # [batch_size, fusion_hidden_dim // 2]

        # 5. CLASIFICAR
        logits = self.classifier(fused)  # [batch_size, num_labels]

        # 6. CALCULAR LOSS SI HAY LABELS
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits, labels)

        # Retornar en formato compatible con Trainer
        return {
            'loss': loss,
            'logits': logits
        }


print("‚úÖ Clase MultimodalTransformerClassifier definida")
print("\nüìä Arquitectura del modelo:")
print("   1. Transformer (texto) ‚Üí Representaci√≥n densa")
print("   2. Embeddings (categ√≥ricas) ‚Üí Vectores densos")
print("   3. Features num√©ricas ‚Üí Valores normalizados")
print("   4. Fusi√≥n ‚Üí Capas densas con dropout")
print("   5. Clasificador ‚Üí Predicci√≥n final")


‚úÖ Clase MultimodalTransformerClassifier definida

üìä Arquitectura del modelo:
   1. Transformer (texto) ‚Üí Representaci√≥n densa
   2. Embeddings (categ√≥ricas) ‚Üí Vectores densos
   3. Features num√©ricas ‚Üí Valores normalizados
   4. Fusi√≥n ‚Üí Capas densas con dropout
   5. Clasificador ‚Üí Predicci√≥n final


In [8]:
# ============================================================================
# üî§ INICIALIZAR TOKENIZER Y DATASETS
# ============================================================================

print("\n" + "="*80)
print("üî§ INICIALIZANDO TOKENIZER Y DATASETS")
print("="*80 + "\n")

try:
    # Cargar tokenizer
    config.logger.info(f"Cargando tokenizer: {config.MODEL_NAME}")
    tokenizer = AutoTokenizer.from_pretrained(config.MODEL_NAME)

    print(f"‚úÖ Tokenizer cargado: {config.MODEL_NAME}")
    print(f"   Vocabulario: {len(tokenizer):,} tokens")

    # Crear datasets
    config.logger.info("Creando datasets multimodales...")

    train_dataset = MultimodalDataset(
        dataframe=train_df,
        tokenizer=tokenizer,
        config=config,
        is_train=True
    )

    val_dataset = MultimodalDataset(
        dataframe=val_df,
        tokenizer=tokenizer,
        config=config,
        is_train=False
    )

    test_dataset = MultimodalDataset(
        dataframe=test_df,
        tokenizer=tokenizer,
        config=config,
        is_train=False
    )

    print(f"\n‚úÖ Datasets multimodales creados:")
    print(f"   Train: {len(train_dataset):,} ejemplos")
    print(f"   Validation: {len(val_dataset):,} ejemplos")
    print(f"   Test: {len(test_dataset):,} ejemplos")

    # Verificar un ejemplo
    sample = train_dataset[0]
    print(f"\nüìù Ejemplo de muestra:")
    print(f"   Input IDs shape: {sample['input_ids'].shape}")
    print(f"   Attention mask shape: {sample['attention_mask'].shape}")
    print(f"   Numeric features shape: {sample['numeric_features'].shape}")
    print(f"   Categorical features shape: {sample['categorical_features'].shape}")
    print(f"   Label: {sample['labels'].item()}")

    print("\n" + "="*80)
    print("‚úÖ TOKENIZACI√ìN Y DATASETS COMPLETADOS")
    print("="*80 + "\n")

except Exception as e:
    print("\n" + "="*80)
    print("‚ùå ERROR EN LA TOKENIZACI√ìN")
    print("="*80)
    print(f"\n{str(e)}\n")
    raise



üî§ INICIALIZANDO TOKENIZER Y DATASETS



tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

‚úÖ Tokenizer cargado: bertin-project/bertin-roberta-base-spanish
   Vocabulario: 50,262 tokens

‚úÖ Datasets multimodales creados:
   Train: 220,882 ejemplos
   Validation: 47,332 ejemplos
   Test: 47,332 ejemplos

üìù Ejemplo de muestra:
   Input IDs shape: torch.Size([128])
   Attention mask shape: torch.Size([128])
   Numeric features shape: torch.Size([1])
   Categorical features shape: torch.Size([2])
   Label: 319

‚úÖ TOKENIZACI√ìN Y DATASETS COMPLETADOS



In [9]:
# ============================================================================
# üìä FUNCIONES DE M√âTRICAS DETALLADAS
# ============================================================================

def compute_detailed_metrics(eval_pred):
    """
    Calcula m√©tricas completas: Accuracy, Precision, Recall, F1
    Con variantes: macro, micro y weighted
    """
    predictions = eval_pred.predictions
    labels = eval_pred.label_ids

    # Obtener predicciones
    if predictions.ndim > 1:
        preds = np.argmax(predictions, axis=1)
    else:
        preds = predictions

    # Accuracy
    accuracy = accuracy_score(labels, preds)

    # Macro
    precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(
        labels, preds, average='macro', zero_division=0
    )

    # Micro
    precision_micro, recall_micro, f1_micro, _ = precision_recall_fscore_support(
        labels, preds, average='micro', zero_division=0
    )

    # Weighted
    precision_weighted, recall_weighted, f1_weighted, _ = precision_recall_fscore_support(
        labels, preds, average='weighted', zero_division=0
    )

    return {
        'accuracy': accuracy,
        'f1_macro': f1_macro,
        'f1_micro': f1_micro,
        'f1_weighted': f1_weighted,
        'precision_macro': precision_macro,
        'precision_micro': precision_micro,
        'precision_weighted': precision_weighted,
        'recall_macro': recall_macro,
        'recall_micro': recall_micro,
        'recall_weighted': recall_weighted,
    }



def display_metrics(metrics, title="M√©tricas"):
    """Muestra las m√©tricas de forma organizada"""
    print("\n" + "="*80)
    print(f"üìä {title.upper()}")
    print("="*80 + "\n")

    # Mostrar Loss si existe
    loss = (
        metrics.get('test_loss') or
        metrics.get('eval_loss') or
        metrics.get('loss', None)
    )
    if loss is not None:
        print(f"üí• LOSS: {loss:.4f}")

    # Accuracy
    acc = (
        metrics.get('test_accuracy') or
        metrics.get('eval_accuracy') or
        metrics.get('accuracy', 0)
    )
    print(f"üéØ ACCURACY: {acc:.4f}")
    print("\n" + "-"*80)

    # Tabla
    print(f"\n{'M√©trica':<20} {'Macro':>12} {'Micro':>12} {'Weighted':>12}")
    print("-"*60)

    def get_m(name):
        return (
            metrics.get(f'test_{name}') or
            metrics.get(f'eval_{name}') or
            metrics.get(name, 0)
        )

    print(f"{'F1 Score':<20} {get_m('f1_macro'):>12.4f} {get_m('f1_micro'):>12.4f} {get_m('f1_weighted'):>12.4f}")
    print(f"{'Precision':<20} {get_m('precision_macro'):>12.4f} {get_m('precision_micro'):>12.4f} {get_m('precision_weighted'):>12.4f}")
    print(f"{'Recall':<20} {get_m('recall_macro'):>12.4f} {get_m('recall_micro'):>12.4f} {get_m('recall_weighted'):>12.4f}")

    print("\n" + "="*80 + "\n")


print("‚úÖ Funciones de m√©tricas cargadas (con loss incluido)")

‚úÖ Funciones de m√©tricas cargadas (con loss incluido)


In [10]:
# ============================================================================
# ü§ñ INICIALIZAR MODELO MULTIMODAL
# ============================================================================

print("\n" + "="*80)
print("ü§ñ INICIALIZANDO MODELO MULTIMODAL")
print("="*80 + "\n")

try:
    config.logger.info(f"Cargando modelo: {config.MODEL_NAME}")

    # Crear modelo multimodal
    model = MultimodalTransformerClassifier(
        config=config,
        model_name=config.MODEL_NAME,
        num_labels=len(label2id),
        categorical_cardinalities=config.CATEGORICAL_CARDINALITIES
    )

    # Mover a GPU si est√° disponible
    model.to(config.DEVICE)

    # Informaci√≥n del modelo
    num_params = sum(p.numel() for p in model.parameters())
    num_trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)

    print(f"‚úÖ Modelo multimodal cargado")
    print(f"   Modelo base: {config.MODEL_NAME}")
    print(f"   N√∫mero de clases: {len(label2id)}")
    print(f"   Par√°metros totales: {num_params:,}")
    print(f"   Par√°metros entrenables: {num_trainable:,}")
    print(f"   Dispositivo: {config.DEVICE}")

    # Informaci√≥n de arquitectura
    print(f"\nüèóÔ∏è  Arquitectura:")
    print(f"   Transformer dim: {model.transformer_dim}")
    print(f"   Num. features num√©ricas: {model.num_numeric}")
    print(f"   Num. features categ√≥ricas: {model.num_categorical}")
    print(f"   Dim. embeddings categ√≥ricos: {config.CATEGORICAL_EMBEDDING_DIM}")
    print(f"   Dim. entrada fusi√≥n: {model.fusion_input_dim}")
    print(f"   Dim. oculta fusi√≥n: {config.FUSION_HIDDEN_DIM}")

    if torch.cuda.is_available():
        print(f"\n   Memoria GPU asignada: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")

    config.logger.info(f"Modelo inicializado con {num_params:,} par√°metros")

    print("\n" + "="*80)
    print("‚úÖ MODELO LISTO PARA ENTRENAMIENTO")
    print("="*80 + "\n")

except Exception as e:
    print("\n" + "="*80)
    print("‚ùå ERROR AL CARGAR EL MODELO")
    print("="*80)
    print(f"\n{str(e)}\n")
    raise



ü§ñ INICIALIZANDO MODELO MULTIMODAL



config.json:   0%|          | 0.00/674 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at bertin-project/bertin-roberta-base-spanish and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


‚úÖ Modelo multimodal cargado
   Modelo base: bertin-project/bertin-roberta-base-spanish
   N√∫mero de clases: 357
   Par√°metros totales: 124,928,693
   Par√°metros entrenables: 124,928,693
   Dispositivo: cuda

üèóÔ∏è  Arquitectura:
   Transformer dim: 768
   Num. features num√©ricas: 1
   Num. features categ√≥ricas: 2
   Dim. embeddings categ√≥ricos: 16
   Dim. entrada fusi√≥n: 801
   Dim. oculta fusi√≥n: 256

   Memoria GPU asignada: 0.50 GB

‚úÖ MODELO LISTO PARA ENTRENAMIENTO



In [11]:
# ============================================================================
# üîß CUSTOM TRAINER PARA MODELO MULTIMODAL
# ============================================================================

class MultimodalTrainer(Trainer):
    """
    Trainer personalizado que maneja inputs multimodales
    (texto + num√©ricas + categ√≥ricas)
    """

    def compute_loss(self, model, inputs, return_outputs=False,**kwargs):
        """
        Calcula la p√©rdida pasando todos los inputs al modelo
        """
        # Extraer inputs
        input_ids = inputs.get('input_ids')
        attention_mask = inputs.get('attention_mask')
        numeric_features = inputs.get('numeric_features')
        categorical_features = inputs.get('categorical_features')
        labels = inputs.get('labels')

        # Forward pass
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            numeric_features=numeric_features,
            categorical_features=categorical_features,
            labels=labels
        )

        loss = outputs['loss']

        return (loss, outputs) if return_outputs else loss

    def prediction_step(self, model, inputs, prediction_loss_only, ignore_keys=None):
        """
        Realiza un paso de predicci√≥n
        """
        # Extraer inputs
        input_ids = inputs.get('input_ids')
        attention_mask = inputs.get('attention_mask')
        numeric_features = inputs.get('numeric_features')
        categorical_features = inputs.get('categorical_features')
        labels = inputs.get('labels')

        device = self.args.device

        # Mover a device
        if input_ids is not None:
            input_ids = input_ids.to(device)
        if attention_mask is not None:
            attention_mask = attention_mask.to(device)
        if numeric_features is not None:
            numeric_features = numeric_features.to(device)
        if categorical_features is not None:
            categorical_features = categorical_features.to(device)
        if labels is not None:
            labels = labels.to(device)

        # Forward pass
        with torch.no_grad():
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                numeric_features=numeric_features,
                categorical_features=categorical_features,
                labels=labels
            )

        loss = outputs['loss']
        logits = outputs['logits']

        if prediction_loss_only:
            return (loss, None, None)

        return (loss, logits, labels)


print("‚úÖ Clase MultimodalTrainer definida")


‚úÖ Clase MultimodalTrainer definida


In [12]:
# ============================================================================
# ‚öôÔ∏è  CONFIGURACI√ìN DEL ENTRENAMIENTO
# ============================================================================

print("\n" + "="*80)
print("‚öôÔ∏è  CONFIGURANDO ENTRENAMIENTO")
print("="*80 + "\n")

# Configuraci√≥n de argumentos
training_args = TrainingArguments(
    output_dir=config.CHECKPOINT_DIR,
    logging_dir=os.path.join(config.OUTPUT_DIR, 'logs'),

    # Hiperpar√°metros
    learning_rate=config.LEARNING_RATE,
    per_device_train_batch_size=config.BATCH_SIZE,
    per_device_eval_batch_size=config.BATCH_SIZE,
    num_train_epochs=config.NUM_EPOCHS,
    warmup_steps=config.WARMUP_STEPS,
    weight_decay=config.WEIGHT_DECAY,

    # Evaluaci√≥n
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_weighted",
    greater_is_better=True,

    # Logging
    logging_steps=100,
    logging_strategy="steps",

    # Optimizaci√≥n
    fp16=torch.cuda.is_available(),
    gradient_accumulation_steps=1,

    # Otros
    seed=config.RANDOM_STATE,
    report_to="none",
    disable_tqdm=False,
)

# Inicializar Trainer
trainer = MultimodalTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_detailed_metrics,
    callbacks=[
        EarlyStoppingCallback(
            early_stopping_patience=config.EARLY_STOPPING_PATIENCE
        )
    ]
)

print("‚úÖ Configuraci√≥n:")
print(f"   Learning rate: {config.LEARNING_RATE}")
print(f"   Batch size: {config.BATCH_SIZE}")
print(f"   Epochs: {config.NUM_EPOCHS}")
print(f"   Early stopping: {config.EARLY_STOPPING_PATIENCE} epochs")
print(f"   FP16: {training_args.fp16}")
print(f"   M√©trica principal: f1_weighted")

print("\n" + "="*80)
print("‚úÖ TRAINER CONFIGURADO Y LISTO")
print("="*80 + "\n")



‚öôÔ∏è  CONFIGURANDO ENTRENAMIENTO

‚úÖ Configuraci√≥n:
   Learning rate: 2e-05
   Batch size: 16
   Epochs: 3
   Early stopping: 3 epochs
   FP16: True
   M√©trica principal: f1_weighted

‚úÖ TRAINER CONFIGURADO Y LISTO



In [13]:
# ============================================================================
# üöÄ ENTRENAMIENTO DEL MODELO MULTIMODAL
# ============================================================================

print("\n" + "="*80)
print("üöÄ INICIANDO ENTRENAMIENTO")
print("="*80)
print(f"\nModelo: {config.MODEL_NAME} (Multimodal)")
print(f"Entradas: Texto + {len(config.NUMERIC_FEATURES)} num√©ricas + {len(config.CATEGORICAL_FEATURES)} categ√≥ricas")
print(f"Datos de entrenamiento: {len(train_dataset):,} ejemplos")
print(f"Datos de validaci√≥n: {len(val_dataset):,} ejemplos")
print(f"\n‚è±Ô∏è  Esto puede tomar tiempo...")
print("="*80 + "\n")

try:
    start_time = datetime.now()
    config.logger.info("Iniciando entrenamiento multimodal...")

    # ENTRENAR
    train_result = trainer.train()

    end_time = datetime.now()
    training_time = end_time - start_time

    config.logger.info(f"Entrenamiento completado en {training_time}")

    print("\n" + "="*80)
    print("‚úÖ ENTRENAMIENTO COMPLETADO")
    print("="*80)
    print(f"\nTiempo total: {training_time}")
    print(f"Training loss: {train_result.training_loss:.4f}")

    # Evaluar en validation
    print("\n" + "-"*80)
    print("üìä Evaluando en conjunto de validaci√≥n...")
    val_metrics = trainer.evaluate()
    display_metrics(val_metrics, "M√©tricas de Validaci√≥n")

    print("="*80 + "\n")

except KeyboardInterrupt:
    print("\n‚ö†Ô∏è  ENTRENAMIENTO INTERRUMPIDO")
    raise

except Exception as e:
    print("\n‚ùå ERROR DURANTE EL ENTRENAMIENTO")
    print(f"\n{str(e)}\n")
    config.logger.error(f"Error: {str(e)}", exc_info=True)
    raise



üöÄ INICIANDO ENTRENAMIENTO

Modelo: bertin-project/bertin-roberta-base-spanish (Multimodal)
Entradas: Texto + 1 num√©ricas + 2 categ√≥ricas
Datos de entrenamiento: 220,882 ejemplos
Datos de validaci√≥n: 47,332 ejemplos

‚è±Ô∏è  Esto puede tomar tiempo...



Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,F1 Micro,F1 Weighted,Precision Macro,Precision Micro,Precision Weighted,Recall Macro,Recall Micro,Recall Weighted
1,0.7295,0.607811,0.9003,0.29135,0.9003,0.878316,0.30607,0.9003,0.866342,0.302301,0.9003,0.9003


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,F1 Micro,F1 Weighted,Precision Macro,Precision Micro,Precision Weighted,Recall Macro,Recall Micro,Recall Weighted
1,0.7295,0.607811,0.9003,0.29135,0.9003,0.878316,0.30607,0.9003,0.866342,0.302301,0.9003,0.9003
2,0.4941,0.429896,0.924575,0.398793,0.924575,0.912239,0.415329,0.924575,0.906371,0.404712,0.924575,0.924575
3,0.392,0.395263,0.930237,0.427094,0.930237,0.920032,0.444322,0.930237,0.915408,0.433148,0.930237,0.930237



‚úÖ ENTRENAMIENTO COMPLETADO

Tiempo total: 1:31:39.519969
Training loss: 0.8570

--------------------------------------------------------------------------------
üìä Evaluando en conjunto de validaci√≥n...



üìä M√âTRICAS DE VALIDACI√ìN

üí• LOSS: 0.3953
üéØ ACCURACY: 0.9302

--------------------------------------------------------------------------------

M√©trica                     Macro        Micro     Weighted
------------------------------------------------------------
F1 Score                   0.4271       0.9302       0.9200
Precision                  0.4443       0.9302       0.9154
Recall                     0.4331       0.9302       0.9302





In [14]:
# ============================================================================
# üß™ EVALUACI√ìN EN TEST SET
# ============================================================================

print("\n" + "="*80)
print("üß™ EVALUACI√ìN EN TEST SET")
print("="*80 + "\n")

try:
    config.logger.info("Evaluando en test set...")

    # Obtener predicciones
    test_predictions = trainer.predict(test_dataset)
    test_metrics = test_predictions.metrics

    # Mostrar m√©tricas
    display_metrics(test_metrics, "M√©tricas de Test (Evaluaci√≥n Final)")

    # Guardar m√©tricas
    metrics_file = os.path.join(config.OUTPUT_DIR, 'test_metrics.json')
    with open(metrics_file, 'w', encoding='utf-8') as f:
        json.dump(test_metrics, f, indent=2)

    print(f"‚úÖ M√©tricas guardadas en: {metrics_file}")

    # An√°lisis detallado
    print("\n" + "="*80)
    print("üìà AN√ÅLISIS DETALLADO")
    print("="*80 + "\n")

    y_pred = np.argmax(test_predictions.predictions, axis=1)
    y_true = test_predictions.label_ids

    # Reporte de clasificaci√≥n
    target_names = [id2label[i] for i in range(len(id2label))]
    class_report = classification_report(
        y_true,
        y_pred,
        target_names=target_names,
        zero_division=0,
        digits=4
    )
    print(class_report)

    # Guardar reporte
    report_file = os.path.join(config.OUTPUT_DIR, 'classification_report.txt')
    with open(report_file, 'w', encoding='utf-8') as f:
        f.write("REPORTE DE CLASIFICACI√ìN MULTIMODAL - TEST SET\n")
        f.write("="*80 + "\n\n")
        f.write(f"Modelo: {config.MODEL_NAME} (Multimodal)\n")
        f.write(f"Features: Texto + {config.NUMERIC_FEATURES} + {config.CATEGORICAL_FEATURES}\n")
        f.write(f"Fecha: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
        f.write("\n" + "="*80 + "\n\n")
        f.write(class_report)

    print(f"\n‚úÖ Reporte guardado en: {report_file}")

    # An√°lisis de errores
    incorrect_mask = y_pred != y_true
    num_incorrect = incorrect_mask.sum()

    print("\n" + "="*80)
    print("üîç AN√ÅLISIS DE ERRORES")
    print("="*80 + "\n")
    print(f"Total: {len(y_true):,}")
    print(f"Correctas: {(~incorrect_mask).sum():,}")
    print(f"Incorrectas: {num_incorrect:,} ({num_incorrect/len(y_true)*100:.2f}%)")

    # Guardar errores
    if num_incorrect > 0:
        errors_df = test_df[incorrect_mask].copy()
        errors_df['predicted_label'] = [id2label[pred] for pred in y_pred[incorrect_mask]]
        errors_df['true_label'] = [id2label[true] for true in y_true[incorrect_mask]]

        probs = torch.nn.functional.softmax(torch.tensor(test_predictions.predictions), dim=-1)
        max_probs = probs.max(dim=-1).values.numpy()
        errors_df['confidence'] = max_probs[incorrect_mask]

        errors_file = os.path.join(config.OUTPUT_DIR, 'error_analysis.csv')
        errors_df.to_csv(errors_file, index=False, encoding='utf-8')
        print(f"\n‚úÖ An√°lisis de errores guardado en: {errors_file}")

    print("\n" + "="*80 + "\n")

except Exception as e:
    print("\n‚ùå ERROR EN LA EVALUACI√ìN")
    print(f"\n{str(e)}\n")
    raise



üß™ EVALUACI√ìN EN TEST SET




üìä M√âTRICAS DE TEST (EVALUACI√ìN FINAL)

üí• LOSS: 0.4005
üéØ ACCURACY: 0.9287

--------------------------------------------------------------------------------

M√©trica                     Macro        Micro     Weighted
------------------------------------------------------------
F1 Score                   0.4271       0.9287       0.9188
Precision                  0.4466       0.9287       0.9145
Recall                     0.4314       0.9287       0.9287


‚úÖ M√©tricas guardadas en: /content/drive/MyDrive/classification_coding_open_ended_occupational_responses_ENAHO/05_BERTIN/bertin-roberta-base-spanish_multimodal_20251120_043339/test_metrics.json

üìà AN√ÅLISIS DETALLADO

              precision    recall  f1-score   support

        0111     0.0000    0.0000    0.0000         3
        0112     0.0000    0.0000    0.0000         2
        0120     0.0000    0.0000    0.0000         9
        0211     1.0000    0.1111    0.2000         9
        0212     0.0000    0.0000 

In [15]:
# # ============================================================================
# # üíæ GUARDADO COMPLETO DEL MODELO Y ARTEFACTOS
# # ============================================================================

# print("\n" + "="*80)
# print("üíæ GUARDANDO MODELO Y ARTEFACTOS")
# print("="*80 + "\n")

# try:
#     # 1. Guardar modelo completo
#     torch.save({
#         'model_state_dict': model.state_dict(),
#         'config': {
#             'model_name': config.MODEL_NAME,
#             'num_labels': len(label2id),
#             'categorical_cardinalities': config.CATEGORICAL_CARDINALITIES,
#             'categorical_embedding_dim': config.CATEGORICAL_EMBEDDING_DIM,
#             'fusion_hidden_dim': config.FUSION_HIDDEN_DIM,
#             'dropout_rate': config.DROPOUT_RATE,
#         }
#     }, os.path.join(config.MODEL_SAVE_DIR, 'pytorch_model.bin'))

#     # 2. Guardar tokenizer
#     tokenizer.save_pretrained(config.MODEL_SAVE_DIR)

#     print(f"‚úÖ Modelo guardado en: {config.MODEL_SAVE_DIR}")

#     # 3. Guardar artefactos
#     artifacts = {
#         'label2id': label2id,
#         'id2label': id2label,
#         'num_labels': len(label2id),
#         'model_name': config.MODEL_NAME,
#         'model_type': config.model_type,
#         'text_column': config.TEXT_COLUMN,
#         'numeric_features': config.NUMERIC_FEATURES,
#         'categorical_features': config.CATEGORICAL_FEATURES,
#         'categorical_cardinalities': config.CATEGORICAL_CARDINALITIES,
#         'target_column': config.TARGET_COLUMN,
#         'max_length': config.MAX_LENGTH,
#         'categorical_embedding_dim': config.CATEGORICAL_EMBEDDING_DIM,
#         'fusion_hidden_dim': config.FUSION_HIDDEN_DIM,
#         'dropout_rate': config.DROPOUT_RATE,
#         'test_metrics': test_metrics,
#         'training_date': datetime.now().isoformat(),
#         'scaler': data_loader.scaler,  # Guardar el scaler
#     }

#     artifacts_file = os.path.join(config.OUTPUT_DIR, 'artifacts.pkl')
#     with open(artifacts_file, 'wb') as f:
#         pickle.dump(artifacts, f)

#     print(f"‚úÖ Artefactos guardados en: {artifacts_file}")

#     # 4. Crear README
#     readme_content = f"""# Modelo Multimodal: {config.experiment_name}

# ## Informaci√≥n del Modelo
# - **Modelo Base**: {config.MODEL_NAME}
# - **Tipo**: Multimodal (Texto + Num√©ricas + Categ√≥ricas)
# - **N√∫mero de Clases**: {len(label2id)}
# - **Fecha de Entrenamiento**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

# ## Features de Entrada
# 1. **Texto**: {config.TEXT_COLUMN}
# 2. **Num√©ricas**: {', '.join(config.NUMERIC_FEATURES)}
# 3. **Categ√≥ricas**: {', '.join(config.CATEGORICAL_FEATURES)}

# ## Arquitectura
# - Transformer dim: {model.transformer_dim}
# - Embedding categ√≥rico: {config.CATEGORICAL_EMBEDDING_DIM}D
# - Fusion hidden: {config.FUSION_HIDDEN_DIM}D
# - Dropout: {config.DROPOUT_RATE}

# ## Resultados (Test Set)
# - **Accuracy**: {test_metrics.get('test_accuracy', test_metrics.get('eval_accuracy', 0)):.4f}
# - **F1 Weighted**: {test_metrics.get('test_f1_weighted', test_metrics.get('eval_f1_weighted', 0)):.4f}
# - **F1 Macro**: {test_metrics.get('test_f1_macro', test_metrics.get('eval_f1_macro', 0)):.4f}

# ## Archivos
# - `pytorch_model.bin`: Modelo completo
# - `artifacts.pkl`: Mapeos y metadata (incluye scaler)
# - `test_metrics.json`: M√©tricas completas
# - `classification_report.txt`: Reporte por clase
# - `error_analysis.csv`: An√°lisis de errores
# """

#     readme_file = os.path.join(config.OUTPUT_DIR, 'README.md')
#     with open(readme_file, 'w', encoding='utf-8') as f:
#         f.write(readme_content)

#     print(f"‚úÖ README creado en: {readme_file}")

#     print("\n" + "="*80)
#     print("üéâ GUARDADO COMPLETADO")
#     print("="*80 + "\n")

# except Exception as e:
#     print("\n‚ùå ERROR AL GUARDAR")
#     print(f"\n{str(e)}\n")
#     raise
