# üéØ Model Training: Cluster-Stratified Random Forest

**Objetivo:** Entrenar modelos cluster-stratified para Crohn y CU

**Input:**
- `../data/processed/crohn/ml_dataset_enhanced.csv`
- `../data/processed/cu/ml_dataset_enhanced.csv`
- `../data/processed/crohn/user_clusters.csv`
- `../data/processed/cu/user_clusters.csv`

**Output:** Modelos entrenados:
- `../models/crohn/` (global + por cluster)
- `../models/cu/` (global + por cluster)

**Autor:** Asier Ortiz Garc√≠a  
**Fecha:** Noviembre 2025

## üì¶ Imports y Configuraci√≥n

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
import pickle
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from imblearn.over_sampling import SMOTE

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (12, 6)

# Crear directorios
Path('../models/crohn').mkdir(parents=True, exist_ok=True)
Path('../models/cu').mkdir(parents=True, exist_ok=True)

print("=" * 80)
print("MODEL TRAINING: Cluster-Stratified Random Forest")
print("=" * 80)

MODEL TRAINING: Cluster-Stratified Random Forest


## üîß Funciones de Entrenamiento

In [2]:
def train_cluster_stratified_models(ibd_type='crohn'):
    """
    Entrena modelos global + cluster-specific para un tipo de IBD.
    """
    print(f"\n{'='*80}")
    print(f"ENTRENANDO MODELOS: {ibd_type.upper()}")
    print(f"{'='*80}\n")
    
    # Cargar dataset
    df = pd.read_csv(f'../data/processed/{ibd_type}/ml_dataset_enhanced.csv')
    clusters_df = pd.read_csv(f'../data/processed/{ibd_type}/user_clusters.csv')
    
    df = df.merge(clusters_df[['user_id', 'cluster']], on='user_id', how='left')
    
    print(f"‚úì Dataset cargado: {len(df):,} registros")
    print(f"  Usuarios: {df['user_id'].nunique():,}")
    print(f"  Distribuci√≥n de clusters: {df['cluster'].value_counts().to_dict()}")
    print(f"  Distribuci√≥n de risk: {df['risk_level'].value_counts().to_dict()}")
    
    # Features
    exclude_cols = ['user_id', 'checkin_date', 'risk_level', 'severity_score', 'cluster',
                    'sex', 'first_checkin', 'days_since_first_checkin', 'is_flare_day',
                    'cumulative_flare_days', 'is_bad_day', 'risk_numeric']
    
    feature_cols = [col for col in df.columns if col not in exclude_cols]
    X = df[feature_cols].copy()
    y = df['risk_level'].copy()
    
    # Encode categorical
    if 'gender' in X.columns:
        X = pd.get_dummies(X, columns=['gender'], drop_first=True)
    
    # Fill missing
    for col in X.columns:
        if X[col].dtype in ['float64', 'int64']:
            X[col].fillna(X[col].median(), inplace=True)
    
    print(f"\nFeatures: {len(X.columns)}")
    
    # 1. Entrenar modelo global
    print(f"\n1Ô∏è‚É£ Entrenando modelo GLOBAL...")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
    # SMOTE
    print("  Aplicando SMOTE...")
    smote = SMOTE(sampling_strategy='not majority', random_state=42)
    X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
    print(f"  Antes SMOTE: {len(X_train):,} | Despu√©s SMOTE: {len(X_train_res):,}")
    
    # Train
    rf_global = RandomForestClassifier(n_estimators=200, max_depth=15, min_samples_split=10, random_state=42, n_jobs=-1)
    rf_global.fit(X_train_res, y_train_res)
    
    # Evaluate
    y_pred = rf_global.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"\n‚úÖ Modelo global entrenado - Accuracy: {acc:.3f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    # Save global
    model_path = f'../models/{ibd_type}/rf_severity_classifier_global.pkl'
    with open(model_path, 'wb') as f:
        pickle.dump(rf_global, f)
    print(f"üíæ Guardado: {model_path}")
    
    # 2. Entrenar modelos por cluster
    cluster_models = {}
    n_clusters = df['cluster'].nunique()
    
    for cluster_id in range(n_clusters):
        print(f"\n2Ô∏è‚É£ Entrenando modelo CLUSTER {cluster_id}...")
        df_cluster = df[df['cluster'] == cluster_id].copy()
        
        if len(df_cluster) < 50:
            print(f"  ‚ö†Ô∏è  Muy pocos datos ({len(df_cluster)} registros), usando modelo global")
            cluster_models[cluster_id] = rf_global
            continue
        
        X_c = df_cluster[feature_cols].copy()
        y_c = df_cluster['risk_level'].copy()
        
        if 'gender' in X_c.columns:
            X_c = pd.get_dummies(X_c, columns=['gender'], drop_first=True)
        
        for col in X_c.columns:
            if X_c[col].dtype in ['float64', 'int64']:
                X_c[col].fillna(X_c[col].median(), inplace=True)
        
        # Align with global features
        for col in X.columns:
            if col not in X_c.columns:
                X_c[col] = 0
        X_c = X_c[X.columns]
        
        X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_c, y_c, test_size=0.2, random_state=42)
        
        # SMOTE if enough samples
        if len(X_train_c) > 30:
            try:
                smote_c = SMOTE(sampling_strategy='not majority', random_state=42)
                X_train_c, y_train_c = smote_c.fit_resample(X_train_c, y_train_c)
            except:
                print("  ‚ö†Ô∏è  SMOTE failed, using original data")
        
        rf_cluster = RandomForestClassifier(n_estimators=150, max_depth=12, random_state=42, n_jobs=-1)
        rf_cluster.fit(X_train_c, y_train_c)
        
        y_pred_c = rf_cluster.predict(X_test_c)
        acc_c = accuracy_score(y_test_c, y_pred_c)
        print(f"  ‚úÖ Cluster {cluster_id} - Accuracy: {acc_c:.3f}")
        
        cluster_models[cluster_id] = rf_cluster
        
        # Save
        model_path = f'../models/{ibd_type}/rf_severity_classifier_cluster_{cluster_id}.pkl'
        with open(model_path, 'wb') as f:
            pickle.dump(rf_cluster, f)
        print(f"  üíæ Guardado: {model_path}")
    
    # Metadata
    metadata = {
        'ibd_type': ibd_type,
        'n_clusters': n_clusters,
        'n_samples': len(df),
        'n_features': len(X.columns),
        'features': list(X.columns),
        'global_accuracy': float(acc),
        'cluster_models': {f'cluster_{i}': f'rf_severity_classifier_cluster_{i}.pkl' for i in range(n_clusters)}
    }
    
    with open(f'../models/{ibd_type}/cluster_models_metadata.json', 'w') as f:
        json.dump(metadata, f, indent=2)
    
    print(f"\n‚úÖ {ibd_type.upper()} completado!")
    return rf_global, cluster_models, metadata

print("‚úì Funci√≥n de entrenamiento definida")

‚úì Funci√≥n de entrenamiento definida


## üîÑ Entrenar Crohn

In [3]:
rf_crohn_global, crohn_cluster_models, crohn_metadata = train_cluster_stratified_models('crohn')


ENTRENANDO MODELOS: CROHN

‚úì Dataset cargado: 7,618 registros
  Usuarios: 897
  Distribuci√≥n de clusters: {1.0: 6612, 0.0: 292, 2.0: 123}
  Distribuci√≥n de risk: {'low': 5732, 'medium': 1757, 'high': 129}

Features: 34

1Ô∏è‚É£ Entrenando modelo GLOBAL...
  Aplicando SMOTE...


  Antes SMOTE: 6,094 | Despu√©s SMOTE: 13,755



‚úÖ Modelo global entrenado - Accuracy: 0.995

Classification Report:
              precision    recall  f1-score   support

        high       0.96      1.00      0.98        26
         low       1.00      0.99      1.00      1147
      medium       0.98      1.00      0.99       351

    accuracy                           0.99      1524
   macro avg       0.98      1.00      0.99      1524
weighted avg       0.99      0.99      0.99      1524

üíæ Guardado: ../models/crohn/rf_severity_classifier_global.pkl

2Ô∏è‚É£ Entrenando modelo CLUSTER 0...


  ‚úÖ Cluster 0 - Accuracy: 0.966
  üíæ Guardado: ../models/crohn/rf_severity_classifier_cluster_0.pkl

2Ô∏è‚É£ Entrenando modelo CLUSTER 1...


  ‚úÖ Cluster 1 - Accuracy: 0.995
  üíæ Guardado: ../models/crohn/rf_severity_classifier_cluster_1.pkl

2Ô∏è‚É£ Entrenando modelo CLUSTER 2...
  ‚ö†Ô∏è  SMOTE failed, using original data


  ‚úÖ Cluster 2 - Accuracy: 1.000
  üíæ Guardado: ../models/crohn/rf_severity_classifier_cluster_2.pkl

‚úÖ CROHN completado!


## üîÑ Entrenar Colitis Ulcerosa

In [4]:
rf_cu_global, cu_cluster_models, cu_metadata = train_cluster_stratified_models('cu')


ENTRENANDO MODELOS: CU

‚úì Dataset cargado: 6,860 registros
  Usuarios: 589
  Distribuci√≥n de clusters: {0.0: 5772, 1.0: 772, 2.0: 3}
  Distribuci√≥n de risk: {'low': 5676, 'medium': 1114, 'high': 70}

Features: 34

1Ô∏è‚É£ Entrenando modelo GLOBAL...
  Aplicando SMOTE...
  Antes SMOTE: 5,488 | Despu√©s SMOTE: 13,623



‚úÖ Modelo global entrenado - Accuracy: 0.994

Classification Report:
              precision    recall  f1-score   support

        high       1.00      0.93      0.96        14
         low       1.00      0.99      1.00      1135
      medium       0.97      1.00      0.98       223

    accuracy                           0.99      1372
   macro avg       0.99      0.97      0.98      1372
weighted avg       0.99      0.99      0.99      1372

üíæ Guardado: ../models/cu/rf_severity_classifier_global.pkl

2Ô∏è‚É£ Entrenando modelo CLUSTER 0...


  ‚úÖ Cluster 0 - Accuracy: 0.996
  üíæ Guardado: ../models/cu/rf_severity_classifier_cluster_0.pkl

2Ô∏è‚É£ Entrenando modelo CLUSTER 1...


  ‚úÖ Cluster 1 - Accuracy: 0.987
  üíæ Guardado: ../models/cu/rf_severity_classifier_cluster_1.pkl

2Ô∏è‚É£ Entrenando modelo CLUSTER 2...
  ‚ö†Ô∏è  Muy pocos datos (3 registros), usando modelo global

‚úÖ CU completado!


## ‚úÖ Resumen Final

In [5]:
print("\n" + "="*80)
print("RESUMEN FINAL")
print("="*80)

print(f"\nüìä CROHN:")
print(f"  Modelos entrenados: 1 global + {crohn_metadata['n_clusters']} cluster-specific")
print(f"  Global accuracy: {crohn_metadata['global_accuracy']:.3f}")
print(f"  Features: {crohn_metadata['n_features']}")

print(f"\nüìä CU:")
print(f"  Modelos entrenados: 1 global + {cu_metadata['n_clusters']} cluster-specific")
print(f"  Global accuracy: {cu_metadata['global_accuracy']:.3f}")
print(f"  Features: {cu_metadata['n_features']}")

print("\nüìÇ Archivos generados:")
print(f"  - ../models/crohn/ ({1 + crohn_metadata['n_clusters']} modelos)")
print(f"  - ../models/cu/ ({1 + cu_metadata['n_clusters']} modelos)")
print("  - Metadata JSON files")

print("\n" + "="*80)
print("‚úÖ MODEL TRAINING COMPLETADO")
print("="*80)
print("\nModelos listos para predicci√≥n via API!")


RESUMEN FINAL

üìä CROHN:
  Modelos entrenados: 1 global + 3 cluster-specific
  Global accuracy: 0.995
  Features: 34

üìä CU:
  Modelos entrenados: 1 global + 3 cluster-specific
  Global accuracy: 0.994
  Features: 34

üìÇ Archivos generados:
  - ../models/crohn/ (4 modelos)
  - ../models/cu/ (4 modelos)
  - Metadata JSON files

‚úÖ MODEL TRAINING COMPLETADO

Modelos listos para predicci√≥n via API!
