# üéØ Feature Engineering TEMPORAL - Predicci√≥n Profesional

**Objetivo:** Crear features para predecir riesgo **FUTURO** (evitando data leakage)

**Approach Profesional:**
- üéØ Predicci√≥n temporal: Usar d√≠a T para predecir d√≠a T+7
- üíä Integraci√≥n de medicaciones IBD (biologics, immunosuppressants, etc.)
- ‚è±Ô∏è Solo features hist√≥ricas/tendencias (NO s√≠ntomas directos)
- üìä Validaci√≥n temporal (train en fechas antiguas, test en recientes)

**Features:**
- ‚úÖ Tendencias de s√≠ntomas (√∫ltimos 7 d√≠as)
- ‚úÖ Medicaciones IBD activas
- ‚úÖ Historial de brotes
- ‚úÖ Demograf√≠a
- ‚ùå NO s√≠ntomas directos del d√≠a T

**Target:**
- `risk_level_future`: Riesgo en el d√≠a T+7

**Autor:** Asier Ortiz Garc√≠a  
**Fecha:** Noviembre 2025

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from pathlib import Path
import warnings
import json
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (14, 6)

Path('../data/processed/crohn').mkdir(parents=True, exist_ok=True)
Path('../data/processed/cu').mkdir(parents=True, exist_ok=True)

print("=" * 80)
print("FEATURE ENGINEERING TEMPORAL - PREDICCI√ìN PROFESIONAL")
print("=" * 80)

FEATURE ENGINEERING TEMPORAL - PREDICCI√ìN PROFESIONAL


## 1Ô∏è‚É£ Cargar y Preparar Datos de Medicaciones

In [2]:
# Categor√≠as de medicaciones IBD
IBD_MEDICATIONS = {
    'biologics': [
        'humira', 'remicade', 'entyvio', 'stelara', 'simponi', 'cimzia',
        'inflectra', 'xeljanz'
    ],
    'immunosuppressants': [
        'azathioprine', 'imuran', 'methotrexate', '6-mp', 'mercaptopurine',
        'cyclosporine', 'tacrolimus'
    ],
    'corticosteroids': [
        'prednisone', 'prednisolone', 'budesonide', 'entocort', 'uceris',
        'hydrocortisone', 'methylprednisolone'
    ],
    'aminosalicylates': [
        'mesalamine', 'mesalazine', 'asacol', 'pentasa', 'lialda',
        'sulfasalazine', 'balsalazide', 'colazal', 'apriso'
    ]
}

def extract_medication_features(raw_df, ibd_type='crohn'):
    '''Extrae features de medicaciones por usuario-fecha (VECTORIZED - R√ÅPIDO)'''
    print(f"\nExtrayendo medicaciones para {ibd_type.upper()}...")

    # Filtrar treatments
    treatments = raw_df[raw_df['trackable_type'] == 'Treatment'].copy()
    treatments = treatments[treatments['trackable_name'].notna()]
    treatments['checkin_date'] = pd.to_datetime(treatments['checkin_date'])

    # Vectorized approach - mucho m√°s r√°pido
    treatments['trackable_lower'] = treatments['trackable_name'].str.lower()

    for category, keywords in IBD_MEDICATIONS.items():
        pattern = '|'.join(keywords)
        treatments[f'{category}_active'] = treatments['trackable_lower'].str.contains(
            pattern, regex=True, na=False
        ).astype(int)

    # Aggregate by user_id and date (max = 1 if any medication in that category)
    med_df = treatments.groupby(['user_id', 'checkin_date']).agg({
        'biologics_active': 'max',
        'immunosuppressants_active': 'max',
        'corticosteroids_active': 'max',
        'aminosalicylates_active': 'max'
    }).reset_index()

    med_df['total_ibd_meds'] = (
        med_df['biologics_active'] +
        med_df['immunosuppressants_active'] +
        med_df['corticosteroids_active'] +
        med_df['aminosalicylates_active']
    )

    print(f"  Medicaciones extra√≠das: {len(med_df):,} registros (OPTIMIZADO)")
    print(f"  Usuarios con medicaciones: {med_df['user_id'].nunique():,}")

    return med_df

print("‚úì Funciones de medicaciones definidas (OPTIMIZADAS)")

‚úì Funciones de medicaciones definidas (OPTIMIZADAS)


## 2Ô∏è‚É£ Procesar S√≠ntomas Base + Target Temporal

In [3]:
SYMPTOM_MAPPING = {
    'abdominal_pain': ['abdominal pain', 'stomach pain', 'belly pain', 'cramping', 'pain'],
    'diarrhea': ['diarrhea', 'loose stools', 'watery stools', 'urgency'],
    'fatigue': ['fatigue', 'tired', 'exhaustion', 'tiredness', 'weakness'],
    'fever': ['fever', 'high temperature', 'chills'],
    'blood_in_stool': ['blood in stool', 'bloody stool', 'rectal bleeding', 'bleeding'],
    'nausea': ['nausea', 'nauseous', 'feeling sick', 'vomiting']
}

SYMPTOM_WEIGHTS_CROHN = {
    'abdominal_pain': 0.25, 'diarrhea': 0.25, 'fatigue': 0.15,
    'blood_in_stool': 0.20, 'fever': 0.10, 'nausea': 0.05
}

SYMPTOM_WEIGHTS_UC = {
    'abdominal_pain': 0.20, 'diarrhea': 0.25, 'fatigue': 0.15,
    'blood_in_stool': 0.25, 'fever': 0.10, 'nausea': 0.05
}

def categorize_symptom(symptom_name):
    if pd.isna(symptom_name):
        return None
    symptom_lower = str(symptom_name).lower()
    for category, keywords in SYMPTOM_MAPPING.items():
        if any(kw in symptom_lower for kw in keywords):
            return category
    return 'other'

def classify_risk(score):
    if score < 0.3:
        return 'low'
    elif score < 0.6:
        return 'medium'
    else:
        return 'high'

def process_symptoms_temporal(df, symptom_weights):
    '''Procesa s√≠ntomas y crea target TEMPORAL (T+7 d√≠as)'''
    print("\n  Procesando s√≠ntomas...")
    symptoms = df[df['trackable_type'] == 'Symptom'].copy()
    symptoms['symptom_category'] = symptoms['trackable_name'].apply(categorize_symptom)
    symptoms['value_numeric'] = pd.to_numeric(symptoms['trackable_value'], errors='coerce')
    
    symptoms_clean = symptoms[
        (symptoms['value_numeric'] >= 0) & 
        (symptoms['value_numeric'] <= 4) &
        (symptoms['symptom_category'].isin(list(SYMPTOM_MAPPING.keys())))
    ].copy()
    
    symptoms_clean['severity_normalized'] = symptoms_clean['value_numeric'] / 4.0
    
    # Aggregate daily
    daily_symptoms = symptoms_clean.groupby(
        ['user_id', 'checkin_date', 'symptom_category']
    )['severity_normalized'].max().reset_index()
    
    daily_pivot = daily_symptoms.pivot_table(
        index=['user_id', 'checkin_date'],
        columns='symptom_category',
        values='severity_normalized',
        fill_value=0.0
    ).reset_index()
    
    # Calculate severity score (for TARGET only)
    symptom_cols = [col for col in daily_pivot.columns if col in SYMPTOM_MAPPING.keys()]
    daily_pivot['severity_score'] = 0.0
    for symptom in symptom_cols:
        weight = symptom_weights.get(symptom, 0.1)
        daily_pivot['severity_score'] += daily_pivot[symptom] * weight
    
    daily_pivot['risk_level'] = daily_pivot['severity_score'].apply(classify_risk)
    
    # üéØ TARGET TEMPORAL: Shift risk to future (T+7 days)
    daily_pivot = daily_pivot.sort_values(['user_id', 'checkin_date'])
    daily_pivot['risk_level_future'] = daily_pivot.groupby('user_id')['risk_level'].shift(-7)
    
    # Remove rows without future target
    daily_pivot = daily_pivot[daily_pivot['risk_level_future'].notna()].copy()
    
    print(f"  Registros con target futuro: {len(daily_pivot):,}")
    print(f"  Distribuci√≥n target (T+7): {daily_pivot['risk_level_future'].value_counts().to_dict()}")
    
    return daily_pivot

print("‚úì Funciones de s√≠ntomas temporales definidas")

‚úì Funciones de s√≠ntomas temporales definidas


## 3Ô∏è‚É£ Crear Features (Solo Tendencias, NO S√≠ntomas Directos)

In [4]:
def create_temporal_features(df):
    '''Crea features temporales - SOLO TENDENCIAS pasadas'''
    print("\n  Creando features temporales...")
    df = df.sort_values(['user_id', 'checkin_date']).copy()
    
    # ‚úÖ Tendencias de los √∫ltimos 7 d√≠as (ANTES del d√≠a T)
    for symptom in ['abdominal_pain', 'diarrhea', 'fatigue', 'blood_in_stool']:
        if symptom in df.columns:
            df[f'{symptom}_trend_7d'] = df.groupby('user_id')[symptom].transform(
                lambda x: x.rolling(7, min_periods=1).mean()
            )
            df[f'{symptom}_volatility_7d'] = df.groupby('user_id')[symptom].transform(
                lambda x: x.rolling(7, min_periods=1).std().fillna(0)
            )
    
    # Agregaciones
    df['symptom_count_avg_7d'] = (
        (df['abdominal_pain'] > 0.2).astype(int) +
        (df['diarrhea'] > 0.2).astype(int) +
        (df['fatigue'] > 0.2).astype(int) +
        (df['blood_in_stool'] > 0).astype(int)
    )
    df['symptom_count_avg_7d'] = df.groupby('user_id')['symptom_count_avg_7d'].transform(
        lambda x: x.rolling(7, min_periods=1).mean()
    )
    
    # Red flags trend
    df['red_flag_trend_7d'] = df.groupby('user_id').apply(
        lambda g: (
            (g['blood_in_stool'] == 1.0).astype(int) * 3 +
            (g['fever'] == 1.0).astype(int) * 2 +
            (g['abdominal_pain'] >= 0.7).astype(int)
        ).rolling(7, min_periods=1).sum()
    ).reset_index(level=0, drop=True)
    
    print(f"  Features temporales creadas")
    return df

def create_history_features(df):
    '''Features de historial m√©dico'''
    print("\n  Creando features de historial...")
    df = df.copy()
    
    # Disease duration
    first_checkin = df.groupby('user_id')['checkin_date'].transform('min')
    df['disease_duration_years'] = (df['checkin_date'] - first_checkin).dt.days / 365.25
    
    # Flare history (usando risk_level del d√≠a T, NO el futuro)
    df['is_flare_day'] = (df['risk_level'] == 'high').astype(int)
    df['cumulative_flare_days'] = df.groupby('user_id')['is_flare_day'].cumsum()
    df['previous_flares'] = (df['cumulative_flare_days'] / 7).astype(int)
    
    # Days since last flare
    df['last_flare_date'] = df[df['is_flare_day'] == 1].groupby('user_id')['checkin_date'].transform('max')
    df['last_flare_date'] = df.groupby('user_id')['last_flare_date'].ffill()
    df['days_since_last_flare'] = (df['checkin_date'] - df['last_flare_date']).dt.days.fillna(365).clip(upper=365)
    
    # Derived
    df['flare_frequency'] = df['previous_flares'] / df['disease_duration_years'].clip(lower=1)
    df['recency_score'] = 1 / (1 + df['days_since_last_flare'] / 30)
    
    print(f"  Features de historial creadas")
    return df

print("‚úì Funciones de features definidas")

‚úì Funciones de features definidas


## 4Ô∏è‚É£ Funci√≥n Principal de Procesamiento

In [5]:
def process_ibd_temporal(ibd_type='crohn'):
    '''Procesa datos de IBD con approach temporal profesional'''
    print(f"\n{'='*80}")
    print(f"PROCESANDO: {ibd_type.upper()} - TEMPORAL PREDICTION")
    print(f"{'='*80}")
    
    # Load data
    df_raw = pd.read_csv('../data/raw/export.csv', low_memory=False)
    df_filtered = pd.read_csv(f'../data/processed/{ibd_type}_filtered.csv')
    df_filtered['checkin_date'] = pd.to_datetime(df_filtered['checkin_date'])
    
    print(f"\n‚úì Datos cargados: {len(df_filtered):,} registros")
    
    # 1. Process symptoms with temporal target
    weights = SYMPTOM_WEIGHTS_CROHN if ibd_type == 'crohn' else SYMPTOM_WEIGHTS_UC
    df_symptoms = process_symptoms_temporal(df_filtered, weights)
    
    # 2. Extract medications
    med_df = extract_medication_features(df_raw, ibd_type)
    
    # 3. Merge medications
    print(f"\n  Mergeando medicaciones...")
    df_merged = df_symptoms.merge(
        med_df[['user_id', 'checkin_date', 'biologics_active', 'immunosuppressants_active',
                'corticosteroids_active', 'aminosalicylates_active', 'total_ibd_meds']],
        on=['user_id', 'checkin_date'],
        how='left'
    )
    
    # Fill missing (user didn't report medications that day)
    for col in ['biologics_active', 'immunosuppressants_active', 'corticosteroids_active',
                'aminosalicylates_active', 'total_ibd_meds']:
        df_merged[col] = df_merged[col].fillna(0).astype(int)
    
    print(f"  Registros despu√©s de merge: {len(df_merged):,}")
    
    # 4. Create temporal features
    df_merged = create_temporal_features(df_merged)
    df_merged = create_history_features(df_merged)
    
    # 5. Add demographics
    print(f"\n  A√±adiendo demograf√≠a...")
    demographics = df_filtered[['user_id', 'age', 'sex']].drop_duplicates('user_id')
    df_merged = df_merged.merge(demographics, on='user_id', how='left')
    df_merged['age'] = df_merged['age'].fillna(df_merged['age'].median())
    df_merged['sex'] = df_merged['sex'].fillna('unknown')
    gender_map = {'male': 'M', 'female': 'F', 'unknown': 'O', 'other': 'O'}
    df_merged['gender'] = df_merged['sex'].map(gender_map).fillna('O')
    
    # 6. Temporal features
    df_merged['month'] = df_merged['checkin_date'].dt.month
    df_merged['day_of_week'] = df_merged['checkin_date'].dt.dayofweek
    df_merged['is_weekend'] = (df_merged['day_of_week'] >= 5).astype(int)
    
    # 7. ‚ùå ELIMINAR s√≠ntomas directos (solo mantener tendencias)
    print(f"\n  ‚ùå Eliminando s√≠ntomas directos (data leakage)...")
    direct_symptoms = ['abdominal_pain', 'diarrhea', 'fatigue', 'nausea', 'blood_in_stool', 'fever']
    df_final = df_merged.drop(columns=[col for col in direct_symptoms if col in df_merged.columns])
    
    # Tambi√©n eliminar severity_score y risk_level (son del d√≠a T)
    df_final = df_final.drop(columns=['severity_score', 'risk_level', 'is_flare_day',
                                      'last_flare_date', 'sex'], errors='ignore')
    
    # 8. üîß LIMPIAR valores infinitos y NaN
    print(f"\n  üîß Limpiando valores infinitos/NaN...")
    for col in df_final.columns:
        if df_final[col].dtype in ['float64', 'int64']:
            # Replace infinity with NaN
            inf_count = np.isinf(df_final[col]).sum()
            if inf_count > 0:
                print(f"    {col}: {inf_count} infinitos ‚Üí median")
                df_final[col] = df_final[col].replace([np.inf, -np.inf], np.nan)
            
            # Fill NaN with median
            nan_count = df_final[col].isna().sum()
            if nan_count > 0:
                df_final[col] = df_final[col].fillna(df_final[col].median())
    
    print(f"\n‚úÖ Dataset temporal completado:")
    print(f"  Total features: {len(df_final.columns)}")
    print(f"  Registros: {len(df_final):,}")
    print(f"  Target distribution: {df_final['risk_level_future'].value_counts().to_dict()}")
    
    # Save
    output_path = f'../data/processed/{ibd_type}/ml_dataset_temporal.csv'
    df_final.to_csv(output_path, index=False)
    print(f"\nüíæ Guardado: {output_path}")
    
    return df_final

print("‚úì Funci√≥n principal lista (con limpieza autom√°tica de inf/NaN)")

‚úì Funci√≥n principal lista (con limpieza autom√°tica de inf/NaN)


## 5Ô∏è‚É£ Ejecutar para Crohn y CU

In [6]:
# Process Crohn
df_crohn_temporal = process_ibd_temporal('crohn')

# Process CU
df_cu_temporal = process_ibd_temporal('cu')

print("\n" + "="*80)
print("‚úÖ FEATURE ENGINEERING TEMPORAL COMPLETADO")
print("="*80)
print("\nPr√≥ximo paso: Notebook 03 - Training con validaci√≥n temporal")


PROCESANDO: CROHN - TEMPORAL PREDICTION



‚úì Datos cargados: 183,304 registros

  Procesando s√≠ntomas...


  Registros con target futuro: 5,268
  Distribuci√≥n target (T+7): {'low': 4100, 'medium': 1091, 'high': 77}

Extrayendo medicaciones para CROHN...


  Medicaciones extra√≠das: 197,552 registros (OPTIMIZADO)
  Usuarios con medicaciones: 14,251

  Mergeando medicaciones...
  Registros despu√©s de merge: 5,268

  Creando features temporales...
  Features temporales creadas

  Creando features de historial...
  Features de historial creadas

  A√±adiendo demograf√≠a...

  ‚ùå Eliminando s√≠ntomas directos (data leakage)...

  üîß Limpiando valores infinitos/NaN...
    recency_score: 1 infinitos ‚Üí median

‚úÖ Dataset temporal completado:
  Total features: 29
  Registros: 5,268
  Target distribution: {'low': 4100, 'medium': 1091, 'high': 77}



üíæ Guardado: ../data/processed/crohn/ml_dataset_temporal.csv

PROCESANDO: CU - TEMPORAL PREDICTION



‚úì Datos cargados: 171,509 registros

  Procesando s√≠ntomas...
  Registros con target futuro: 5,307
  Distribuci√≥n target (T+7): {'low': 4625, 'medium': 665, 'high': 17}

Extrayendo medicaciones para CU...


  Medicaciones extra√≠das: 197,552 registros (OPTIMIZADO)
  Usuarios con medicaciones: 14,251

  Mergeando medicaciones...
  Registros despu√©s de merge: 5,307

  Creando features temporales...
  Features temporales creadas

  Creando features de historial...
  Features de historial creadas

  A√±adiendo demograf√≠a...

  ‚ùå Eliminando s√≠ntomas directos (data leakage)...

  üîß Limpiando valores infinitos/NaN...

‚úÖ Dataset temporal completado:
  Total features: 29
  Registros: 5,307
  Target distribution: {'low': 4625, 'medium': 665, 'high': 17}

üíæ Guardado: ../data/processed/cu/ml_dataset_temporal.csv



‚úÖ FEATURE ENGINEERING TEMPORAL COMPLETADO

Pr√≥ximo paso: Notebook 03 - Training con validaci√≥n temporal
