# Model Comparison: Blumenstock et al. (2022) Replication

This notebook replicates the experiments from:

> Blumenstock, G., Lessmann, S., & Seow, H-V. (2022). Deep learning for survival and competing risk modelling. *Journal of the Operational Research Society*, 73(1), 26-38.

## Models
1. **CSC**: Cause-Specific Cox
2. **FGR**: Fine-Gray Model
3. **RSF**: Random Survival Forest
4. **DeepHit**: Deep Learning approach (Lee et al., 2018)

## Experiments (Dataset 2: 2010-2025)
- **Exp 4.1**: Loan-level variables only (9 features)
- **Exp 4.2**: Macroeconomic variables only (13 features)
- **Exp 4.3**: All variables (22 features)

## Evaluation
- Time-dependent concordance index at 24, 48, 72 months
- Separate evaluation for prepayment (k=1) and default (k=2)
- Results averaged across cross-validation folds

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
import warnings
warnings.filterwarnings('ignore')

# Models
from lifelines import CoxPHFitter
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Competing risks module
import sys
sys.path.insert(0, '..')
from src.competing_risks import (
    CompetingRisksRSF,
    CompetingRisksDeepHit,
    time_dependent_concordance_index,
    evaluate_all_events,
    format_results_table,
    plot_concordance_comparison,
    EVAL_TIMES,
)

# Check if PyTorch/pycox available for DeepHit
try:
    import torch
    import pycox
    DEEPHIT_AVAILABLE = True
    print(f"PyTorch version: {torch.__version__}")
    print(f"pycox available: {DEEPHIT_AVAILABLE}")
except ImportError:
    DEEPHIT_AVAILABLE = False
    print("DeepHit not available (install pycox: pip install pycox torch torchtuples)")

sns.set_style('whitegrid')
%matplotlib inline

print("\nImports complete.")
print(f"Evaluation times: {EVAL_TIMES} months")

## Load Data

In [None]:
DATA_DIR = Path('../data/processed')

# Load the Blumenstock-style dataset
data_path = DATA_DIR / 'blumenstock_dataset2.parquet'
if data_path.exists():
    df = pd.read_parquet(data_path)
    print(f"Loaded {len(df):,} observations")
else:
    print("Dataset not found. Please run 03_data_preparation_blumenstock.ipynb first.")
    # Fall back to standard survival data
    df = pd.read_parquet(DATA_DIR / 'survival_data.parquet')
    print(f"Loaded fallback data: {len(df):,} observations")

# Load variable config
var_config_path = DATA_DIR / 'blumenstock_variables.json'
if var_config_path.exists():
    with open(var_config_path) as f:
        var_config = json.load(f)
    LOAN_LEVEL_VARS = var_config['loan_level_vars']
    MACRO_VARS = var_config['macro_vars']
    ALL_VARS = var_config['all_vars']
else:
    # Default variables
    LOAN_LEVEL_VARS = ['int_rate', 'orig_upb', 'fico_score', 'dti_r', 'ltv_r']
    MACRO_VARS = []
    ALL_VARS = LOAN_LEVEL_VARS

print(f"\nLoan-level variables: {len(LOAN_LEVEL_VARS)}")
print(f"Macro variables: {len(MACRO_VARS)}")

In [None]:
# Data summary
print("=== Data Summary ===")
print(f"Observations: {len(df):,}")
print(f"Folds: {df['fold'].nunique() if 'fold' in df.columns else 'N/A'}")

print(f"\nEvent distribution:")
event_counts = df['event_code'].value_counts().sort_index()
for code, count in event_counts.items():
    name = {0: 'Censored', 1: 'Prepay', 2: 'Default'}.get(code, 'Other')
    print(f"  {name} (k={code}): {count:,} ({count/len(df)*100:.1f}%)")

print(f"\nDuration statistics:")
print(df['duration'].describe())

## Setup Experiments

Define variable sets for each experiment.

In [None]:
# Filter to available variables
available_cols = df.columns.tolist()

LOAN_VARS_AVAIL = [v for v in LOAN_LEVEL_VARS if v in available_cols]
MACRO_VARS_AVAIL = [v for v in MACRO_VARS if v in available_cols]
ALL_VARS_AVAIL = [v for v in ALL_VARS if v in available_cols]

# Experiment configurations
EXPERIMENTS = {
    'Exp 4.1': {
        'name': 'Loan-level only',
        'features': LOAN_VARS_AVAIL,
    },
    'Exp 4.2': {
        'name': 'Macro only',
        'features': MACRO_VARS_AVAIL,
    },
    'Exp 4.3': {
        'name': 'All variables',
        'features': ALL_VARS_AVAIL,
    },
}

print("=== Experiment Setup ===")
for exp_name, config in EXPERIMENTS.items():
    print(f"\n{exp_name}: {config['name']}")
    print(f"  Features ({len(config['features'])}): {config['features'][:5]}..." 
          if len(config['features']) > 5 else f"  Features ({len(config['features'])}): {config['features']}")

## Model Classes

Wrapper classes for consistent interface.

In [None]:
class CauseSpecificCoxModel:
    """Cause-Specific Cox model for competing risks."""
    
    def __init__(self, penalizer=0.01):
        self.penalizer = penalizer
        self.models = {}
        self.scaler = StandardScaler()
        self.feature_cols = []
        
    def fit(self, X, duration, event_code, feature_cols):
        self.feature_cols = feature_cols
        
        # Scale features
        X_scaled = self.scaler.fit_transform(X)
        
        # Fit model for each event
        for event in [1, 2]:  # Prepay, Default
            # Create cause-specific event indicator
            event_indicator = (event_code == event).astype(int)
            
            # Create DataFrame for lifelines
            train_df = pd.DataFrame(X_scaled, columns=feature_cols)
            train_df['duration'] = duration
            train_df['event'] = event_indicator
            
            # Fit Cox model
            cph = CoxPHFitter(penalizer=self.penalizer)
            cph.fit(train_df, duration_col='duration', event_col='event')
            self.models[event] = cph
            
        return self
    
    def predict_risk(self, X, event):
        X_scaled = self.scaler.transform(X)
        X_df = pd.DataFrame(X_scaled, columns=self.feature_cols)
        return self.models[event].predict_partial_hazard(X_df).values.flatten()


class FineGrayModel:
    """Discrete-time Fine-Gray model using logistic regression."""
    
    def __init__(self, C=1.0):
        self.C = C
        self.models = {}
        self.scaler = StandardScaler()
        
    def fit(self, X, duration, event_code, feature_cols):
        # Scale features
        X_scaled = self.scaler.fit_transform(X)
        
        # Fit model for each event
        for event in [1, 2]:  # Prepay, Default
            # Fine-Gray: primary event = 1, competing events stay in risk set with y=0
            y = (event_code == event).astype(int)
            
            # Fit logistic regression
            model = LogisticRegression(C=self.C, penalty='l2', solver='lbfgs', max_iter=1000)
            model.fit(X_scaled, y)
            self.models[event] = model
            
        return self
    
    def predict_risk(self, X, event):
        X_scaled = self.scaler.transform(X)
        return self.models[event].predict_proba(X_scaled)[:, 1]


print("Model classes defined.")

## Run Experiments

Following paper's cross-validation design with 10 folds.

In [None]:
def run_single_experiment(train_df, test_df, feature_cols, exp_name, val_df=None):
    """
    Run experiment with all four models (CSC, FGR, RSF, DeepHit).
    
    Returns results DataFrame with C-index at 24, 48, 72 months.
    """
    results = []
    
    # Prepare data
    X_train = train_df[feature_cols].values
    X_test = test_df[feature_cols].values
    
    duration_train = train_df['duration'].values
    duration_test = test_df['duration'].values
    
    event_train = train_df['event_code'].values
    event_test = test_df['event_code'].values
    
    # === 1. Cause-Specific Cox ===
    print("  Training CSC...")
    csc = CauseSpecificCoxModel(penalizer=0.01)
    csc.fit(X_train, duration_train, event_train, feature_cols)
    
    risk_prepay_csc = csc.predict_risk(X_test, event=1)
    risk_default_csc = csc.predict_risk(X_test, event=2)
    
    csc_results = evaluate_all_events(
        duration_test, event_test, risk_prepay_csc, risk_default_csc, EVAL_TIMES
    )
    csc_results['Model'] = 'CSC'
    csc_results['Experiment'] = exp_name
    results.append(csc_results)
    
    # === 2. Fine-Gray ===
    print("  Training FGR...")
    fgr = FineGrayModel(C=1.0)
    fgr.fit(X_train, duration_train, event_train, feature_cols)
    
    risk_prepay_fgr = fgr.predict_risk(X_test, event=1)
    risk_default_fgr = fgr.predict_risk(X_test, event=2)
    
    fgr_results = evaluate_all_events(
        duration_test, event_test, risk_prepay_fgr, risk_default_fgr, EVAL_TIMES
    )
    fgr_results['Model'] = 'FGR'
    fgr_results['Experiment'] = exp_name
    results.append(fgr_results)
    
    # === 3. Random Survival Forest ===
    print("  Training RSF...")
    rsf = CompetingRisksRSF(
        n_estimators=100,
        min_samples_split=10,
        min_samples_leaf=5,
        n_jobs=-1,
        random_state=42
    )
    rsf.fit(X_train, duration_train, event_train, event_types=[1, 2])
    
    risk_prepay_rsf = rsf.predict_risk(X_test, event=1)
    risk_default_rsf = rsf.predict_risk(X_test, event=2)
    
    rsf_results = evaluate_all_events(
        duration_test, event_test, risk_prepay_rsf, risk_default_rsf, EVAL_TIMES
    )
    rsf_results['Model'] = 'RSF'
    rsf_results['Experiment'] = exp_name
    results.append(rsf_results)
    
    # === 4. DeepHit ===
    if DEEPHIT_AVAILABLE:
        print("  Training DeepHit...")
        
        # Prepare validation data
        if val_df is not None:
            val_data = (
                val_df[feature_cols],
                val_df['duration'].values,
                val_df['event_code'].values,
            )
        else:
            # Use a portion of training data for validation
            val_data = None
        
        deephit = CompetingRisksDeepHit(
            num_durations=100,
            num_nodes_shared=[64, 64],
            num_nodes_indiv=[32],
            batch_norm=True,
            dropout=0.1,
            alpha=0.2,
            sigma=0.1,
            lr=0.01,
            weight_decay=0.01,
            batch_size=256,
            epochs=100,  # Reduced for speed
            patience=5,
            verbose=False,
            random_state=42,
        )
        
        try:
            deephit.fit(
                train_df[feature_cols],
                duration_train,
                event_train,
                event_types=[1, 2],
                val_data=val_data,
            )
            
            risk_prepay_dh = deephit.predict_risk(test_df[feature_cols], event=1)
            risk_default_dh = deephit.predict_risk(test_df[feature_cols], event=2)
            
            dh_results = evaluate_all_events(
                duration_test, event_test, risk_prepay_dh, risk_default_dh, EVAL_TIMES
            )
            dh_results['Model'] = 'DeepHit'
            dh_results['Experiment'] = exp_name
            results.append(dh_results)
        except Exception as e:
            print(f"    DeepHit failed: {e}")
    else:
        print("  Skipping DeepHit (not available)")
    
    return pd.concat(results, ignore_index=True)


print("Experiment function defined.")

In [None]:
# Run all experiments
all_results = []

# Use fold-based cross-validation if available
if 'fold' in df.columns:
    n_folds = df['fold'].nunique()
    # Use first 10 folds for CV, last fold reserved for tuning
    cv_folds = list(range(min(10, n_folds)))
else:
    # Simple train/test split
    cv_folds = [0]
    df['fold'] = 0

for exp_name, config in EXPERIMENTS.items():
    print(f"\n{'='*60}")
    print(f"Running {exp_name}: {config['name']}")
    print(f"Features: {len(config['features'])}")
    print(f"{'='*60}")
    
    feature_cols = config['features']
    
    if len(feature_cols) == 0:
        print("  No features available, skipping...")
        continue
    
    # Filter to complete cases for this feature set
    df_exp = df.dropna(subset=feature_cols + ['duration', 'event_code'])
    print(f"Complete cases: {len(df_exp):,}")
    
    if len(df_exp) < 1000:
        print("  Insufficient data, skipping...")
        continue
    
    # Cross-validation
    for fold in cv_folds[:3]:  # Limit folds for speed
        print(f"\nFold {fold + 1}/{len(cv_folds[:3])}")
        
        # Split by fold
        test_df = df_exp[df_exp['fold'] == fold]
        train_df = df_exp[df_exp['fold'] != fold]
        
        if len(test_df) < 100 or len(train_df) < 500:
            print("  Insufficient data in fold, skipping...")
            continue
        
        print(f"  Train: {len(train_df):,}, Test: {len(test_df):,}")
        
        # Run experiment
        fold_results = run_single_experiment(train_df, test_df, feature_cols, exp_name)
        fold_results['Fold'] = fold
        all_results.append(fold_results)

# Combine all results
if all_results:
    results_df = pd.concat(all_results, ignore_index=True)
    print(f"\n\nTotal results: {len(results_df)} rows")
else:
    print("No results generated.")

## Results Summary

Format results following paper's Table 3.

In [None]:
if 'results_df' in dir() and len(results_df) > 0:
    # Aggregate across folds
    agg_results = results_df.groupby(['Experiment', 'Model', 'Metric']).agg({
        'Prepay (k=1)': 'mean',
        'Default (k=2)': 'mean',
        'Combined': 'mean'
    }).reset_index()
    
    print("=" * 70)
    print("RESULTS SUMMARY (Mean across folds)")
    print("=" * 70)
    
    for exp_name in EXPERIMENTS.keys():
        exp_data = agg_results[agg_results['Experiment'] == exp_name]
        if len(exp_data) == 0:
            continue
            
        print(f"\n{exp_name}: {EXPERIMENTS[exp_name]['name']}")
        print("-" * 70)
        
        # Format as table similar to paper
        model_order = ['CSC', 'FGR', 'RSF', 'DeepHit']
        for model in model_order:
            model_data = exp_data[exp_data['Model'] == model]
            if len(model_data) == 0:
                continue
                
            c_values = {}
            for _, row in model_data.iterrows():
                metric = row['Metric']
                c_values[metric] = {
                    'prepay': row['Prepay (k=1)'] * 100,
                    'default': row['Default (k=2)'] * 100,
                    'combined': row['Combined'] * 100,
                }
            
            # Print row
            c24 = c_values.get('C(24)', {}).get('combined', 0)
            c48 = c_values.get('C(48)', {}).get('combined', 0)
            c72 = c_values.get('C(72)', {}).get('combined', 0)
            oc = c_values.get('ØC', {}).get('combined', 0)
            
            print(f"{model:10} C(24)={c24:5.2f}  C(48)={c48:5.2f}  C(72)={c72:5.2f}  ØC={oc:5.2f}")
else:
    print("No results to display.")

In [None]:
if 'agg_results' in dir() and len(agg_results) > 0:
    # Create detailed table
    print("\n=== Detailed Results (C-index × 100) ===")
    
    for exp_name in EXPERIMENTS.keys():
        exp_data = agg_results[agg_results['Experiment'] == exp_name]
        if len(exp_data) == 0:
            continue
            
        print(f"\n{exp_name}")
        print("-" * 80)
        
        # Create pivot table
        pivot = exp_data.pivot(index='Model', columns='Metric')[['Prepay (k=1)', 'Default (k=2)', 'Combined']]
        pivot = pivot * 100  # Convert to percentage
        
        # Reorder columns
        cols_order = []
        for metric in ['C(24)', 'C(48)', 'C(72)', 'ØC']:
            for col in ['Prepay (k=1)', 'Default (k=2)', 'Combined']:
                if (col, metric) in pivot.columns:
                    cols_order.append((col, metric))
        
        pivot = pivot[cols_order]
        print(pivot.round(2).to_string())

## Visualization

In [None]:
if 'agg_results' in dir() and len(agg_results) > 0:
    # Plot comparison
    fig, axes = plt.subplots(1, 3, figsize=(16, 5))
    
    # Define colors for each model
    model_colors = {'CSC': 'C0', 'FGR': 'C1', 'RSF': 'C2', 'DeepHit': 'C3'}
    
    for idx, exp_name in enumerate(EXPERIMENTS.keys()):
        ax = axes[idx]
        exp_data = agg_results[agg_results['Experiment'] == exp_name]
        
        if len(exp_data) == 0:
            ax.text(0.5, 0.5, 'No data', ha='center', va='center', transform=ax.transAxes)
            ax.set_title(exp_name)
            continue
        
        # Filter to time-specific metrics
        time_data = exp_data[exp_data['Metric'].str.startswith('C(')]
        
        # Plot bars
        models = ['CSC', 'FGR', 'RSF', 'DeepHit']
        models = [m for m in models if m in time_data['Model'].values]
        metrics = sorted(time_data['Metric'].unique())
        x = np.arange(len(metrics))
        width = 0.8 / len(models) if models else 0.2
        
        for i, model in enumerate(models):
            model_data = time_data[time_data['Model'] == model].set_index('Metric')
            values = [model_data.loc[m, 'Combined'] * 100 if m in model_data.index else 0 
                      for m in metrics]
            ax.bar(x + i * width, values, width, label=model, 
                   color=model_colors.get(model, f'C{i}'), alpha=0.8)
        
        ax.set_xlabel('Time Point')
        ax.set_ylabel('C-index (×100)')
        ax.set_title(f"{exp_name}\n{EXPERIMENTS[exp_name]['name']}")
        ax.set_xticks(x + width * (len(models) - 1) / 2)
        ax.set_xticklabels(metrics)
        ax.legend()
        ax.axhline(y=50, color='gray', linestyle='--', alpha=0.5)
        ax.set_ylim(40, 100)
        ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.savefig('../reports/figures/model_comparison_blumenstock.png', dpi=150)
    plt.show()

In [None]:
if 'agg_results' in dir() and len(agg_results) > 0:
    # Plot by event type
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Define colors for each model
    model_colors = {'CSC': 'C0', 'FGR': 'C1', 'RSF': 'C2', 'DeepHit': 'C3'}
    
    # Best experiment (4.3 if available)
    exp_name = 'Exp 4.3' if 'Exp 4.3' in agg_results['Experiment'].values else agg_results['Experiment'].iloc[0]
    exp_data = agg_results[agg_results['Experiment'] == exp_name]
    time_data = exp_data[exp_data['Metric'].str.startswith('C(')]
    
    for idx, (col, title) in enumerate([('Prepay (k=1)', 'Prepayment'), ('Default (k=2)', 'Default')]):
        ax = axes[idx]
        
        models = ['CSC', 'FGR', 'RSF', 'DeepHit']
        models = [m for m in models if m in time_data['Model'].values]
        metrics = sorted(time_data['Metric'].unique())
        x = np.arange(len(metrics))
        width = 0.8 / len(models) if models else 0.2
        
        for i, model in enumerate(models):
            model_data = time_data[time_data['Model'] == model].set_index('Metric')
            values = [model_data.loc[m, col] * 100 if m in model_data.index else 0 
                      for m in metrics]
            ax.bar(x + i * width, values, width, label=model,
                   color=model_colors.get(model, f'C{i}'), alpha=0.8)
        
        ax.set_xlabel('Time Point')
        ax.set_ylabel('C-index (×100)')
        ax.set_title(f'{title} Risk Prediction ({exp_name})')
        ax.set_xticks(x + width * (len(models) - 1) / 2)
        ax.set_xticklabels(metrics)
        ax.legend()
        ax.axhline(y=50, color='gray', linestyle='--', alpha=0.5)
        ax.set_ylim(40, 100)
        ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.savefig('../reports/figures/model_comparison_by_event.png', dpi=150)
    plt.show()

## Save Results

In [None]:
if 'results_df' in dir() and len(results_df) > 0:
    # Save detailed results
    results_df.to_csv('../reports/blumenstock_results_detailed.csv', index=False)
    print("Saved detailed results to reports/blumenstock_results_detailed.csv")
    
    # Save aggregated results
    agg_results.to_csv('../reports/blumenstock_results_summary.csv', index=False)
    print("Saved summary results to reports/blumenstock_results_summary.csv")

## Conclusions

### Key Findings (Expected based on paper)

1. **DeepHit and RSF should outperform statistical models** - Deep learning and ML models typically achieve higher C-index

2. **Default prediction is easier** - ØC₂ > ØC₁ typically (loan-level variables are strong predictors of default)

3. **Macro variables help prepayment more than default** - Interest rate environment drives prepayment

4. **Combined variables perform best** - Exp 4.3 should show highest ØC

### Model Ranking (Paper's findings)

From the paper, the expected ranking for Dataset 2 is:
- **DeepHit** ≈ **RSF** > **FGR** > **CSC**

### Implementation Notes

- **DeepHit**: Uses pycox library with cause-specific network architecture
- **RSF**: Cause-specific approach using scikit-survival
- **FGR**: Discrete-time approximation with logistic regression
- **CSC**: Semi-parametric Cox model from lifelines

### References

- Blumenstock et al. (2022): Main paper being replicated
- Lee et al. (2018): DeepHit original paper
- Ishwaran et al. (2014): RSF for competing risks