# Random Survival Forest for Competing Risks

This notebook implements **Random Survival Forest (RSF)** for competing risks following Ishwaran et al. (2014). RSF is a non-parametric ensemble method that can capture non-linear effects and interactions.

## Methodology

| Aspect | Description |
|--------|-------------|
| **Model** | Cause-specific RSF (separate model per event) |
| **Base** | scikit-survival RandomSurvivalForest |
| **Features** | Blumenstock et al. (2022) - 21 variables |
| **Evaluation** | Time-dependent C-index at 24, 48, 72 months |

## Comparison to Parametric Models

| Model | Assumptions | Interpretability | Flexibility |
|-------|-------------|------------------|-------------|
| Cox PH | Proportional hazards | High (coefficients) | Low |
| Fine-Gray | Subdistribution PH | High (coefficients) | Low |
| **RSF** | None | Medium (importance) | High |

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Survival analysis
from sksurv.ensemble import RandomSurvivalForest
from sksurv.util import Surv
from sksurv.metrics import concordance_index_censored, concordance_index_ipcw

# Custom module
import sys
sys.path.insert(0, '..')
from src.competing_risks.random_forest import CompetingRisksRSF, fit_rsf_competing_risks

from sklearn.preprocessing import StandardScaler
import pickle

sns.set_style('whitegrid')
%matplotlib inline

# Time horizons for evaluation (matching notebooks 05 and 06)
TIME_HORIZONS = [24, 48, 72]

print("Imports complete.")
print(f"Time horizons for C-index evaluation: {TIME_HORIZONS} months")

In [None]:
# === CONFIGURATION ===
DATA_DIR = Path('../data/processed')
FIGURES_DIR = Path('../reports/figures')
MODELS_DIR = Path('../models')

FIGURES_DIR.mkdir(parents=True, exist_ok=True)
MODELS_DIR.mkdir(parents=True, exist_ok=True)

# Cross-validation folds (Blumenstock methodology)
TRAIN_FOLDS = list(range(10))  # Folds 0-9 for training
TEST_FOLD = 10                  # Fold 10 for testing

# RSF hyperparameters
RSF_PARAMS = {
    'n_estimators': 100,
    'max_depth': 10,
    'min_samples_split': 20,
    'min_samples_leaf': 10,
    'max_features': 'sqrt',
    'n_jobs': -1,
    'random_state': 42,
}

print(f"Training folds: {TRAIN_FOLDS}")
print(f"Test fold: {TEST_FOLD}")
print(f"\nRSF parameters: {RSF_PARAMS}")

---

## Load Loan-Month Panel Data

In [None]:
# Load the loan-month panel data
print("Loading loan-month panel data...")
panel_df = pd.read_parquet(DATA_DIR / 'loan_month_panel.parquet')

print(f"Loaded {len(panel_df):,} loan-months")
print(f"Unique loans: {panel_df['loan_sequence_number'].nunique():,}")
print(f"Folds: {sorted(panel_df['fold'].unique())}")
print(f"Vintages: {panel_df['vintage_year'].min()} - {panel_df['vintage_year'].max()}")

print("\nEvent distribution (terminal observations):")
event_names = {0: 'Censored', 1: 'Prepay', 2: 'Default'}
terminal_events = panel_df[panel_df['event'] == 1].groupby('event_code').size()
for code, count in terminal_events.items():
    print(f"  {event_names.get(code, 'Other')} (k={code}): {count:,}")

---

## Define Features (Blumenstock et al. 2022)

In [None]:
# Define feature groups (Blumenstock et al. 2022, Table 2)
# Matching notebooks 05 and 06 exactly

# Static covariates (fixed at origination) - 5 variables
STATIC_FEATURES = [
    'int_rate',      # Initial interest rate
    'orig_upb',      # Original unpaid balance
    'fico_score',    # Initial FICO score
    'dti_r',         # Initial debt-to-income ratio
    'ltv_r',         # Initial loan-to-value ratio
]

# Behavioral covariates (time-varying) - 4 variables
BEHAVIORAL_FEATURES = [
    'bal_repaid',      # Current repaid balance in percent
    't_act_12m',       # No. of times not being delinquent in last 12 months
    't_del_30d_12m',   # No. of times being 30 days delinquent in last 12 months
    't_del_60d_12m',   # No. of times being 60 days delinquent in last 12 months
]

# Macro covariates (time-varying) - 12 variables
MACRO_FEATURES = [
    'hpi_st_d_t_o',    # HPI difference between origination and today (state)
    'ppi_c_FRMA',      # Current prepayment incentive
    'TB10Y_d_t_o',     # Treasury rate difference
    'FRMA30Y_d_t_o',   # 30Y FRM difference
    'ppi_o_FRMA',      # Prepayment incentive at origination
    'hpi_st_log12m',   # HPI 12-month log return (state)
    'hpi_r_st_us',     # Ratio of state HPI to national HPI
    'st_unemp_r12m',   # Unemployment 12-month log return (state)
    'st_unemp_r3m',    # Unemployment 3-month log return (state)
    'TB10Y_r12m',      # Treasury rate 12-month return
    'T10Y3MM',         # Yield spread (10Y - 3M)
    'T10Y3MM_r12m',    # Yield spread 12-month return
]

ALL_FEATURES = STATIC_FEATURES + BEHAVIORAL_FEATURES + MACRO_FEATURES

# Filter to available features
feature_cols = [f for f in ALL_FEATURES if f in panel_df.columns]
missing_features = [f for f in ALL_FEATURES if f not in panel_df.columns]

print("=== Feature Groups (Blumenstock et al. 2022) ===")
print(f"Static features: {len([f for f in STATIC_FEATURES if f in feature_cols])}/5")
print(f"Behavioral features: {len([f for f in BEHAVIORAL_FEATURES if f in feature_cols])}/4")
print(f"Macro features: {len([f for f in MACRO_FEATURES if f in feature_cols])}/12")
print(f"\nTotal available: {len(feature_cols)}/21")

if missing_features:
    print(f"\nMissing features ({len(missing_features)}):")
    for f in missing_features:
        print(f"  - {f}")

---

## Prepare Data for RSF

RSF requires terminal observations (one per loan) rather than the full panel. We use the last observation for each loan.

In [None]:
# For RSF, we need terminal observations (one per loan)
# Get the last observation for each loan
print("=== Preparing Terminal Observations ===")

time_col = 'loan_age'
event_col = 'event_code'

# Sort and get last observation per loan
panel_df = panel_df.sort_values(['loan_sequence_number', time_col])
terminal_df = panel_df.groupby('loan_sequence_number').last().reset_index()

print(f"Terminal observations: {len(terminal_df):,} loans")

# Lag bal_repaid to avoid data leakage (matching notebook 06)
# At prepayment, bal_repaid=100% by definition
# Use the second-to-last observation's bal_repaid
if 'bal_repaid' in feature_cols:
    print("\nLagging bal_repaid to avoid data leakage...")
    
    # Get second-to-last observation for bal_repaid
    def get_lagged_bal_repaid(group):
        if len(group) >= 2:
            return group['bal_repaid'].iloc[-2]
        else:
            return group['bal_repaid'].iloc[-1]
    
    bal_repaid_lag = panel_df.groupby('loan_sequence_number').apply(get_lagged_bal_repaid)
    terminal_df['bal_repaid_lag1'] = terminal_df['loan_sequence_number'].map(bal_repaid_lag)
    
    # Replace bal_repaid with lagged version
    feature_cols = [f if f != 'bal_repaid' else 'bal_repaid_lag1' for f in feature_cols]
    print("✓ Created bal_repaid_lag1")

# Log transform UPB
if 'orig_upb' in terminal_df.columns:
    terminal_df['log_upb'] = np.log(terminal_df['orig_upb'])
    feature_cols = [f if f != 'orig_upb' else 'log_upb' for f in feature_cols]
    print("✓ Created log_upb")

# Drop rows with missing features
n_before = len(terminal_df)
terminal_df = terminal_df.dropna(subset=feature_cols)
n_after = len(terminal_df)
print(f"\nAfter dropping NaN: {n_after:,} loans (dropped {n_before - n_after:,})")

print(f"\nFeatures ({len(feature_cols)}): {feature_cols}")

In [None]:
# Split by folds (matching notebooks 05 and 06)
print("=== Splitting Data by Folds ===")

train_df = terminal_df[terminal_df['fold'].isin(TRAIN_FOLDS)].copy()
test_df = terminal_df[terminal_df['fold'] == TEST_FOLD].copy()

print(f"Training set (folds 0-9): {len(train_df):,} loans")
print(f"Test set (fold 10): {len(test_df):,} loans")

# Event distribution
print("\nTraining set event distribution:")
for code, name in event_names.items():
    count = (train_df[event_col] == code).sum()
    print(f"  {name}: {count:,}")

---

## Fit Random Survival Forest

We fit cause-specific RSF models for prepayment and default.

In [None]:
# Fit RSF for PREPAYMENT (event=1)
print("=== Fitting RSF for PREPAYMENT ===")

# Prepare data
X_train = train_df[feature_cols].values
X_test = test_df[feature_cols].values

# Create cause-specific event indicator (prepay=True, others=False)
event_prepay_train = (train_df[event_col] == 1).values
event_prepay_test = (test_df[event_col] == 1).values

duration_train = train_df[time_col].values
duration_test = test_df[time_col].values

# Create structured array for scikit-survival
y_train_prepay = Surv.from_arrays(event_prepay_train, duration_train)
y_test_prepay = Surv.from_arrays(event_prepay_test, duration_test)

# Fit RSF
rsf_prepay = RandomSurvivalForest(**RSF_PARAMS)
rsf_prepay.fit(X_train, y_train_prepay)

print(f"\n✓ RSF Prepayment model fitted")
print(f"  Trees: {rsf_prepay.n_estimators}")
print(f"  Max depth: {rsf_prepay.max_depth}")
print(f"  Training events: {event_prepay_train.sum():,}")

In [None]:
# Fit RSF for DEFAULT (event=2)
print("=== Fitting RSF for DEFAULT ===")

# Create cause-specific event indicator (default=True, others=False)
event_default_train = (train_df[event_col] == 2).values
event_default_test = (test_df[event_col] == 2).values

# Create structured array
y_train_default = Surv.from_arrays(event_default_train, duration_train)
y_test_default = Surv.from_arrays(event_default_test, duration_test)

# Fit RSF
rsf_default = RandomSurvivalForest(**RSF_PARAMS)
rsf_default.fit(X_train, y_train_default)

print(f"\n✓ RSF Default model fitted")
print(f"  Trees: {rsf_default.n_estimators}")
print(f"  Max depth: {rsf_default.max_depth}")
print(f"  Training events: {event_default_train.sum():,}")

---

## Model Evaluation

Evaluate using time-dependent C-index at 24, 48, and 72 months.

In [None]:
# Evaluate PREPAYMENT model
print("=== RSF Prepayment Model Evaluation ===")

# Get risk scores (higher = more risk of prepayment)
risk_prepay_train = rsf_prepay.predict(X_train)
risk_prepay_test = rsf_prepay.predict(X_test)

# Time-dependent C-index
print("\nTime-Dependent C-index (IPCW) for PREPAYMENT:")
print("-" * 50)

cindex_prepay_results = {}
for tau in TIME_HORIZONS:
    try:
        c_tau = concordance_index_ipcw(
            y_train_prepay,
            y_test_prepay,
            risk_prepay_test,
            tau=tau
        )
        cindex_prepay_results[tau] = c_tau[0]
        print(f"  τ = {tau:3d} months: C-index = {c_tau[0]:.4f}")
    except Exception as e:
        print(f"  τ = {tau:3d} months: Error - {str(e)[:50]}")

# Overall C-index (Harrell's)
c_index_prepay = concordance_index_censored(
    event_prepay_test,
    duration_test,
    risk_prepay_test
)
print(f"\nOverall C-index (Harrell): {c_index_prepay[0]:.4f}")
print(f"  Concordant: {c_index_prepay[1]:,} | Discordant: {c_index_prepay[2]:,} | Tied: {c_index_prepay[3]:,}")

In [None]:
# Evaluate DEFAULT model
print("=== RSF Default Model Evaluation ===")

# Get risk scores
risk_default_train = rsf_default.predict(X_train)
risk_default_test = rsf_default.predict(X_test)

# Time-dependent C-index
print("\nTime-Dependent C-index (IPCW) for DEFAULT:")
print("-" * 50)

cindex_default_results = {}
for tau in TIME_HORIZONS:
    try:
        c_tau = concordance_index_ipcw(
            y_train_default,
            y_test_default,
            risk_default_test,
            tau=tau
        )
        cindex_default_results[tau] = c_tau[0]
        print(f"  τ = {tau:3d} months: C-index = {c_tau[0]:.4f}")
    except Exception as e:
        print(f"  τ = {tau:3d} months: Error - {str(e)[:50]}")

# Overall C-index (Harrell's)
c_index_default = concordance_index_censored(
    event_default_test,
    duration_test,
    risk_default_test
)
print(f"\nOverall C-index (Harrell): {c_index_default[0]:.4f}")
print(f"  Concordant: {c_index_default[1]:,} | Discordant: {c_index_default[2]:,} | Tied: {c_index_default[3]:,}")

In [None]:
# Plot time-dependent C-index comparison
fig, ax = plt.subplots(figsize=(10, 6))

# Prepare data for plotting
horizons = sorted(set(cindex_prepay_results.keys()) & set(cindex_default_results.keys()))
prepay_cindex = [cindex_prepay_results[h] for h in horizons]
default_cindex = [cindex_default_results[h] for h in horizons]

x = np.arange(len(horizons))
width = 0.35

bars1 = ax.bar(x - width/2, prepay_cindex, width, label='Prepayment', color='steelblue', alpha=0.8)
bars2 = ax.bar(x + width/2, default_cindex, width, label='Default', color='indianred', alpha=0.8)

# Add value labels
for bar, val in zip(bars1, prepay_cindex):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
            f'{val:.3f}', ha='center', va='bottom', fontsize=10)
for bar, val in zip(bars2, default_cindex):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
            f'{val:.3f}', ha='center', va='bottom', fontsize=10)

ax.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5, label='Random (0.5)')

ax.set_xlabel('Time Horizon (months)', fontsize=12)
ax.set_ylabel('C-index (IPCW)', fontsize=12)
ax.set_title('RSF: Time-Dependent Concordance Index by Event Type', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels([f'τ = {h}' for h in horizons])
ax.set_ylim(0.4, 1.0)
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'rsf_time_dependent_cindex.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Figure saved to: {FIGURES_DIR / 'rsf_time_dependent_cindex.png'}")

---

## Feature Importance

RSF provides permutation-based feature importance scores.

In [None]:
# Extract feature importance
print("=== Feature Importance ===")

# Prepayment model
importance_prepay = pd.DataFrame({
    'feature': feature_cols,
    'importance': rsf_prepay.feature_importances_
}).sort_values('importance', ascending=False)

# Default model
importance_default = pd.DataFrame({
    'feature': feature_cols,
    'importance': rsf_default.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 Features - PREPAYMENT:")
print(importance_prepay.head(10).to_string(index=False))

print("\nTop 10 Features - DEFAULT:")
print(importance_default.head(10).to_string(index=False))

In [None]:
# Plot feature importance comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 8))

# Prepayment
ax = axes[0]
top_n = 15
plot_df = importance_prepay.head(top_n).iloc[::-1]  # Reverse for horizontal bar
ax.barh(plot_df['feature'], plot_df['importance'], color='steelblue', alpha=0.7)
ax.set_xlabel('Importance')
ax.set_title('RSF Feature Importance: Prepayment', fontsize=12)
ax.grid(True, alpha=0.3, axis='x')

# Default
ax = axes[1]
plot_df = importance_default.head(top_n).iloc[::-1]
ax.barh(plot_df['feature'], plot_df['importance'], color='indianred', alpha=0.7)
ax.set_xlabel('Importance')
ax.set_title('RSF Feature Importance: Default', fontsize=12)
ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'rsf_feature_importance.png', dpi=150, bbox_inches='tight')
plt.show()

---

## Save Models

In [None]:
# Save models
with open(MODELS_DIR / 'rsf_prepay.pkl', 'wb') as f:
    pickle.dump(rsf_prepay, f)

with open(MODELS_DIR / 'rsf_default.pkl', 'wb') as f:
    pickle.dump(rsf_default, f)

# Save feature importance
importance_prepay.to_csv(MODELS_DIR / 'rsf_importance_prepay.csv', index=False)
importance_default.to_csv(MODELS_DIR / 'rsf_importance_default.csv', index=False)

print(f"Models saved to {MODELS_DIR}:")
print(f"  - rsf_prepay.pkl")
print(f"  - rsf_default.pkl")
print(f"  - rsf_importance_prepay.csv")
print(f"  - rsf_importance_default.csv")

---

## Summary

In [None]:
print("=" * 70)
print("RANDOM SURVIVAL FOREST - SUMMARY")
print("=" * 70)

print(f"\nData:")
print(f"  Training loans: {len(train_df):,}")
print(f"  Test loans: {len(test_df):,}")

print(f"\nFeatures: {len(feature_cols)}")
print(f"  Static: {len([f for f in STATIC_FEATURES if f in feature_cols or 'log_upb' in feature_cols])}")
print(f"  Behavioral: {len([f for f in BEHAVIORAL_FEATURES if f in feature_cols or 'bal_repaid_lag1' in feature_cols])}")
print(f"  Macro: {len([f for f in MACRO_FEATURES if f in feature_cols])}")

print(f"\nRSF Parameters:")
for param, value in RSF_PARAMS.items():
    print(f"  {param}: {value}")

print(f"\n{'='*70}")
print("MODEL PERFORMANCE (Test Set)")
print("=" * 70)

print(f"\nPREPAYMENT MODEL:")
print(f"  Training events: {event_prepay_train.sum():,}")
print(f"  Overall C-index (Harrell): {c_index_prepay[0]:.4f}")
print(f"  Time-Dependent C-index (IPCW):")
for tau, c in cindex_prepay_results.items():
    print(f"    τ = {tau:3d} months: {c:.4f}")

print(f"\nDEFAULT MODEL:")
print(f"  Training events: {event_default_train.sum():,}")
print(f"  Overall C-index (Harrell): {c_index_default[0]:.4f}")
print(f"  Time-Dependent C-index (IPCW):")
for tau, c in cindex_default_results.items():
    print(f"    τ = {tau:3d} months: {c:.4f}")

print(f"\nTop 3 Important Features:")
print(f"  Prepayment: {', '.join(importance_prepay['feature'].head(3).tolist())}")
print(f"  Default: {', '.join(importance_default['feature'].head(3).tolist())}")

---

## Next Steps

**Notebook 08**: Model Comparison

Compare all models:
- Cause-Specific Cox (notebook 05)
- Fine-Gray (notebook 06)
- Random Survival Forest (this notebook)

Key comparisons:
- Time-dependent C-index at multiple horizons
- Calibration assessment
- Cumulative incidence predictions
- Computational cost