# Fine-Gray Subdistribution Hazard Model

This notebook implements the **Fine-Gray competing risks model** using a discrete-time approximation. The Fine-Gray model estimates the subdistribution hazard, which directly relates to the cumulative incidence function.

## Key Concepts

| Model | Estimates | Risk Set | Use Case |
|-------|-----------|----------|----------|
| **Cause-specific Cox** | Hazard among those at risk | Subjects leave at any event | Etiology, risk factors |
| **Fine-Gray** | Subdistribution hazard | Competing events stay in risk set | Cumulative incidence prediction |

## Why Discrete-Time?

1. Fine-Gray with time-varying covariates is mathematically complex
2. Discrete-time is equivalent in the limit
3. Easy to implement with standard tools
4. Used in practice at major mortgage institutions

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Competing risks module
import sys
sys.path.insert(0, '..')
from src.competing_risks import (
    DiscreteTimeFineGray,
    fit_discrete_time_competing_risks,
    create_fine_gray_dataset,
)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm

sns.set_style('whitegrid')
%matplotlib inline

## Load Data

For discrete-time Fine-Gray, we ideally need the loan-month panel. If not available, we'll use the survival data with approximations.

In [3]:
DATA_DIR = Path('../data/processed')

# Try to load loan-month panel, fall back to survival data
panel_path = DATA_DIR / 'loan_month_panel.parquet'
survival_path = DATA_DIR / 'survival_data.parquet'

if panel_path.exists():
    print("Loading loan-month panel data...")
    df = pd.read_parquet(panel_path)
    use_panel = True
else:
    print("Loan-month panel not found. Loading survival data...")
    print("(Run notebook 03 first to create the panel)")
    df = pd.read_parquet(survival_path)
    use_panel = False
    
    # Create event code
    event_map = {
        'censored': 0, 'prepay': 1, 'default': 2,
        'matured': 0, 'other': 3, 'defect': 3,
    }
    df['event_code'] = df['event_type'].map(event_map)

print(f"Loaded {len(df):,} records")

Loading loan-month panel data...
Loaded 5,462,536 records


In [None]:
# Define feature groups (Blumenstock et al. 2022, Table 2)
# Matching notebook 05 (Cause-Specific Cox) exactly

# Static covariates (fixed at origination) - 5 variables
STATIC_FEATURES = [
    'int_rate',      # Initial interest rate
    'orig_upb',      # Original unpaid balance
    'fico_score',    # Initial FICO score
    'dti_r',         # Initial debt-to-income ratio
    'ltv_r',         # Initial loan-to-value ratio
]

# Behavioral covariates (time-varying) - 4 variables
BEHAVIORAL_FEATURES = [
    'bal_repaid',      # Current repaid balance in percent
    't_act_12m',       # No. of times not being delinquent in last 12 months
    't_del_30d_12m',   # No. of times being 30 days delinquent in last 12 months
    't_del_60d_12m',   # No. of times being 60 days delinquent in last 12 months
]

# Macro covariates (time-varying) - 12 variables from Blumenstock Table 2
MACRO_FEATURES = [
    # Origination-relative differences
    'hpi_st_d_t_o',    # HPI difference between origination and today (state)
    'ppi_c_FRMA',      # Current prepayment incentive (loan rate - current mortgage rate)
    'TB10Y_d_t_o',     # Treasury rate difference (today - origination)
    'FRMA30Y_d_t_o',   # 30Y FRM difference (today - origination)
    'ppi_o_FRMA',      # Prepayment incentive at origination (loan rate - orig mortgage rate)
    
    # State-level variables
    'hpi_st_log12m',   # HPI 12-month log return (state)
    'hpi_r_st_us',     # Ratio of state HPI to national HPI
    'st_unemp_r12m',   # Unemployment 12-month log return (state)
    'st_unemp_r3m',    # Unemployment 3-month log return (state)
    
    # National macro variables
    'TB10Y_r12m',      # Treasury rate 12-month return
    'T10Y3MM',         # Yield spread (10Y - 3M)
    'T10Y3MM_r12m',    # Yield spread 12-month return
]

# All features (9 loan-level + 12 macro = 21 total, matching Blumenstock Dataset 2)
ALL_FEATURES = STATIC_FEATURES + BEHAVIORAL_FEATURES + MACRO_FEATURES

if use_panel:
    time_col = 'loan_age'
    event_col = 'event_code'
else:
    time_col = 'duration'
    event_col = 'event_code'

# Filter to available features
feature_cols = [f for f in ALL_FEATURES if f in df.columns]
missing_features = [f for f in ALL_FEATURES if f not in df.columns]

print("=== Feature Groups (Blumenstock et al. 2022) ===")
print(f"Static features: {len([f for f in STATIC_FEATURES if f in feature_cols])}/5")
print(f"Behavioral features: {len([f for f in BEHAVIORAL_FEATURES if f in feature_cols])}/4")
print(f"Macro features: {len([f for f in MACRO_FEATURES if f in feature_cols])}/12")
print(f"\nTotal available: {len(feature_cols)}/21")

if missing_features:
    print(f"\nMissing features ({len(missing_features)}):")
    for f in missing_features:
        print(f"  - {f}")

In [None]:
# Prepare modeling data
print("=== Preparing Model Data ===")

# Required columns for modeling and train-test split
required_cols = [time_col, event_col, 'loan_sequence_number', 'fold']

if not use_panel:
    # Work with one record per loan (terminal)
    available_cols = [c for c in feature_cols + required_cols if c in df.columns]
    df_model = df[available_cols].dropna(subset=feature_cols).copy()
else:
    # Use panel data directly
    extra_cols = ['is_terminal'] if 'is_terminal' in df.columns else []
    available_cols = [c for c in feature_cols + required_cols + extra_cols if c in df.columns]
    df_model = df[available_cols].dropna(subset=feature_cols).copy()

# Log transform UPB for better coefficient interpretation (matching notebook 05)
if 'orig_upb' in df_model.columns:
    df_model['log_upb'] = np.log(df_model['orig_upb'])
    # Replace orig_upb with log_upb in feature list
    feature_cols = [f if f != 'orig_upb' else 'log_upb' for f in feature_cols]
    print("✓ Created log_upb (log of original UPB)")

print(f"Model data: {len(df_model):,} records")
print(f"Unique loans: {df_model['loan_sequence_number'].nunique():,}")
print(f"Folds: {sorted(df_model['fold'].unique())}")
print(f"Features ({len(feature_cols)}): {feature_cols}")

## Prepare Fine-Gray Dataset

In Fine-Gray, subjects with competing events remain in the risk set (with event=0) for the primary event.

In [None]:
# Create Fine-Gray formatted dataset for prepayment (event=1)
# In Fine-Gray:
# - Primary event (prepay=1) -> y = 1
# - Competing event (default=2) -> y = 0, but STAY in risk set
# - Censored -> y = 0

df_fg = df_model.copy()
df_fg['y_prepay'] = (df_fg[event_col] == 1).astype(int)
df_fg['y_default'] = (df_fg[event_col] == 2).astype(int)

print("Fine-Gray outcome distribution:")
print(f"  Prepay (y=1): {df_fg['y_prepay'].sum():,}")
print(f"  Non-prepay (y=0): {(df_fg['y_prepay'] == 0).sum():,}")

In [None]:
# Train-test split using fold structure from loan_month_panel
# Matching notebook 05 (Cause-Specific Cox) methodology
# Folds 0-9 for training, fold 10 for testing/tuning

TRAIN_FOLDS = list(range(10))
TEST_FOLD = 10

train_df = df_fg[df_fg['fold'].isin(TRAIN_FOLDS)].copy()
test_df = df_fg[df_fg['fold'] == TEST_FOLD].copy()

print("Using Blumenstock fold structure (from loan_month_panel):")
print(f"  Training folds: {TRAIN_FOLDS}")
print(f"  Test fold: {TEST_FOLD}")
print(f"\nTraining: {len(train_df):,} records ({train_df['loan_sequence_number'].nunique():,} loans)")
print(f"Test: {len(test_df):,} records ({test_df['loan_sequence_number'].nunique():,} loans)")

## Fit Discrete-Time Fine-Gray Model

We use logistic regression to model the discrete-time subdistribution hazard.

In [None]:
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(train_df[feature_cols])
X_test = scaler.transform(test_df[feature_cols])

y_train = train_df['y_prepay'].values
y_test = test_df['y_prepay'].values

# Fit logistic regression (sklearn)
fg_model = LogisticRegression(C=1.0, penalty='l2', solver='lbfgs', max_iter=1000)
fg_model.fit(X_train, y_train)

print("Discrete-time Fine-Gray model fitted (sklearn)")

In [None]:
# Also fit with statsmodels for inference (p-values, CIs)
X_train_sm = sm.add_constant(X_train)
X_test_sm = sm.add_constant(X_test)

fg_sm = sm.Logit(y_train, X_train_sm)
fg_result = fg_sm.fit(disp=False)

print("\n=== Fine-Gray Model Summary (Prepayment) ===")
print(fg_result.summary2())

In [None]:
# Extract and display coefficients
coef_df = pd.DataFrame({
    'feature': ['intercept'] + feature_cols,
    'coefficient': fg_result.params,
    'std_err': fg_result.bse,
    'z': fg_result.tvalues,
    'p_value': fg_result.pvalues,
    'hazard_ratio': np.exp(fg_result.params),
})

# Add confidence intervals
conf_int = fg_result.conf_int()
coef_df['ci_lower'] = np.exp(conf_int[0])
coef_df['ci_upper'] = np.exp(conf_int[1])

print("\n=== Fine-Gray Coefficients (Standardized) ===")
print(coef_df.round(4).to_string(index=False))

In [None]:
# Plot hazard ratios with confidence intervals
fig, ax = plt.subplots(figsize=(10, 6))

# Skip intercept
plot_df = coef_df[coef_df['feature'] != 'intercept'].copy()

y_pos = np.arange(len(plot_df))
colors = ['green' if c < 0 else 'red' for c in plot_df['coefficient']]

ax.barh(y_pos, plot_df['hazard_ratio'], color=colors, alpha=0.7)
ax.errorbar(plot_df['hazard_ratio'], y_pos, 
            xerr=[plot_df['hazard_ratio'] - plot_df['ci_lower'],
                  plot_df['ci_upper'] - plot_df['hazard_ratio']],
            fmt='none', color='black', capsize=3)

ax.axvline(x=1, color='black', linestyle='--', linewidth=1)
ax.set_yticks(y_pos)
ax.set_yticklabels(plot_df['feature'])
ax.set_xlabel('Hazard Ratio (95% CI)')
ax.set_title('Fine-Gray Model: Prepayment Subdistribution Hazard Ratios')
ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig('../reports/figures/fine_gray_hazard_ratios.png', dpi=150)
plt.show()

## Model Evaluation

In [None]:
from sklearn.metrics import roc_auc_score, brier_score_loss, log_loss

# Predict probabilities
y_pred_train = fg_model.predict_proba(X_train)[:, 1]
y_pred_test = fg_model.predict_proba(X_test)[:, 1]

# Calculate metrics
print("=== Model Evaluation ===")
print(f"\nTraining Set:")
print(f"  AUC: {roc_auc_score(y_train, y_pred_train):.4f}")
print(f"  Brier Score: {brier_score_loss(y_train, y_pred_train):.4f}")
print(f"  Log Loss: {log_loss(y_train, y_pred_train):.4f}")

print(f"\nTest Set:")
print(f"  AUC: {roc_auc_score(y_test, y_pred_test):.4f}")
print(f"  Brier Score: {brier_score_loss(y_test, y_pred_test):.4f}")
print(f"  Log Loss: {log_loss(y_test, y_pred_test):.4f}")

In [None]:
# Calibration plot
from sklearn.calibration import calibration_curve

fig, ax = plt.subplots(figsize=(8, 8))

prob_true, prob_pred = calibration_curve(y_test, y_pred_test, n_bins=10)

ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
ax.plot(prob_pred, prob_true, 'o-', label='Fine-Gray model')

ax.set_xlabel('Mean predicted probability')
ax.set_ylabel('Fraction of positives')
ax.set_title('Calibration Plot: Fine-Gray Prepayment Model')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../reports/figures/fine_gray_calibration.png', dpi=150)
plt.show()

## Cumulative Incidence Prediction

For discrete-time models:
$$CIF(t) = \sum_{s \leq t} h(s) \cdot S(s-1)$$

where $h(s)$ is the subdistribution hazard and $S(s)$ is survival.

In [None]:
def predict_cif_simple(hazard_probs: np.ndarray) -> np.ndarray:
    """
    Convert hazard probabilities to cumulative incidence.
    
    For a single subject's hazard over time:
    CIF(t) = 1 - prod(1 - h(s)) for s = 1 to t
    """
    survival = np.cumprod(1 - hazard_probs)
    cif = 1 - survival
    return cif

# For illustration, show CIF for representative profiles
print("Predicted subdistribution hazard (first 10 records):")
print(y_pred_test[:10])

## Save Model

In [None]:
import pickle

MODELS_DIR = Path('../models')
MODELS_DIR.mkdir(exist_ok=True)

# Save sklearn model
with open(MODELS_DIR / 'fine_gray_prepay.pkl', 'wb') as f:
    pickle.dump({'model': fg_model, 'scaler': scaler, 'features': feature_cols}, f)

# Save statsmodels result
fg_result.save(MODELS_DIR / 'fine_gray_prepay_sm.pkl')

print(f"Models saved to {MODELS_DIR}")

## Summary

In [None]:
print("=" * 60)
print("FINE-GRAY MODEL SUMMARY")
print("=" * 60)

print(f"\nModel: Discrete-time Fine-Gray (Logistic Regression)")
print(f"Primary Event: Prepayment (event=1)")
print(f"Competing Event: Default (event=2)")

print(f"\nFeatures (Blumenstock et al. 2022):")
print(f"  Static: {len([f for f in STATIC_FEATURES if f in feature_cols or 'log_upb' in feature_cols])}/5")
print(f"  Behavioral: {len([f for f in BEHAVIORAL_FEATURES if f in feature_cols])}/4")
print(f"  Macro: {len([f for f in MACRO_FEATURES if f in feature_cols])}/12")
print(f"  Total: {len(feature_cols)}/21")

print(f"\nPerformance:")
print(f"  Test AUC: {roc_auc_score(y_test, y_pred_test):.4f}")
print(f"  Test Brier Score: {brier_score_loss(y_test, y_pred_test):.4f}")

print(f"\nKey Coefficients (significant at p<0.05):")
sig_coefs = coef_df[(coef_df['p_value'] < 0.05) & (coef_df['feature'] != 'intercept')]
for _, row in sig_coefs.iterrows():
    direction = '↑' if row['coefficient'] > 0 else '↓'
    print(f"  {row['feature']}: HR={row['hazard_ratio']:.3f} {direction}")

## Next Steps

**Notebook 07**: Compare Fine-Gray vs Cause-Specific Cox models

Key comparisons:
- Coefficient differences and interpretations
- Discrimination metrics (C-index, time-dependent AUC)
- Cumulative incidence predictions vs observed
- Calibration assessment