# Cause-Specific Cox Proportional Hazards Models with Time-Varying Covariates

This notebook fits **cause-specific Cox models** for prepayment and default using the loan-month panel data with time-varying covariates.

## Methodology (Blumenstock et al. 2022)

### Data Structure
- **Interval format**: Each loan contributes multiple rows (one per month)
- **Time-varying covariates**: Behavioral and macroeconomic variables updated each month
- **Fold-based cross-validation**: 11 folds for robust evaluation

### Covariates
- **Static**: FICO, LTV, DTI, interest rate, original UPB
- **Behavioral**: Balance repaid %, 12-month rolling delinquency counts
- **Macro**: HPI changes, mortgage rate spread, Treasury rate changes, unemployment

### Models
1. Cause-specific Cox for **prepayment** (default treated as censored)
2. Cause-specific Cox for **default** (prepayment treated as censored)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Survival analysis
from lifelines import CoxTimeVaryingFitter  # Correct fitter for time-varying covariates
from sksurv.metrics import concordance_index_censored

from sklearn.preprocessing import StandardScaler

sns.set_style('whitegrid')
%matplotlib inline

print("Imports complete.")

In [None]:
# === CONFIGURATION ===
DATA_DIR = Path('../data/processed')
FIGURES_DIR = Path('../reports/figures')
MODELS_DIR = Path('../models')

FIGURES_DIR.mkdir(parents=True, exist_ok=True)
MODELS_DIR.mkdir(parents=True, exist_ok=True)

# Cross-validation folds (Blumenstock: fold 10 for hyperparameter tuning)
TRAIN_FOLDS = list(range(10))  # Folds 0-9 for training/validation
TUNE_FOLD = 10                  # Fold 10 for hyperparameter tuning

print(f"Training folds: {TRAIN_FOLDS}")
print(f"Tuning fold: {TUNE_FOLD}")

---

## Load Loan-Month Panel Data

In [None]:
# Load the loan-month panel data
print("Loading loan-month panel data...")
panel_df = pd.read_parquet(DATA_DIR / 'loan_month_panel.parquet')

print(f"Loaded {len(panel_df):,} loan-months")
print(f"Unique loans: {panel_df['loan_sequence_number'].nunique():,}")
print(f"Folds: {sorted(panel_df['fold'].unique())}")
print(f"Vintages: {panel_df['vintage_year'].min()} - {panel_df['vintage_year'].max()}")

print("\nEvent distribution:")
event_names = {0: 'Censored', 1: 'Prepay', 2: 'Default'}
terminal_events = panel_df[panel_df['event'] == 1].groupby('event_code').size()
for code, count in terminal_events.items():
    print(f"  {event_names.get(code, 'Other')} (k={code}): {count:,}")

In [None]:
# Examine data structure
print("=== Panel Data Structure ===")
print(f"\nColumns: {list(panel_df.columns)}")

print("\n=== Example: Single Loan History ===")
example_loan = panel_df['loan_sequence_number'].iloc[0]
example_df = panel_df[panel_df['loan_sequence_number'] == example_loan].head(10)
print(example_df[['loan_sequence_number', 'start', 'stop', 'event', 'event_code', 
                  'loan_age', 'bal_repaid', 't_del_30d_12m']].to_string(index=False))

print("\n=== Interval Format ===")
print("Each row represents interval (start, stop] where:")
print("  - start: Beginning of interval (loan_age - 1)")
print("  - stop: End of interval (loan_age)")
print("  - event: 1 if terminal event occurs at stop, 0 otherwise")

In [None]:
# Verify interval format validity
print("=== Verifying Interval Format ===")

# Check start < stop
invalid_intervals = (panel_df['start'] >= panel_df['stop']).sum()
print(f"Invalid intervals (start >= stop): {invalid_intervals:,}")

# Check start values
print(f"\nStart value distribution:")
print(f"  Min: {panel_df['start'].min()}")
print(f"  Max: {panel_df['start'].max()}")
print(f"  Negative starts: {(panel_df['start'] < 0).sum():,}")

# Check stop values
print(f"\nStop value distribution:")
print(f"  Min: {panel_df['stop'].min()}")
print(f"  Max: {panel_df['stop'].max()}")

# For lifelines, start must be >= 0
# Since loan_age starts at 1, start = loan_age - 1 starts at 0
if panel_df['start'].min() < 0:
    print("\n⚠ Warning: Negative start values found. Adjusting...")
    # This shouldn't happen with loan_age starting at 1
else:
    print("\n✓ All start values are non-negative")

---

## Define Features

In [None]:
# Define feature groups (Blumenstock et al. 2022, Table 2)

# Static covariates (fixed at origination) - 5 variables
STATIC_FEATURES = [
    'int_rate',      # Initial interest rate
    'orig_upb',      # Original unpaid balance
    'fico_score',    # Initial FICO score
    'dti_r',         # Initial debt-to-income ratio
    'ltv_r',         # Initial loan-to-value ratio
]

# Behavioral covariates (time-varying) - 4 variables
BEHAVIORAL_FEATURES = [
    'bal_repaid',      # Current repaid balance in percent
    't_act_12m',       # No. of times not being delinquent in last 12 months
    't_del_30d_12m',   # No. of times being 30 days delinquent in last 12 months
    't_del_60d_12m',   # No. of times being 60 days delinquent in last 12 months
]

# Macro covariates (time-varying) - 12 variables from Blumenstock Table 2
MACRO_FEATURES = [
    # Origination-relative differences
    'hpi_st_d_t_o',    # HPI difference between origination and today (state)
    'ppi_c_FRMA',      # Current prepayment incentive (loan rate - current mortgage rate)
    'TB10Y_d_t_o',     # Treasury rate difference (today - origination)
    'FRMA30Y_d_t_o',   # 30Y FRM difference (today - origination)
    'ppi_o_FRMA',      # Prepayment incentive at origination (loan rate - orig mortgage rate)
    
    # State-level variables
    'hpi_st_log12m',   # HPI 12-month log return (state)
    'hpi_r_st_us',     # Ratio of state HPI to national HPI
    'st_unemp_r12m',   # Unemployment 12-month log return (state)
    'st_unemp_r3m',    # Unemployment 3-month log return (state)
    
    # National macro variables
    'TB10Y_r12m',      # Treasury rate 12-month return
    'T10Y3MM',         # Yield spread (10Y - 3M)
    'T10Y3MM_r12m',    # Yield spread 12-month return
]

# All features (9 loan-level + 12 macro = 21 total, matching Blumenstock Dataset 2)
ALL_FEATURES = STATIC_FEATURES + BEHAVIORAL_FEATURES + MACRO_FEATURES

# Filter to available features
available_features = [f for f in ALL_FEATURES if f in panel_df.columns]
missing_features = [f for f in ALL_FEATURES if f not in panel_df.columns]

print("=== Feature Groups (Blumenstock et al. 2022) ===")
print(f"Static features: {len([f for f in STATIC_FEATURES if f in available_features])}/5")
print(f"Behavioral features: {len([f for f in BEHAVIORAL_FEATURES if f in available_features])}/4")
print(f"Macro features: {len([f for f in MACRO_FEATURES if f in available_features])}/12")
print(f"\nTotal available: {len(available_features)}/21")
if missing_features:
    print(f"\nMissing features ({len(missing_features)}):")
    for f in missing_features:
        print(f"  - {f}")

In [None]:
# Prepare modeling data
print("=== Preparing Model Data ===")

# Required columns for interval-censored Cox regression
required_cols = ['start', 'stop', 'event', 'event_code', 'fold', 'loan_sequence_number']
model_cols = required_cols + available_features

# Filter to complete cases
df_model = panel_df[model_cols].copy()
n_before = len(df_model)
df_model = df_model.dropna(subset=available_features)
n_after = len(df_model)

print(f"Rows before dropna: {n_before:,}")
print(f"Rows after dropna: {n_after:,}")
print(f"Dropped: {n_before - n_after:,} ({(n_before - n_after) / n_before * 100:.1f}%)")

print(f"\nUnique loans: {df_model['loan_sequence_number'].nunique():,}")

# Log transform UPB for better coefficient interpretation
if 'orig_upb' in df_model.columns:
    df_model['log_upb'] = np.log(df_model['orig_upb'])
    # Replace orig_upb with log_upb in feature list
    available_features = [f if f != 'orig_upb' else 'log_upb' for f in available_features]
    print("\n✓ Created log_upb (log of original UPB)")

In [None]:
# Check feature coverage
print("=== Feature Coverage ===")
for feature in available_features:
    coverage = df_model[feature].notna().mean()
    print(f"  {feature}: {coverage:.1%}")

---

## Split Data by Folds

In [None]:
# Split by folds (Blumenstock methodology)
print("=== Splitting Data by Folds ===")

# Training set: folds 0-9
train_df = df_model[df_model['fold'].isin(TRAIN_FOLDS)].copy()

# Tuning set: fold 10
tune_df = df_model[df_model['fold'] == TUNE_FOLD].copy()

print(f"Training set (folds 0-9):")
print(f"  Loan-months: {len(train_df):,}")
print(f"  Unique loans: {train_df['loan_sequence_number'].nunique():,}")

print(f"\nTuning set (fold 10):")
print(f"  Loan-months: {len(tune_df):,}")
print(f"  Unique loans: {tune_df['loan_sequence_number'].nunique():,}")

# Event distribution in training set
print("\nTraining set event distribution:")
train_events = train_df[train_df['event'] == 1].groupby('event_code').size()
for code, count in train_events.items():
    print(f"  {event_names.get(code, 'Other')}: {count:,}")

---

## Cause-Specific Cox Model for Prepayment

Fit Cox model with time-varying covariates where:
- **Event** = prepayment (event_code = 1)
- **Censored** = default, actual censoring, and ongoing observations

In [None]:
# Prepare data for prepayment model (cause-specific)
print("=== Preparing Prepayment Model Data ===")

# Create prepayment-specific event indicator
# Prepay = 1, everything else (default, censored) = 0
train_prepay = train_df.copy()
train_prepay['event_prepay'] = ((train_prepay['event'] == 1) & 
                                 (train_prepay['event_code'] == 1)).astype(int)

tune_prepay = tune_df.copy()
tune_prepay['event_prepay'] = ((tune_prepay['event'] == 1) & 
                                (tune_prepay['event_code'] == 1)).astype(int)

print(f"Training prepayments: {train_prepay['event_prepay'].sum():,}")
print(f"Tuning prepayments: {tune_prepay['event_prepay'].sum():,}")

In [None]:
# Fit cause-specific Cox model for prepayment with time-varying covariates
print("=== Fitting Cause-Specific Cox Model: PREPAYMENT ===")

# Columns for CoxTimeVaryingFitter
# Requires: id_col, start, stop, event, and covariates
cox_cols = ['loan_sequence_number', 'start', 'stop', 'event_prepay'] + available_features

# Fit model with penalization for stability
ctv_prepay = CoxTimeVaryingFitter(penalizer=0.01)
ctv_prepay.fit(
    train_prepay[cox_cols],
    id_col='loan_sequence_number',
    start_col='start',
    stop_col='stop',
    event_col='event_prepay',
    show_progress=True
)

print("\n" + "=" * 60)
print("PREPAYMENT MODEL RESULTS")
print("=" * 60)
ctv_prepay.print_summary()

In [None]:
# Plot hazard ratios for prepayment model
fig, ax = plt.subplots(figsize=(10, 8))
ctv_prepay.plot(ax=ax)
ax.set_title('Cause-Specific Cox Model: Prepayment\nHazard Ratios (95% CI)', fontsize=14)
ax.axvline(x=0, color='black', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'cox_prepay_hazard_ratios_tv.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Evaluate prepayment model on tuning set
print("=== Prepayment Model Evaluation (Tuning Set) ===")

# Get terminal observations only for C-index calculation
# (each loan should have exactly one terminal observation)
tune_terminal = tune_df.groupby('loan_sequence_number').last().reset_index()
tune_terminal['event_prepay'] = (tune_terminal['event_code'] == 1).astype(int)

# Compute risk scores using model coefficients
# Use the model's coefficient index to ensure correct feature alignment
model_features = ctv_prepay.params_.index.tolist()
X_tune = tune_terminal[model_features].values
coefs = ctv_prepay.params_.values

# Risk = exp(X @ beta) - higher means higher hazard
risk_prepay = np.exp(X_tune @ coefs)

# Calculate C-index
c_index_prepay = concordance_index_censored(
    tune_terminal['event_prepay'].astype(bool),
    tune_terminal['stop'],
    risk_prepay.flatten()
)

print(f"Prepayment Model - Tuning Set C-index: {c_index_prepay[0]:.4f}")
print(f"  Concordant pairs: {c_index_prepay[1]:,}")
print(f"  Discordant pairs: {c_index_prepay[2]:,}")
print(f"  Tied pairs: {c_index_prepay[3]:,}")

---

## Cause-Specific Cox Model for Default

Fit Cox model with time-varying covariates where:
- **Event** = default (event_code = 2)
- **Censored** = prepayment, actual censoring, and ongoing observations

In [None]:
# Prepare data for default model (cause-specific)
print("=== Preparing Default Model Data ===")

# Create default-specific event indicator
# Default = 1, everything else (prepay, censored) = 0
train_default = train_df.copy()
train_default['event_default'] = ((train_default['event'] == 1) & 
                                   (train_default['event_code'] == 2)).astype(int)

tune_default = tune_df.copy()
tune_default['event_default'] = ((tune_default['event'] == 1) & 
                                  (tune_default['event_code'] == 2)).astype(int)

print(f"Training defaults: {train_default['event_default'].sum():,}")
print(f"Tuning defaults: {tune_default['event_default'].sum():,}")

In [None]:
# Fit cause-specific Cox model for default with time-varying covariates
print("=== Fitting Cause-Specific Cox Model: DEFAULT ===")

# Columns for CoxTimeVaryingFitter
cox_cols = ['loan_sequence_number', 'start', 'stop', 'event_default'] + available_features

# Fit model with penalization for stability
ctv_default = CoxTimeVaryingFitter(penalizer=0.01)
ctv_default.fit(
    train_default[cox_cols],
    id_col='loan_sequence_number',
    start_col='start',
    stop_col='stop',
    event_col='event_default',
    show_progress=True
)

print("\n" + "=" * 60)
print("DEFAULT MODEL RESULTS")
print("=" * 60)
ctv_default.print_summary()

In [None]:
# Plot hazard ratios for default model
fig, ax = plt.subplots(figsize=(10, 8))
ctv_default.plot(ax=ax)
ax.set_title('Cause-Specific Cox Model: Default\nHazard Ratios (95% CI)', fontsize=14)
ax.axvline(x=0, color='black', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'cox_default_hazard_ratios_tv.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Evaluate default model on tuning set
print("=== Default Model Evaluation (Tuning Set) ===")

# Get terminal observations (reuse tune_terminal from prepay evaluation)
tune_terminal['event_default'] = (tune_terminal['event_code'] == 2).astype(int)

# Compute risk scores using model coefficients
# Use the model's coefficient index to ensure correct feature alignment
model_features = ctv_default.params_.index.tolist()
X_tune = tune_terminal[model_features].values
coefs = ctv_default.params_.values

# Risk = exp(X @ beta) - higher means higher hazard
risk_default = np.exp(X_tune @ coefs)

# Calculate C-index
c_index_default = concordance_index_censored(
    tune_terminal['event_default'].astype(bool),
    tune_terminal['stop'],
    risk_default.flatten()
)

print(f"Default Model - Tuning Set C-index: {c_index_default[0]:.4f}")
print(f"  Concordant pairs: {c_index_default[1]:,}")
print(f"  Discordant pairs: {c_index_default[2]:,}")
print(f"  Tied pairs: {c_index_default[3]:,}")

---

## Compare Coefficients: Prepayment vs Default

In [None]:
# Extract and compare coefficients
prepay_coefs = ctv_prepay.summary[['coef', 'exp(coef)', 'p']].copy()
prepay_coefs.columns = ['coef_prepay', 'hr_prepay', 'p_prepay']

default_coefs = ctv_default.summary[['coef', 'exp(coef)', 'p']].copy()
default_coefs.columns = ['coef_default', 'hr_default', 'p_default']

# Combine
comparison = prepay_coefs.join(default_coefs)
comparison['coef_diff'] = comparison['coef_prepay'] - comparison['coef_default']
comparison['effect_direction'] = np.where(
    comparison['coef_prepay'] * comparison['coef_default'] > 0,
    'Same', 'Opposite'
)

print("=== Coefficient Comparison: Prepayment vs Default ===")
print(comparison.round(4).to_string())

In [None]:
# Plot coefficient comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Left: Bar chart of coefficients
ax = axes[0]
x = np.arange(len(available_features))
width = 0.35

bars1 = ax.barh(x - width/2, comparison['coef_prepay'], width, 
                label='Prepayment', color='steelblue', alpha=0.7)
bars2 = ax.barh(x + width/2, comparison['coef_default'], width, 
                label='Default', color='indianred', alpha=0.7)

ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
ax.set_ylabel('Feature')
ax.set_xlabel('Coefficient')
ax.set_title('Cause-Specific Cox Coefficients', fontsize=12)
ax.set_yticks(x)
ax.set_yticklabels(available_features)
ax.legend()
ax.grid(True, alpha=0.3, axis='x')

# Right: Hazard ratios
ax = axes[1]
ax.barh(x - width/2, comparison['hr_prepay'], width, 
        label='Prepayment', color='steelblue', alpha=0.7)
ax.barh(x + width/2, comparison['hr_default'], width, 
        label='Default', color='indianred', alpha=0.7)

ax.axvline(x=1, color='black', linestyle='--', linewidth=0.5)
ax.set_ylabel('Feature')
ax.set_xlabel('Hazard Ratio')
ax.set_title('Cause-Specific Hazard Ratios', fontsize=12)
ax.set_yticks(x)
ax.set_yticklabels(available_features)
ax.legend()
ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'cox_coefficient_comparison_tv.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Interpretation of key coefficients
print("=== Key Coefficient Interpretations ===")

def interpret_coef(name, hr_prepay, hr_default, p_prepay, p_default):
    print(f"\n{name}:")
    
    # Prepayment effect
    if p_prepay < 0.05:
        if hr_prepay > 1:
            print(f"  Prepayment: +1 unit → {(hr_prepay-1)*100:.1f}% higher prepay hazard")
        else:
            print(f"  Prepayment: +1 unit → {(1-hr_prepay)*100:.1f}% lower prepay hazard")
    else:
        print(f"  Prepayment: Not significant (p={p_prepay:.3f})")
    
    # Default effect
    if p_default < 0.05:
        if hr_default > 1:
            print(f"  Default: +1 unit → {(hr_default-1)*100:.1f}% higher default hazard")
        else:
            print(f"  Default: +1 unit → {(1-hr_default)*100:.1f}% lower default hazard")
    else:
        print(f"  Default: Not significant (p={p_default:.3f})")

# Key variables to interpret
key_vars = ['fico_score', 'ltv_r', 'ppi_c_FRMA', 'bal_repaid', 't_del_30d_12m', 'hpi_st_d_t_o']
key_vars = [v for v in key_vars if v in comparison.index]

for var in key_vars:
    interpret_coef(
        var,
        comparison.loc[var, 'hr_prepay'],
        comparison.loc[var, 'hr_default'],
        comparison.loc[var, 'p_prepay'],
        comparison.loc[var, 'p_default']
    )

---

## Model Summary

In [None]:
print("=" * 70)
print("CAUSE-SPECIFIC COX MODELS WITH TIME-VARYING COVARIATES - SUMMARY")
print("=" * 70)

print(f"\nData:")
print(f"  Training loan-months: {len(train_df):,}")
print(f"  Training unique loans: {train_df['loan_sequence_number'].nunique():,}")
print(f"  Tuning loan-months: {len(tune_df):,}")
print(f"  Tuning unique loans: {tune_df['loan_sequence_number'].nunique():,}")

print(f"\nFeatures: {len(available_features)}")
print(f"  Static: {len([f for f in STATIC_FEATURES if f in available_features or f.replace('orig_upb', 'log_upb') in available_features])}")
print(f"  Behavioral (time-varying): {len([f for f in BEHAVIORAL_FEATURES if f in available_features])}")
print(f"  Macro (time-varying): {len([f for f in MACRO_FEATURES if f in available_features])}")

print(f"\nPrepayment Model:")
print(f"  Tuning C-index: {c_index_prepay[0]:.4f}")
print(f"  Training prepayments: {train_prepay['event_prepay'].sum():,}")

print(f"\nDefault Model:")
print(f"  Tuning C-index: {c_index_default[0]:.4f}")
print(f"  Training defaults: {train_default['event_default'].sum():,}")

---

## Save Models

In [None]:
import pickle

# Save models
with open(MODELS_DIR / 'cox_prepay_tv.pkl', 'wb') as f:
    pickle.dump(ctv_prepay, f)
    
with open(MODELS_DIR / 'cox_default_tv.pkl', 'wb') as f:
    pickle.dump(ctv_default, f)

# Save feature list
with open(MODELS_DIR / 'cox_features.pkl', 'wb') as f:
    pickle.dump(available_features, f)

# Save comparison table
comparison.to_csv(MODELS_DIR / 'cox_coefficient_comparison.csv')

print(f"Models saved to {MODELS_DIR}:")
print(f"  - cox_prepay_tv.pkl (CoxTimeVaryingFitter)")
print(f"  - cox_default_tv.pkl (CoxTimeVaryingFitter)")
print(f"  - cox_features.pkl")
print(f"  - cox_coefficient_comparison.csv")

---

## Next Steps

### Completed
- ✅ Loaded loan-month panel data with time-varying covariates
- ✅ Fit cause-specific Cox model for prepayment
- ✅ Fit cause-specific Cox model for default
- ✅ Evaluated models on tuning set (C-index)
- ✅ Compared coefficient effects between models

### Key Findings
- Time-varying covariates capture dynamic risk factors
- Behavioral variables (delinquency history) strongly predict default
- Prepayment incentive (rate spread) drives prepayment hazard
- HPI changes affect both prepayment and default risks

### Future Work
1. **Cross-validation**: Use folds 0-9 for k-fold cross-validation
2. **Fine-Gray models**: Subdistribution hazard for cumulative incidence
3. **Deep learning**: DeepSurv or DeepHit for non-linear effects
4. **Calibration**: Assess predicted vs actual cumulative incidence