# Data Preparation: Blumenstock et al. (2022) Replication

This notebook prepares data following the methodology from:

> Blumenstock, G., Lessmann, S., & Seow, H-V. (2022). Deep learning for survival and competing risk modelling. *Journal of the Operational Research Society*, 73(1), 26-38.

## Dataset 2: Post-Crisis Period (2010-2025)

**Variables from Table 2:**

### Loan-Level Variables (9)
| Variable | Description |
|----------|-------------|
| `int_rate` | Initial interest rate |
| `orig_upb` | Original unpaid balance |
| `fico_score` | Initial FICO score |
| `dti_r` | Initial debt-to-income ratio |
| `ltv_r` | Initial loan-to-value ratio |
| `bal_repaid` | Current repaid balance in percent |
| `t_act_12m` | No. of times not being delinquent in last 12 months |
| `t_del_30d_12m` | No. of times being 30 days delinquent in last 12 months |
| `t_del_60d_12m` | No. of times being 60 days delinquent in last 12 months |

### Macroeconomic Variables (13 for Dataset 2)
| Variable | Description |
|----------|-------------|
| `hpi_st_d_t_o` | HPI difference between origination and today (state) |
| `ppi_c_FRMA` | Current prepayment incentive |
| `TB10Y_d_t_o` | Treasury rate difference |
| `FRMA30Y_d_t_o` | 30Y FRM difference |
| `ppi_o_FRMA` | Prepayment incentive at origination |
| `hpi_st_log12m` | HPI 12-month log return (state) |
| `hpi_r_st_us` | Ratio of state HPI to national HPI |
| `st_unemp_r12m` | Unemployment 12-month log return (state) |
| `st_unemp_r3m` | Unemployment 3-month log return (state) |
| `TB10Y_r12m` | Treasury rate 12-month return |
| `T10Y3MM` | Yield spread (10Y - 3M) |
| `T10Y3MM_r12m` | Yield spread 12-month return |

### Event Definitions
- **Default (k=2)**: Loan turning 3-month delinquent for the first time
- **Prepayment (k=1)**: Loan repaid completely and unexpectedly
- **Censored (k=0)**: Active loan without event

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
from glob import glob
import warnings
warnings.filterwarnings('ignore')

# Import column definitions
import sys
sys.path.insert(0, '..')
from src.data.columns import (
    ORIGINATION_COLUMNS, ORIGINATION_DTYPES,
    PERFORMANCE_COLUMNS, PERFORMANCE_DTYPES,
    ZERO_BALANCE_CODE_MAP
)

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 200)

print("Imports complete.")

## Configuration

Following the paper's Dataset 2 setup (2010-2025).

In [None]:
# Data paths
RAW_DATA_DIR = Path('../data/raw')
PROCESSED_DATA_DIR = Path('../data/processed')
EXTERNAL_DATA_DIR = Path('../data/external')

# Dataset 2: Post-crisis period (2010-2025)
VINTAGES = list(range(2010, 2026))

# Sampling strategy from paper:
# - 11 random subsamples of 10,000 each
# - 10 for cross-validation, 1 for hyperparameter tuning
SAMPLE_SIZE_PER_FOLD = 10000
N_FOLDS = 11

# Default definition: 3-month delinquent for the first time
DEFAULT_DELINQUENCY_THRESHOLD = 3

print(f"Dataset 2 Period: {min(VINTAGES)}-{max(VINTAGES)}")
print(f"Sample size per fold: {SAMPLE_SIZE_PER_FOLD:,}")
print(f"Number of folds: {N_FOLDS}")

## Step 1: Load Macroeconomic Data

In [None]:
# Load national macro data
macro_national = pd.read_parquet(EXTERNAL_DATA_DIR / 'fred_monthly_panel.parquet')
macro_national.index.name = 'date'
macro_national = macro_national.reset_index()
macro_national['date'] = pd.to_datetime(macro_national['date'])
macro_national['year_month'] = macro_national['date'].dt.to_period('M')

# Calculate additional variables needed for paper
# TB10Y_r12m: 10-year treasury rate 12-month return
macro_national['TB10Y_r12m'] = macro_national['DGS10'].pct_change(12)

# T10Y3MM: Yield spread (need 3-month rate)
# Using T10Y2Y as proxy if 3M not available
if 'DGS3MO' in macro_national.columns:
    macro_national['T10Y3MM'] = macro_national['DGS10'] - macro_national['DGS3MO']
else:
    # Calculate from T10Y2Y and DGS2
    macro_national['T10Y3MM'] = macro_national['T10Y2Y'] + macro_national['DGS2'] - macro_national['DGS10']
    # Fallback to just using T10Y2Y
    if macro_national['T10Y3MM'].isna().all():
        macro_national['T10Y3MM'] = macro_national['T10Y2Y']

# T10Y3MM_r12m: Yield spread 12-month return
macro_national['T10Y3MM_r12m'] = macro_national['T10Y3MM'].diff(12)

print(f"National macro data: {macro_national.shape}")
print(f"Date range: {macro_national['date'].min()} to {macro_national['date'].max()}")

In [None]:
# Load state-level unemployment
state_unemp = pd.read_parquet(EXTERNAL_DATA_DIR / 'state_unemployment.parquet')
state_unemp.index.name = 'date'
state_unemp = state_unemp.reset_index()
state_unemp['date'] = pd.to_datetime(state_unemp['date'])
state_unemp['year_month'] = state_unemp['date'].dt.to_period('M')

# Calculate returns for each state
state_unemp_returns = state_unemp.copy()
for col in state_unemp.columns:
    if '_unemployment' in col:
        state = col.replace('_unemployment', '')
        # 12-month log return
        state_unemp_returns[f'{state}_unemp_r12m'] = np.log(
            state_unemp[col] / state_unemp[col].shift(12)
        )
        # 3-month log return
        state_unemp_returns[f'{state}_unemp_r3m'] = np.log(
            state_unemp[col] / state_unemp[col].shift(3)
        )

# Melt to long format
state_cols = [c for c in state_unemp.columns if '_unemployment' in c]
state_unemp_long = state_unemp.melt(
    id_vars=['date', 'year_month'],
    value_vars=state_cols,
    var_name='state_col',
    value_name='state_unemployment'
)
state_unemp_long['property_state'] = state_unemp_long['state_col'].str.replace('_unemployment', '')

# Add returns
for col in [c for c in state_unemp_returns.columns if '_unemp_r12m' in c or '_unemp_r3m' in c]:
    state = col.split('_')[0]
    return_type = 'st_unemp_r12m' if 'r12m' in col else 'st_unemp_r3m'
    temp = state_unemp_returns[['year_month', col]].copy()
    temp['property_state'] = state
    temp = temp.rename(columns={col: return_type})
    state_unemp_long = state_unemp_long.merge(
        temp, on=['year_month', 'property_state'], how='left'
    )

state_unemp_long = state_unemp_long[['year_month', 'property_state', 'state_unemployment', 
                                      'st_unemp_r12m', 'st_unemp_r3m']].drop_duplicates()

print(f"State unemployment: {state_unemp_long.shape}")
print(state_unemp_long.head())

In [None]:
# Load state-level HPI
state_hpi = pd.read_parquet(EXTERNAL_DATA_DIR / 'state_hpi.parquet')
state_hpi.index.name = 'date'
state_hpi = state_hpi.reset_index()
state_hpi['date'] = pd.to_datetime(state_hpi['date'])
state_hpi['year_month'] = state_hpi['date'].dt.to_period('M')

# Calculate 12-month log return for each state
for col in state_hpi.columns:
    if '_hpi' in col and 'yoy' not in col:
        state = col.replace('_hpi', '')
        state_hpi[f'{state}_hpi_log12m'] = np.log(
            state_hpi[col] / state_hpi[col].shift(12)
        )

# Calculate national HPI (average across states or use FHFA national)
hpi_cols = [c for c in state_hpi.columns if c.endswith('_hpi') and len(c) <= 6]
state_hpi['national_hpi'] = state_hpi[hpi_cols].mean(axis=1)

# Melt to long format
state_hpi_long = state_hpi.melt(
    id_vars=['date', 'year_month', 'national_hpi'],
    value_vars=hpi_cols,
    var_name='state_col',
    value_name='state_hpi'
)
state_hpi_long['property_state'] = state_hpi_long['state_col'].str.replace('_hpi', '')

# Add log returns
log_cols = [c for c in state_hpi.columns if '_hpi_log12m' in c]
for col in log_cols:
    state = col.replace('_hpi_log12m', '')
    temp = state_hpi[['year_month', col]].copy()
    temp['property_state'] = state
    temp = temp.rename(columns={col: 'hpi_st_log12m'})
    state_hpi_long = state_hpi_long.merge(
        temp, on=['year_month', 'property_state'], how='left'
    )

# Calculate ratio of state HPI to national HPI
state_hpi_long['hpi_r_st_us'] = state_hpi_long['state_hpi'] / state_hpi_long['national_hpi']

state_hpi_long = state_hpi_long[['year_month', 'property_state', 'state_hpi', 'national_hpi',
                                  'hpi_st_log12m', 'hpi_r_st_us']].drop_duplicates()

print(f"State HPI: {state_hpi_long.shape}")
print(state_hpi_long.head())

## Step 2: Load and Process Loan Data

Process Freddie Mac data with paper's variable definitions.

In [None]:
def load_origination_data(vintage: int) -> pd.DataFrame:
    """Load origination data for a vintage."""
    pattern = f'sample_{vintage}/sample_orig_{vintage}.txt'
    files = list(RAW_DATA_DIR.glob(f'**/{pattern}'))
    
    if not files:
        return pd.DataFrame()
    
    df = pd.read_csv(
        files[0], sep='|', names=ORIGINATION_COLUMNS,
        dtype=ORIGINATION_DTYPES, na_values=['', ' ']
    )
    return df


def load_performance_data(vintage: int) -> pd.DataFrame:
    """Load performance (monthly) data for a vintage."""
    pattern = f'sample_{vintage}/sample_svcg_{vintage}.txt'
    files = list(RAW_DATA_DIR.glob(f'**/{pattern}'))
    
    if not files:
        return pd.DataFrame()
    
    df = pd.read_csv(
        files[0], sep='|', names=PERFORMANCE_COLUMNS,
        dtype=PERFORMANCE_DTYPES, na_values=['', ' ']
    )
    return df


print("Load functions defined.")

In [None]:
def process_vintage_blumenstock(vintage: int) -> pd.DataFrame:
    """
    Process a vintage following Blumenstock et al. (2022) methodology.
    
    Creates loan-level survival data with:
    - Terminal record per loan
    - Behavioral features from last 12 months
    - Paper's variable definitions
    """
    print(f"Processing vintage {vintage}...")
    
    # Load data
    orig_df = load_origination_data(vintage)
    perf_df = load_performance_data(vintage)
    
    if orig_df.empty or perf_df.empty:
        return pd.DataFrame()
    
    # Parse reporting period
    perf_df['reporting_date'] = pd.to_datetime(
        perf_df['monthly_reporting_period'].astype(str), format='%Y%m'
    )
    perf_df['year_month'] = perf_df['reporting_date'].dt.to_period('M')
    
    # Parse delinquency status
    perf_df['delinquency_status'] = pd.to_numeric(
        perf_df['current_loan_delinquency_status'].replace({'X': '0', 'XX': '0'}),
        errors='coerce'
    ).fillna(0).astype(int)
    
    # Sort by loan and time
    perf_df = perf_df.sort_values(['loan_sequence_number', 'loan_age'])
    
    # === Calculate behavioral variables (rolling 12-month) ===
    perf_df['is_current'] = (perf_df['delinquency_status'] == 0).astype(int)
    perf_df['is_30d_del'] = (perf_df['delinquency_status'] == 1).astype(int)
    perf_df['is_60d_del'] = (perf_df['delinquency_status'] == 2).astype(int)
    
    # Rolling counts over last 12 months
    grouped = perf_df.groupby('loan_sequence_number')
    perf_df['t_act_12m'] = grouped['is_current'].transform(
        lambda x: x.rolling(12, min_periods=1).sum()
    )
    perf_df['t_del_30d_12m'] = grouped['is_30d_del'].transform(
        lambda x: x.rolling(12, min_periods=1).sum()
    )
    perf_df['t_del_60d_12m'] = grouped['is_60d_del'].transform(
        lambda x: x.rolling(12, min_periods=1).sum()
    )
    
    # === Determine event type ===
    # Default: first time reaching 90+ days delinquent
    perf_df['is_default'] = (perf_df['delinquency_status'] >= DEFAULT_DELINQUENCY_THRESHOLD).astype(int)
    perf_df['first_default'] = grouped['is_default'].transform(
        lambda x: (x.cumsum() == 1) & (x == 1)
    ).astype(int)
    
    # Prepayment: zero balance code = 01 (not at maturity)
    perf_df['is_prepay'] = (
        (perf_df['zero_balance_code'] == '01') & 
        (perf_df['loan_age'] < perf_df['remaining_months_to_maturity'].fillna(360) + perf_df['loan_age'] - 6)
    ).astype(int)
    
    # === Calculate balance repaid ===
    orig_upb = orig_df.set_index('loan_sequence_number')['orig_upb']
    perf_df['orig_upb_lookup'] = perf_df['loan_sequence_number'].map(orig_upb)
    perf_df['bal_repaid'] = (
        (perf_df['orig_upb_lookup'] - perf_df['current_actual_upb'].fillna(0)) / 
        perf_df['orig_upb_lookup']
    ) * 100
    perf_df['bal_repaid'] = perf_df['bal_repaid'].clip(0, 100)
    
    # === Get terminal record for each loan ===
    # Determine event at terminal record
    def get_terminal_event(group):
        """Get terminal record with event type."""
        last_row = group.iloc[-1].copy()
        
        # Check for default (first 90+ delinquency)
        default_rows = group[group['first_default'] == 1]
        if len(default_rows) > 0:
            last_row = default_rows.iloc[0].copy()
            last_row['event_code'] = 2  # Default
            return last_row
        
        # Check for prepayment
        prepay_rows = group[group['is_prepay'] == 1]
        if len(prepay_rows) > 0:
            last_row = prepay_rows.iloc[0].copy()
            last_row['event_code'] = 1  # Prepay
            return last_row
        
        # Censored
        last_row['event_code'] = 0
        return last_row
    
    print(f"  Getting terminal records...")
    terminal_df = perf_df.groupby('loan_sequence_number').apply(get_terminal_event)
    terminal_df = terminal_df.reset_index(drop=True)
    
    # === Merge with origination data ===
    orig_cols = [
        'loan_sequence_number', 'credit_score', 'orig_ltv', 'orig_dti',
        'orig_upb', 'orig_interest_rate', 'orig_loan_term',
        'first_payment_date', 'property_state'
    ]
    orig_subset = orig_df[[c for c in orig_cols if c in orig_df.columns]].copy()
    orig_subset['vintage_year'] = vintage
    
    # Parse origination date
    orig_subset['first_payment_date'] = pd.to_datetime(
        orig_subset['first_payment_date'].astype(str), format='%Y%m', errors='coerce'
    )
    orig_subset['orig_year_month'] = orig_subset['first_payment_date'].dt.to_period('M')
    
    terminal_df = terminal_df.merge(orig_subset, on='loan_sequence_number', how='left')
    
    print(f"  Loans: {len(terminal_df):,}")
    print(f"  Events: Prepay={sum(terminal_df['event_code']==1):,}, "
          f"Default={sum(terminal_df['event_code']==2):,}, "
          f"Censored={sum(terminal_df['event_code']==0):,}")
    
    return terminal_df


print("Process function defined.")

In [None]:
# Process all vintages in Dataset 2
all_loans = []

for vintage in VINTAGES:
    df = process_vintage_blumenstock(vintage)
    if not df.empty:
        all_loans.append(df)

# Combine
print("\nCombining all vintages...")
loans_df = pd.concat(all_loans, ignore_index=True)
print(f"Total loans: {len(loans_df):,}")

## Step 3: Merge Macroeconomic Variables

Add paper's macroeconomic variables at the observation time.

In [None]:
# Merge state unemployment data
print("Merging state unemployment...")
loans_df = loans_df.merge(
    state_unemp_long,
    on=['year_month', 'property_state'],
    how='left'
)
print(f"  Coverage: {loans_df['st_unemp_r12m'].notna().mean():.1%}")

# Merge state HPI data  
print("Merging state HPI...")
loans_df = loans_df.merge(
    state_hpi_long,
    on=['year_month', 'property_state'],
    how='left'
)
print(f"  Coverage: {loans_df['hpi_st_log12m'].notna().mean():.1%}")

# Merge national macro data
print("Merging national macro...")
macro_cols = ['year_month', 'MORTGAGE30US', 'DGS10', 'TB10Y_r12m', 'T10Y3MM', 'T10Y3MM_r12m']
macro_subset = macro_national[[c for c in macro_cols if c in macro_national.columns]].copy()
loans_df = loans_df.merge(macro_subset, on='year_month', how='left')
print(f"  Coverage: {loans_df['MORTGAGE30US'].notna().mean():.1%}")

In [None]:
# Get origination-time values for difference calculations
print("Calculating origination-time differences...")

# Get origination-time state HPI
orig_hpi = state_hpi_long[['year_month', 'property_state', 'state_hpi']].rename(
    columns={'year_month': 'orig_year_month', 'state_hpi': 'orig_state_hpi'}
)
loans_df = loans_df.merge(orig_hpi, on=['orig_year_month', 'property_state'], how='left')

# hpi_st_d_t_o: Difference of HPI between origination and today (state-level)
loans_df['hpi_st_d_t_o'] = loans_df['state_hpi'] - loans_df['orig_state_hpi']

# Get origination-time macro rates
orig_macro = macro_national[['year_month', 'MORTGAGE30US', 'DGS10']].rename(
    columns={'year_month': 'orig_year_month', 'MORTGAGE30US': 'orig_MORTGAGE30US', 'DGS10': 'orig_DGS10'}
)
loans_df = loans_df.merge(orig_macro, on='orig_year_month', how='left')

# ppi_c_FRMA: Current prepayment incentive
loans_df['ppi_c_FRMA'] = loans_df['orig_interest_rate'] - loans_df['MORTGAGE30US']

# ppi_o_FRMA: Prepayment incentive at origination
loans_df['ppi_o_FRMA'] = loans_df['orig_interest_rate'] - loans_df['orig_MORTGAGE30US']

# TB10Y_d_t_o: Difference of 10-year treasury rate
loans_df['TB10Y_d_t_o'] = loans_df['DGS10'] - loans_df['orig_DGS10']

# FRMA30Y_d_t_o: Difference of 30-year FRM average
loans_df['FRMA30Y_d_t_o'] = loans_df['MORTGAGE30US'] - loans_df['orig_MORTGAGE30US']

print("Difference variables calculated.")

## Step 4: Rename Variables to Paper's Conventions

In [None]:
# Rename to paper's variable names
rename_map = {
    # Loan-level
    'orig_interest_rate': 'int_rate',
    'credit_score': 'fico_score',
    'orig_dti': 'dti_r',
    'orig_ltv': 'ltv_r',
    # Duration
    'loan_age': 'duration',
}

loans_df = loans_df.rename(columns=rename_map)

# Define final variable sets (from paper Table 2)
LOAN_LEVEL_VARS = [
    'int_rate',           # Initial interest rate
    'orig_upb',           # Original unpaid balance
    'fico_score',         # Initial FICO score
    'dti_r',              # Initial debt-to-income ratio
    'ltv_r',              # Initial loan-to-value ratio
    'bal_repaid',         # Current repaid balance in percent
    't_act_12m',          # Times not delinquent in last 12 months
    't_del_30d_12m',      # Times 30 days delinquent in last 12 months
    't_del_60d_12m',      # Times 60 days delinquent in last 12 months
]

MACRO_VARS = [
    'hpi_st_d_t_o',       # HPI difference (state)
    'ppi_c_FRMA',         # Current prepayment incentive
    'TB10Y_d_t_o',        # Treasury rate difference
    'FRMA30Y_d_t_o',      # 30Y FRM difference
    'ppi_o_FRMA',         # Prepayment incentive at origination
    'hpi_st_log12m',      # HPI 12-month log return (state)
    'hpi_r_st_us',        # Ratio of state HPI to national HPI
    'st_unemp_r12m',      # Unemployment 12-month log return (state)
    'st_unemp_r3m',       # Unemployment 3-month log return (state)
    'TB10Y_r12m',         # Treasury rate 12-month return
    'T10Y3MM',            # Yield spread (10Y - 3M)
    'T10Y3MM_r12m',       # Yield spread 12-month return
]

ALL_VARS = LOAN_LEVEL_VARS + MACRO_VARS

print(f"Loan-level variables: {len(LOAN_LEVEL_VARS)}")
print(f"Macro variables: {len(MACRO_VARS)}")
print(f"Total: {len(ALL_VARS)}")

In [None]:
# Check variable coverage
print("=== Variable Coverage ===")
for var in ALL_VARS:
    if var in loans_df.columns:
        coverage = loans_df[var].notna().mean()
        print(f"{var}: {coverage:.1%}")
    else:
        print(f"{var}: MISSING")

## Step 5: Create Subsamples for Cross-Validation

Following paper: 11 random subsamples of 10,000 each.

In [None]:
# Filter to complete cases
required_cols = ['duration', 'event_code'] + [v for v in ALL_VARS if v in loans_df.columns]
loans_complete = loans_df.dropna(subset=[c for c in required_cols if c in loans_df.columns])

print(f"Complete cases: {len(loans_complete):,} / {len(loans_df):,} ({len(loans_complete)/len(loans_df):.1%})")

# Event distribution
print("\nEvent distribution:")
print(loans_complete['event_code'].value_counts().sort_index())
print("\n0=Censored, 1=Prepay, 2=Default")

In [None]:
# Create stratified subsamples
# Need to ensure each sample has both event types
from sklearn.model_selection import StratifiedShuffleSplit

# Filter to just prepay (1) and default (2) for stratification
# Include some censored as well
loans_for_sampling = loans_complete.copy()

# Create 11 subsamples
n_samples = min(N_FOLDS * SAMPLE_SIZE_PER_FOLD, len(loans_for_sampling))
if n_samples < len(loans_for_sampling):
    # Sample from the data
    sampled_df = loans_for_sampling.sample(n=n_samples, random_state=42)
else:
    sampled_df = loans_for_sampling

# Assign fold numbers
sampled_df = sampled_df.reset_index(drop=True)
sampled_df['fold'] = sampled_df.index % N_FOLDS

print(f"Total sampled: {len(sampled_df):,}")
print(f"\nSamples per fold:")
print(sampled_df['fold'].value_counts().sort_index())

In [None]:
# Check event distribution per fold
print("=== Event Distribution per Fold ===")
for fold in range(N_FOLDS):
    fold_data = sampled_df[sampled_df['fold'] == fold]
    prepay = (fold_data['event_code'] == 1).sum()
    default = (fold_data['event_code'] == 2).sum()
    censored = (fold_data['event_code'] == 0).sum()
    print(f"Fold {fold}: n={len(fold_data):,}, prepay={prepay:,}, default={default:,}, censored={censored:,}")

## Step 6: Save Processed Data

In [None]:
# Select final columns
final_cols = [
    # Identifiers
    'loan_sequence_number', 'vintage_year', 'fold',
    # Survival data
    'duration', 'event_code',
    # Loan-level variables
] + [v for v in LOAN_LEVEL_VARS if v in sampled_df.columns] + [
    # Macro variables  
] + [v for v in MACRO_VARS if v in sampled_df.columns] + [
    # Additional useful columns
    'property_state', 'year_month'
]

# Remove duplicates and filter
final_cols = list(dict.fromkeys(final_cols))
final_cols = [c for c in final_cols if c in sampled_df.columns]

final_df = sampled_df[final_cols].copy()
print(f"Final dataset: {len(final_df):,} rows, {len(final_cols)} columns")

In [None]:
# Save to parquet
output_path = PROCESSED_DATA_DIR / 'blumenstock_dataset2.parquet'
final_df.to_parquet(output_path, index=False)
print(f"Saved to {output_path}")

# Also save variable lists for reference
var_config = {
    'loan_level_vars': LOAN_LEVEL_VARS,
    'macro_vars': MACRO_VARS,
    'all_vars': ALL_VARS,
}
import json
with open(PROCESSED_DATA_DIR / 'blumenstock_variables.json', 'w') as f:
    json.dump(var_config, f, indent=2)
print("Variable config saved.")

In [None]:
# Final summary
print("=" * 60)
print("BLUMENSTOCK DATASET 2 SUMMARY")
print("=" * 60)
print(f"\nPeriod: {final_df['vintage_year'].min()}-{final_df['vintage_year'].max()}")
print(f"Total observations: {len(final_df):,}")
print(f"Number of folds: {final_df['fold'].nunique()}")

print(f"\n=== Event Distribution ===")
event_counts = final_df['event_code'].value_counts().sort_index()
for code, count in event_counts.items():
    event_name = {0: 'Censored', 1: 'Prepayment', 2: 'Default'}.get(code, 'Other')
    pct = count / len(final_df) * 100
    print(f"  {event_name} (k={code}): {count:,} ({pct:.1f}%)")

print(f"\n=== Variable Summary ===")
print(f"Loan-level variables: {len([v for v in LOAN_LEVEL_VARS if v in final_df.columns])}")
print(f"Macro variables: {len([v for v in MACRO_VARS if v in final_df.columns])}")

print(f"\n=== Duration Statistics ===")
print(final_df['duration'].describe())

## Next Steps

Data is ready for experiments:

1. **Notebook 05**: Cause-Specific Cox (CSC)
2. **Notebook 06**: Fine-Gray Model (FGR)
3. **Notebook 07**: Random Survival Forest (RSF)
4. **Notebook 08**: Model Comparison

### Experiments (from paper)

| Experiment | Variables | File |
|------------|-----------|------|
| Exp 4.1 | Loan-level only | Use `LOAN_LEVEL_VARS` |
| Exp 4.2 | Macro only | Use `MACRO_VARS` |
| Exp 4.3 | All variables | Use `ALL_VARS` |