# Data Preparation: Blumenstock et al. (2022) Replication

This notebook prepares data following the methodology from:

> Blumenstock, G., Lessmann, S., & Seow, H-V. (2022). Deep learning for survival and competing risk modelling. *Journal of the Operational Research Society*, 73(1), 26-38.

## Dataset 2: Post-Crisis Period (2010-2025)

**Variables from Table 2:**

### Loan-Level Variables (9)
| Variable | Description |
|----------|-------------|
| `int_rate` | Initial interest rate |
| `orig_upb` | Original unpaid balance |
| `fico_score` | Initial FICO score |
| `dti_r` | Initial debt-to-income ratio |
| `ltv_r` | Initial loan-to-value ratio |
| `bal_repaid` | Current repaid balance in percent |
| `t_act_12m` | No. of times not being delinquent in last 12 months |
| `t_del_30d_12m` | No. of times being 30 days delinquent in last 12 months |
| `t_del_60d_12m` | No. of times being 60 days delinquent in last 12 months |

### Macroeconomic Variables (13 for Dataset 2)
| Variable | Description |
|----------|-------------|
| `hpi_st_d_t_o` | HPI difference between origination and today (state) |
| `ppi_c_FRMA` | Current prepayment incentive |
| `TB10Y_d_t_o` | Treasury rate difference |
| `FRMA30Y_d_t_o` | 30Y FRM difference |
| `ppi_o_FRMA` | Prepayment incentive at origination |
| `hpi_st_log12m` | HPI 12-month log return (state) |
| `hpi_r_st_us` | Ratio of state HPI to national HPI |
| `st_unemp_r12m` | Unemployment 12-month log return (state) |
| `st_unemp_r3m` | Unemployment 3-month log return (state) |
| `TB10Y_r12m` | Treasury rate 12-month return |
| `T10Y3MM` | Yield spread (10Y - 3M) |
| `T10Y3MM_r12m` | Yield spread 12-month return |

### Event Definitions
- **Default (k=2)**: Loan turning 3-month delinquent for the first time
- **Prepayment (k=1)**: Loan repaid completely and unexpectedly
- **Censored (k=0)**: Active loan without event

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from glob import glob
import warnings
warnings.filterwarnings('ignore')

# Import column definitions
import sys
sys.path.insert(0, '..')
from src.data.columns import (
    ORIGINATION_COLUMNS, ORIGINATION_DTYPES,
    PERFORMANCE_COLUMNS, PERFORMANCE_DTYPES,
    ZERO_BALANCE_CODE_MAP
)

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 200)

print("Imports complete.")

Imports complete.


## Configuration

Following the paper's Dataset 2 setup (2010-2025).

In [None]:
# Data paths
RAW_DATA_DIR = Path('../data/raw')
PROCESSED_DATA_DIR = Path('../data/processed')
EXTERNAL_DATA_DIR = Path('../data/external')

# Dataset 2: Post-crisis period (2010-2025)
VINTAGES = list(range(2010, 2026))

# Sampling strategy from paper:
# - 11 random subsamples of 10,000 each
# - 10 for cross-validation, 1 for hyperparameter tuning
SAMPLE_SIZE_PER_FOLD = 10000
N_FOLDS = 11

# Default definition: 3-month delinquent for the first time
DEFAULT_DELINQUENCY_THRESHOLD = 3

print(f"Dataset 2 Period: {min(VINTAGES)}-{max(VINTAGES)}")
print(f"Sample size per fold: {SAMPLE_SIZE_PER_FOLD:,}")
print(f"Number of folds: {N_FOLDS}")

# === VERIFY REQUIRED FILES EXIST ===
print("\n=== Checking Required Files ===")
required_files = {
    'National macro data': EXTERNAL_DATA_DIR / 'fred_monthly_panel.parquet',
    'State unemployment': EXTERNAL_DATA_DIR / 'state_unemployment.parquet',
    'State HPI': EXTERNAL_DATA_DIR / 'state_hpi.parquet',
}

missing_files = []
for name, path in required_files.items():
    if path.exists():
        print(f"  ✓ {name}: {path}")
    else:
        print(f"  ✗ {name}: MISSING - {path}")
        missing_files.append(name)

if missing_files:
    print(f"\n⚠️  WARNING: {len(missing_files)} required file(s) missing!")
    print("   Run: python -m src.data.download_fred --include-states")
    print("   Some macro variables will not be calculated.")
else:
    print("\n✓ All required files present.")

## Step 1: Load Macroeconomic Data

In [None]:
# Load national macro data
macro_path = EXTERNAL_DATA_DIR / 'fred_monthly_panel.parquet'

if macro_path.exists():
    macro_national = pd.read_parquet(macro_path)
    macro_national.index.name = 'date'
    macro_national = macro_national.reset_index()
    macro_national['date'] = pd.to_datetime(macro_national['date'])
    macro_national['year_month'] = macro_national['date'].dt.to_period('M')

    # Verify required columns exist
    required_macro_cols = ['MORTGAGE30US', 'DGS10', 'DGS3MO']
    missing_cols = [c for c in required_macro_cols if c not in macro_national.columns]
    if missing_cols:
        print(f"⚠️  WARNING: Missing columns in macro data: {missing_cols}")
    
    # Calculate additional variables needed for paper
    # TB10Y_r12m: 10-year treasury rate 12-month return
    if 'DGS10' in macro_national.columns:
        macro_national['TB10Y_r12m'] = macro_national['DGS10'].pct_change(12)

    # T10Y3MM: Yield spread (need 3-month rate)
    if 'DGS10' in macro_national.columns and 'DGS3MO' in macro_national.columns:
        macro_national['T10Y3MM'] = macro_national['DGS10'] - macro_national['DGS3MO']
        # T10Y3MM_r12m: Yield spread 12-month return
        macro_national['T10Y3MM_r12m'] = macro_national['T10Y3MM'].pct_change(12)

    print(f"✓ National macro data loaded: {macro_national.shape}")
    print(f"  Date range: {macro_national['date'].min()} to {macro_national['date'].max()}")
    print(f"  Columns: {list(macro_national.columns)[:10]}...")
else:
    print(f"✗ ERROR: National macro data not found at {macro_path}")
    print("  Run: python -m src.data.download_fred")
    macro_national = None

In [None]:
# Load state-level unemployment
unemp_path = EXTERNAL_DATA_DIR / 'state_unemployment.parquet'

if unemp_path.exists():
    state_unemp = pd.read_parquet(unemp_path)
    state_unemp.index.name = 'date'
    state_unemp = state_unemp.reset_index()
    state_unemp['date'] = pd.to_datetime(state_unemp['date'])
    state_unemp['year_month'] = state_unemp['date'].dt.to_period('M')

    # Melt to long format
    state_cols = [c for c in state_unemp.columns if '_unemployment' in c]
    
    if len(state_cols) == 0:
        print("⚠️  WARNING: No unemployment columns found in state data")
        state_unemp_long = pd.DataFrame()
    else:
        state_unemp_long = state_unemp.melt(
            id_vars=['date', 'year_month'],
            value_vars=state_cols,
            var_name='state_col',
            value_name='state_unemployment'
        )
        state_unemp_long['property_state'] = state_unemp_long['state_col'].str.replace('_unemployment', '')

        # Calculate returns by state (need to sort first)
        state_unemp_long = state_unemp_long.sort_values(['property_state', 'year_month'])
        
        # Group by state and calculate rolling returns
        def calc_state_returns(group):
            group = group.copy()
            group['st_unemp_r12m'] = np.log(group['state_unemployment'] / group['state_unemployment'].shift(12))
            group['st_unemp_r3m'] = np.log(group['state_unemployment'] / group['state_unemployment'].shift(3))
            return group
        
        state_unemp_long = state_unemp_long.groupby('property_state', group_keys=False).apply(calc_state_returns)
        
        state_unemp_long = state_unemp_long[['year_month', 'property_state', 'state_unemployment', 
                                              'st_unemp_r12m', 'st_unemp_r3m']].drop_duplicates()

        print(f"✓ State unemployment loaded: {state_unemp_long.shape}")
        print(f"  States: {state_unemp_long['property_state'].nunique()}")
        print(f"  Sample: {state_unemp_long['property_state'].unique()[:5].tolist()}...")
else:
    print(f"✗ ERROR: State unemployment data not found at {unemp_path}")
    print("  Run: python -m src.data.download_fred --include-states")
    state_unemp_long = pd.DataFrame(columns=['year_month', 'property_state', 'st_unemp_r12m', 'st_unemp_r3m'])

In [None]:
# Load state-level HPI
hpi_path = EXTERNAL_DATA_DIR / 'state_hpi.parquet'

if hpi_path.exists():
    state_hpi = pd.read_parquet(hpi_path)
    state_hpi.index.name = 'date'
    state_hpi = state_hpi.reset_index()
    state_hpi['date'] = pd.to_datetime(state_hpi['date'])
    state_hpi['year_month'] = state_hpi['date'].dt.to_period('M')

    # Find HPI columns (2-letter state code + _hpi)
    hpi_cols = [c for c in state_hpi.columns if c.endswith('_hpi') and len(c) <= 6]
    
    if len(hpi_cols) == 0:
        print("⚠️  WARNING: No HPI columns found in state data")
        state_hpi_long = pd.DataFrame()
    else:
        # Calculate national HPI (average across states)
        state_hpi['national_hpi'] = state_hpi[hpi_cols].mean(axis=1)

        # Melt to long format
        state_hpi_long = state_hpi.melt(
            id_vars=['date', 'year_month', 'national_hpi'],
            value_vars=hpi_cols,
            var_name='state_col',
            value_name='state_hpi'
        )
        state_hpi_long['property_state'] = state_hpi_long['state_col'].str.replace('_hpi', '')

        # Sort by state and time for proper rolling calculations
        state_hpi_long = state_hpi_long.sort_values(['property_state', 'year_month'])

        # Calculate log returns by state
        def calc_hpi_returns(group):
            group = group.copy()
            group['hpi_st_log12m'] = np.log(group['state_hpi'] / group['state_hpi'].shift(12))
            return group
        
        state_hpi_long = state_hpi_long.groupby('property_state', group_keys=False).apply(calc_hpi_returns)

        # Calculate ratio of state HPI to national HPI
        state_hpi_long['hpi_r_st_us'] = state_hpi_long['state_hpi'] / state_hpi_long['national_hpi']

        state_hpi_long = state_hpi_long[['year_month', 'property_state', 'state_hpi', 'national_hpi',
                                          'hpi_st_log12m', 'hpi_r_st_us']].drop_duplicates()

        print(f"✓ State HPI loaded: {state_hpi_long.shape}")
        print(f"  States: {state_hpi_long['property_state'].nunique()}")
        print(f"  Sample: {state_hpi_long['property_state'].unique()[:5].tolist()}...")
else:
    print(f"✗ ERROR: State HPI data not found at {hpi_path}")
    print("  Run: python -m src.data.download_fred --include-states")
    state_hpi_long = pd.DataFrame(columns=['year_month', 'property_state', 'state_hpi', 'national_hpi', 
                                            'hpi_st_log12m', 'hpi_r_st_us'])

## Step 2: Load and Process Loan Data

Process Freddie Mac data with paper's variable definitions.

In [12]:
def load_origination_data(vintage: int) -> pd.DataFrame:
    """Load origination data for a vintage."""
    pattern = f'sample_{vintage}/sample_orig_{vintage}.txt'
    files = list(RAW_DATA_DIR.glob(f'**/{pattern}'))
    
    if not files:
        return pd.DataFrame()
    
    df = pd.read_csv(
        files[0], sep='|', names=ORIGINATION_COLUMNS,
        dtype=ORIGINATION_DTYPES, na_values=['', ' ']
    )
    return df


def load_performance_data(vintage: int) -> pd.DataFrame:
    """Load performance (monthly) data for a vintage."""
    pattern = f'sample_{vintage}/sample_svcg_{vintage}.txt'
    files = list(RAW_DATA_DIR.glob(f'**/{pattern}'))
    
    if not files:
        return pd.DataFrame()
    
    df = pd.read_csv(
        files[0], sep='|', names=PERFORMANCE_COLUMNS,
        dtype=PERFORMANCE_DTYPES, na_values=['', ' ']
    )
    return df


print("Load functions defined.")

Load functions defined.


In [13]:
def process_vintage_blumenstock(vintage: int) -> pd.DataFrame:
    """
    Process a vintage following Blumenstock et al. (2022) methodology.
    
    Creates loan-level survival data with:
    - Terminal record per loan
    - Behavioral features from last 12 months
    - Paper's variable definitions
    """
    print(f"Processing vintage {vintage}...")
    
    # Load data
    orig_df = load_origination_data(vintage)
    perf_df = load_performance_data(vintage)
    
    if orig_df.empty or perf_df.empty:
        return pd.DataFrame()
    
    # Parse reporting period
    perf_df['reporting_date'] = pd.to_datetime(
        perf_df['monthly_reporting_period'].astype(str), format='%Y%m'
    )
    perf_df['year_month'] = perf_df['reporting_date'].dt.to_period('M')
    
    # Parse delinquency status
    perf_df['delinquency_status'] = pd.to_numeric(
        perf_df['current_loan_delinquency_status'].replace({'X': '0', 'XX': '0'}),
        errors='coerce'
    ).fillna(0).astype(int)
    
    # Sort by loan and time
    perf_df = perf_df.sort_values(['loan_sequence_number', 'loan_age'])
    
    # === Calculate behavioral variables (rolling 12-month) ===
    perf_df['is_current'] = (perf_df['delinquency_status'] == 0).astype(int)
    perf_df['is_30d_del'] = (perf_df['delinquency_status'] == 1).astype(int)
    perf_df['is_60d_del'] = (perf_df['delinquency_status'] == 2).astype(int)
    
    # Rolling counts over last 12 months
    grouped = perf_df.groupby('loan_sequence_number')
    perf_df['t_act_12m'] = grouped['is_current'].transform(
        lambda x: x.rolling(12, min_periods=1).sum()
    )
    perf_df['t_del_30d_12m'] = grouped['is_30d_del'].transform(
        lambda x: x.rolling(12, min_periods=1).sum()
    )
    perf_df['t_del_60d_12m'] = grouped['is_60d_del'].transform(
        lambda x: x.rolling(12, min_periods=1).sum()
    )
    
    # === Determine event type ===
    # Default: first time reaching 90+ days delinquent
    perf_df['is_default'] = (perf_df['delinquency_status'] >= DEFAULT_DELINQUENCY_THRESHOLD).astype(int)
    perf_df['first_default'] = grouped['is_default'].transform(
        lambda x: (x.cumsum() == 1) & (x == 1)
    ).astype(int)
    
    # Prepayment: zero balance code = 01 (not at maturity)
    perf_df['is_prepay'] = (
        (perf_df['zero_balance_code'] == '01') & 
        (perf_df['loan_age'] < perf_df['remaining_months_to_maturity'].fillna(360) + perf_df['loan_age'] - 6)
    ).astype(int)
    
    # === Calculate balance repaid ===
    orig_upb = orig_df.set_index('loan_sequence_number')['orig_upb']
    perf_df['orig_upb_lookup'] = perf_df['loan_sequence_number'].map(orig_upb)
    perf_df['bal_repaid'] = (
        (perf_df['orig_upb_lookup'] - perf_df['current_actual_upb'].fillna(0)) / 
        perf_df['orig_upb_lookup']
    ) * 100
    perf_df['bal_repaid'] = perf_df['bal_repaid'].clip(0, 100)
    
    # === Get terminal record for each loan ===
    # Determine event at terminal record
    def get_terminal_event(group):
        """Get terminal record with event type."""
        last_row = group.iloc[-1].copy()
        
        # Check for default (first 90+ delinquency)
        default_rows = group[group['first_default'] == 1]
        if len(default_rows) > 0:
            last_row = default_rows.iloc[0].copy()
            last_row['event_code'] = 2  # Default
            return last_row
        
        # Check for prepayment
        prepay_rows = group[group['is_prepay'] == 1]
        if len(prepay_rows) > 0:
            last_row = prepay_rows.iloc[0].copy()
            last_row['event_code'] = 1  # Prepay
            return last_row
        
        # Censored
        last_row['event_code'] = 0
        return last_row
    
    print(f"  Getting terminal records...")
    terminal_df = perf_df.groupby('loan_sequence_number').apply(get_terminal_event)
    terminal_df = terminal_df.reset_index(drop=True)
    
    # === Merge with origination data ===
    orig_cols = [
        'loan_sequence_number', 'credit_score', 'orig_ltv', 'orig_dti',
        'orig_upb', 'orig_interest_rate', 'orig_loan_term',
        'first_payment_date', 'property_state'
    ]
    orig_subset = orig_df[[c for c in orig_cols if c in orig_df.columns]].copy()
    orig_subset['vintage_year'] = vintage
    
    # Parse origination date
    orig_subset['first_payment_date'] = pd.to_datetime(
        orig_subset['first_payment_date'].astype(str), format='%Y%m', errors='coerce'
    )
    orig_subset['orig_year_month'] = orig_subset['first_payment_date'].dt.to_period('M')
    
    terminal_df = terminal_df.merge(orig_subset, on='loan_sequence_number', how='left')
    
    print(f"  Loans: {len(terminal_df):,}")
    print(f"  Events: Prepay={sum(terminal_df['event_code']==1):,}, "
          f"Default={sum(terminal_df['event_code']==2):,}, "
          f"Censored={sum(terminal_df['event_code']==0):,}")
    
    return terminal_df


print("Process function defined.")

Process function defined.


In [14]:
# Process all vintages in Dataset 2
all_loans = []

for vintage in VINTAGES:
    df = process_vintage_blumenstock(vintage)
    if not df.empty:
        all_loans.append(df)

# Combine
print("\nCombining all vintages...")
loans_df = pd.concat(all_loans, ignore_index=True)
print(f"Total loans: {len(loans_df):,}")

Processing vintage 2010...
  Getting terminal records...
  Loans: 50,000
  Events: Prepay=44,437, Default=1,614, Censored=3,949
Processing vintage 2011...
  Getting terminal records...
  Loans: 50,000
  Events: Prepay=42,938, Default=1,528, Censored=5,534
Processing vintage 2012...
  Getting terminal records...
  Loans: 50,000
  Events: Prepay=37,822, Default=1,698, Censored=10,480
Processing vintage 2013...
  Getting terminal records...
  Loans: 50,000
  Events: Prepay=37,060, Default=1,882, Censored=11,058
Processing vintage 2014...
  Getting terminal records...
  Loans: 50,000
  Events: Prepay=39,244, Default=1,919, Censored=8,837
Processing vintage 2015...
  Getting terminal records...
  Loans: 50,000
  Events: Prepay=36,516, Default=2,023, Censored=11,461
Processing vintage 2016...
  Getting terminal records...
  Loans: 50,000
  Events: Prepay=33,910, Default=2,345, Censored=13,745
Processing vintage 2017...
  Getting terminal records...
  Loans: 50,000
  Events: Prepay=35,261, De

## Step 3: Merge Macroeconomic Variables

Add paper's macroeconomic variables at the observation time.

In [None]:
# Merge macroeconomic data at observation time
print("=== Merging Macro Data at Observation Time ===\n")

n_before = len(loans_df)

# Merge state unemployment data
if not state_unemp_long.empty:
    print("Merging state unemployment...")
    loans_df = loans_df.merge(
        state_unemp_long,
        on=['year_month', 'property_state'],
        how='left'
    )
    coverage = loans_df['st_unemp_r12m'].notna().mean()
    print(f"  ✓ st_unemp_r12m coverage: {coverage:.1%}")
    print(f"  ✓ st_unemp_r3m coverage: {loans_df['st_unemp_r3m'].notna().mean():.1%}")
else:
    print("⚠️  Skipping state unemployment merge (data not loaded)")
    loans_df['st_unemp_r12m'] = np.nan
    loans_df['st_unemp_r3m'] = np.nan

# Merge state HPI data  
if not state_hpi_long.empty:
    print("\nMerging state HPI...")
    loans_df = loans_df.merge(
        state_hpi_long,
        on=['year_month', 'property_state'],
        how='left'
    )
    print(f"  ✓ state_hpi coverage: {loans_df['state_hpi'].notna().mean():.1%}")
    print(f"  ✓ hpi_st_log12m coverage: {loans_df['hpi_st_log12m'].notna().mean():.1%}")
    print(f"  ✓ hpi_r_st_us coverage: {loans_df['hpi_r_st_us'].notna().mean():.1%}")
else:
    print("⚠️  Skipping state HPI merge (data not loaded)")
    loans_df['state_hpi'] = np.nan
    loans_df['national_hpi'] = np.nan
    loans_df['hpi_st_log12m'] = np.nan
    loans_df['hpi_r_st_us'] = np.nan

# Merge national macro data
if macro_national is not None:
    print("\nMerging national macro...")
    macro_cols = ['year_month', 'MORTGAGE30US', 'DGS10', 'TB10Y_r12m', 'T10Y3MM', 'T10Y3MM_r12m']
    macro_cols = [c for c in macro_cols if c in macro_national.columns]
    macro_subset = macro_national[macro_cols].copy()
    loans_df = loans_df.merge(macro_subset, on='year_month', how='left')
    print(f"  ✓ MORTGAGE30US coverage: {loans_df['MORTGAGE30US'].notna().mean():.1%}")
    print(f"  ✓ DGS10 coverage: {loans_df['DGS10'].notna().mean():.1%}")
    if 'TB10Y_r12m' in loans_df.columns:
        print(f"  ✓ TB10Y_r12m coverage: {loans_df['TB10Y_r12m'].notna().mean():.1%}")
    if 'T10Y3MM' in loans_df.columns:
        print(f"  ✓ T10Y3MM coverage: {loans_df['T10Y3MM'].notna().mean():.1%}")
else:
    print("⚠️  Skipping national macro merge (data not loaded)")
    loans_df['MORTGAGE30US'] = np.nan
    loans_df['DGS10'] = np.nan
    loans_df['TB10Y_r12m'] = np.nan
    loans_df['T10Y3MM'] = np.nan
    loans_df['T10Y3MM_r12m'] = np.nan

# Verify no rows were lost
n_after = len(loans_df)
if n_after != n_before:
    print(f"\n⚠️  WARNING: Row count changed from {n_before:,} to {n_after:,} during merge!")
else:
    print(f"\n✓ Row count unchanged: {n_after:,}")

In [None]:
# Get origination-time values for difference calculations
print("=== Calculating Origination-Time Differences ===\n")

# Track which variables were successfully created
created_vars = []
failed_vars = []

# === 1. HPI difference (state-level) ===
if not state_hpi_long.empty and 'state_hpi' in loans_df.columns:
    print("Calculating hpi_st_d_t_o (HPI difference)...")
    
    # Get origination-time state HPI
    orig_hpi = state_hpi_long[['year_month', 'property_state', 'state_hpi']].rename(
        columns={'year_month': 'orig_year_month', 'state_hpi': 'orig_state_hpi'}
    )
    loans_df = loans_df.merge(orig_hpi, on=['orig_year_month', 'property_state'], how='left')
    
    # hpi_st_d_t_o: Difference of HPI between origination and today (state-level)
    loans_df['hpi_st_d_t_o'] = loans_df['state_hpi'] - loans_df['orig_state_hpi']
    
    coverage = loans_df['hpi_st_d_t_o'].notna().mean()
    print(f"  ✓ hpi_st_d_t_o coverage: {coverage:.1%}")
    if coverage > 0.9:
        created_vars.append('hpi_st_d_t_o')
    else:
        failed_vars.append(('hpi_st_d_t_o', f'{coverage:.1%} coverage'))
else:
    print("⚠️  Cannot calculate hpi_st_d_t_o (state HPI not available)")
    loans_df['hpi_st_d_t_o'] = np.nan
    failed_vars.append(('hpi_st_d_t_o', 'state HPI not loaded'))

# === 2. Interest rate and mortgage differences ===
if macro_national is not None and 'MORTGAGE30US' in loans_df.columns:
    print("\nCalculating prepayment incentives and rate differences...")
    
    # Get origination-time macro rates
    orig_macro = macro_national[['year_month', 'MORTGAGE30US', 'DGS10']].rename(
        columns={'year_month': 'orig_year_month', 'MORTGAGE30US': 'orig_MORTGAGE30US', 'DGS10': 'orig_DGS10'}
    )
    loans_df = loans_df.merge(orig_macro, on='orig_year_month', how='left')
    
    # Check if we have the required columns
    has_int_rate = 'orig_interest_rate' in loans_df.columns
    has_mortgage = 'MORTGAGE30US' in loans_df.columns and 'orig_MORTGAGE30US' in loans_df.columns
    has_treasury = 'DGS10' in loans_df.columns and 'orig_DGS10' in loans_df.columns
    
    # ppi_c_FRMA: Current prepayment incentive (loan rate - current mortgage rate)
    if has_int_rate and has_mortgage:
        loans_df['ppi_c_FRMA'] = loans_df['orig_interest_rate'] - loans_df['MORTGAGE30US']
        coverage = loans_df['ppi_c_FRMA'].notna().mean()
        print(f"  ✓ ppi_c_FRMA coverage: {coverage:.1%}")
        created_vars.append('ppi_c_FRMA') if coverage > 0.9 else failed_vars.append(('ppi_c_FRMA', f'{coverage:.1%}'))
    else:
        loans_df['ppi_c_FRMA'] = np.nan
        failed_vars.append(('ppi_c_FRMA', 'missing required columns'))

    # ppi_o_FRMA: Prepayment incentive at origination (loan rate - orig mortgage rate)
    if has_int_rate and has_mortgage:
        loans_df['ppi_o_FRMA'] = loans_df['orig_interest_rate'] - loans_df['orig_MORTGAGE30US']
        coverage = loans_df['ppi_o_FRMA'].notna().mean()
        print(f"  ✓ ppi_o_FRMA coverage: {coverage:.1%}")
        created_vars.append('ppi_o_FRMA') if coverage > 0.9 else failed_vars.append(('ppi_o_FRMA', f'{coverage:.1%}'))
    else:
        loans_df['ppi_o_FRMA'] = np.nan
        failed_vars.append(('ppi_o_FRMA', 'missing required columns'))

    # TB10Y_d_t_o: Difference of 10-year treasury rate (today - origination)
    if has_treasury:
        loans_df['TB10Y_d_t_o'] = loans_df['DGS10'] - loans_df['orig_DGS10']
        coverage = loans_df['TB10Y_d_t_o'].notna().mean()
        print(f"  ✓ TB10Y_d_t_o coverage: {coverage:.1%}")
        created_vars.append('TB10Y_d_t_o') if coverage > 0.9 else failed_vars.append(('TB10Y_d_t_o', f'{coverage:.1%}'))
    else:
        loans_df['TB10Y_d_t_o'] = np.nan
        failed_vars.append(('TB10Y_d_t_o', 'missing DGS10'))

    # FRMA30Y_d_t_o: Difference of 30-year FRM average (today - origination)
    if has_mortgage:
        loans_df['FRMA30Y_d_t_o'] = loans_df['MORTGAGE30US'] - loans_df['orig_MORTGAGE30US']
        coverage = loans_df['FRMA30Y_d_t_o'].notna().mean()
        print(f"  ✓ FRMA30Y_d_t_o coverage: {coverage:.1%}")
        created_vars.append('FRMA30Y_d_t_o') if coverage > 0.9 else failed_vars.append(('FRMA30Y_d_t_o', f'{coverage:.1%}'))
    else:
        loans_df['FRMA30Y_d_t_o'] = np.nan
        failed_vars.append(('FRMA30Y_d_t_o', 'missing MORTGAGE30US'))
        
else:
    print("⚠️  Cannot calculate rate differences (macro data not available)")
    for var in ['ppi_c_FRMA', 'ppi_o_FRMA', 'TB10Y_d_t_o', 'FRMA30Y_d_t_o']:
        loans_df[var] = np.nan
        failed_vars.append((var, 'macro data not loaded'))

# === Summary ===
print(f"\n=== Derived Variables Summary ===")
print(f"✓ Successfully created: {len(created_vars)}")
for var in created_vars:
    print(f"    - {var}")

if failed_vars:
    print(f"\n⚠️  Failed or low coverage: {len(failed_vars)}")
    for var, reason in failed_vars:
        print(f"    - {var}: {reason}")

## Step 4: Rename Variables to Paper's Conventions

In [13]:
# Rename to paper's variable names
rename_map = {
    # Loan-level
    'orig_interest_rate': 'int_rate',
    'credit_score': 'fico_score',
    'orig_dti': 'dti_r',
    'orig_ltv': 'ltv_r',
    # Duration
    'loan_age': 'duration',
}

loans_df = loans_df.rename(columns=rename_map)

# Define final variable sets (from paper Table 2)
LOAN_LEVEL_VARS = [
    'int_rate',           # Initial interest rate
    'orig_upb',           # Original unpaid balance
    'fico_score',         # Initial FICO score
    'dti_r',              # Initial debt-to-income ratio
    'ltv_r',              # Initial loan-to-value ratio
    'bal_repaid',         # Current repaid balance in percent
    't_act_12m',          # Times not delinquent in last 12 months
    't_del_30d_12m',      # Times 30 days delinquent in last 12 months
    't_del_60d_12m',      # Times 60 days delinquent in last 12 months
]

MACRO_VARS = [
    'hpi_st_d_t_o',       # HPI difference (state)
    'ppi_c_FRMA',         # Current prepayment incentive
    'TB10Y_d_t_o',        # Treasury rate difference
    'FRMA30Y_d_t_o',      # 30Y FRM difference
    'ppi_o_FRMA',         # Prepayment incentive at origination
    'hpi_st_log12m',      # HPI 12-month log return (state)
    'hpi_r_st_us',        # Ratio of state HPI to national HPI
    'st_unemp_r12m',      # Unemployment 12-month log return (state)
    'st_unemp_r3m',       # Unemployment 3-month log return (state)
    'TB10Y_r12m',         # Treasury rate 12-month return
    'T10Y3MM',            # Yield spread (10Y - 3M)
    'T10Y3MM_r12m',       # Yield spread 12-month return
]

ALL_VARS = LOAN_LEVEL_VARS + MACRO_VARS

print(f"Loan-level variables: {len(LOAN_LEVEL_VARS)}")
print(f"Macro variables: {len(MACRO_VARS)}")
print(f"Total: {len(ALL_VARS)}")

Loan-level variables: 9
Macro variables: 12
Total: 21


In [None]:
# Check variable coverage with detailed status
print("=" * 70)
print("VARIABLE COVERAGE CHECK")
print("=" * 70)

all_ok = True
coverage_threshold = 0.90  # Warn if below 90%

print("\n=== Loan-Level Variables (9) ===")
for var in LOAN_LEVEL_VARS:
    if var in loans_df.columns:
        coverage = loans_df[var].notna().mean()
        status = "✓" if coverage >= coverage_threshold else "⚠️"
        if coverage < coverage_threshold:
            all_ok = False
        print(f"  {status} {var}: {coverage:.1%}")
    else:
        print(f"  ✗ {var}: MISSING")
        all_ok = False

print("\n=== Macro Variables (12) ===")
# Mark critical derived variables
critical_vars = ['hpi_st_d_t_o', 'ppi_c_FRMA', 'TB10Y_d_t_o', 'FRMA30Y_d_t_o', 'ppi_o_FRMA']

for var in MACRO_VARS:
    critical_marker = " [CRITICAL]" if var in critical_vars else ""
    if var in loans_df.columns:
        coverage = loans_df[var].notna().mean()
        status = "✓" if coverage >= coverage_threshold else "⚠️"
        if coverage < coverage_threshold:
            all_ok = False
        
        # Show sample values for critical variables
        if var in critical_vars and coverage > 0:
            sample_vals = loans_df[var].dropna().head(3).tolist()
            print(f"  {status} {var}: {coverage:.1%}{critical_marker}")
            print(f"      Sample values: {[round(v, 3) for v in sample_vals]}")
        else:
            print(f"  {status} {var}: {coverage:.1%}{critical_marker}")
    else:
        print(f"  ✗ {var}: MISSING{critical_marker}")
        all_ok = False

# Final status
print("\n" + "=" * 70)
if all_ok:
    print("✓ ALL VARIABLES OK (coverage >= 90%)")
else:
    print("⚠️  SOME VARIABLES HAVE ISSUES - Check warnings above")
print("=" * 70)

## Step 5: Create Subsamples for Cross-Validation

Following paper: 11 random subsamples of 10,000 each.

In [15]:
# Filter to complete cases
required_cols = ['duration', 'event_code'] + [v for v in ALL_VARS if v in loans_df.columns]
loans_complete = loans_df.dropna(subset=[c for c in required_cols if c in loans_df.columns])

print(f"Complete cases: {len(loans_complete):,} / {len(loans_df):,} ({len(loans_complete)/len(loans_df):.1%})")

# Event distribution
print("\nEvent distribution:")
print(loans_complete['event_code'].value_counts().sort_index())
print("\n0=Censored, 1=Prepay, 2=Default")

Complete cases: 774,608 / 774,950 (100.0%)

Event distribution:
event_code
0    326436
1    422695
2     25477
Name: count, dtype: int64

0=Censored, 1=Prepay, 2=Default


In [16]:
# Create stratified subsamples
# Need to ensure each sample has both event types
from sklearn.model_selection import StratifiedShuffleSplit

# Filter to just prepay (1) and default (2) for stratification
# Include some censored as well
loans_for_sampling = loans_complete.copy()

# Create 11 subsamples
n_samples = min(N_FOLDS * SAMPLE_SIZE_PER_FOLD, len(loans_for_sampling))
if n_samples < len(loans_for_sampling):
    # Sample from the data
    sampled_df = loans_for_sampling.sample(n=n_samples, random_state=42)
else:
    sampled_df = loans_for_sampling

# Assign fold numbers
sampled_df = sampled_df.reset_index(drop=True)
sampled_df['fold'] = sampled_df.index % N_FOLDS

print(f"Total sampled: {len(sampled_df):,}")
print(f"\nSamples per fold:")
print(sampled_df['fold'].value_counts().sort_index())

Total sampled: 110,000

Samples per fold:
fold
0     10000
1     10000
2     10000
3     10000
4     10000
5     10000
6     10000
7     10000
8     10000
9     10000
10    10000
Name: count, dtype: int64


In [17]:
# Check event distribution per fold
print("=== Event Distribution per Fold ===")
for fold in range(N_FOLDS):
    fold_data = sampled_df[sampled_df['fold'] == fold]
    prepay = (fold_data['event_code'] == 1).sum()
    default = (fold_data['event_code'] == 2).sum()
    censored = (fold_data['event_code'] == 0).sum()
    print(f"Fold {fold}: n={len(fold_data):,}, prepay={prepay:,}, default={default:,}, censored={censored:,}")

=== Event Distribution per Fold ===
Fold 0: n=10,000, prepay=5,470, default=304, censored=4,226
Fold 1: n=10,000, prepay=5,334, default=328, censored=4,338
Fold 2: n=10,000, prepay=5,483, default=326, censored=4,191
Fold 3: n=10,000, prepay=5,430, default=334, censored=4,236
Fold 4: n=10,000, prepay=5,397, default=337, censored=4,266
Fold 5: n=10,000, prepay=5,513, default=300, censored=4,187
Fold 6: n=10,000, prepay=5,384, default=343, censored=4,273
Fold 7: n=10,000, prepay=5,487, default=315, censored=4,198
Fold 8: n=10,000, prepay=5,450, default=313, censored=4,237
Fold 9: n=10,000, prepay=5,480, default=318, censored=4,202
Fold 10: n=10,000, prepay=5,408, default=331, censored=4,261


## Step 6: Save Processed Data

In [18]:
# Select final columns
final_cols = [
    # Identifiers
    'loan_sequence_number', 'vintage_year', 'fold',
    # Survival data
    'duration', 'event_code',
    # Loan-level variables
] + [v for v in LOAN_LEVEL_VARS if v in sampled_df.columns] + [
    # Macro variables  
] + [v for v in MACRO_VARS if v in sampled_df.columns] + [
    # Additional useful columns
    'property_state', 'year_month'
]

# Remove duplicates and filter
final_cols = list(dict.fromkeys(final_cols))
final_cols = [c for c in final_cols if c in sampled_df.columns]

final_df = sampled_df[final_cols].copy()
print(f"Final dataset: {len(final_df):,} rows, {len(final_cols)} columns")

Final dataset: 110,000 rows, 28 columns


In [19]:
# Save to parquet
output_path = PROCESSED_DATA_DIR / 'blumenstock_dataset2.parquet'
final_df.to_parquet(output_path, index=False)
print(f"Saved to {output_path}")

# Also save variable lists for reference
var_config = {
    'loan_level_vars': LOAN_LEVEL_VARS,
    'macro_vars': MACRO_VARS,
    'all_vars': ALL_VARS,
}
import json
with open(PROCESSED_DATA_DIR / 'blumenstock_variables.json', 'w') as f:
    json.dump(var_config, f, indent=2)
print("Variable config saved.")

Saved to ../data/processed/blumenstock_dataset2.parquet
Variable config saved.


In [None]:
# === FINAL VERIFICATION: Reload and verify saved dataset ===
print("=" * 70)
print("FINAL VERIFICATION: Checking Saved Dataset")
print("=" * 70)

# Reload the saved dataset
verify_df = pd.read_parquet(output_path)
print(f"\nReloaded dataset: {len(verify_df):,} rows, {len(verify_df.columns)} columns")

# Check all required variables are present and have good coverage
print("\n=== Critical Variable Verification ===")

critical_macro_vars = ['hpi_st_d_t_o', 'ppi_c_FRMA', 'TB10Y_d_t_o', 'FRMA30Y_d_t_o', 'ppi_o_FRMA']
all_critical_ok = True

for var in critical_macro_vars:
    if var in verify_df.columns:
        coverage = verify_df[var].notna().mean()
        mean_val = verify_df[var].mean()
        std_val = verify_df[var].std()
        
        if coverage < 0.9:
            print(f"  ⚠️  {var}: {coverage:.1%} coverage (LOW!)")
            all_critical_ok = False
        elif verify_df[var].isna().all():
            print(f"  ✗ {var}: ALL NaN VALUES!")
            all_critical_ok = False
        else:
            print(f"  ✓ {var}: {coverage:.1%} coverage, mean={mean_val:.3f}, std={std_val:.3f}")
    else:
        print(f"  ✗ {var}: MISSING FROM SAVED DATASET!")
        all_critical_ok = False

# Summary statistics for all variables
print("\n=== Quick Statistics ===")
print(f"Total variables: {len(verify_df.columns)}")
print(f"Loan-level vars present: {len([v for v in LOAN_LEVEL_VARS if v in verify_df.columns])}/{len(LOAN_LEVEL_VARS)}")
print(f"Macro vars present: {len([v for v in MACRO_VARS if v in verify_df.columns])}/{len(MACRO_VARS)}")

# Final status
print("\n" + "=" * 70)
if all_critical_ok:
    print("✓ VERIFICATION PASSED: All critical variables present with good coverage")
else:
    print("✗ VERIFICATION FAILED: Some critical variables missing or have low coverage")
    print("  Check the warnings above and re-run data preparation if needed.")
print("=" * 70)

In [20]:
# Final summary
print("=" * 60)
print("BLUMENSTOCK DATASET 2 SUMMARY")
print("=" * 60)
print(f"\nPeriod: {final_df['vintage_year'].min()}-{final_df['vintage_year'].max()}")
print(f"Total observations: {len(final_df):,}")
print(f"Number of folds: {final_df['fold'].nunique()}")

print(f"\n=== Event Distribution ===")
event_counts = final_df['event_code'].value_counts().sort_index()
for code, count in event_counts.items():
    event_name = {0: 'Censored', 1: 'Prepayment', 2: 'Default'}.get(code, 'Other')
    pct = count / len(final_df) * 100
    print(f"  {event_name} (k={code}): {count:,} ({pct:.1f}%)")

print(f"\n=== Variable Summary ===")
print(f"Loan-level variables: {len([v for v in LOAN_LEVEL_VARS if v in final_df.columns])}")
print(f"Macro variables: {len([v for v in MACRO_VARS if v in final_df.columns])}")

print(f"\n=== Duration Statistics ===")
print(final_df['duration'].describe())

BLUMENSTOCK DATASET 2 SUMMARY

Period: 2010-2025
Total observations: 110,000
Number of folds: 11

=== Event Distribution ===
  Censored (k=0): 46,615 (42.4%)
  Prepayment (k=1): 59,836 (54.4%)
  Default (k=2): 3,549 (3.2%)

=== Variable Summary ===
Loan-level variables: 9
Macro variables: 12

=== Duration Statistics ===
count    110000.000000
mean         49.535464
std          39.820797
min           0.000000
25%          19.000000
50%          39.000000
75%          68.000000
max         184.000000
Name: duration, dtype: float64


## Next Steps

Data is ready for experiments:

1. **Notebook 05**: Cause-Specific Cox (CSC)
2. **Notebook 06**: Fine-Gray Model (FGR)
3. **Notebook 07**: Random Survival Forest (RSF)
4. **Notebook 08**: Model Comparison

### Experiments (from paper)

| Experiment | Variables | File |
|------------|-----------|------|
| Exp 4.1 | Loan-level only | Use `LOAN_LEVEL_VARS` |
| Exp 4.2 | Macro only | Use `MACRO_VARS` |
| Exp 4.3 | All variables | Use `ALL_VARS` |