# Create Loan-Month Panel Data for Cox Regression

This notebook creates **loan-month panel data** for cause-specific Cox regression with time-varying covariates.

## Methodology (Blumenstock et al. 2022)

### Sampling Strategy
- **11 folds** (10 for cross-validation, 1 for hyperparameter tuning)
- **10,000 loans per fold** (100 defaults + 9,900 non-defaults)
- **Sampling without replacement** to ensure independent folds

### Data Structure
- Each loan contributes multiple rows (one per month observed)
- Time-varying covariates updated each month
- Interval format `(start, stop)` for Cox regression

## Phases
1. **Phase 1**: Sample loans (this section)
2. **Phase 2**: Create loan-month panel for sampled loans
3. **Phase 3**: Merge time-varying covariates

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 200)

print("Imports complete.")

In [None]:
# === CONFIGURATION ===

# Data paths
RAW_DATA_DIR = Path('../data/raw')
PROCESSED_DATA_DIR = Path('../data/processed')
EXTERNAL_DATA_DIR = Path('../data/external')

# Sampling configuration (Blumenstock et al.)
N_FOLDS = 11
DEFAULTS_PER_FOLD = 100
NON_DEFAULTS_PER_FOLD = 9_900
LOANS_PER_FOLD = DEFAULTS_PER_FOLD + NON_DEFAULTS_PER_FOLD  # 10,000

# Total needed
TOTAL_DEFAULTS_NEEDED = N_FOLDS * DEFAULTS_PER_FOLD      # 1,100
TOTAL_NON_DEFAULTS_NEEDED = N_FOLDS * NON_DEFAULTS_PER_FOLD  # 108,900
TOTAL_LOANS = N_FOLDS * LOANS_PER_FOLD  # 110,000

# Random seed for reproducibility
RANDOM_SEED = 42

print("=== Sampling Configuration ===")
print(f"Number of folds: {N_FOLDS}")
print(f"Loans per fold: {LOANS_PER_FOLD:,}")
print(f"  - Defaults per fold: {DEFAULTS_PER_FOLD}")
print(f"  - Non-defaults per fold: {NON_DEFAULTS_PER_FOLD:,}")
print(f"\nTotal loans to sample: {TOTAL_LOANS:,}")
print(f"  - Total defaults needed: {TOTAL_DEFAULTS_NEEDED:,}")
print(f"  - Total non-defaults needed: {TOTAL_NON_DEFAULTS_NEEDED:,}")

---

# Phase 1: Sample Loans

Load terminal records and sample loans using stratified sampling:
- 100 defaults per fold (without replacement)
- 9,900 non-defaults per fold (without replacement)

In [None]:
# Load terminal records to identify loan outcomes
print("Loading terminal records...")
terminal_df = pd.read_parquet(PROCESSED_DATA_DIR / 'survival_data_blumenstock.parquet')

print(f"\nLoaded {len(terminal_df):,} loans")
print(f"Vintages: {terminal_df['vintage_year'].min()} - {terminal_df['vintage_year'].max()}")

print("\nEvent distribution:")
event_names = {0: 'Censored', 1: 'Prepay', 2: 'Default'}
for code, count in terminal_df['event_code'].value_counts().sort_index().items():
    pct = count / len(terminal_df) * 100
    print(f"  {event_names.get(code, 'Other')} (k={code}): {count:,} ({pct:.1f}%)")

In [None]:
# Separate loans by terminal event type
defaulted_loans = terminal_df[terminal_df['event_code'] == 2]['loan_sequence_number'].unique()
non_defaulted_loans = terminal_df[terminal_df['event_code'] != 2]['loan_sequence_number'].unique()

print("=== Loan Pools ===")
print(f"Defaulted loans available: {len(defaulted_loans):,}")
print(f"Non-defaulted loans available: {len(non_defaulted_loans):,}")

# Verify we have enough loans
print("\n=== Verification ===")
if len(defaulted_loans) >= TOTAL_DEFAULTS_NEEDED:
    print(f"✓ Sufficient defaults: {len(defaulted_loans):,} >= {TOTAL_DEFAULTS_NEEDED:,} needed")
else:
    print(f"✗ INSUFFICIENT defaults: {len(defaulted_loans):,} < {TOTAL_DEFAULTS_NEEDED:,} needed")
    raise ValueError("Not enough defaulted loans for sampling")

if len(non_defaulted_loans) >= TOTAL_NON_DEFAULTS_NEEDED:
    print(f"✓ Sufficient non-defaults: {len(non_defaulted_loans):,} >= {TOTAL_NON_DEFAULTS_NEEDED:,} needed")
else:
    print(f"✗ INSUFFICIENT non-defaults: {len(non_defaulted_loans):,} < {TOTAL_NON_DEFAULTS_NEEDED:,} needed")
    raise ValueError("Not enough non-defaulted loans for sampling")

In [None]:
# Shuffle loans (without replacement sampling)
print("Shuffling loan pools...")
np.random.seed(RANDOM_SEED)

defaulted_shuffled = np.random.permutation(defaulted_loans)
non_defaulted_shuffled = np.random.permutation(non_defaulted_loans)

print(f"  Defaulted pool shuffled: {len(defaulted_shuffled):,} loans")
print(f"  Non-defaulted pool shuffled: {len(non_defaulted_shuffled):,} loans")

In [None]:
# Create fold assignments by taking sequential chunks (without replacement)
print("\n=== Creating Fold Assignments ===")

fold_assignments = {}  # loan_id -> fold number
fold_loan_lists = []   # List of loan IDs per fold (for verification)

for fold in range(N_FOLDS):
    # Defaults for this fold (sequential chunk)
    d_start = fold * DEFAULTS_PER_FOLD
    d_end = d_start + DEFAULTS_PER_FOLD
    fold_defaults = defaulted_shuffled[d_start:d_end]
    
    # Non-defaults for this fold (sequential chunk)
    nd_start = fold * NON_DEFAULTS_PER_FOLD
    nd_end = nd_start + NON_DEFAULTS_PER_FOLD
    fold_non_defaults = non_defaulted_shuffled[nd_start:nd_end]
    
    # Combine and assign fold number
    fold_loans = np.concatenate([fold_defaults, fold_non_defaults])
    fold_loan_lists.append(fold_loans)
    
    for loan_id in fold_loans:
        fold_assignments[loan_id] = fold
    
    print(f"Fold {fold:2d}: {len(fold_defaults):,} defaults + {len(fold_non_defaults):,} non-defaults = {len(fold_loans):,} loans")

# Collect all sampled loan IDs
sampled_loan_ids = set(fold_assignments.keys())

print(f"\n=== Summary ===")
print(f"Total loans sampled: {len(sampled_loan_ids):,}")
print(f"Expected: {TOTAL_LOANS:,}")
assert len(sampled_loan_ids) == TOTAL_LOANS, "Mismatch in total sampled loans!"

In [None]:
# Verify no overlap between folds (sampling without replacement)
print("=== Verifying No Overlap Between Folds ===")

all_ok = True
for i in range(N_FOLDS):
    for j in range(i + 1, N_FOLDS):
        set_i = set(fold_loan_lists[i])
        set_j = set(fold_loan_lists[j])
        overlap = set_i & set_j
        if len(overlap) > 0:
            print(f"✗ Overlap between fold {i} and fold {j}: {len(overlap)} loans")
            all_ok = False

if all_ok:
    print("✓ No overlap between any folds - sampling without replacement verified")

In [None]:
# Verify event distribution per fold
print("=== Event Distribution per Fold ===")
print(f"Target: {DEFAULTS_PER_FOLD} defaults (1%) + {NON_DEFAULTS_PER_FOLD:,} non-defaults per fold\n")

# Create a DataFrame of sampled loans with their fold assignments
sampled_df = terminal_df[terminal_df['loan_sequence_number'].isin(sampled_loan_ids)].copy()
sampled_df['fold'] = sampled_df['loan_sequence_number'].map(fold_assignments)

for fold in range(N_FOLDS):
    fold_data = sampled_df[sampled_df['fold'] == fold]
    n_defaults = (fold_data['event_code'] == 2).sum()
    n_prepay = (fold_data['event_code'] == 1).sum()
    n_censored = (fold_data['event_code'] == 0).sum()
    default_pct = n_defaults / len(fold_data) * 100
    
    status = "✓" if n_defaults == DEFAULTS_PER_FOLD else "✗"
    print(f"{status} Fold {fold:2d}: n={len(fold_data):,}, defaults={n_defaults} ({default_pct:.1f}%), prepay={n_prepay:,}, censored={n_censored:,}")

# Overall summary
total_defaults = (sampled_df['event_code'] == 2).sum()
total_prepay = (sampled_df['event_code'] == 1).sum()
total_censored = (sampled_df['event_code'] == 0).sum()

print(f"\n=== Overall ===")
print(f"Total defaults: {total_defaults:,} ({total_defaults/len(sampled_df)*100:.1f}%)")
print(f"Total prepay: {total_prepay:,} ({total_prepay/len(sampled_df)*100:.1f}%)")
print(f"Total censored: {total_censored:,} ({total_censored/len(sampled_df)*100:.1f}%)")

In [None]:
# Check vintage distribution in sampled data
print("=== Vintage Distribution in Sampled Data ===")

vintage_dist = sampled_df.groupby('vintage_year').agg({
    'loan_sequence_number': 'count',
    'event_code': lambda x: (x == 2).sum()  # Count defaults
}).rename(columns={'loan_sequence_number': 'n_loans', 'event_code': 'n_defaults'})

vintage_dist['default_rate'] = (vintage_dist['n_defaults'] / vintage_dist['n_loans'] * 100).round(2)

print(vintage_dist.to_string())
print(f"\nTotal: {vintage_dist['n_loans'].sum():,} loans, {vintage_dist['n_defaults'].sum():,} defaults")

In [None]:
# Save the sampled loans with fold assignments
print("=== Saving Sampled Loan Data ===")

# Select key columns for the sampled loans reference file
sample_cols = [
    'loan_sequence_number', 'fold', 'vintage_year', 'property_state',
    'event_code', 'duration',
    # Static covariates
    'int_rate', 'orig_upb', 'fico_score', 'dti_r', 'ltv_r',
    # Origination info for macro merging
    'first_payment_date', 'orig_year_month'
]

# Filter to available columns
sample_cols = [c for c in sample_cols if c in sampled_df.columns]
sampled_loans_df = sampled_df[sample_cols].copy()

# Save to parquet
output_path = PROCESSED_DATA_DIR / 'sampled_loans_blumenstock.parquet'
sampled_loans_df.to_parquet(output_path, index=False)

print(f"✓ Saved sampled loans to: {output_path}")
print(f"  Shape: {sampled_loans_df.shape}")
print(f"  Columns: {list(sampled_loans_df.columns)}")

In [None]:
# Save fold assignments as a separate lookup file (for use in panel creation)
import json

# Convert to serializable format
fold_assignments_serializable = {str(k): int(v) for k, v in fold_assignments.items()}

# Save as JSON
fold_path = PROCESSED_DATA_DIR / 'fold_assignments.json'
with open(fold_path, 'w') as f:
    json.dump(fold_assignments_serializable, f)

print(f"✓ Saved fold assignments to: {fold_path}")
print(f"  Total entries: {len(fold_assignments_serializable):,}")

---

## Phase 1 Summary

### Completed
- ✅ Loaded terminal records (774,950 loans)
- ✅ Separated defaulted (25,527) vs non-defaulted (749,423) loans
- ✅ Sampled 100 defaults + 9,900 non-defaults per fold (without replacement)
- ✅ Created 11 folds with 10,000 loans each (110,000 total)
- ✅ Verified no overlap between folds
- ✅ Saved sampled loans to `sampled_loans_blumenstock.parquet`
- ✅ Saved fold assignments to `fold_assignments.json`

### Output Files
| File | Description |
|------|-------------|
| `sampled_loans_blumenstock.parquet` | 110,000 sampled loans with fold assignments and static covariates |
| `fold_assignments.json` | Loan ID → Fold mapping for panel creation |

### Next: Phase 2
Create loan-month panel data for the sampled loans by expanding performance records.

In [None]:
print("="*60)
print("PHASE 1 COMPLETE: Loan Sampling")
print("="*60)
print(f"\nSampled {len(sampled_loan_ids):,} loans across {N_FOLDS} folds")
print(f"  - {DEFAULTS_PER_FOLD} defaults per fold ({TOTAL_DEFAULTS_NEEDED:,} total)")
print(f"  - {NON_DEFAULTS_PER_FOLD:,} non-defaults per fold ({TOTAL_NON_DEFAULTS_NEEDED:,} total)")
print(f"\nOutput files:")
print(f"  - {PROCESSED_DATA_DIR / 'sampled_loans_blumenstock.parquet'}")
print(f"  - {PROCESSED_DATA_DIR / 'fold_assignments.json'}")
print(f"\nReady for Phase 2: Create loan-month panel data")

---

# Phase 2: Create Loan-Month Panel Data

Expand the sampled loans into loan-month panel format:
1. Load raw performance data for each vintage
2. Filter to sampled loans (early filtering for memory efficiency)
3. Calculate behavioral variables (rolling 12-month counts)
4. Determine events per loan-month
5. Create interval format (start, stop) for Cox regression

In [None]:
# Load sampled loan IDs and fold assignments from Phase 1
import json

print("=== Loading Phase 1 Outputs ===")

# Load sampled loans reference
sampled_loans_df = pd.read_parquet(PROCESSED_DATA_DIR / 'sampled_loans_blumenstock.parquet')
print(f"Loaded {len(sampled_loans_df):,} sampled loans")

# Load fold assignments
with open(PROCESSED_DATA_DIR / 'fold_assignments.json', 'r') as f:
    fold_assignments = json.load(f)
print(f"Loaded {len(fold_assignments):,} fold assignments")

# Create set of sampled loan IDs for fast lookup
sampled_loan_ids = set(fold_assignments.keys())
print(f"Sampled loan IDs: {len(sampled_loan_ids):,}")

# Vintages to process
VINTAGES = list(range(2010, 2026))
print(f"\nVintages to process: {VINTAGES[0]} - {VINTAGES[-1]}")

In [None]:
# Import column definitions
import sys
sys.path.insert(0, '..')
from src.data.columns import (
    ORIGINATION_COLUMNS, ORIGINATION_DTYPES,
    PERFORMANCE_COLUMNS, PERFORMANCE_DTYPES,
)

# Default definition: 3-month delinquent (90+ days)
DEFAULT_DELINQUENCY_THRESHOLD = 3

def load_performance_data(vintage: int) -> pd.DataFrame:
    """Load performance (loan-month) data for a vintage."""
    pattern = f'sample_{vintage}/sample_svcg_{vintage}.txt'
    files = list(RAW_DATA_DIR.glob(f'**/{pattern}'))
    
    if not files:
        return pd.DataFrame()
    
    df = pd.read_csv(
        files[0], sep='|', names=PERFORMANCE_COLUMNS,
        dtype=PERFORMANCE_DTYPES, na_values=['', ' ']
    )
    return df


def load_origination_data(vintage: int) -> pd.DataFrame:
    """Load origination data for a vintage."""
    pattern = f'sample_{vintage}/sample_orig_{vintage}.txt'
    files = list(RAW_DATA_DIR.glob(f'**/{pattern}'))
    
    if not files:
        return pd.DataFrame()
    
    df = pd.read_csv(
        files[0], sep='|', names=ORIGINATION_COLUMNS,
        dtype=ORIGINATION_DTYPES, na_values=['', ' ']
    )
    return df


print("Data loading functions defined.")

In [None]:
def process_vintage_panel(vintage: int, sampled_loan_ids: set, fold_assignments: dict) -> pd.DataFrame:
    """
    Process a vintage into loan-month panel format for sampled loans only.
    
    Returns DataFrame with one row per loan-month, including:
    - Interval format (start, stop) for Cox regression
    - Behavioral variables (rolling 12-month counts)
    - Event indicators
    """
    print(f"\nProcessing vintage {vintage}...")
    
    # Load performance data
    perf_df = load_performance_data(vintage)
    if perf_df.empty:
        print(f"  No performance data found for vintage {vintage}")
        return pd.DataFrame()
    
    initial_rows = len(perf_df)
    initial_loans = perf_df['loan_sequence_number'].nunique()
    
    # EARLY FILTER: Keep only sampled loans (critical for memory efficiency)
    perf_df = perf_df[perf_df['loan_sequence_number'].isin(sampled_loan_ids)]
    
    if len(perf_df) == 0:
        print(f"  No sampled loans found in vintage {vintage}")
        return pd.DataFrame()
    
    filtered_loans = perf_df['loan_sequence_number'].nunique()
    print(f"  Filtered: {initial_loans:,} → {filtered_loans:,} loans ({len(perf_df):,} loan-months)")
    
    # Add fold assignment
    perf_df['fold'] = perf_df['loan_sequence_number'].map(fold_assignments)
    
    # Parse reporting period
    perf_df['reporting_date'] = pd.to_datetime(
        perf_df['monthly_reporting_period'].astype(str), format='%Y%m'
    )
    perf_df['year_month'] = perf_df['reporting_date'].dt.to_period('M')
    
    # Parse delinquency status
    perf_df['delinquency_status'] = pd.to_numeric(
        perf_df['current_loan_delinquency_status'].replace({'X': '0', 'XX': '0'}),
        errors='coerce'
    ).fillna(0).astype(int)
    
    # Sort by loan and time (required for rolling calculations)
    perf_df = perf_df.sort_values(['loan_sequence_number', 'loan_age'])
    
    # === Calculate behavioral variables ===
    # Binary indicators for each month
    perf_df['is_current'] = (perf_df['delinquency_status'] == 0).astype(int)
    perf_df['is_30d_del'] = (perf_df['delinquency_status'] == 1).astype(int)
    perf_df['is_60d_del'] = (perf_df['delinquency_status'] == 2).astype(int)
    
    # Rolling 12-month counts (per loan)
    grouped = perf_df.groupby('loan_sequence_number')
    perf_df['t_act_12m'] = grouped['is_current'].transform(
        lambda x: x.rolling(12, min_periods=1).sum()
    )
    perf_df['t_del_30d_12m'] = grouped['is_30d_del'].transform(
        lambda x: x.rolling(12, min_periods=1).sum()
    )
    perf_df['t_del_60d_12m'] = grouped['is_60d_del'].transform(
        lambda x: x.rolling(12, min_periods=1).sum()
    )
    
    # === Determine events ===
    # Default: first time reaching 90+ days delinquent
    perf_df['is_default'] = (perf_df['delinquency_status'] >= DEFAULT_DELINQUENCY_THRESHOLD).astype(int)
    perf_df['cumsum_default'] = grouped['is_default'].transform('cumsum')
    perf_df['first_default'] = ((perf_df['cumsum_default'] == 1) & (perf_df['is_default'] == 1)).astype(int)
    
    # Prepayment: zero balance code = 01
    perf_df['is_prepay'] = (perf_df['zero_balance_code'] == '01').astype(int)
    
    # === Create interval format (start, stop) ===
    # Filter out loan_age <= 0 (origination month with no elapsed time)
    # These rows cannot contribute to the risk set and would create invalid start < 0
    n_before_filter = len(perf_df)
    perf_df = perf_df[perf_df['loan_age'] > 0].copy()
    n_dropped = n_before_filter - len(perf_df)
    if n_dropped > 0:
        print(f"  Dropped {n_dropped:,} rows with loan_age <= 0")
    
    perf_df['start'] = perf_df['loan_age'] - 1
    perf_df['stop'] = perf_df['loan_age']
    
    # Event indicator (1 if event occurs at this stop time)
    perf_df['event'] = 0
    perf_df.loc[perf_df['first_default'] == 1, 'event'] = 1
    perf_df.loc[perf_df['is_prepay'] == 1, 'event'] = 1
    
    # Event code (for cause-specific models)
    perf_df['event_code'] = 0  # censored
    perf_df.loc[perf_df['is_prepay'] == 1, 'event_code'] = 1  # prepay
    perf_df.loc[perf_df['first_default'] == 1, 'event_code'] = 2  # default
    
    # === Remove observations AFTER the event ===
    # Loan should not contribute to risk set after event
    # Need to recompute grouped after filtering
    grouped = perf_df.groupby('loan_sequence_number')
    perf_df['cumsum_event'] = grouped['event'].transform('cumsum')
    perf_df = perf_df[perf_df['cumsum_event'] <= 1]  # Keep up to and including first event
    
    # Add vintage year
    perf_df['vintage_year'] = vintage
    
    print(f"  Final: {perf_df['loan_sequence_number'].nunique():,} loans, {len(perf_df):,} loan-months")
    
    return perf_df


print("Processing function defined.")

In [None]:
# Process all vintages and combine into panel
print("=" * 60)
print("PROCESSING ALL VINTAGES")
print("=" * 60)

panel_dfs = []
vintage_stats = []

for vintage in VINTAGES:
    panel_vintage = process_vintage_panel(vintage, sampled_loan_ids, fold_assignments)
    
    if not panel_vintage.empty:
        n_loans = panel_vintage['loan_sequence_number'].nunique()
        n_months = len(panel_vintage)
        n_events = panel_vintage['event'].sum()
        
        vintage_stats.append({
            'vintage': vintage,
            'loans': n_loans,
            'loan_months': n_months,
            'events': n_events,
            'avg_duration': n_months / n_loans if n_loans > 0 else 0
        })
        
        panel_dfs.append(panel_vintage)

# Combine all vintages
print("\n" + "=" * 60)
print("COMBINING VINTAGES")
print("=" * 60)

panel_df = pd.concat(panel_dfs, ignore_index=True)

print(f"\nTotal loan-months: {len(panel_df):,}")
print(f"Total unique loans: {panel_df['loan_sequence_number'].nunique():,}")
print(f"Total events: {panel_df['event'].sum():,}")

In [None]:
# Display vintage statistics
print("=== Vintage Statistics ===")
stats_df = pd.DataFrame(vintage_stats)
stats_df['avg_duration'] = stats_df['avg_duration'].round(1)
print(stats_df.to_string(index=False))

print(f"\nTotal: {stats_df['loans'].sum():,} loans, {stats_df['loan_months'].sum():,} loan-months")

## Merge Origination Data (Static Covariates)

Add static loan characteristics from origination files.

In [None]:
# Load and merge origination data for static covariates
print("=== Merging Origination Data ===")

# Load origination data for all vintages
orig_dfs = []
for vintage in VINTAGES:
    orig_df = load_origination_data(vintage)
    if not orig_df.empty:
        # Filter to sampled loans
        orig_df = orig_df[orig_df['loan_sequence_number'].isin(sampled_loan_ids)]
        if not orig_df.empty:
            orig_df['vintage_year'] = vintage
            orig_dfs.append(orig_df)

orig_all = pd.concat(orig_dfs, ignore_index=True)
print(f"Loaded origination data for {len(orig_all):,} loans")

# Select columns to merge
orig_cols = [
    'loan_sequence_number',
    'credit_score',      # fico_score
    'orig_ltv',          # ltv_r
    'orig_dti',          # dti_r
    'orig_upb',          # orig_upb
    'orig_interest_rate', # int_rate
    'first_payment_date',
    'property_state',
]

# Filter to available columns
orig_cols = [c for c in orig_cols if c in orig_all.columns]
orig_subset = orig_all[orig_cols].drop_duplicates(subset=['loan_sequence_number'])

print(f"Origination columns to merge: {orig_cols}")

# Merge with panel
n_before = len(panel_df)
panel_df = panel_df.merge(orig_subset, on='loan_sequence_number', how='left')
n_after = len(panel_df)

print(f"Merged: {n_before:,} → {n_after:,} rows")

# Rename columns to match Blumenstock variable names
rename_map = {
    'credit_score': 'fico_score',
    'orig_ltv': 'ltv_r',
    'orig_dti': 'dti_r',
    'orig_interest_rate': 'int_rate',
}
panel_df = panel_df.rename(columns=rename_map)

# Parse origination date for macro merging
panel_df['first_payment_date'] = pd.to_datetime(
    panel_df['first_payment_date'].astype(str), format='%Y%m', errors='coerce'
)
panel_df['orig_year_month'] = panel_df['first_payment_date'].dt.to_period('M')

# Calculate bal_repaid (percent of balance repaid)
panel_df['bal_repaid'] = (
    (panel_df['orig_upb'] - panel_df['current_actual_upb'].fillna(0)) / 
    panel_df['orig_upb']
) * 100
panel_df['bal_repaid'] = panel_df['bal_repaid'].clip(0, 100)

print(f"\n✓ Static covariates merged")

## Merge Macroeconomic Data (Time-Varying)

Add time-varying macro variables:
- National: Treasury rates, mortgage rates, yield spreads
- State-level: HPI, unemployment

In [None]:
# Load macroeconomic data
print("=== Loading Macro Data ===")

# National macro data
macro_path = EXTERNAL_DATA_DIR / 'fred_monthly_panel.parquet'
if macro_path.exists():
    macro_national = pd.read_parquet(macro_path)
    macro_national.index.name = 'date'
    macro_national = macro_national.reset_index()
    macro_national['date'] = pd.to_datetime(macro_national['date'])
    macro_national['year_month'] = macro_national['date'].dt.to_period('M')
    
    # Calculate derived variables
    if 'DGS10' in macro_national.columns:
        macro_national['TB10Y_r12m'] = macro_national['DGS10'].pct_change(12)
    if 'DGS10' in macro_national.columns and 'DGS3MO' in macro_national.columns:
        macro_national['T10Y3MM'] = macro_national['DGS10'] - macro_national['DGS3MO']
        macro_national['T10Y3MM_r12m'] = macro_national['T10Y3MM'].pct_change(12)
    
    print(f"✓ National macro: {len(macro_national)} months")
else:
    print(f"✗ National macro not found: {macro_path}")
    macro_national = None

# State unemployment
unemp_path = EXTERNAL_DATA_DIR / 'state_unemployment.parquet'
if unemp_path.exists():
    state_unemp = pd.read_parquet(unemp_path)
    state_unemp.index.name = 'date'
    state_unemp = state_unemp.reset_index()
    state_unemp['date'] = pd.to_datetime(state_unemp['date'])
    state_unemp['year_month'] = state_unemp['date'].dt.to_period('M')
    
    # Melt to long format
    state_cols = [c for c in state_unemp.columns if '_unemployment' in c]
    state_unemp_long = state_unemp.melt(
        id_vars=['date', 'year_month'],
        value_vars=state_cols,
        var_name='state_col',
        value_name='state_unemployment'
    )
    state_unemp_long['property_state'] = state_unemp_long['state_col'].str.replace('_unemployment', '')
    
    # Calculate returns
    state_unemp_long = state_unemp_long.sort_values(['property_state', 'year_month'])
    state_unemp_long['st_unemp_r12m'] = state_unemp_long.groupby('property_state')['state_unemployment'].transform(
        lambda x: np.log(x / x.shift(12))
    )
    state_unemp_long['st_unemp_r3m'] = state_unemp_long.groupby('property_state')['state_unemployment'].transform(
        lambda x: np.log(x / x.shift(3))
    )
    
    state_unemp_long = state_unemp_long[['year_month', 'property_state', 'state_unemployment', 
                                          'st_unemp_r12m', 'st_unemp_r3m']].drop_duplicates()
    print(f"✓ State unemployment: {state_unemp_long['property_state'].nunique()} states")
else:
    print(f"✗ State unemployment not found: {unemp_path}")
    state_unemp_long = pd.DataFrame()

# State HPI
hpi_path = EXTERNAL_DATA_DIR / 'state_hpi.parquet'
if hpi_path.exists():
    state_hpi = pd.read_parquet(hpi_path)
    state_hpi.index.name = 'date'
    state_hpi = state_hpi.reset_index()
    state_hpi['date'] = pd.to_datetime(state_hpi['date'])
    state_hpi['year_month'] = state_hpi['date'].dt.to_period('M')
    
    # Melt to long format
    hpi_cols = [c for c in state_hpi.columns if c.endswith('_hpi') and len(c) <= 6]
    state_hpi['national_hpi'] = state_hpi[hpi_cols].mean(axis=1)
    
    state_hpi_long = state_hpi.melt(
        id_vars=['date', 'year_month', 'national_hpi'],
        value_vars=hpi_cols,
        var_name='state_col',
        value_name='state_hpi'
    )
    state_hpi_long['property_state'] = state_hpi_long['state_col'].str.replace('_hpi', '')
    
    # Calculate returns
    state_hpi_long = state_hpi_long.sort_values(['property_state', 'year_month'])
    state_hpi_long['hpi_st_log12m'] = state_hpi_long.groupby('property_state')['state_hpi'].transform(
        lambda x: np.log(x / x.shift(12))
    )
    state_hpi_long['hpi_r_st_us'] = state_hpi_long['state_hpi'] / state_hpi_long['national_hpi']
    
    state_hpi_long = state_hpi_long[['year_month', 'property_state', 'state_hpi', 'national_hpi',
                                      'hpi_st_log12m', 'hpi_r_st_us']].drop_duplicates()
    print(f"✓ State HPI: {state_hpi_long['property_state'].nunique()} states")
else:
    print(f"✗ State HPI not found: {hpi_path}")
    state_hpi_long = pd.DataFrame()

In [None]:
# Merge macro data with panel (by observation year_month)
print("=== Merging Macro Data at Observation Time ===")
n_before = len(panel_df)

# Merge national macro
if macro_national is not None:
    macro_cols = ['year_month', 'MORTGAGE30US', 'DGS10', 'TB10Y_r12m', 'T10Y3MM', 'T10Y3MM_r12m']
    macro_cols = [c for c in macro_cols if c in macro_national.columns]
    panel_df = panel_df.merge(macro_national[macro_cols], on='year_month', how='left')
    print(f"  ✓ National macro merged: MORTGAGE30US coverage = {panel_df['MORTGAGE30US'].notna().mean():.1%}")

# Merge state unemployment
if not state_unemp_long.empty:
    panel_df = panel_df.merge(
        state_unemp_long[['year_month', 'property_state', 'st_unemp_r12m', 'st_unemp_r3m']],
        on=['year_month', 'property_state'],
        how='left'
    )
    print(f"  ✓ State unemployment merged: coverage = {panel_df['st_unemp_r12m'].notna().mean():.1%}")

# Merge state HPI
if not state_hpi_long.empty:
    panel_df = panel_df.merge(
        state_hpi_long[['year_month', 'property_state', 'state_hpi', 'national_hpi', 'hpi_st_log12m', 'hpi_r_st_us']],
        on=['year_month', 'property_state'],
        how='left'
    )
    print(f"  ✓ State HPI merged: coverage = {panel_df['state_hpi'].notna().mean():.1%}")

n_after = len(panel_df)
print(f"\nRows: {n_before:,} → {n_after:,}")

In [None]:
# Calculate origination-relative differences
print("=== Calculating Origination-Time Differences ===")

# Get origination-time macro values
if macro_national is not None:
    orig_macro = macro_national[['year_month', 'MORTGAGE30US', 'DGS10']].rename(
        columns={'year_month': 'orig_year_month', 
                 'MORTGAGE30US': 'orig_MORTGAGE30US', 
                 'DGS10': 'orig_DGS10'}
    )
    panel_df = panel_df.merge(orig_macro, on='orig_year_month', how='left')
    
    # ppi_c_FRMA: Current prepayment incentive (int_rate - current_mortgage_rate)
    panel_df['ppi_c_FRMA'] = panel_df['int_rate'] - panel_df['MORTGAGE30US']
    
    # ppi_o_FRMA: Prepayment incentive at origination
    panel_df['ppi_o_FRMA'] = panel_df['int_rate'] - panel_df['orig_MORTGAGE30US']
    
    # TB10Y_d_t_o: Treasury rate difference (today - origination)
    panel_df['TB10Y_d_t_o'] = panel_df['DGS10'] - panel_df['orig_DGS10']
    
    # FRMA30Y_d_t_o: Mortgage rate difference (today - origination)
    panel_df['FRMA30Y_d_t_o'] = panel_df['MORTGAGE30US'] - panel_df['orig_MORTGAGE30US']
    
    print(f"  ✓ ppi_c_FRMA coverage: {panel_df['ppi_c_FRMA'].notna().mean():.1%}")
    print(f"  ✓ TB10Y_d_t_o coverage: {panel_df['TB10Y_d_t_o'].notna().mean():.1%}")

# Get origination-time HPI
if not state_hpi_long.empty:
    orig_hpi = state_hpi_long[['year_month', 'property_state', 'state_hpi']].rename(
        columns={'year_month': 'orig_year_month', 'state_hpi': 'orig_state_hpi'}
    )
    panel_df = panel_df.merge(orig_hpi, on=['orig_year_month', 'property_state'], how='left')
    
    # hpi_st_d_t_o: HPI difference (today - origination)
    panel_df['hpi_st_d_t_o'] = panel_df['state_hpi'] - panel_df['orig_state_hpi']
    
    print(f"  ✓ hpi_st_d_t_o coverage: {panel_df['hpi_st_d_t_o'].notna().mean():.1%}")

print("\n✓ Origination-relative differences calculated")

## Select Final Columns and Verify Panel

In [None]:
# Define final columns for the panel
FINAL_COLUMNS = [
    # Identifiers
    'loan_sequence_number', 'fold', 'vintage_year', 'property_state',
    
    # Time indices
    'loan_age', 'start', 'stop', 'year_month',
    
    # Event indicators
    'event', 'event_code',
    
    # Static covariates
    'int_rate', 'orig_upb', 'fico_score', 'dti_r', 'ltv_r',
    
    # Time-varying behavioral
    'bal_repaid', 't_act_12m', 't_del_30d_12m', 't_del_60d_12m',
    
    # Time-varying macro
    'hpi_st_d_t_o', 'ppi_c_FRMA', 'TB10Y_d_t_o', 'FRMA30Y_d_t_o',
    'ppi_o_FRMA', 'hpi_st_log12m', 'hpi_r_st_us',
    'st_unemp_r12m', 'st_unemp_r3m', 
    'TB10Y_r12m', 'T10Y3MM', 'T10Y3MM_r12m',
]

# Filter to available columns
available_cols = [c for c in FINAL_COLUMNS if c in panel_df.columns]
missing_cols = [c for c in FINAL_COLUMNS if c not in panel_df.columns]

print("=== Column Selection ===")
print(f"Requested: {len(FINAL_COLUMNS)} columns")
print(f"Available: {len(available_cols)} columns")
if missing_cols:
    print(f"Missing: {missing_cols}")

# Select final columns
panel_final = panel_df[available_cols].copy()
print(f"\nFinal panel shape: {panel_final.shape}")

In [None]:
# Verify panel integrity
print("=== Panel Verification ===")

# Check loan counts per fold
print("\nLoans per fold:")
loans_per_fold = panel_final.groupby('fold')['loan_sequence_number'].nunique()
for fold, n_loans in loans_per_fold.items():
    print(f"  Fold {fold}: {n_loans:,} loans")

# Check events per fold
print("\nEvents per fold:")
terminal_events = panel_final[panel_final['event'] == 1].groupby(['fold', 'event_code']).size().unstack(fill_value=0)
print(terminal_events)

# Check variable coverage
print("\n=== Variable Coverage ===")
LOAN_VARS = ['int_rate', 'orig_upb', 'fico_score', 'dti_r', 'ltv_r',
             'bal_repaid', 't_act_12m', 't_del_30d_12m', 't_del_60d_12m']
MACRO_VARS = ['hpi_st_d_t_o', 'ppi_c_FRMA', 'TB10Y_d_t_o', 'FRMA30Y_d_t_o',
              'ppi_o_FRMA', 'hpi_st_log12m', 'hpi_r_st_us', 
              'st_unemp_r12m', 'st_unemp_r3m', 'TB10Y_r12m', 'T10Y3MM', 'T10Y3MM_r12m']

print("\nLoan-level variables:")
for var in LOAN_VARS:
    if var in panel_final.columns:
        coverage = panel_final[var].notna().mean()
        print(f"  {var}: {coverage:.1%}")

print("\nMacro variables:")
for var in MACRO_VARS:
    if var in panel_final.columns:
        coverage = panel_final[var].notna().mean()
        print(f"  {var}: {coverage:.1%}")

In [None]:
# Save the loan-month panel
print("=== Saving Loan-Month Panel ===")

output_path = PROCESSED_DATA_DIR / 'loan_month_panel.parquet'
panel_final.to_parquet(output_path, index=False)

print(f"✓ Saved to: {output_path}")
print(f"  Shape: {panel_final.shape}")
print(f"  File size: {output_path.stat().st_size / 1e6:.1f} MB")

---

## Phase 2 Summary

### Completed
- ✅ Processed 16 vintages (2010-2025)
- ✅ Filtered to sampled loans (early filtering for memory efficiency)
- ✅ Calculated behavioral variables (rolling 12-month counts)
- ✅ Created interval format (start, stop) for Cox regression
- ✅ Merged static covariates from origination data
- ✅ Merged time-varying macro data (national + state-level)
- ✅ Calculated origination-relative differences
- ✅ Saved loan-month panel

### Output
| File | Description |
|------|-------------|
| `loan_month_panel.parquet` | ~5.5M loan-month records with time-varying covariates |

### Data Structure
Each row represents one loan-month with:
- **Interval format**: `(start, stop)` for Cox regression
- **Event indicator**: 1 if event occurs at `stop`, 0 otherwise
- **Static covariates**: Fixed at origination
- **Time-varying covariates**: Updated each month

### Next Steps
Use `loan_month_panel.parquet` in notebook 05 for cause-specific Cox regression with time-varying covariates.

In [None]:
print("=" * 60)
print("PHASE 2 COMPLETE: Loan-Month Panel Created")
print("=" * 60)
print(f"\nPanel statistics:")
print(f"  Total loan-months: {len(panel_final):,}")
print(f"  Total loans: {panel_final['loan_sequence_number'].nunique():,}")
print(f"  Folds: {panel_final['fold'].nunique()}")
print(f"  Vintages: {panel_final['vintage_year'].min()} - {panel_final['vintage_year'].max()}")
print(f"\nEvents:")
print(f"  Prepayments: {(panel_final[panel_final['event']==1]['event_code']==1).sum():,}")
print(f"  Defaults: {(panel_final[panel_final['event']==1]['event_code']==2).sum():,}")
print(f"\nOutput file: {output_path}")
print(f"\nReady for Phase 3 (notebook 05): Cause-Specific Cox with time-varying covariates")