Based on the Origination Data File and Monthly Performance Data File schemas:

Target Variable (to be derived):

- Current Loan Delinquency Status: The number of days the borrower is delinquent in making loan payments as of the end of the monthly reporting period. Used to derive the target (e.g., delinquent if >0). Guide notes: 0 = Current, 1 = 30-59 days, 2 = 60-89 days, ..., RA = Repayment Plan, RF = REO, 999 = Unknown.


Predictor Variables (Features):

From Origination Data:

- Credit Score: The standardized credit score used to evaluate the borrower during the loan origination process. Lower scores indicate higher risk. Guide notes: FICO score, masked as 300 for <300, 850 for >850, or 9999 for missing.
- Original Loan-to-Value (LTV): The ratio of the original loan amount to the property value at origination. Higher ratios increase default risk. Guide notes: Rounded to nearest integer, 999 for missing.
- Original Combined Loan-to-Value (CLTV): The ratio of the original loan amount and any subordinate lien amount to the property value at origination. Higher ratios increase default risk. Guide notes: Rounded to nearest integer, 999 for missing.
- Original Debt-to-Income (DTI) Ratio: The ratio of the borrower's total monthly debt payments to gross monthly income at origination. Higher DTI suggests financial strain. Guide notes: Rounded to nearest integer, 999 for missing or not considered.
- Original Interest Rate: The interest rate on the loan as stated on the note at the time the loan was originated. Higher rates may lead to higher payments and defaults. Guide notes: Reported to the nearest eighth of a percent.
- Original Loan Term: The number of months in which the loan is scheduled to be repaid. Longer terms may reduce monthly payments but increase long-term risk. Guide notes: In months, e.g., 360 for 30-year loans.
- Number of Borrowers: The number of borrowers who are obligated to repay the mortgage note. Multiple borrowers may reduce risk. Guide notes: 99 for missing.
- Property State: The two-letter postal abbreviation for the state in which the property is located. Captures regional economic factors. Guide notes: U.S. states only.
- Occupancy Status: The classification for the property occupancy status at the time the loan was originated. Investment properties have higher risk. Guide notes: O = Owner Occupied, S = Second Home, I = Investment Property, 9 = Unknown.


From Performance Data:

- Loan Age: The number of scheduled monthly payments that have elapsed since the loan was originated. Helps capture loan seasoning. Guide notes: In months, 999 for missing.
- Remaining Months to Legal Maturity: The number of months remaining until the loan is scheduled to mature. Shorter terms may indicate higher risk near maturity. Guide notes: In months, 999 for missing.
- Current Actual UPB: The unpaid principal balance of the loan as of the end of the monthly reporting period. Higher UPB may correlate with defaults. Guide notes: Rounded to nearest $1,000, 000000 for zero balance.
- Current Interest Rate: The interest rate on the loan as of the end of the monthly reporting period. Adjustments can affect affordability. Guide notes: Reported to the nearest eighth of a percent, 99.999 for missing.


Rationale for Selection: 

These variables cover borrower creditworthiness, loan affordability, property details, and ongoing performance, which are key drivers of default risk. The target is derived from 'Current Loan Delinquency Status' as a binary flag (1 for delinquent, 0 for current).


Key Identifiers:

- Loan Sequence Number: A unique identifier for each loan, critical for merging and tracking across origination and performance data. Guide notes: 12-character alphanumeric, masked for privacy.
- Original Loan-to-Value (LTV): The ratio of the original loan amount to the property value at origination, providing additional context to Original Combined Loan-to-Value (CLTV). Guide notes: Rounded to nearest integer, 999 for missing.
- First Payment Date: The date of the first scheduled payment, offering a temporal anchor for loan age and performance. Guide notes: Format YYYYMMDD, parsed as datetime64[ns].

In [42]:
import pandas as pd
import os
from pathlib import Path

In [43]:
# Set up file paths
mbs_risk_path = Path('/Users/dr/Documents/GitHub/MBS_RiskManagement')
data_path = mbs_risk_path / 'data' / 'extracted_data' / 'splits'

In [44]:
# Input files
split_files = [
    data_path / 'merged_loans_part_1.csv',
    data_path / 'merged_loans_part_2.csv', 
    data_path / 'merged_loans_part_3.csv'
]

In [45]:
# Output file
output_file = mbs_risk_path / 'regression_data.csv'

In [46]:
# Selected columns for logistic regression (from origination and performance data)
selected_columns = [
    'Loan Sequence Number', 'Credit Score', 'Original Combined Loan-to-Value (CLTV)', 
    'Original Loan-to-Value (LTV)', 'Original Debt-to-Income (DTI) Ratio', 'Original Interest Rate', 
    'Original Loan Term', 'Number of Borrowers', 'Property State', 'Occupancy Status', 
    'Loan Age', 'Remaining Months to Legal Maturity', 'Current Actual UPB', 
    'Current Interest Rate', 'Current Loan Delinquency Status', 'First Payment Date', 'Maturity Date'
]

In [47]:
# Define data types for proper parsing
dtypes = {
    'Loan Sequence Number': 'object',
    'Credit Score': 'Int64',
    'Original Combined Loan-to-Value (CLTV)': 'Int64',
    'Original Loan-to-Value (LTV)': 'Int64', 
    'Original Debt-to-Income (DTI) Ratio': 'Int64',
    'Original Interest Rate': 'float64',
    'Original Loan Term': 'Int64',
    'Number of Borrowers': 'Int64',
    'Property State': 'object',
    'Occupancy Status': 'object',
    'Loan Age': 'Int64',
    'Remaining Months to Legal Maturity': 'Int64',
    'Current Actual UPB': 'float64',
    'Current Interest Rate': 'float64',
    'Current Loan Delinquency Status': 'object'
}

In [48]:
# Date columns to parse
date_columns = ['First Payment Date', 'Maturity Date']

In [49]:
print("Starting to extract and consolidate logistic regression data from split files...")

Starting to extract and consolidate logistic regression data from split files...


In [50]:
# Read all split files, extracting only the selected columns
all_data_frames = []

for i, file_path in enumerate(split_files, 1):
    print(f"Processing file {i}/3: {file_path.name}")
    
    try:
        # Check if file exists
        if not file_path.exists():
            print(f"  ✗ File not found: {file_path}")
            continue
            
        # Read CSV with specified dtypes and date parsing
        df = pd.read_csv(
            file_path,
            usecols=lambda x: x in selected_columns,  # Only read needed columns
            dtype=dtypes,
            parse_dates=date_columns,
            infer_datetime_format=True,
            low_memory=False
        )
        
        print(f"  ✓ Successfully read {len(df):,} rows with {len(df.columns)} selected columns")
        
        # Check for missing columns
        missing_cols = set(selected_columns) - set(df.columns)
        if missing_cols:
            print(f"  ⚠ Missing columns in this file: {missing_cols}")
        
        all_data_frames.append(df)
        
    except Exception as e:
        print(f"  ✗ Error reading {file_path.name}: {e}")
        # Try to read without usecols to see what columns are available
        try:
            test_df = pd.read_csv(file_path, nrows=1)
            available_cols = set(test_df.columns)
            selected_cols_set = set(selected_columns)
            print(f"  Available columns: {len(available_cols)}")
            print(f"  Missing selected columns: {selected_cols_set - available_cols}")
        except Exception as e2:
            print(f"  Could not read file to check columns: {e2}")

Processing file 1/3: merged_loans_part_1.csv
  ✗ File not found: /Users/dr/Documents/GitHub/MBS_RiskManagement/data/extracted_data/splits/merged_loans_part_1.csv
Processing file 2/3: merged_loans_part_2.csv
  ✗ File not found: /Users/dr/Documents/GitHub/MBS_RiskManagement/data/extracted_data/splits/merged_loans_part_2.csv
Processing file 3/3: merged_loans_part_3.csv
  ✗ File not found: /Users/dr/Documents/GitHub/MBS_RiskManagement/data/extracted_data/splits/merged_loans_part_3.csv


In [51]:
# Combine all dataframes
if all_data_frames:
    print("\nCombining all dataframes...")
    consolidated_df = pd.concat(all_data_frames, ignore_index=True)
    print(f"✓ Consolidated dataset shape: {consolidated_df.shape}")
    
    # Display basic info about the dataset
    print("\nDataset overview:")
    print(f"Total rows: {len(consolidated_df):,}")
    print(f"Total columns: {len(consolidated_df.columns)}")
    
    # Check for missing columns in final dataset
    final_cols = set(consolidated_df.columns)
    missing_final_cols = set(selected_columns) - final_cols
    if missing_final_cols:
        print(f"⚠ Columns not found in any file: {missing_final_cols}")
    
    # Check for missing values
    print("\nMissing values per column:")
    missing_values = consolidated_df.isnull().sum()
    for col in selected_columns:
        if col in consolidated_df.columns:
            missing_count = missing_values[col]
            if missing_count > 0:
                print(f"  {col}: {missing_count:,} missing values ({missing_count/len(consolidated_df)*100:.2f}%)")