
### Target Variable (to be derived):

- *Current Loan Delinquency Status*: The number of days the borrower is delinquent in making loan payments as of the end of the monthly reporting period. Used to derive the target (e.g., delinquent if >0). Guide notes: 0 = Current, 1 = 30-59 days, 2 = 60-89 days, ..., RA = Repayment Plan, RF = REO, 999 = Unknown.


### Predictor Variables (Features):

#### From Origination Data:

- *Credit Score*: The standardized credit score used to evaluate the borrower during the loan origination process. Lower scores indicate higher risk. Guide notes: FICO score, masked as 300 for <300, 850 for >850, or 9999 for missing.
- *Original Combined Loan-to-Value (CLTV)*: The ratio of the original loan amount and any subordinate lien amount to the property value at origination. Higher ratios increase default risk. Guide notes: Rounded to nearest integer, 999 for missing.
- *Original Debt-to-Income (DTI) Ratio*: The ratio of the borrower's total monthly debt payments to gross monthly income at origination. Higher DTI suggests financial strain. Guide notes: Rounded to nearest integer, 999 for missing or not considered.
- *Original Interest Rate*: The interest rate on the loan as stated on the note at the time the loan was originated. Higher rates may lead to higher payments and defaults. Guide notes: Reported to the nearest eighth of a percent.
- *Original Loan Term*: The number of months in which the loan is scheduled to be repaid. Longer terms may reduce monthly payments but increase long-term risk. Guide notes: In months, e.g., 360 for 30-year loans.
- *Number of Borrowers*: The number of borrowers who are obligated to repay the mortgage note. Multiple borrowers may reduce risk. Guide notes: 99 for missing.
- *Property State*: The two-letter postal abbreviation for the state in which the property is located. Captures regional economic factors. Guide notes: U.S. states only.
- *Occupancy Status*: The classification for the property occupancy status at the time the loan was originated. Investment properties have higher risk. Guide notes: O = Owner Occupied, S = Second Home, I = Investment Property, 9 = Unknown.


#### From Performance Data:

- *Loan Age*: The number of scheduled monthly payments that have elapsed since the loan was originated. Helps capture loan seasoning. Guide notes: In months, 999 for missing.
- *Remaining Months to Legal Maturity*: The number of months remaining until the loan is scheduled to mature. Shorter terms may indicate higher risk near maturity. Guide notes: In months, 999 for missing.
- *Current Actual UPB*: The unpaid principal balance of the loan as of the end of the monthly reporting period. Higher UPB may correlate with defaults. Guide notes: Rounded to nearest $1,000, 000000 for zero balance.
- *Current Interest Rate*: The interest rate on the loan as of the end of the monthly reporting period. Adjustments can affect affordability. Guide notes: Reported to the nearest eighth of a percent, 99.999 for missing.


Rationale for Selection: 

These variables cover borrower creditworthiness, loan affordability, property details, and ongoing performance, which are key drivers of default risk. The target is derived from 'Current Loan Delinquency Status' as a binary flag (1 for delinquent, 0 for current).


Key Identifiers:

- *Loan Sequence Number*: A unique identifier for each loan, critical for merging and tracking across origination and performance data. Guide notes: 12-character alphanumeric, masked for privacy.
- *Original Loan-to-Value (LTV)*: The ratio of the original loan amount to the property value at origination, providing additional context to Original Combined Loan-to-Value (CLTV). Guide notes: Rounded to nearest integer, 999 for missing.
- *First Payment Date*: The date of the first scheduled payment, offering a temporal anchor for loan age and performance. Guide notes: Format YYYYMMDD, parsed as datetime64[ns].

In [106]:
import pandas as pd
import os
import numpy as np

In [107]:
# Directory containing the data files
data_dir = '/Users/dr/Documents/GitHub/MBS_RiskManagement/'
extracted_data = os.path.join(data_dir, 'extracted_data')

In [108]:
# Ensure extracted_data directory exists
os.makedirs(extracted_data, exist_ok=True)

In [109]:
# List of merged loan data files (adjust filenames as needed)
merged_files = [
    os.path.join(extracted_data, 'splits/merged_loans_part_1.csv'),
    os.path.join(extracted_data, 'splits/merged_loans_part_2.csv'),
    os.path.join(extracted_data, 'splits/merged_loans_part_3.csv')
]

In [110]:
# Selected column names for logistic regression
selected_columns = [
    'Credit Score',
    'Original Loan-to-Value (LTV)',
    'Original Combined Loan-to-Value (CLTV)',
    'Original Debt-to-Income (DTI) Ratio',
    'Original Interest Rate',
    'Original UPB',
    'Current Actual UPB',
    'Loan Age',
    'Remaining Months to Legal Maturity',
    'Estimated Loan-to-Value (ELTV)',
    'Current Loan Delinquency Status',
    'Number of Borrowers',
    'Property State',
    'Current Deferred UPB',
    'Current Interest Rate',
    'Occupancy Status',
    'Original Loan Term',
    'First Payment Date',
]

In [111]:
# Read and combine data
dataframes = []
for file in merged_files:
    if os.path.exists(file):
        # Read the first column (Loan Sequence Number) separately
        loan_seq_df = pd.read_csv(file, usecols=[0], header=None, low_memory=False)
        loan_seq_df.columns = ['Loan Sequence Number']
        
        # Read the selected columns with header
        selected_df = pd.read_csv(file, usecols=selected_columns, low_memory=False)
        
        # Combine the two DataFrames by index
        df = pd.concat([loan_seq_df, selected_df], axis=1)
        dataframes.append(df)
    else:
        print(f"File not found: {file}")

In [112]:
# Concatenate all dataframes
combined_df = pd.concat(dataframes, ignore_index=True)

In [113]:
# Derive binary target 'Default' from 'Current Loan Delinquency Status'
if 'Current Loan Delinquency Status' in combined_df.columns:
    combined_df['Current Loan Delinquency Status'] = pd.to_numeric(combined_df['Current Loan Delinquency Status'], errors='coerce')
    combined_df['Default'] = np.where(combined_df['Current Loan Delinquency Status'].fillna(0) > 0, 1, 0)
    print("Binary target 'Default' added.")
else:
    print("Target column 'Current Loan Delinquency Status' not found in the data.")

Binary target 'Default' added.


In [114]:
# Save the combined dataset
output_file = os.path.join(extracted_data, 'regression_data.csv')
combined_df.to_csv(output_file, index=False)
print(f"Regression data saved as {output_file}")

Regression data saved as /Users/dr/Documents/GitHub/MBS_RiskManagement/extracted_data/regression_data.csv


In [115]:
combined_df.head()

Unnamed: 0,Loan Sequence Number,Credit Score,First Payment Date,Occupancy Status,Original Combined Loan-to-Value (CLTV),Original Debt-to-Income (DTI) Ratio,Original UPB,Original Loan-to-Value (LTV),Original Interest Rate,Property State,Original Loan Term,Number of Borrowers,Current Actual UPB,Current Loan Delinquency Status,Loan Age,Remaining Months to Legal Maturity,Current Interest Rate,Current Deferred UPB,Estimated Loan-to-Value (ELTV),Default
0,,629.0,2014-05-01,P,77.0,45.0,324000.0,71.0,3.875,KY,180.0,2.0,0.0,0.0,74.0,106.0,3.875,0.0,50.0,0
1,F14Q10000001,770.0,2014-04-01,P,89.0,30.0,65000.0,89.0,3.375,NY,180.0,2.0,0.0,0.0,40.0,140.0,3.375,0.0,999.0,0
2,F14Q10000002,674.0,2014-03-01,P,89.0,999.0,182000.0,76.0,3.375,MI,180.0,1.0,0.0,0.0,75.0,105.0,3.375,0.0,999.0,0
3,F14Q10000003,717.0,2014-04-01,I,77.0,41.0,107000.0,77.0,5.25,RI,360.0,2.0,84852.01,0.0,132.0,228.0,5.25,0.0,21.0,0
4,F14Q10000004,813.0,2014-05-01,P,95.0,32.0,165000.0,95.0,4.125,IA,360.0,1.0,0.0,3.0,47.0,313.0,4.125,0.0,999.0,1


In [116]:
combined_df['Default'].value_counts()

Default
0    20112740
1      210382
Name: count, dtype: int64