Here is your task

The risk manager has collected data on the loan borrowers. The data is in tabular format, with each row providing details of the borrower, including their income, total loans outstanding, and a few other metrics. There is also a column indicating if the borrower has previously defaulted on a loan. You must use this data to build a model that, given details for any loan described above, will predict the probability that the borrower will default (also known as PD: the probability of default). Use the provided data to train a function that will estimate the probability of default for a borrower. Assuming a recovery rate of 10%, this can be used to give the expected loss on a loan.

You should produce a function that can take in the properties of a loan and output the expected loss.
You can explore any technique ranging from a simple regression or a decision tree to something more advanced. You can also use multiple methods and provide a comparative analysis.

In [17]:
import pandas as pd

df = pd.read_csv(r'C:\Users\darsh\Documents\theforage\JPMorgan Quantitative Research\Task_3_and_4_Loan_Data.csv')

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score
import joblib

# Load your dataset
# df = pd.read_csv("loan_data.csv")  # uncomment and modify if you have the CSV
# For this mock-up, we assume df is already available

# Assume 'defaulted' is the target column
TARGET = 'default'
RECOVERY_RATE = 0.1  # 10% recovery

# Step 1: Preprocessing
def preprocess_and_split(df):
    X = df.drop(columns=[TARGET])
    y = df[TARGET]

    # Separate numerical and categorical features
    num_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
    cat_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Preprocessing pipelines
    num_pipeline = Pipeline([
        ('scaler', StandardScaler())
    ])

    cat_pipeline = Pipeline([
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ])

    preprocessor = ColumnTransformer([
        ('num', num_pipeline, num_features),
        ('cat', cat_pipeline, cat_features)
    ])

    return X_train, X_test, y_train, y_test, preprocessor

# Step 2: Train Models
def train_models(X_train, y_train, preprocessor):
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
    }

    fitted_models = {}
    for name, model in models.items():
        pipeline = Pipeline([
            ('preprocessor', preprocessor),
            ('classifier', model)
        ])
        pipeline.fit(X_train, y_train)
        fitted_models[name] = pipeline

    return fitted_models

# Step 3: Evaluate Models
def evaluate_models(models, X_test, y_test):
    for name, model in models.items():
        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_prob)
        acc = accuracy_score(y_test, y_pred)
        print(f"{name} - AUC: {auc:.3f}, Accuracy: {acc:.3f}")

# Step 4: Define Prediction Function for Expected Loss
def make_expected_loss_function(model):
    def expected_loss(input_dict):
        input_df = pd.DataFrame([input_dict])
        pd_prob = model.predict_proba(input_df)[0][1]
        expected_loss = pd_prob * (1 - RECOVERY_RATE)
        return {
            'Probability of Default': pd_prob,
            'Expected Loss': expected_loss
        }
    return expected_loss

# Sample usage
X_train, X_test, y_train, y_test, preprocessor = preprocess_and_split(df)
models = train_models(X_train, y_train, preprocessor)
evaluate_models(models, X_test, y_test)

# Choose best model
best_model = models['Gradient Boosting']  # for example
expected_loss_fn = make_expected_loss_function(best_model)

Logistic Regression - AUC: 1.000, Accuracy: 0.996
Random Forest - AUC: 1.000, Accuracy: 0.996
Gradient Boosting - AUC: 1.000, Accuracy: 0.996
