# Loan Default Prediction: Deep Learning vs Offline Reinforcement Learning

## Project Overview

**Business Context:** As a Research Scientist at a fintech company, we aim to improve the loan approval process to maximize financial returns while managing default risk.

**Objective:** Build and compare two approaches:
1. **Supervised Deep Learning Model**: Predicts probability of loan default
2. **Offline Reinforcement Learning Agent**: Learns a policy to maximize expected financial return

**Dataset:** Historical loan data from 2007-2018 including applicant details and loan outcomes

---

## Project Structure

### Task 1: Exploratory Data Analysis & Preprocessing
- Data exploration and feature understanding
- Feature engineering and selection
- Data cleaning and preprocessing

### Task 2: Deep Learning Classification Model
- Binary classification: Fully Paid (0) vs Defaulted (1)
- Multi-Layer Perceptron with PyTorch
- Evaluation: AUC-ROC, F1-Score

### Task 3: Offline Reinforcement Learning Agent
- Frame as offline RL problem
- Define state, action, and reward
- Train using modern offline RL algorithm
- Evaluation: Estimated Policy Value

### Task 4: Analysis & Comparison
- Compare model predictions and policies
- Analyze business implications
- Propose future improvements

---

## 1. Import Required Libraries

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning - Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# Machine Learning - Metrics
from sklearn.metrics import (
    roc_auc_score, f1_score, precision_score, recall_score,
    confusion_matrix, classification_report, roc_curve, auc
)

# Deep Learning - PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Plotting configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 10

print("✓ All libraries imported successfully!")
print(f"PyTorch Version: {torch.__version__}")
print(f"Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")

## 2. Load and Explore the Dataset

The dataset is large (~1.6GB), so we'll use efficient loading strategies.

In [None]:
# Define file path
DATA_PATH = Path("shodhAI_dataset/accepted_2007_to_2018q4.csv/accepted_2007_to_2018Q4.csv")

# First, let's load a preview to understand the structure
print("Loading dataset preview...")
df_preview = pd.read_csv(DATA_PATH, nrows=5000)

print(f"Dataset preview shape: {df_preview.shape}")
print(f"Total columns: {df_preview.shape[1]}")
print(f"\nFirst few column names:")
print(df_preview.columns.tolist()[:20])

In [None]:
# Check the target variable distribution
print("Loan Status Distribution (preview):")
print(df_preview['loan_status'].value_counts())
print(f"\nPercentage:")
print(df_preview['loan_status'].value_counts(normalize=True).round(4) * 100)

### Feature Selection Strategy

Based on domain knowledge and predictive power for loan default, we'll select the following features:

**Loan Characteristics:**
- `loan_amnt`: Loan amount
- `term`: Loan term (36/60 months)
- `int_rate`: Interest rate
- `installment`: Monthly payment
- `grade`, `sub_grade`: Loan risk grade

**Borrower Financial Profile:**
- `annual_inc`: Annual income
- `dti`: Debt-to-income ratio
- `revol_bal`, `revol_util`: Revolving credit utilization
- `total_acc`, `open_acc`: Number of credit accounts

**Credit History:**
- `delinq_2yrs`: Past delinquencies
- `pub_rec`, `pub_rec_bankruptcies`: Public records
- `inq_last_6mths`: Recent credit inquiries
- `fico_range_low`, `fico_range_high`: FICO score

**Other:**
- `emp_length`: Employment length
- `home_ownership`: Home ownership status
- `verification_status`: Income verification
- `purpose`: Loan purpose
- `loan_status`: **TARGET VARIABLE**

In [None]:
# Define selected features
SELECTED_FEATURES = [
    # Loan characteristics
    'loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade',
    
    # Borrower financial profile
    'annual_inc', 'dti', 'revol_bal', 'revol_util', 'total_acc', 'open_acc',
    
    # Credit history
    'delinq_2yrs', 'pub_rec', 'pub_rec_bankruptcies', 'inq_last_6mths',
    'fico_range_low', 'fico_range_high',
    
    # Other features
    'emp_length', 'home_ownership', 'verification_status', 'purpose',
    
    # Target
    'loan_status'
]

# Check which features are available in the dataset
available_features = [col for col in SELECTED_FEATURES if col in df_preview.columns]
missing_features = [col for col in SELECTED_FEATURES if col not in df_preview.columns]

print(f"Available features: {len(available_features)}/{len(SELECTED_FEATURES)}")
if missing_features:
    print(f"Missing features: {missing_features}")
    
print(f"\nLoading full dataset with {len(available_features)} selected features...")
print("This may take 1-2 minutes...")

In [None]:
# Load the full dataset with selected features
df = pd.read_csv(DATA_PATH, usecols=available_features, low_memory=False)

print(f"✓ Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / (1024**2):.2f} MB")

# Display basic info
print(f"\nDataset Info:")
df.info()

## 3. Exploratory Data Analysis (EDA)

In [None]:
# Target variable distribution
print("=" * 80)
print("LOAN STATUS DISTRIBUTION")
print("=" * 80)

status_counts = df['loan_status'].value_counts()
status_pct = df['loan_status'].value_counts(normalize=True) * 100

print("\nAbsolute counts:")
print(status_counts)
print("\nPercentage:")
print(status_pct.round(2))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Count plot
status_counts.plot(kind='bar', ax=axes[0], color='skyblue', edgecolor='black')
axes[0].set_title('Loan Status Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Loan Status')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)

# Pie chart
axes[1].pie(status_pct, labels=status_pct.index, autopct='%1.1f%%', 
            startangle=90, colors=sns.color_palette('pastel'))
axes[1].set_title('Loan Status Percentage', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Statistical summary of numerical features
print("=" * 80)
print("NUMERICAL FEATURES - STATISTICAL SUMMARY")
print("=" * 80)
print(df.describe())

In [None]:
# Missing values analysis
missing_pct = (df.isnull().sum() / len(df)) * 100
missing_df = pd.DataFrame({
    'Column': missing_pct.index,
    'Missing_Percentage': missing_pct.values
}).sort_values('Missing_Percentage', ascending=False)

print("Top 10 columns with missing values:")
print(missing_df.head(10))

## 4. Data Cleaning and Preprocessing