# Loan Approval Decision Support Analysis

## Objective

The goal of this project is to evaluate classification models to support loan approval decisions while balancing predictive performance, interpretability, and financial risk. Rather than automating approvals, this analysis focuses on understanding model tradeoffs and how predictions could inform real-world decision-making.


## Context
Loan approval decisions involve tradeoffs between approving qualified applicants and minimizing default risk. In practice, predictive models are often used as decision-support tools to flag applications for approval or manual review rather than as fully automated systems.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

## Data Overview
The dataset contains historical loan application records with applicant demographic, financial, and loan-related features. The target variable indicates whether a loan was approved or rejected.

Prior to modeling, the data was reviewed for missing values, categorical variables were encoded, and features were prepared for use in classification models.

The target variable is moderately imbalanced, which is important to consider when interpreting model performance.

In [110]:
df = pd.read_csv('data/Loan-Approval-Prediction.csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [None]:
df.shape
df.info()
df.describe()

The dataset contains 480 records with a mix of numeric and categorical features relevant to loan approval decisions.

In [None]:
df['Loan_Status'].value_counts().plot(kind='bar', color=['skyblue', 'salmon'])

## Data Preparation
To prepare the data for modeling, missing values were handled and categorical variables were encoded. The dataset was then split into input features (X) and the target variable (y) to ensure models only used information available at the time of the loan decision.


In [None]:
df['Loan_Status'] = df.Loan_Status.replace('N', 0).replace('Y',1).astype(int)
df['Gender'] = df['Gender'].map({'Female': 0, 'Male': 1})
df['Married'] = df['Married'].map({'No': 0, 'Yes': 1})
df['Self_Employed'] = df['Self_Employed'].map({'No': 0, 'Yes': 1})
df['Education'] = df.Education.replace('Not Graduate', 0).replace('Graduate',1).astype(int)
df['Dependents'] = df['Dependents'].map({'0': 0, '1': 1, '2': 2, '3+': 3})
df['Property_Area'] = df.Property_Area.replace('Rural', 0).replace('Semiurban',1).replace('Urban',2).astype(int)

df = df.dropna()

df = df[df['ApplicantIncome'] <= 30000]

In [None]:
X = df.drop(columns=["Loan_Status", "Loan_ID"])
y = df["Loan_Status"]

In [None]:
rs=123

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=rs
    )

## Modeling and Evaluation
Multiple classification models were evaluated to assess their suitability for supporting loan approval decisions. Rather than focusing solely on overall accuracy, evaluation emphasized class-specific performance metrics and error tradeoffs relevant to financial risk.


### Logistic Regression
Logistic regression was used as a baseline model due to its interpretability and common use in credit decisioning.


In [None]:
lr = LogisticRegression(random_state=rs,
                            penalty='l2',
                            solver='lbfgs',
                            max_iter=1000)
lr.fit(X_train, y_train)

In [None]:
lr_preds = lr.predict(X_test)
print(classification_report(y_test, lr_preds))

In [None]:
confusion_matrix(y_test, lr_preds)

**Interpretation**

The logistic regression model prioritized approving qualified applicants, resulting in fewer false rejections but a higher number of false approvals. While this approach may support higher approval rates, it also introduces increased financial risk.


### Random Forest Classifier

A Random Forest model was evaluated to compare performance against logistic regression. While less interpretable, tree-based models can capture non-linear relationships and interactions between features, which may improve predictive performance in complex decision settings.

In [None]:
rf = RandomForestClassifier(
    n_estimators=100,
    random_state=rs
)

rf.fit(X_train, y_train)


In [None]:
rf_preds = rf.predict(X_test)

print(classification_report(y_test, rf_preds))

In [None]:
confusion_matrix(y_test, rf_preds)

**Interpretation**

The Random Forest model reduced false approvals compared to logistic regression, indicating improved risk control. However, this came at the cost of increased false rejections.

### Model Comparison and Risk Tradeoffs

The logistic regression model favored approval rates and minimized false rejections, while the Random Forest model adopted a more conservative risk posture by reducing false approvals. Model selection should therefore be guided by organizational risk tolerance rather than overall accuracy alone.

This tradeoff reflects different risk postures: logistic regression may be suitable in scenarios prioritizing approval rates and customer experience, while the Random Forest model may better align with risk-averse lending strategies focused on minimizing financial loss.


## Practical Application

In a real-world setting, this analysis could support loan officers by flagging applications for additional review rather than automating approval decisions. Interpretable models may be preferred for regulatory transparency, while more complex models can enhance internal risk assessment.