<a href="https://colab.research.google.com/github/bchan9ASU/MAT421/blob/main/ProjectPlan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Plan: Loan Approval Prediction Using Logistic Regression**

## **1. Introduction to the Problem**
Loan approval is a crucial process for financial institutions. Traditionally, lenders assess applications based on factors like **income, credit score, and employment status** to determine whether an applicant qualifies for a loan.

This project aims to build a **logistic regression model** to predict whether a loan application will be **approved or rejected** based on applicant features. By automating this process, banks can make data-driven lending decisions faster and more efficiently.

### **Dataset**
- Source: [Kaggle Loan Approval Dataset](https://www.kaggle.com/datasets/rohit265/loan-approval-dataset)
- Features include: **Credit Score, Income, Loan Amount, Employment Status, Loan Term**
- Target Variable: **Loan Approval (1 = Approved, 0 = Rejected)**

**Hypothesis:** Credit score and income will be the most important predictors of loan approval.

## **2. Related Work**
Logistic regression is widely used in financial risk assessment. Previous research indicates that machine learning models, including logistic regression, have achieved **80-85% accuracy** in loan approval predictions.

Studies have compared **logistic regression, decision trees, and random forests**, showing that while tree-based models can provide slightly higher accuracy, logistic regression offers better interpretability.

By analyzing **feature importance**, we will validate which factors most strongly influence loan approval decisions.

## **3. Proposed Methodology**
The project will follow a **standard machine learning pipeline**:

1. **Data Preprocessing**
   - Handle missing values.
   - Encode categorical variables (e.g., Employment Status).
   - Scale numerical features (e.g., Income, Loan Amount).
   - Create derived features like **Debt-to-Income Ratio**.

2. **Model Training**
   - Train **logistic regression** using `scikit-learn`.
   - Split dataset into **80% training, 20% testing**.

3. **Evaluation Metrics**
   - **Accuracy, Precision, Recall, Confusion Matrix**.
   - **ROC Curve & AUC Score** to assess model performance.

### **Code**

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load dataset
file_path = 'loan_approval_data.csv'  # Ensure the file is in the working directory or provide the full path
df = pd.read_csv(file_path)

# Display the first few rows
print("Dataset Preview:")
display(df.head())

# Basic information about dataset
print("\nDataset Info:")
df.info()

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Handle missing values - Fill numerical columns with median, categorical with mode
for col in df.columns:
    if df[col].dtype == 'object':  # Categorical columns
        df[col].fillna(df[col].mode()[0], inplace=True)
    else:  # Numerical columns
        df[col].fillna(df[col].median(), inplace=True)

# Encode categorical variables using one-hot encoding
df = pd.get_dummies(df, drop_first=True)

# Summary statistics
print("\nSummary Statistics:")
display(df.describe())

# Save the cleaned dataset
df.to_csv("loan_approval_data_cleaned.csv", index=False)
print("\nPreprocessed dataset saved as 'loan_approval_data_cleaned.csv'.")


FileNotFoundError: [Errno 2] No such file or directory: 'loan_approval_data.csv'

## **4. Experiment Setup**
### **Dataset & Features**
- **Input Features:** Credit Score, Income, Loan Amount, Loan Term, Employment Status.
- **Target:** Loan Approval (Binary: Approved = 1, Rejected = 0).

### **Tools & Libraries**
- `pandas` and `numpy` for data processing.
- `scikit-learn` for model training and evaluation.
- `matplotlib` and `seaborn` for visualization.

### **Workflow**
1. **Exploratory Data Analysis (EDA)**: Check distributions and relationships.
2. **Feature Engineering**: Transform categorical features, normalize numeric features.
3. **Model Training**: Fit logistic regression model.
4. **Evaluation**: Compute accuracy, precision, recall, confusion matrix, and ROC curve.

In [None]:
# Import required libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define target variable and features
target_column = 'Loan_Status'  # Ensure this matches the actual column name in the dataset

# Verify if the target column exists
if target_column in df.columns:
    X = df.drop(columns=[target_column])  # Features
    y = df[target_column]  # Target variable
else:
    raise ValueError(f"Target column '{target_column}' not found in dataset.")

# Perform train-test split (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize numerical features for better model performance
scaler = StandardScaler()
numerical_cols = X_train.select_dtypes(include=['float64', 'int64']).columns

X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

print(f"Training data shape: {X_train.shape}, Testing data shape: {X_test.shape}")

## **5. Expected Results**
- The **logistic regression model** should achieve an **accuracy between 80-85%**.
- **Credit Score and Income** will likely be the most influential features.
- The **Confusion Matrix** will help analyze false approvals and false rejections.
- **ROC Curve & AUC Score** should confirm strong classification ability.

If the dataset is imbalanced, we may adjust the decision threshold or use class weighting to improve precision and recall.


In [None]:
# Import required libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    confusion_matrix, classification_report
)
from sklearn.model_selection import GridSearchCV

# Define logistic regression model
log_reg = LogisticRegression(solver='liblinear', max_iter=1000, random_state=42)

# Perform hyperparameter tuning using GridSearchCV
param_grid = {'C': [0.01, 0.1, 1, 10, 100]}  # Regularization strength
grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best model from GridSearchCV
best_model = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Compute evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

# Display model performance
print(f"Best Hyperparameter (C): {grid_search.best_params_['C']}")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC AUC Score: {roc_auc:.2f}")

# Print classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", conf_matrix)

## **Conclusion**
This project aims to **demonstrate how logistic regression can be used to predict loan approvals**. The final deliverable will include:
- A **trained logistic regression model**.
- **Evaluation metrics and visualizations** (Confusion Matrix, ROC Curve).
- **Insights on the most important features** affecting loan approval.

Since this is an **individual project**, all tasks will be handled independently by myself.