<a href="https://colab.research.google.com/github/bchan9ASU/MAT421/blob/main/ProjectPlan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Plan: Loan Approval Prediction Using Logistic Regression**

## **1. Introduction to the Problem**
Loan approval is a crucial process for financial institutions. Traditionally, lenders assess applications based on factors like **income, credit score, and employment status** to determine whether an applicant qualifies for a loan.

This project aims to build a **logistic regression model** to predict whether a loan application will be **approved or rejected** based on applicant features. By automating this process, banks can make data-driven lending decisions faster and more efficiently.

### **Dataset**
- Source: [Kaggle Loan Approval Dataset](https://www.kaggle.com/datasets/rohit265/loan-approval-dataset)
- Features include: **Credit Score, Income, Loan Amount, Employment Status, Loan Term**
- Target Variable: **Loan Approval (1 = Approved, 0 = Rejected)**

**Hypothesis:** Credit score and income will be the most important predictors of loan approval.

## **2. Related Work**
Logistic regression is widely used in financial risk assessment. Previous research indicates that machine learning models, including logistic regression, have achieved **80-85% accuracy** in loan approval predictions.

Studies have compared **logistic regression, decision trees, and random forests**, showing that while tree-based models can provide slightly higher accuracy, logistic regression offers better interpretability.

By analyzing **feature importance**, we will validate which factors most strongly influence loan approval decisions.

## **3. Proposed Methodology**
The project will follow a **standard machine learning pipeline**:

1. **Data Preprocessing**
   - Handle missing values.
   - Encode categorical variables (e.g., Employment Status).
   - Scale numerical features (e.g., Income, Loan Amount).
   - Create derived features like **Debt-to-Income Ratio**.

2. **Model Training**
   - Train **logistic regression** using `scikit-learn`.
   - Split dataset into **80% training, 20% testing**.

3. **Evaluation Metrics**
   - **Accuracy, Precision, Recall, Confusion Matrix**.
   - **ROC Curve & AUC Score** to assess model performance.

### **Code Placeholder** (to be implemented later)

In [None]:
# Load dataset (placeholder)
import pandas as pd
df = pd.read_csv('loan_approval_data.csv')  # Replace with actual file path
df.head()

## **4. Experiment Setup**
### **Dataset & Features**
- **Input Features:** Credit Score, Income, Loan Amount, Loan Term, Employment Status.
- **Target:** Loan Approval (Binary: Approved = 1, Rejected = 0).

### **Tools & Libraries**
- `pandas` and `numpy` for data processing.
- `scikit-learn` for model training and evaluation.
- `matplotlib` and `seaborn` for visualization.

### **Workflow**
1. **Exploratory Data Analysis (EDA)**: Check distributions and relationships.
2. **Feature Engineering**: Transform categorical features, normalize numeric features.
3. **Model Training**: Fit logistic regression model.
4. **Evaluation**: Compute accuracy, precision, recall, confusion matrix, and ROC curve.

In [None]:
# Train-Test Split (placeholder)
from sklearn.model_selection import train_test_split
X = df.drop(columns=['Loan_Status'])  # Assuming Loan_Status is the target column
y = df['Loan_Status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## **5. Expected Results**
- The **logistic regression model** should achieve an **accuracy between 80-85%**.
- **Credit Score and Income** will likely be the most influential features.
- The **Confusion Matrix** will help analyze false approvals and false rejections.
- **ROC Curve & AUC Score** should confirm strong classification ability.

If the dataset is imbalanced, we may adjust the decision threshold or use class weighting to improve precision and recall.


In [None]:
# Model Training & Evaluation (placeholder)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Compute accuracy, precision, and recall
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)

print(f'Accuracy: {acc:.2f}, Precision: {prec:.2f}, Recall: {rec:.2f}')

## **Conclusion**
This project aims to **demonstrate how logistic regression can be used to predict loan approvals**. The final deliverable will include:
- A **trained logistic regression model**.
- **Evaluation metrics and visualizations** (Confusion Matrix, ROC Curve).
- **Insights on the most important features** affecting loan approval.

Since this is an **individual project**, all tasks will be handled independently by myself.