# Loan Approval Prediction using Machine Learning

## Objective

The goal of this project is to build an end-to-end machine learning pipeline to predict whether a loan application will be approved or rejected based on applicant demographic, financial, and credit-related features.

Two classification models are used:
- Logistic Regression
- Decision Tree Classifier

The project includes data preprocessing, model training, evaluation, and exporting trained models for deployment.


1. Import Required Libraries

In [88]:
import pandas as pd
import numpy as np

# Visualization (optional but good practice)
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Evaluation
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Model persistence
import joblib


2. Load the Dataset

In [89]:
df = pd.read_csv("loan_approval_dataset.csv")  # rename if needed
df.head()


Unnamed: 0,person_age,person_gender,person_education,person_income,person_emp_exp,person_home_ownership,loan_amnt,loan_intent,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,credit_score,previous_loan_defaults_on_file,loan_status
0,22,female,Master,71948,0,RENT,35000,PERSONAL,16.02,0.49,3,561,No,1
1,21,female,High School,12282,0,OWN,1000,EDUCATION,11.14,0.08,2,504,Yes,0
2,25,female,High School,12438,3,MORTGAGE,5500,MEDICAL,12.87,0.44,3,635,No,1
3,23,female,Bachelor,79753,0,RENT,35000,MEDICAL,15.23,0.44,2,675,No,1
4,24,male,Master,66135,1,RENT,35000,MEDICAL,14.27,0.53,4,586,No,1


3. Dataset Overview

In [90]:
df.shape

(45000, 14)

In [91]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45000 entries, 0 to 44999
Data columns (total 14 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   person_age                      45000 non-null  int64  
 1   person_gender                   45000 non-null  object 
 2   person_education                45000 non-null  object 
 3   person_income                   45000 non-null  int64  
 4   person_emp_exp                  45000 non-null  int64  
 5   person_home_ownership           45000 non-null  object 
 6   loan_amnt                       45000 non-null  int64  
 7   loan_intent                     45000 non-null  object 
 8   loan_int_rate                   45000 non-null  float64
 9   loan_percent_income             45000 non-null  float64
 10  cb_person_cred_hist_length      45000 non-null  int64  
 11  credit_score                    45000 non-null  int64  
 12  previous_loan_defaults_on_file  

Check Missing Values

In [92]:
df.isnull().sum()

Unnamed: 0,0
person_age,0
person_gender,0
person_education,0
person_income,0
person_emp_exp,0
person_home_ownership,0
loan_amnt,0
loan_intent,0
loan_int_rate,0
loan_percent_income,0


### Observations
- The dataset contains both numerical and categorical features.
- The target variable is `loan_status` (1 = approved, 0 = rejected).
- Missing values will be handled during preprocessing.


4. Separate Features and Target

In [93]:
X = df.drop("loan_status", axis=1)
y = df["loan_status"]

5. Identify Feature Types

In [94]:
categorical_features = [
    "person_gender",
    "person_education",
    "person_home_ownership",
    "loan_intent",
    "previous_loan_defaults_on_file"
]

numerical_features = [
    "person_age",
    "person_income",
    "person_emp_exp",
    "loan_amnt",
    "loan_int_rate",
    "loan_percent_income",
    "cb_person_cred_hist_length",
    "credit_score"
]


### Feature Groups
- **Categorical features** are encoded using One-Hot Encoding.
- **Numerical features** are scaled using StandardScaler.


6. Preprocessing Pipeline

In [95]:
numeric_transformer = Pipeline(steps=[
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)


7. Train-Test Split

In [96]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


- 80% training data
- 20% testing data
- Stratification preserves class balance


8. Logistic Regression Model
*   Build Pipeline






In [97]:
lr_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000))
])


*  Train Model



In [98]:
lr_pipeline.fit(X_train, y_train)

*   Evaluate




In [99]:
lr_preds = lr_pipeline.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, lr_preds))
print(classification_report(y_test, lr_preds))

Logistic Regression Accuracy: 0.8993333333333333
              precision    recall  f1-score   support

           0       0.93      0.94      0.94      7000
           1       0.79      0.75      0.77      2000

    accuracy                           0.90      9000
   macro avg       0.86      0.84      0.85      9000
weighted avg       0.90      0.90      0.90      9000



9. Decision Tree Model

*   Build Pipeline



In [100]:
dt_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", DecisionTreeClassifier(
        random_state=42,
        max_depth=10
    ))
])

*   Train Model



In [101]:
dt_pipeline.fit(X_train, y_train)

*   Evaluate

In [102]:
dt_preds = dt_pipeline.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, dt_preds))
print(classification_report(y_test, dt_preds))

Decision Tree Accuracy: 0.9214444444444444
              precision    recall  f1-score   support

           0       0.93      0.97      0.95      7000
           1       0.89      0.73      0.81      2000

    accuracy                           0.92      9000
   macro avg       0.91      0.85      0.88      9000
weighted avg       0.92      0.92      0.92      9000



10. Model Comparison
- Logistic Regression provides a strong baseline and interpretability.
- Decision Tree captures non-linear relationships.
- Both models achieve reasonable performance on the test set.


11. Export Models Using joblib

In [103]:
joblib.dump(lr_pipeline, "logistic_regression_pipeline.pkl")
joblib.dump(dt_pipeline, "decision_tree_pipeline.pkl")

['decision_tree_pipeline.pkl']

The saved pipelines include:
- Preprocessing (scaling + encoding)
- Trained classifier

These files will be used directly in the backend application.


12. Save Feature Names for Reference

In [104]:
feature_names = (
    numerical_features +
    list(
        lr_pipeline.named_steps["preprocessor"]
        .named_transformers_["cat"]
        .named_steps["encoder"]
        .get_feature_names_out(categorical_features)
    )
)

joblib.dump(feature_names, "feature_names.pkl")


['feature_names.pkl']