# üìå Loan Approval Prediction using Machine Learning

üë§ **Author:** Emmanuel J. Generale  
üìÖ **Date:** June 2025  
üíª **Platform:** Google Collab

## üìò Project Description
This project aims to develop a machine learning model that predicts whether a loan application should be approved based on the applicant‚Äôs personal and financial information. I will use algorithms such as Logistic Regression, Decision Trees, Random Forests, and Boosting techniques to apply what I learn on Andrew NG ML-Course. The dataset used is a real-world anonymized loan dataset from kaggle.

## üéØ Project Objectives

- Apply what I‚Äôve learned in Course 1 & 2 of Andrew Ng‚Äôs Machine Learning Specialization

- Practice real-world data cleaning and preprocessing, including handling missing values and encoding categorical features

- Build and evaluate multiple classification models, including Logistic Regression, Random Forest, and Gradient Boosting

- Understand model performance through accuracy, feature importance, and model tuning

- Explore feature importance

- Perform hyperparameter tuning

- Document and share the entire ML workflow for personal learning and portfolio
- Learn how to save and reuse machine learning models using joblib


In [None]:
from google.colab import files
uploaded = files.upload()


In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('train_u6lujuX_CVtuZ9i.csv')

# Displaying the first 5 rows to check contents
df.head()


#Inspect the Dataset Structure#

**Check the Basic Structure and Missing Values.**

‚óè To know which columns need cleaning

‚óè And also to understand the data types so to choose appropriate preprocessing steps.



In [None]:
# Info about columns, data types, and non-null counts
df.info()

# Count how many missing values per column
df.isnull().sum()


**One thing I've learn is missing Values shouldn't be ignore because they can:**

- Break certain models like logistic regression, Reduce accuracy, or Bias the results if not handled properly.


- These missing values means that some applicants didn‚Äôt fill out those fields. And we need to fill or fix them before training the model.



#Data Cleaning & Preprocessing#
#üéØ Goal:#

‚óè Handle missing values


‚óèPrepare data for modeling (clean, consistent, and usable)




**Fill Missing Values**:
Let‚Äôs handle the columns one by one.



In [None]:
# Fill missing values

df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True)
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0], inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)
df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True)



**Why do I use mode to fill missing values in categorical columns?**

Since we have missing value for categorial, it's reasonable to assume it's the most common one.
**Example: if 80% are "Male", it's likely that a missing value is also "Male"**



In [None]:
df.drop('Loan_ID', axis=1, inplace=True)
#axis=1 column
#inplace=T to change original df

**Numeric Column ‚Üí Fill with Median**

In [None]:
# Fill LoanAmount with the median (less sensitive to outliers)
df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True)
# Recheck for missing values
df.isnull().sum()


##Encode Categorical Variables##

##üéØ Goal:##

‚óè Machine learning models work with numbers, not text. So we need to convert text categories like "Male", "Yes", "Graduate" into numeric values.



##Let‚Äôs list all categorical columns first:##

In [None]:
df.select_dtypes(include='object').columns

## We will encode Target Column

Loan_Status is our target: it‚Äôs "Y" for approved, "N" for rejected.


**And also label Encode Binary Columns**:
These columns have only two unique values (binary):

- Gender: Male / Female

- Married: Yes / No

- Education: Graduate / Not Graduate

- Self_Employed: Yes / No


Lets Convert it:


In [None]:
#Target column
df['Loan_Status'] = df['Loan_Status'].map({'Y': 1, 'N': 0})

#Other Binary Columns
df['Gender'] = df['Gender'].map({'Male': 1, 'Female': 0})
df['Married'] = df['Married'].map({'Yes': 1, 'No': 0})
df['Education'] = df['Education'].map({'Graduate': 1, 'Not Graduate': 0})
df['Self_Employed'] = df['Self_Employed'].map({'Yes': 1, 'No': 0})


**One-Hot Encode Non-Binary Columns**:

‚óè Dependents: ('0', '1', '2', '3+')

‚óè Property_Area: ('Urban', 'Rural', 'Semiurban')

We‚Äôll use pd.get_dummies() for one-hot encoding:

In [None]:
df = pd.get_dummies(df, columns=['Dependents', 'Property_Area'], drop_first=True)

df.info()


‚òë Now all features become numeric, and we're now ready to start splitting data and modeling.

#üöÄ Let‚Äôs Now Proceed to Train-Test Split & Baseline Model (Logistic Regression)#

##üéØ Goal:##

‚óè Split your dataset into training and testing sets.

‚óè Train a Logistic Regression model (your baseline classifier).

‚óè Evaluate its basic performance.





**Split Into Training and Test Sets:**

We will remove the Loan_Status column from the feature set (X).

This prevents the model from "cheating" by seeing the answer during training.



In [None]:
# Separate input features (X) and target variable (y)
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']

from sklearn.model_selection import train_test_split

# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

*Test_size=0.2 ‚Üí 20% of the data is set aside for evaluation.*

*Random_state=42 ‚Üí ensures reproducible results.*



**Scale the Features:**

Since some features (like ApplicantIncome, LoanAmount) have very large values compared to others (like 0/1 binary features). We‚Äôll use StandardScaler to standardize all feature columns:



In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize scaler
scaler = StandardScaler()

# Fit on training data and transform both train and test sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


**Train a Logistic Regression Model:**

In [None]:
from sklearn.linear_model import LogisticRegression

# Create model with higher iteration limit
model = LogisticRegression(max_iter=1000)

# Train on scaled data
model.fit(X_train_scaled, y_train)

# Predict
y_pred = model.predict(X_test_scaled)


**Evaluate:**

In [None]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy (scaled):", accuracy)


*The score of ~78.9% accuracy from our Logistic Regression baseline is a solid starting point ‚Äî especially for a real-world dataset. Now let‚Äôs aim to improve the performance using more powerful models.*

#Advanced Models ‚Äì Decision Tree, Random Forest, and Gradient Boosting#

In this part, I will also apply what I learn on **Course 2 of Andrew Ng: ML Specialization.**

These are tree-based models that often perform better than logistic regression, especially with non-linear patterns.

**‚úÖ Decision Tree Classifier:**

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Initialize and train
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# Predict and evaluate
y_pred_dt = dt_model.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))


**‚úÖ Random Forest Classifier:**

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize and train
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict and evaluate
y_pred_rf = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))


**‚úÖ Gradient Boosting:**

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialize and train
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)

# Predict and evaluate
y_pred_gb = gb_model.predict(X_test)
print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_gb))


##üìä Model Performance Summary
| Model               | Accuracy (%)                                        |
| ------------------- | --------------------------------------------------- |
| Logistic Regression | **78.86%** ‚úÖ *(baseline model)*                     |
| Decision Tree       | 69.11% ‚ùå *(overfitting, weaker generalization)*     |
| Random Forest       | 77.24% ‚úÖ *(good, but slightly lower than baseline)* |
| Gradient Boosting   | **78.05%** ‚úÖ *(almost tied with baseline)*          |


##üß† Observations:

‚óè ‚úÖ Logistic Regression performed best overall. This suggests the data might be linearly separable or the features work well with simple decision boundaries.

‚óè ‚ùå Decision Tree underperformed ‚Äî this is common since a single tree tends to overfit.

‚óè ‚úÖ Random Forest & Gradient Boosting did well but didn't outperform logistic regression by much maybe because:

‚Ä¢ The dataset is small (only 614 rows)

‚Ä¢ There‚Äôs not much non-linear pattern to exploit





#üîπFeature Importance:
We will now understand which features contribute most to loan approval decisions.

**I‚Äôll do this using:**

Random Forest (good balance of accuracy + interpretability)


A simple bar chart to visualize



**üìà Get Feature Importances from Random Forest:**

In [None]:
# Get feature importances from trained Random Forest model
import pandas as pd
import numpy as np

importances = rf_model.feature_importances_
feature_names = X.columns

# Create a DataFrame for easy sorting
feat_imp_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importances
        })

# Sort by importance (descending)
feat_imp_df = feat_imp_df.sort_values(by='Importance', ascending=False)
print(feat_imp_df)


**üìà Visualize with Bar Plot:**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feat_imp_df, palette='viridis')
plt.title('Feature Importance (Random Forest)')
plt.tight_layout()
plt.show()


**‚úÖ Interpreting the Output:**

**Feature Highest importance is Credit_History**, Meaning it has Strong influence on loan approval

**Low importance is** Self_Employed, Dependents_2 meaning it has less impact.

**Feature Importance from Gradient Boosting**

Comparing Gradient Boosting feature importances to Random Forest‚Äôs will:

* Confirm which features are consistently important

* Spot any differences in how the models "think"



In [None]:
# Extract feature importances from the Gradient Boosting model
gb_importances = gb_model.feature_importances_

# Create DataFrame for comparison
gb_feat_imp_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': gb_importances
        }).sort_values(by='Importance', ascending=False)

# Show the table
print(gb_feat_imp_df)


In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=gb_feat_imp_df, palette='coolwarm')
plt.title('Feature Importance (Gradient Boosting)')
plt.tight_layout()
plt.show()


**‚úÖ Merge Feature Importances:**


In [None]:
# Merge Random Forest and Gradient Boosting importances
merged_feat_imp = pd.DataFrame({
    'Feature': X.columns,
    'Random Forest': rf_model.feature_importances_,
    'Gradient Boosting': gb_model.feature_importances_
})

# Sort by average importance
merged_feat_imp['Average'] = (merged_feat_imp['Random Forest'] + merged_feat_imp['Gradient Boosting']) / 2
merged_feat_imp = merged_feat_imp.sort_values(by='Average', ascending=False).drop(columns='Average')

print(merged_feat_imp)


**üìà Side-by-Side Bar Plot using Seaborn**

In [None]:
# Melt the DataFrame for easier plotting
melted = pd.melt(merged_feat_imp, id_vars='Feature',
            value_vars=['Random Forest', 'Gradient Boosting'],
            var_name='Model', value_name='Importance')

plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', hue='Model', data=melted)
plt.title('Feature Importance Comparison: Random Forest vs Gradient Boosting')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.legend(loc='lower right')
plt.tight_layout()
plt.show()


##üîç What I noticed:##

**Top features consistent across both models** ‚úÖ

**One model rely more on a specific feature than the other - The Gradient Boosting Model** ‚úÖ

**There is features that Gradient Model almost ignores** ‚úÖ



#Hyperpqrameter Tuning (Random Forest) #

**üéØ To improve our model performance.**

We‚Äôll use RandomizedSearchCV first ‚Äî according to GPT it's faster and more practical than GridSearchCV when you're starting out, especially with limited resources (like running on a mobile or basic Colab).

We'll tune the Random Forest model.

**Import the Tools & Define Hyperparameter Grid:**

Here are some commonly tuned hyperparameters for Random Forest:



In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Define the hyperparameter space
param_dist = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
                    }


**Perform Randomized Search:**

In [None]:
# Initialize the base model
rf = RandomForestClassifier(random_state=42)

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=20,           # Try 20 random combinations
    cv=5,                # 5-fold cross-validation
    scoring='accuracy',
    random_state=42,
    n_jobs=-1            # Use all CPU cores (if supported)
)

# Fit the search on training data
random_search.fit(X_train, y_train)

# Best model
best_rf = random_search.best_estimator_
print("Best Parameters:", random_search.best_params_)


**Evaluate Tuned Model:**

In [None]:
# Predict using best model
y_pred_best_rf = best_rf.predict(X_test)

# Evaluate
print("Tuned Random Forest Accuracy:", accuracy_score(y_test, y_pred_best_rf))


##Hyperparameter Tuning for Gradient Boosting##
let‚Äôs now tune the Gradient Boosting Classifier (GBM) using RandomizedSearchCV.




**Define Hyperparameter Grid:**

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
param_grid_gbm = {
'n_estimators': [50, 100, 200, 300],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'max_depth': [3, 4, 5, 6],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'subsample': [0.6, 0.8, 1.0]
                          }



**Perform Randomized Search:**

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Initialize base GBM model
gbm = GradientBoostingClassifier(random_state=42)

# Set up randomized search
random_search_gbm = RandomizedSearchCV(
    estimator=gbm,
    param_distributions=param_grid_gbm,
    n_iter=20,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
        )

# Fit search on training data
random_search_gbm.fit(X_train, y_train)


#Best model
best_gbm = random_search_gbm.best_estimator_
print("Best Parameters for GBM:", random_search_gbm.best_params_)


**Evaluate the Tuned GBM:**

In [None]:
y_pred_best_gbm = best_gbm.predict(X_test)
print("Tuned Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_best_gbm))


##üìä Same Accuracy: 78.86% (RF & GBM after tuning)##

**üîç Observation:**
‚úÖ Both tuned models reached the same test accuracy - This means they both learned similar decision boundaries from our dataset

‚ùåNo accuracy improvement over Logistic Regression - The baseline which is also ~78.9% was already very strong

##Save the Final Model:##

In [None]:
import joblib

# Save the trained Gradient Boosting model
joblib.dump(best_gbm, 'loan_approval_model.pkl')


In [None]:
# To load and use it later:
loaded_model = joblib.load('loan_approval_model.pkl')

# Make predictions
loaded_model.predict(X_test)


In [None]:
import shutil

# Rename the current notebook (default is always '/content/Untitled0.ipynb' or similar)
shutil.copy('Loan_Approval_Prediction_Project.ipynb', '/content/loan_approval_notebook.ipynb')
