# Loan Approval Prediction Analysis

This notebook demonstrates the analysis of a loan approval dataset using various machine learning models. We preprocess the data to ensure that the dataset is clean and ready for modeling. Several machine learning models are trained to predict whether a loan will be approved or rejected. The key steps include handling missing values, detecting outliers, and addressing multicollinearity, as well as choosing the best model based on performance metrics.

### Objectives
1. Handle missing values, outliers, and multicollinearity in the dataset.
2. Train multiple machine learning models: Logistic Regression, SVM, Decision Tree, Random Forest, and Gradient Boosting.
3. Evaluate the models using KPIs such as Precision, Recall, F1-Score, and processing time.
4. Select the best model for predicting loan approvals.

## Step 1: Data Preprocessing
In this section, we handle data cleaning and preprocessing, including missing values, outlier detection, and multicollinearity.

In [1]:
# Loading necessary libraries and the dataset
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_score, recall_score, f1_score
import time
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
# Load the dataset
df = pd.read_excel('Bank_Personal_Loan_Modelling.xlsx', sheet_name='Data')
df

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,29,3,40,92697,1,1.9,3,0,0,0,0,1,0
4996,4997,30,4,15,92037,4,0.4,1,85,0,0,0,1,0
4997,4998,63,39,24,93023,2,0.3,3,0,0,0,0,0,0
4998,4999,65,40,49,90034,3,0.5,2,0,0,0,0,1,0


In [12]:
# Step 2: Check for missing values
missing_values = df.isnull().sum()
if missing_values.sum() > 0:
    from sklearn.impute import SimpleImputer
    imputer = SimpleImputer(strategy='mean')
    df = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

In [13]:
# Step 3: Apply log transformation to handle outliers
df['Income_log'] = np.log1p(df['Income'])
df['CCAvg_log'] = np.log1p(df['CCAvg'])

In [10]:
# Step 4: Check for multicollinearity
df_cleaned = df.drop(columns=['Experience', 'Income', 'CCAvg'])
X_cleaned = df_cleaned.drop(columns=['ID', 'Personal Loan', 'ZIP Code'])
y_cleaned = df_cleaned['Personal Loan']

In [14]:
# Step 5: Split the cleaned data and standardize
X_train_cleaned, X_test_cleaned, y_train_cleaned, y_test_cleaned = train_test_split(X_cleaned, y_cleaned, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train_cleaned_scaled = scaler.fit_transform(X_train_cleaned)
X_test_cleaned_scaled = scaler.transform(X_test_cleaned)

## Step 2: Model Training and Evaluation
In this step, we initialize and train multiple models, then evaluate their performance using Precision, Recall, F1-Score, and Processing Time.

In [15]:
# Initialize and train models
models = {
    'Logistic Regression': LogisticRegression(),
    'SVM': SVC(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier()
}
results_cleaned = []
for name, model in models.items():
    start_time = time.time()
    model.fit(X_train_cleaned_scaled, y_train_cleaned)
    y_pred_cleaned = model.predict(X_test_cleaned_scaled)
    end_time = time.time()
    results_cleaned.append({
        'Model': name,
        'Precision': precision_score(y_test_cleaned, y_pred_cleaned),
        'Recall': recall_score(y_test_cleaned, y_pred_cleaned),
        'F1-Score': f1_score(y_test_cleaned, y_pred_cleaned),
        'Processing Time (s)': end_time - start_time
    })
results_cleaned_df = pd.DataFrame(results_cleaned)
results_cleaned_df

Unnamed: 0,Model,Precision,Recall,F1-Score,Processing Time (s)
0,Logistic Regression,0.884298,0.681529,0.769784,0.01481
1,SVM,0.957627,0.719745,0.821818,0.169665
2,Decision Tree,0.927632,0.898089,0.912621,0.005119
3,Random Forest,0.986207,0.910828,0.94702,0.355463
4,Gradient Boosting,0.972973,0.917197,0.944262,0.459594


## Step 3: Model Performance Summary
The following table summarizes the performance of each model. We evaluate the models based on Precision, Recall, F1-Score, and Processing Time.

In [16]:
# Display the results
results_cleaned_df

Unnamed: 0,Model,Precision,Recall,F1-Score,Processing Time (s)
0,Logistic Regression,0.884298,0.681529,0.769784,0.01481
1,SVM,0.957627,0.719745,0.821818,0.169665
2,Decision Tree,0.927632,0.898089,0.912621,0.005119
3,Random Forest,0.986207,0.910828,0.94702,0.355463
4,Gradient Boosting,0.972973,0.917197,0.944262,0.459594


## Conclusion
Based on the results, the **Random Forest** model performed the best, achieving an F1-Score of 0.947 while maintaining reasonable processing time. The preprocessing steps, including log transformation for outliers and addressing multicollinearity, helped improve the model's overall performance.