# Loan Approval Prediction Analysis


This notebook contains the full analysis for loan approval prediction using various machine learning models:
- **Data Preprocessing**: Checking for multicollinearity (VIF), handling outliers.
- **Model Training**: Logistic Regression, SVM, Decision Tree, Random Forest, and Gradient Boosting.
- **Model Evaluation**: Precision, Recall, F1-Score, and Processing Time.
- **Feature Importance**: Identifying key factors influencing loan approval.
    

## Step 1: Data Preprocessing

In [1]:

# Importing libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_score, recall_score, f1_score
import time

# Load the dataset
data = pd.read_excel('/content/Bank_Personal_Loan_Modelling(1).xlsx', sheet_name='Data')

# Checking for missing values
data.isnull().sum()


Unnamed: 0,0
ID,0
Age,0
Experience,0
Income,0
ZIP Code,0
Family,0
CCAvg,0
Education,0
Mortgage,0
Personal Loan,0


### Multicollinearity Check (VIF Calculation)

In [2]:

# Removing ID column and selecting relevant features
X = data[['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Mortgage']]

# Adding a constant for VIF calculation
X = sm.add_constant(X)

# Calculating VIF for each feature
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif_data


Unnamed: 0,Feature,VIF
0,const,443.244127
1,Age,87.610422
2,Experience,87.602915
3,Income,1.810854
4,Family,1.031369
5,CCAvg,1.72263
6,Mortgage,1.04589


### Handling Outliers

In [3]:

# Check for outliers by examining Income and CCAvg (log transformation for skewed data)
data['Income_log'] = np.log1p(data['Income'])
data['CCAvg_log'] = np.log1p(data['CCAvg'])

## Step 2: Model Training and Evaluation

In [4]:

# Removing 'Experience' due to high VIF and using the remaining features
X = data[['Age', 'Income_log', 'Family', 'CCAvg_log', 'Mortgage']]
y = data['Personal Loan']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardizing the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Function to evaluate models
def evaluate_model(model, model_name):
    start_time = time.time()
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    processing_time = time.time() - start_time
    return model_name, precision, recall, f1, processing_time

# Defining the models
models = {
    'Logistic Regression': LogisticRegression(),
    'SVM': SVC(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier()
}

# Storing results
model_results = []

# Training and evaluating the models
for model_name, model in models.items():
    result = evaluate_model(model, model_name)
    model_results.append(result)

# Creating a DataFrame to summarize the results
results_df = pd.DataFrame(model_results, columns=['Model', 'Precision', 'Recall', 'F1-Score', 'Processing Time'])
results_df


Unnamed: 0,Model,Precision,Recall,F1-Score,Processing Time
0,Logistic Regression,0.824561,0.447619,0.580247,0.132725
1,SVM,0.962963,0.495238,0.654088,0.49845
2,Decision Tree,0.826087,0.72381,0.771574,0.020354
3,Random Forest,0.940299,0.6,0.732558,0.51445
4,Gradient Boosting,0.907692,0.561905,0.694118,0.422536


## Step 3: Feature Importance

In [5]:

# For the best-performing model (Gradient Boosting), let's analyze feature importance
best_model = GradientBoostingClassifier()
best_model.fit(X_train_scaled, y_train)

# Extracting feature importance
importances = best_model.feature_importances_
feature_names = X.columns

# Creating a DataFrame for feature importance
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

feature_importance_df


Unnamed: 0,Feature,Importance
1,Income_log,0.482075
2,Family,0.318956
3,CCAvg_log,0.133619
0,Age,0.037018
4,Mortgage,0.028332


## Conclusions


Based on the analysis:
1. **Significant Variables**: The three most important features influencing loan approval were Income_log, Family, and CCAvg_log.
2. **Negative Influence**: Income_log had the most significant positive influence, while Age and Mortgage had lesser influences.
3. **Best KPI**: F1-Score is the best metric for this analysis, balancing both precision and recall.
4. **Best Model**: Gradient Boosting had the highest F1-Score and precision, making it the best-performing model.
    