<a href="https://colab.research.google.com/github/baisura/Loan-Approval-Prediction/blob/main/Loan_predicition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Loan Approval Prediction with Machine Learning**

# STEP 1: Importing Libraries

In [6]:
import pandas as pd
import numpy as np

# STEP 2: Reading the Data

In [8]:
df = pd.read_csv('Loan Predication_dataset.csv')
print(df.head())

    Loan_ID Gender Married Dependents     Education Self_Employed  \
0  LP001002   Male      No          0      Graduate            No   
1  LP001003   Male     Yes          1      Graduate            No   
2  LP001005   Male     Yes          0      Graduate           Yes   
3  LP001006   Male     Yes          0  Not Graduate            No   
4  LP001008   Male      No          0      Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                0.0         NaN             360.0   
1             4583             1508.0       128.0             360.0   
2             3000                0.0        66.0             360.0   
3             2583             2358.0       120.0             360.0   
4             6000                0.0       141.0             360.0   

   Credit_History Property_Area Loan_Status  
0             1.0         Urban           Y  
1             1.0         Rural           N  
2             1.0   

# STEP 3:  Removing the 'Loan_ID' column from your DataFrame

In [None]:
df = df.dropna()

# STEP 4: Check for missing (null) values in a DataFrame

In [None]:
df.isnull().sum()

Unnamed: 0,0
Gender,0
Married,0
Dependents,0
Education,0
Self_Employed,0
ApplicantIncome,0
CoapplicantIncome,0
LoanAmount,0
Loan_Amount_Term,0
Credit_History,0


In [None]:
print(df.describe())

       ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
count       480.000000         480.000000  480.000000        480.000000   
mean       5364.231250        1581.093583  144.735417        342.050000   
std        5668.251251        2617.692267   80.508164         65.212401   
min         150.000000           0.000000    9.000000         36.000000   
25%        2898.750000           0.000000  100.000000        360.000000   
50%        3859.000000        1084.500000  128.000000        360.000000   
75%        5852.500000        2253.250000  170.000000        360.000000   
max       81000.000000       33837.000000  600.000000        480.000000   

       Credit_History  
count      480.000000  
mean         0.854167  
std          0.353307  
min          0.000000  
25%          1.000000  
50%          1.000000  
75%          1.000000  
max          1.000000  


# STEP 5: Fill missing values in categorical columns with mode
Cleans Data: It fills in any missing spots (NaNs) in columns like 'Gender', 'Married', etc.

Uses Mode: Missing values are replaced with the most common answer found in that column.

Prepares for Models: This ensures your data is complete and ready for machine learning algorithms.

Safe Code: It uses a robust method to apply changes correctly to your DataFrame without errors.

In [9]:
print("--- Before filling missing values ---")
print(df[['Gender', 'Married', 'Dependents', 'Self_Employed']].isnull().sum())



# Fill 'Gender' missing values with its mode
df.loc[:, 'Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])

# Fill 'Married' missing values with its mode
df.loc[:, 'Married'] = df['Married'].fillna(df['Married'].mode()[0])

# Fill 'Dependents' missing values with its mode
df.loc[:, 'Dependents'] = df['Dependents'].fillna(df['Dependents'].mode()[0])

# Fill 'Self_Employed' missing values with its mode
df.loc[:, 'Self_Employed'] = df['Self_Employed'].fillna(df['Self_Employed'].mode()[0])

print("\n--- After filling missing values (corrected) ---")
print(df[['Gender', 'Married', 'Dependents', 'Self_Employed']].isnull().sum())

--- Before filling missing values ---
Gender           13
Married           3
Dependents       15
Self_Employed    32
dtype: int64

--- After filling missing values (corrected) ---
Gender           0
Married          0
Dependents       0
Self_Employed    0
dtype: int64


In [None]:
# Before fixing, let's assume df is your DataFrame

print("--- Before filling missing numerical/credit history values ---")
print(df[['LoanAmount', 'Loan_Amount_Term', 'Credit_History']].isnull().sum())

# Corrected way to fill missing values to avoid SettingWithCopyWarning and FutureWarning

# Fill missing values in LoanAmount with the median
df.loc[:, 'LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].median())

# Fill missing values in Loan_Amount_Term with the mode
df.loc[:, 'Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0])

# Fill missing values in Credit_History with the mode
df.loc[:, 'Credit_History'] = df['Credit_History'].fillna(df['Credit_History'].mode()[0])

print("\n--- After filling missing numerical/credit history values (corrected) ---")
print(df[['LoanAmount', 'Loan_Amount_Term', 'Credit_History']].isnull().sum())

--- Before filling missing numerical/credit history values ---
LoanAmount          0
Loan_Amount_Term    0
Credit_History      0
dtype: int64

--- After filling missing numerical/credit history values (corrected) ---
LoanAmount          0
Loan_Amount_Term    0
Credit_History      0
dtype: int64


STEP 6: Exploratory Data Analysis

 Visualizes Loan Status: The code's main purpose is to create a visual representation of how many loan applications were 'Approved'  versus 'Rejected'.


Uses Seaborn and Matplotlib: It leverages the seaborn library (which builds on matplotlib) to generate a bar chart. These are common and powerful Python libraries for data visualization.

Clear Labeling: It maps the numerical 0 and 1 from the 'Loan_Status' column to more understandable labels like 'Rejected' and 'Approved' directly on the plot for better readability.

Includes Percentages: The code calculates and displays the percentage of each loan status category directly on the bars of the plot, providing quick insights into the distribution.List item




In [None]:
import plotly.express as px

loan_status_count = df['Loan_Status'].value_counts()
fig_loan_status = px.pie(loan_status_count,
                         names=loan_status_count.index,
                         title='Loan Approval Status')
fig_loan_status.show()

STEP 7: Distribution of the gender column

In [None]:
gender_count = df['Gender'].value_counts()
fig_gender = px.bar(gender_count,
                    x=gender_count.index,
                    y=gender_count.values,
                    title='Gender Distribution')
fig_gender.show()

STEP 8 :  Distribution of the martial status column

In [None]:
married_count = df['Married'].value_counts()
fig_married = px.bar(married_count,
                     x=married_count.index,
                     y=married_count.values,
                     title='Marital Status Distribution')
fig_married.show()

STEP 9: Distribution of the education column

In [None]:
education_count = df['Education'].value_counts()
fig_education = px.bar(education_count,
                       x=education_count.index,
                       y=education_count.values,
                       title='Education Distribution')
fig_education.show()

STEP 10:  Distribution of the Applicant Income column

In [None]:
fig_applicant_income = px.histogram(df, x='ApplicantIncome',
                                    title='Applicant Income Distribution')
fig_applicant_income.show()

STEP 11: Relationship between the income of the loan applicant and the loan status

In [None]:
fig_income = px.box(df, x='Loan_Status',
                    y='ApplicantIncome',
                    color="Loan_Status",
                    title='Loan_Status vs ApplicantIncome')
fig_income.show()

STEP 12:Remove the outliers

In [None]:
# Calculate the IQR
Q1 = df['ApplicantIncome'].quantile(0.25)
Q3 = df['ApplicantIncome'].quantile(0.75)
IQR = Q3 - Q1

# Define the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df = df[(df['ApplicantIncome'] >= lower_bound) & (df['ApplicantIncome'] <= upper_bound)]

STEP 13 : Relationship between the income of the loan co-applicant and the loan status

In [None]:
fig_coapplicant_income = px.box(df,
                                x='Loan_Status',
                                y='CoapplicantIncome',
                                color="Loan_Status",
                                title='Loan_Status vs CoapplicantIncome')
fig_coapplicant_income.show()

STEP 14 :Remove the outliers from this column as well:

In [None]:
# Calculate the IQR
Q1 = df['CoapplicantIncome'].quantile(0.25)
Q3 = df['CoapplicantIncome'].quantile(0.75)
IQR = Q3 - Q1

# Define the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df = df[(df['CoapplicantIncome'] >= lower_bound) & (df['CoapplicantIncome'] <= upper_bound)]

STEP 15 : Relationship between the loan amount and the loan status

In [None]:
fig_loan_amount = px.box(df, x='Loan_Status',
                         y='LoanAmount',
                         color="Loan_Status",
                         title='Loan_Status vs LoanAmount')
fig_loan_amount.show()

STEP 16 : Relationship between credit history and loan status:


In [None]:
fig_credit_history = px.histogram(df, x='Credit_History', color='Loan_Status',
                                  barmode='group',
                                  title='Loan_Status vs Credit_His')
fig_credit_history.show()

STEP 17 : Relationship between the property area and the loan status

In [None]:
fig_property_area = px.histogram(df, x='Property_Area', color='Loan_Status',
                                 barmode='group',
                                title='Loan_Status vs Property_Area')
fig_property_area.show()

# Data Preparation and Training Loan Approval Prediction

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

STEP 1: Convert categorical columns into numerical ones

In [None]:
cat_cols = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area']
df = pd.get_dummies(df, columns=cat_cols)

STEP 2: Split the data into training and test sets

In [None]:
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']

STEP 3: Split the data into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

STEP 4: Scale the numerical columns using StandardScaler

In [None]:
scaler = StandardScaler()
numerical_cols = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])


Creating a SVM Model and Training the

In [None]:
from sklearn.svm import SVC
model = SVC(random_state=42)
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
print(y_pred)

['Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y'
 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N'
 'N' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y'
 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y'
 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y']


Convert X_test to a DataFrame and Add the predicted values to X_test_df

In [None]:
X_test_df = pd.DataFrame(X_test, columns=X_test.columns)
X_test_df['Loan_Status_Predicted'] = y_pred
print(X_test_df.head())

     ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
606        -0.435271           0.855642    0.799126          0.262114   
101         0.293605           1.782738    0.387171          0.262114   
255        -0.595391          -0.919042    2.334594          0.262114   
47          0.787604          -0.919042    0.256095          0.262114   
516        -1.126768           0.239472   -0.324387          2.134359   

     Credit_History  Gender_Female  Gender_Male  Married_No  Married_Yes  \
606        0.429731          False         True       False         True   
101        0.429731          False         True        True        False   
255        0.429731           True        False        True        False   
47         0.429731          False         True       False         True   
516        0.429731           True        False       False         True   

     Dependents_0  ...  Dependents_2  Dependents_3+  Education_Graduate  \
606         False  ...       

**Model Training and Evaluation:**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
    classification_report
)


rf_model = RandomForestClassifier(random_state=42) # RandomForestClassifier
lr_model = LogisticRegression(random_state=42, solver='liblinear') # LogisticRegression, liblinear is good for small datasets



models = {
    "Random Forest Classifier": rf_model,
    "Logistic Regression": lr_model
}

for name, model in models.items():
    print(f"\n--- Training and Evaluating {name} ---")


    model.fit(X_train, y_train)


    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]


    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred, pos_label='Y'):.4f}")
    print(f"Recall: {recall_score(y_test, y_pred, pos_label='Y'):.4f}")
    print(f"F1-Score: {f1_score(y_test, y_pred, pos_label='Y'):.4f}")
    print(f"ROC AUC Score: {roc_auc_score(y_test, y_pred_proba):.4f}")

    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_pred))

    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))


--- Training and Evaluating Random Forest Classifier ---
Accuracy: 0.7791
Precision: 0.7922
Recall: 0.9531
F1-Score: 0.8652
ROC AUC Score: 0.6200

Confusion Matrix:
[[ 6 16]
 [ 3 61]]

Classification Report:
              precision    recall  f1-score   support

           N       0.67      0.27      0.39        22
           Y       0.79      0.95      0.87        64

    accuracy                           0.78        86
   macro avg       0.73      0.61      0.63        86
weighted avg       0.76      0.78      0.74        86


--- Training and Evaluating Logistic Regression ---
Accuracy: 0.8140
Precision: 0.8077
Recall: 0.9844
F1-Score: 0.8873
ROC AUC Score: 0.6818

Confusion Matrix:
[[ 7 15]
 [ 1 63]]

Classification Report:
              precision    recall  f1-score   support

           N       0.88      0.32      0.47        22
           Y       0.81      0.98      0.89        64

    accuracy                           0.81        86
   macro avg       0.84      0.65      0.6