# Loan Approval Prediction Machine Learning for Production
We are going to solve the Loan Approval Prediction. This is a classification problem in which we need to classify whether the loan will be approved or not. classification refers to a predictive modeling problem where a class label is predicted for a given example of input data.

In this case, the company maybe wants to automate the loan eligibility process (real-time) based on customer detail provided while filling out online application forms. These details are Gender, Marital Status, Education, number of Dependents, Income, Loan Amount, Credit History, and others.

To automate this process, they have provided a dataset to identify the customer segments that are eligible for loan amounts so that they can specifically target these customers.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import fbeta_score, make_scorer,accuracy_score
from imblearn.combine import SMOTEENN

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
df = pd.read_csv("C:/Users/Dickson/Downloads/Data Science tutorials/Data science project/Datasets/loan-approval-train.csv")

In [5]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


Categorical Columns: Gender (Male/Female), Married (Yes/No), Number of dependents (Possible values:0,1,2,3+), Education (Graduate / Not Graduate), Self-Employed (No/Yes), credit history(Yes/No), Property Area (Rural/Semi-Urban/Urban) and Loan Status (Y/N)(i. e. Target variable).

Numerical Columns: Loan ID, Applicant Income, Co-applicant Income, Loan Amount, and Loan amount term

# Data Preprocessing
Concatenating the train and test data for data preprocessing:

dropping the unwanted column :

In [6]:
df.drop('Loan_ID', axis = 1, inplace = True)

Identify missing values :

In [7]:
df.isna().sum()

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [8]:
df.dtypes

Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

Imputing the missing values:

Fill null values with mode

In [9]:
for i in [df]:
    i['Gender'] = i['Gender'].fillna(df.Gender.dropna().mode()[0])
    i['Married'] = i['Married'].fillna(df.Married.dropna().mode()[0])
    i['Dependents'] = i['Dependents'].fillna(df.Dependents.dropna().mode()[0])
    i['Self_Employed'] = i['Self_Employed'].fillna(df.Self_Employed.dropna().mode()[0])
    i['Credit_History'] = i['Credit_History'].fillna(df.Credit_History.dropna().mode()[0])

In [10]:
df.isnull().sum()

Gender                0
Married               0
Dependents            0
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History        0
Property_Area         0
Loan_Status           0
dtype: int64

Now we will see the complete description of the continuous data as well as the categorical data

Next, we will be using Iterative imputer for filling missing values of LoanAmount and Loan_Amount_Term

In [11]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

from sklearn.ensemble import RandomForestRegressor

data = df.loc[:, ['LoanAmount','Loan_Amount_Term']]

#Run imputer with a Random Forest estimator
imp = IterativeImputer(RandomForestRegressor(), max_iter=10, random_state =0)
data = pd.DataFrame(imp.fit_transform(data),columns=data.columns)
df= df.combine_first(data)
df.shape

(614, 12)

In [12]:
df.isnull().sum()

ApplicantIncome      0
CoapplicantIncome    0
Credit_History       0
Dependents           0
Education            0
Gender               0
LoanAmount           0
Loan_Amount_Term     0
Loan_Status          0
Married              0
Property_Area        0
Self_Employed        0
dtype: int64

So now as we have imputed all the missing values we go on to mapping the categorical variables with the integers.

In [13]:
for i in [df]:
    i['Gender'] = i['Gender'].map({"Male":0, "Female":1}).astype(int)
    i['Married'] = i['Married'].map({"No":0, "Yes":1}).astype(int)
    i['Education'] = i['Education'].map({"Not Graduate":0, "Graduate":1}).astype(int)
    i['Self_Employed'] = i['Self_Employed'].map({"No":0, "Yes":1}).astype(int)
    i['Credit_History'] = i['Credit_History'].astype(int)

In [14]:
for i in [df]:
    i['Property_Area'] = i['Property_Area'].map({"Urban":0, "Rural":1, "Semiurban":2}).astype(int)
    i['Dependents'] = i['Dependents'].map({"0":0, "1":1, "2":2, "3+":3}).astype(int)
    

In [15]:
df.sample(5)

Unnamed: 0,ApplicantIncome,CoapplicantIncome,Credit_History,Dependents,Education,Gender,LoanAmount,Loan_Amount_Term,Loan_Status,Married,Property_Area,Self_Employed
67,10750,0.0,1,1,1,0,312.0,360.0,Y,1,0,0
442,4707,1993.0,1,3,0,0,148.0,360.0,Y,0,2,0
518,4683,1915.0,1,0,1,0,185.0,360.0,N,0,2,0
90,2958,2900.0,1,0,1,0,131.0,360.0,Y,1,2,0
476,6700,1750.0,1,2,1,0,230.0,300.0,Y,1,2,0


In [16]:
df['Loan_Status'] = df['Loan_Status'].map({"N":0,"Y":1}).astype(int)

In [17]:
df.dtypes

ApplicantIncome        int64
CoapplicantIncome    float64
Credit_History         int32
Dependents             int32
Education              int32
Gender                 int32
LoanAmount           float64
Loan_Amount_Term     float64
Loan_Status            int32
Married                int32
Property_Area          int32
Self_Employed          int32
dtype: object

# Building Machine Learning Model:
Creating X (input variables) and Y (Target Variable) from the new_train data.

In [18]:
# Normalize can be set to True to print proportions instead of number 
df['Loan_Status'].value_counts(normalize=True)

1    0.687296
0    0.312704
Name: Loan_Status, dtype: float64

In [19]:
X = df.drop("Loan_Status", axis = 1)
y= df['Loan_Status']

In [20]:
X.shape

(614, 11)

In [21]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test =train_test_split(X,y,test_size=0.3)

In [22]:
X_train.shape

(429, 11)

In [23]:
X_test.shape

(185, 11)

# Using ML algorithm for training

# Logistic Regression

In [24]:
log_crf = LogisticRegression()
from sklearn.model_selection import cross_val_score
cross_val_score(log_crf,X_train,y_train, scoring=make_scorer(accuracy_score), cv=3)

array([0.7972028 , 0.85314685, 0.8041958 ])

In [25]:
pred = log_crf.fit(X_train,y_train).predict(X_test)
accuracy_score(pred,y_test)

0.7891891891891892

In [26]:
y_test.head(10)

212    1
27     1
414    0
143    1
573    0
26     1
209    0
539    1
257    0
599    1
Name: Loan_Status, dtype: int32

In [27]:
pred[:10]

array([1, 1, 0, 1, 1, 1, 1, 1, 1, 1])

As seen below the data set is imbalanced dataset.

Hence, can try out SMOTEENN (UpSampling + ENN)

In [28]:
df['Loan_Status'].value_counts(normalize=True)

1    0.687296
0    0.312704
Name: Loan_Status, dtype: float64

In [29]:
sm = SMOTEENN()
X_resampled, y_resampled = sm.fit_sample(X,y)

In [30]:
xr_train,xr_test,yr_train,yr_test=train_test_split(X_resampled, y_resampled,test_size=0.3)

In [31]:
log_crf_sm = LogisticRegression()

cross_val_score(log_crf_sm,xr_train,yr_train, scoring=make_scorer(accuracy_score), cv=3)

array([0.84375   , 0.796875  , 0.66666667])

In [32]:
pred_s = log_crf_sm.fit(xr_train,yr_train).predict(xr_test)
accuracy_score(pred_s,yr_test)

0.7349397590361446

Seems our smoteenn is not improving our model hence we retain por original logistic model

# Saving the model for reuse

In [33]:
import pickle
with open('loan_approval_model.pickle','wb') as f:
    pickle.dump(log_crf,f)

In [34]:
df.head(2)

Unnamed: 0,ApplicantIncome,CoapplicantIncome,Credit_History,Dependents,Education,Gender,LoanAmount,Loan_Amount_Term,Loan_Status,Married,Property_Area,Self_Employed
0,5849,0.0,1,0,1,0,147.112198,360.0,1,0,0,0
1,4583,1508.0,1,1,1,0,128.0,360.0,0,1,1,0
