# Loan predictions

## Problem Statement

We want to automate the loan eligibility process based on customer details that are provided as online application forms are being filled. You can find the dataset [here](https://drive.google.com/file/d/1h_jl9xqqqHflI5PsuiQd_soNYxzFfjKw/view?usp=sharing). These details concern the customer's Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and other things as well. 

|Variable| Description|
|: ------------- |:-------------|
|Loan_ID| Unique Loan ID|
|Gender| Male/ Female|
|Married| Applicant married (Y/N)|
|Dependents| Number of dependents|
|Education| Applicant Education (Graduate/ Under Graduate)|
|Self_Employed| Self employed (Y/N)|
|ApplicantIncome| Applicant income|
|CoapplicantIncome| Coapplicant income|
|LoanAmount| Loan amount in thousands|
|Loan_Amount_Term| Term of loan in months|
|Credit_History| credit history meets guidelines|
|Property_Area| Urban/ Semi Urban/ Rural|
|Loan_Status| Loan approved (Y/N)



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

## 1. Hypothesis Generation

Generating a hypothesis is a major step in the process of analyzing data. This involves understanding the problem and formulating a meaningful hypothesis about what could potentially have a good impact on the outcome. This is done BEFORE looking at the data, and we end up creating a laundry list of the different analyses which we can potentially perform if data is available.

#### Possible hypotheses
Which applicants are more likely to get a loan

1. Applicants having a credit history 
2. Applicants with higher applicant and co-applicant incomes
3. Applicants with higher education level
4. Properties in urban areas with high growth perspectives

Do more brainstorming and create some hypotheses of your own. Remember that the data might not be sufficient to test all of these, but forming these enables a better understanding of the problem.

## 2. Data Exploration
Let's do some basic data exploration here and come up with some inferences about the data. Go ahead and try to figure out some irregularities and address them in the next section. 

In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [3]:

df = pd.read_csv("data.csv") 
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


One of the key challenges in any data set are missing values. Lets start by checking which columns contain missing values.

In [4]:
df.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

Look at some basic statistics for numerical variables.

In [5]:
df.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


1. How many applicants have a `Credit_History`? (`Credit_History` has value 1 for those who have a credit history and 0 otherwise)
2. Is the `ApplicantIncome` distribution in line with your expectation? Similarly, what about `CoapplicantIncome`?
3. Tip: Can you see a possible skewness in the data by comparing the mean to the median, i.e. the 50% figure of a feature.



In [6]:
print(f"With credit history: {df[df['Credit_History'] > 0.5]['Credit_History'].count()}")
print(f"Without credit history: {df[df['Credit_History'] < 0.5]['Credit_History'].count()}")

With credit history: 475
Without credit history: 89


In [7]:
from scipy.stats import skewtest

In [8]:
skewtest(df['ApplicantIncome'])

SkewtestResult(statistic=23.464895502137175, pvalue=9.31490691480801e-122)

In [9]:
skewtest(df['CoapplicantIncome'])

SkewtestResult(statistic=24.61339313159598, pvalue=9.080377361881494e-134)

These are without a doubt skewed

Let's discuss nominal (categorical) variable. Look at the number of unique values in each of them.

In [10]:
df['Gender'].unique()
df['Married'].unique()
df['Dependents'].unique()
df['Education'].unique()
df['Self_Employed'].unique()
df['Property_Area'].unique()
df['Loan_Status'].unique()

array(['Y', 'N'], dtype=object)

Explore further using the frequency of different categories in each nominal variable. Exclude the ID obvious reasons.

In [11]:
df['Gender'].value_counts(dropna=False)

Male      489
Female    112
NaN        13
Name: Gender, dtype: int64

In [12]:
df['Married'].value_counts(dropna=False)

Yes    398
No     213
NaN      3
Name: Married, dtype: int64

In [13]:
df['Dependents'].value_counts(dropna=False)

0      345
1      102
2      101
3+      51
NaN     15
Name: Dependents, dtype: int64

In [14]:
df['Education'].value_counts(dropna=False)

Graduate        480
Not Graduate    134
Name: Education, dtype: int64

In [15]:
df['Self_Employed'].value_counts(dropna=False)

No     500
Yes     82
NaN     32
Name: Self_Employed, dtype: int64

In [16]:
df['Property_Area'].value_counts(dropna=False)

Semiurban    233
Urban        202
Rural        179
Name: Property_Area, dtype: int64

In [17]:
df['Loan_Status'].value_counts(dropna=False)

Y    422
N    192
Name: Loan_Status, dtype: int64

### Distribution analysis

Study distribution of various variables. Plot the histogram of ApplicantIncome, try different number of bins.



In [18]:
import plotly.express as px 

In [19]:
fig = px.histogram(df, x='ApplicantIncome', nbins=50)
fig.show()

In [20]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [21]:
fig = px.histogram(df, x='CoapplicantIncome', nbins=60)
fig.show()

In [22]:
fig = px.histogram(df, x='LoanAmount', nbins=50)
fig.show()


Look at box plots to understand the distributions. 
Let's just look at data that is within the interquartile range to get a better understanding of the underlying data without the outliers.

In [23]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df[~((df[['ApplicantIncome']] < (Q1 - 1.5 * IQR)) |(df[['ApplicantIncome']] > (Q3 + 1.5 * IQR))).any(axis=1)].describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,564.0,564.0,544.0,550.0,517.0
mean,4124.723404,1692.294184,133.8125,341.890909,0.841393
std,1926.989806,2979.23233,59.06735,65.761526,0.365663
min,150.0,0.0,9.0,12.0,0.0
25%,2744.0,0.0,100.0,360.0,1.0
50%,3638.5,1405.5,124.0,360.0,1.0
75%,5010.5,2337.0,159.25,360.0,1.0
max,10139.0,41667.0,495.0,480.0,1.0


In [24]:
# fig = px.box(df[~((df[['TotalIncome']] < (Q1 - 1.5 * IQR)) |(df[['TotalIncome']] > (Q3 + 1.5 * IQR))).any(axis=1)],x='Loan_Status', y="TotalIncome")
# fig.show()

In [25]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

df_strip = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

In [26]:
fig = px.box(df_strip, y="ApplicantIncome")
fig.show()

In [27]:
fig = px.box(df_strip, y="LoanAmount")
fig.show()

Look at the distribution of income segregated  by `Education`

In [138]:
fig = px.box(df_strip, x='Gender', y="ApplicantIncome")
fig.show()

In [28]:
fig = px.box(df_strip, x='Education', y="ApplicantIncome")
fig.show()

Look at the histogram and boxplot of LoanAmount

In [29]:
fig = px.box(df_strip, x='Education', y="LoanAmount")
fig.show()

In [31]:
fig = px.box(df_strip, x='Loan_Status', y="ApplicantIncome")
fig.show()

In [34]:
fig = px.box(df_strip, x='Loan_Status', y="LoanAmount")
fig.show()

In [32]:
df_strip.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


There might be some extreme values. Both `ApplicantIncome` and `LoanAmount` require some amount of data munging. `LoanAmount` has missing and well as extreme values values, while `ApplicantIncome` has a few extreme values, which demand deeper understanding. 

### Categorical variable analysis

Try to understand categorical variables in more details using `pandas.DataFrame.pivot_table` and some visualizations.

In [135]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [136]:
pd.pivot_table(df, values=['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount'], 
                            index=['Gender', 'Education'], 
                            aggfunc={'ApplicantIncome': np.mean,
                                     'CoapplicantIncome': np.mean, 
                                     'LoanAmount': np.mean})

Unnamed: 0_level_0,Unnamed: 1_level_0,ApplicantIncome,CoapplicantIncome,LoanAmount
Gender,Education,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,Graduate,4646.467391,1231.206522,129.855556
Female,Not Graduate,4629.7,541.3,111.736842
Male,Graduate,5992.345745,1845.691277,157.99449
Male,Not Graduate,3630.061947,1401.00885,119.654206


In [137]:
pd.pivot_table(df, values=['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Status'], 
                            index=['Property_Area', 'Dependents'], 
                            aggfunc={'ApplicantIncome': np.mean,
                                     'CoapplicantIncome': np.mean, 
                                     'LoanAmount': np.mean})

Unnamed: 0_level_0,Unnamed: 1_level_0,ApplicantIncome,CoapplicantIncome,LoanAmount
Property_Area,Dependents,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Rural,0,4757.697248,1690.358899,142.654206
Rural,1,4769.809524,1352.904762,141.947368
Rural,2,5611.37931,1964.689655,173.740741
Rural,3+,11305.888889,1056.555556,186.5
Semiurban,0,4854.669231,1617.884615,133.921875
Semiurban,1,7042.8,1319.3,167.210526
Semiurban,2,4316.162162,1222.27027,140.594595
Semiurban,3+,6566.8,1891.5,196.65
Urban,0,5158.632075,1562.273585,131.111111
Urban,1,5518.878049,1568.121951,158.390244


##### Loan status by dependent education

In [138]:
pd.pivot_table(df, values=['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount'], 
                            index=['Loan_Status', 'Education'], 
                            aggfunc={'ApplicantIncome': np.mean,
                                     'CoapplicantIncome': np.mean, 
                                     'LoanAmount': np.mean})

Unnamed: 0_level_0,Unnamed: 1_level_0,ApplicantIncome,CoapplicantIncome,LoanAmount
Loan_Status,Education,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
N,Graduate,6114.514286,2111.028571,161.38806
N,Not Graduate,3646.442308,1249.903846,122.234043
Y,Graduate,5751.576471,1555.423294,151.093656
Y,Not Graduate,3860.256098,1293.439024,116.1625


## 3. Data Cleaning

This step typically involves imputing missing values and treating outliers. 

### Imputing Missing Values

Missing values may not always be NaNs. For instance, the `Loan_Amount_Term` might be 0, which does not make sense.



Impute missing values for all columns. Use the values which you find most meaningful (mean, mode, median, zero.... maybe different mean values for different groups)

In [139]:
df.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [140]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [141]:
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Married'] = df['Married'].fillna(df['Married'].mode()[0])
df['Dependents'] = df['Dependents'].fillna(df['Dependents'].mode()[0])
df['Self_Employed'] = df['Self_Employed'].fillna(df['Self_Employed'].mode()[0])
df = df[df['Credit_History'].notna()]
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mean())
df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].mean())
# df['ApplicantIncome'] = df['ApplicantIncome'].fillna(df['ApplicantIncome'].mean())
# df['CoapplicantIncome'] = df['CoapplicantIncome'].fillna(df['CoapplicantIncome'].mean())

In [142]:
df.isna().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

#### Categorizing co-applicants
It may be a good idea to categorize transactions that have or don't have co-applicants based on whether they have any coapplicant income

In [143]:
def cat_coapp(income):
    if (income > 0):
        return 1
    return 0

In [144]:
df['has_co_app'] = df['CoapplicantIncome'].apply(cat_coapp)

### Extreme values
Try a log transformation to get rid of the extreme values in `LoanAmount`. Plot the histogram before and after the transformation

In [145]:
log_trans = pd.DataFrame(df['ApplicantIncome'])
log_trans['LogIncome'] = df['ApplicantIncome'] + abs(df['ApplicantIncome'].min())

In [146]:
log_trans['LogIncome'] = np.log(log_trans['LogIncome'])

In [147]:
fig = px.histogram(log_trans, x='ApplicantIncome')
fig.show()

In [148]:
fig = px.histogram(log_trans, x='LogIncome')
fig.show()

Combine both incomes as total income and take a log transformation of the same.

In [149]:
df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome']
df['TotalIncome'] = np.log(df['TotalIncome'])

## 3a. Analyzing the transformed data
Let's look at how the loans are distributed with pie charts

In [157]:
pd.pivot_table(df, values=['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount'], 
                            index=['Loan_Status', 'Education'], 
                            aggfunc={'ApplicantIncome': np.mean,
                                     'CoapplicantIncome': np.mean, 
                                     'LoanAmount': np.mean})

Unnamed: 0_level_0,Unnamed: 1_level_0,ApplicantIncome,CoapplicantIncome,LoanAmount
Loan_Status,Education,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
N,Graduate,6215.992308,1828.469231,158.641861
N,Not Graduate,3629.897959,1186.755102,122.743714
Y,Graduate,5814.977636,1506.261725,149.555257
Y,Not Graduate,3847.75,1331.402778,116.405233


In [172]:
df['Gender']

0        Male
1        Male
2        Male
3        Male
4        Male
        ...  
609    Female
610      Male
611      Male
612      Male
613    Female
Name: Gender, Length: 564, dtype: object

In [179]:
f_acc = df[df['Gender'] == 'Female'][['Loan_Status', 'Gender', 'Loan_ID']].groupby(['Loan_Status', 'Gender']).count().reset_index()
px.pie(f_acc, values='Loan_ID', names='Loan_Status', title='Loan Approval Rating for Females')

In [70]:
f_acc

Unnamed: 0,Loan_Status,Gender,Loan_ID
0,N,Male,150
1,Y,Male,339


In [69]:
f_acc = df[df['Gender'] == 'Male'][['Loan_Status', 'Gender', 'Loan_ID']].groupby(['Loan_Status', 'Gender']).count().reset_index()
px.pie(f_acc, values='Loan_ID', names='Loan_Status', title='Loan Approval Rating for Males')

Female loan acceptance rate: 64.4%

Male loan acceptance rate: 69.1%

In [92]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler

In [136]:
loan_dep = df[['Loan_Status', 'Dependents', 'Loan_ID']].groupby(['Loan_Status', 'Dependents']).count().unstack()
loan_dep = pd.DataFrame(loan_dep.values, columns=['0', '1', '2', '3+'], index=['Y', 'N'])
temp = loan_dep.iloc[0].copy()
loan_dep.iloc[0] = loan_dep.iloc[1]
loan_dep.iloc[1] = temp
loan_dep['0'] = loan_dep['0'] / 3.45
loan_dep['1'] = loan_dep['1'] / 1.02
loan_dep['2'] = loan_dep['2'] / 1.01
loan_dep['3+'] = loan_dep['3+'] / 0.51

# scaler = StandardScaler()
# loan_dep = scaler.fit_transform(loan_dep)
# scaler.transform(loan_dep)
# loan_dep = loan_dep.transpose()
# loan_dep = loan_dep.reset_index()
loan_dep
# loan_labels = ['0 Dependents', '1 Dependent', '2 Dependents', '3+ Dependents']
# fig.add_trace(px.pie(loan_dep.iloc[0]))
# for column, i in enumerate(loan_dep):
#     fig.add_trace(px.pie(column), title=loan_labels[i], row=1, col=i)

Unnamed: 0,0,1,2,3+
Y,68.985507,64.705882,75.247525,64.705882
N,31.014493,35.294118,24.752475,35.294118


In [137]:
px.bar(loan_dep.transpose())

In [107]:
px.pie(loan_dep, names='index', values='0', title='0 Dependents')

ValueError: Value of 'names' is not the name of a column in 'data_frame'. Expected one of [('Loan_ID', '0'), ('Loan_ID', '1'), ('Loan_ID', '2'), ('Loan_ID', '3+')] but received: index
 To use the index, pass it in directly as `df.index`.

In [88]:
px.pie(loan_dep, names='index', values='1', title='1 Dependent')

In [46]:
df[['Loan_Status', 'Dependents', 'Loan_ID']].groupby(['Loan_Status', 'Dependents']).count().unstack()#.reset_index()

Unnamed: 0_level_0,Loan_ID,Loan_ID,Loan_ID,Loan_ID
Dependents,0,1,2,3+
Loan_Status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
N,107,36,25,18
Y,238,66,76,33


In [150]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,has_co_app,TotalIncome
0,LP001002,Male,No,0,Graduate,No,5849,0.0,145.088398,360.0,1.0,Urban,Y,0,8.674026
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,1,8.714568
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,0,8.006368
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,1,8.505323
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,0,8.699515


In [184]:
f_acc = df[df['Education'] == 'Graduate'][['Loan_Status', 'Education', 'Loan_ID']].groupby(['Loan_Status', 'Education']).count().reset_index()
px.pie(f_acc, values='Loan_ID', names='Loan_Status', title='Loan Approval Rating for Graduates')

In [185]:
f_acc = df[df['Education'] == 'Not Graduate'][['Loan_Status', 'Education', 'Loan_ID']].groupby(['Loan_Status', 'Education']).count().reset_index()
px.pie(f_acc, values='Loan_ID', names='Loan_Status', title='Loan Approval Rating for Non-Graduates')

0: 67.3%
1: 60%
2: 73.7%
3: 66.7%

In [186]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,has_co_app,TotalIncome
0,LP001002,Male,No,0,Graduate,No,5849,0.0,145.088398,360.0,1.0,Urban,Y,0,8.674026
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,1,8.714568
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,0,8.006368
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,1,8.505323
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,0,8.699515


In [187]:
f_acc = df[df['Self_Employed'] == 'Yes'][['Loan_Status', 'Self_Employed', 'Loan_ID']].groupby(['Loan_Status', 'Self_Employed']).count().reset_index()
px.pie(f_acc, values='Loan_ID', names='Loan_Status', title='Loan Approval Rating for Self Employed')

In [188]:
f_acc = df[df['Self_Employed'] == 'No'][['Loan_Status', 'Self_Employed', 'Loan_ID']].groupby(['Loan_Status', 'Self_Employed']).count().reset_index()
px.pie(f_acc, values='Loan_ID', names='Loan_Status', title='Loan Approval Rating for Non Self Employed')

## 4. Building a Predictive Model

In [128]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [129]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,TotalIncome,has_co_app
0,LP001002,Male,No,0,Graduate,No,5849,0.0,145.088398,360.0,1.0,Urban,Y,8.674026,0
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,8.714568,1
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,8.006368,0
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,8.505323,1
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,8.699515,0


In [130]:
X = df.drop(['Loan_ID', 'ApplicantIncome', 'CoapplicantIncome'], axis=1)
X = pd.get_dummies(X, drop_first=True)

In [131]:
X

Unnamed: 0,LoanAmount,Loan_Amount_Term,Credit_History,TotalIncome,has_co_app,Gender_Male,Married_Yes,Dependents_1,Dependents_2,Dependents_3+,Education_Not Graduate,Self_Employed_Yes,Property_Area_Semiurban,Property_Area_Urban,Loan_Status_Y
0,145.088398,360.0,1.0,8.674026,0,1,0,0,0,0,0,0,0,1,1
1,128.000000,360.0,1.0,8.714568,1,1,1,1,0,0,0,0,0,0,0
2,66.000000,360.0,1.0,8.006368,0,1,1,0,0,0,0,1,0,1,1
3,120.000000,360.0,1.0,8.505323,1,1,1,0,0,0,1,0,0,1,1
4,141.000000,360.0,1.0,8.699515,0,1,0,0,0,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,71.000000,360.0,1.0,7.972466,0,0,0,0,0,0,0,0,0,0,1
610,40.000000,180.0,1.0,8.320205,0,1,1,0,0,1,0,0,0,0,1
611,253.000000,360.0,1.0,9.025456,1,1,1,1,0,0,0,0,0,1,1
612,187.000000,360.0,1.0,8.933664,0,1,1,0,1,0,0,0,0,1,1


In [132]:
X_train, X_test, y_train, y_test = train_test_split(X.drop(['Loan_Status_Y'], axis=1),X['Loan_Status_Y'], test_size=0.3, random_state=42)

In [133]:
clf = RandomForestClassifier()

In [134]:
clf.fit(X_train, y_train)

RandomForestClassifier()

In [135]:
y_pred = clf.predict(X_test)

In [136]:
from sklearn.metrics import accuracy_score

In [137]:
base_acc = accuracy_score(y_test, y_pred)

In [138]:
base_acc

0.8647058823529412

Try paramater grid search to improve the results

In [139]:
param_grid = {
    'n_estimators': [50, 100, 150, 200, 250, 300, 350, 400],
    'criterion': ['gini', 'entropy'],
    'max_depth': [5, 10, 15, 20, 25, None],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10]
}

In [140]:
from sklearn.model_selection import RandomizedSearchCV

In [141]:
rf_random = RandomizedSearchCV(estimator=clf, param_distributions=param_grid, n_iter=50, cv=3, random_state=42)

In [142]:
rf_random.fit(X_train, y_train)

RandomizedSearchCV(cv=3, estimator=RandomForestClassifier(), n_iter=50,
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': [5, 10, 15, 20, 25, None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [50, 100, 150, 200, 250,
                                                         300, 350, 400]},
                   random_state=42)

In [143]:
rf_random.best_params_

{'n_estimators': 50,
 'min_samples_split': 10,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': None,
 'criterion': 'gini'}

In [144]:
clf = RandomForestClassifier(n_estimators=50, min_samples_split=10, min_samples_leaf=1, max_features='auto', max_depth=None, criterion='gini')
clf.fit(X_train, y_train)

RandomForestClassifier(min_samples_split=10, n_estimators=50)

In [145]:
y_pred = clf.predict(X_test)

In [146]:
tuned_acc = accuracy_score(y_test, y_pred)

In [147]:
tuned_acc

0.8529411764705882

In [150]:
print(f'Improvement over default random forest classifier: {tuned_acc - base_acc}')

Improvement over default random forest classifier: -0.01176470588235301


## 5. Using Pipeline
If you didn't use pipelines before, transform your data prep, feat. engineering and modeling steps into Pipeline. It will be helpful for deployment.

The goal here is to create the pipeline that will take one row of our dataset and predict the probability of being granted a loan.

`pipeline.predict(x)`

## 6. Deploy your model to cloud and test it with PostMan, BASH or Python