# Classification Problem: Loan Prediction

In this notebook we will perform preprocessing step, before trying to apply predictive modeling task.

## Preliminary and Preprocessing

__Import Libraries__

In [1]:
# essentials 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# preprocessing 
from sklearn.preprocessing import StandardScaler

# model selection 
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV

# pipeline
from sklearn.pipeline import Pipeline

# ML algorithms 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier

# supress warnings
import warnings
warnings.filterwarnings('ignore')

__Load dataset__

In [2]:
dataset = pd.read_csv('train.csv')

Take a look on first five of the data.

In [3]:
dataset.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [4]:
dataset.shape

(614, 13)

The dataset contains 614 data points with 13 features (including column Loan_ID and Loan_Status). Let's get the intrinsic information about this dataset.

In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               601 non-null object
Married              611 non-null object
Dependents           599 non-null object
Education            614 non-null object
Self_Employed        582 non-null object
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null object
Loan_Status          614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.4+ KB


Look's like we need to preprocess our data before throw it into machine learning models. Here, we have several columns with null values. Let's see which columns 

In [6]:
print("True if contains null values, False otherwise:\n{}\n".format(dataset.isnull().any())); 
print("Column with null values:\n{}".format(list(dataset.columns[dataset.isnull().any().values])))

True if contains null values, False otherwise:
Loan_ID              False
Gender                True
Married               True
Dependents            True
Education            False
Self_Employed         True
ApplicantIncome      False
CoapplicantIncome    False
LoanAmount            True
Loan_Amount_Term      True
Credit_History        True
Property_Area        False
Loan_Status          False
dtype: bool

Column with null values:
['Gender', 'Married', 'Dependents', 'Self_Employed', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']


In [7]:
dataset.loc[:, dataset.isnull().any().values].isnull().sum()

Gender              13
Married              3
Dependents          15
Self_Employed       32
LoanAmount          22
Loan_Amount_Term    14
Credit_History      50
dtype: int64

It looks like we have to inspect the columns one-by-one to understand better and know what impute method we should use.

### Imputing column "Married"

In [8]:
dataset[dataset['Married'].isnull()]

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
104,LP001357,Male,,,Graduate,No,3816,754.0,160.0,360.0,1.0,Urban,Y
228,LP001760,Male,,,Graduate,No,4758,0.0,158.0,480.0,1.0,Semiurban,Y
435,LP002393,Female,,,Graduate,No,10047,0.0,,240.0,1.0,Semiurban,Y


In [9]:
dataset['Married'].value_counts()

Yes    398
No     213
Name: Married, dtype: int64

The method i use to impute is not by droping it but impute it with 2 yes and 1 no, since there are more yes than no.

In [10]:
dataset[dataset['Married'].isnull()]['Married']

104    NaN
228    NaN
435    NaN
Name: Married, dtype: object

In [11]:
dataset['Married'].fillna({104 : 'Yes', 228 : 'Yes', 435 : 'No'}, inplace=True)

In [12]:
dataset['Married'].isnull().any()

False

No missing values in column 'Married'

### Imputing column Gender

In [13]:
dataset[dataset['Gender'].isnull()]

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
23,LP001050,,Yes,2,Not Graduate,No,3365,1917.0,112.0,360.0,0.0,Rural,N
126,LP001448,,Yes,3+,Graduate,No,23803,0.0,370.0,360.0,1.0,Rural,Y
171,LP001585,,Yes,3+,Graduate,No,51763,0.0,700.0,300.0,1.0,Urban,Y
188,LP001644,,Yes,0,Graduate,Yes,674,5296.0,168.0,360.0,1.0,Rural,Y
314,LP002024,,Yes,0,Graduate,No,2473,1843.0,159.0,360.0,1.0,Rural,N
334,LP002103,,Yes,1,Graduate,Yes,9833,1833.0,182.0,180.0,1.0,Urban,Y
460,LP002478,,Yes,0,Graduate,Yes,2083,4083.0,160.0,360.0,,Semiurban,Y
467,LP002501,,Yes,0,Graduate,No,16692,0.0,110.0,360.0,1.0,Semiurban,Y
477,LP002530,,Yes,2,Graduate,No,2873,1872.0,132.0,360.0,0.0,Semiurban,N
507,LP002625,,No,0,Graduate,No,3583,0.0,96.0,360.0,1.0,Urban,N


In [14]:
dataset['Gender'].value_counts()

Male      489
Female    112
Name: Gender, dtype: int64

In [15]:
dataset[dataset['Gender'].isnull()]['Gender']

23     NaN
126    NaN
171    NaN
188    NaN
314    NaN
334    NaN
460    NaN
467    NaN
477    NaN
507    NaN
576    NaN
588    NaN
592    NaN
Name: Gender, dtype: object

Since there are 13 missing values and the ratio between female and male is approximately 1/4. The method is fill 8 missing values with Male and 5 missing values with Female.

In [16]:
dataset[dataset['Gender'].isnull()]['Gender'][:8].fillna('Male')

23     Male
126    Male
171    Male
188    Male
314    Male
334    Male
460    Male
467    Male
Name: Gender, dtype: object

In [17]:
dataset[dataset['Gender'].isnull()]['Gender'][8:].fillna('Female')

477    Female
507    Female
576    Female
588    Female
592    Female
Name: Gender, dtype: object

In [18]:
dataset['Gender'].fillna(dataset[dataset['Gender'].isnull()]['Gender'].iloc[:8].fillna('Male'), inplace=True)

In [19]:
dataset[dataset['Gender'].isnull()]['Gender']

477    NaN
507    NaN
576    NaN
588    NaN
592    NaN
Name: Gender, dtype: object

In [20]:
dataset['Gender'].fillna('Female', inplace=True)

In [21]:
dataset['Gender'].isnull().any()

False

No missing values.

### Imputing column 'Dependents'

In [22]:
dataset[dataset['Dependents'].isnull()]

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
102,LP001350,Male,Yes,,Graduate,No,13650,0.0,,360.0,1.0,Urban,Y
104,LP001357,Male,Yes,,Graduate,No,3816,754.0,160.0,360.0,1.0,Urban,Y
120,LP001426,Male,Yes,,Graduate,No,5667,2667.0,180.0,360.0,1.0,Rural,Y
226,LP001754,Male,Yes,,Not Graduate,Yes,4735,0.0,138.0,360.0,1.0,Urban,N
228,LP001760,Male,Yes,,Graduate,No,4758,0.0,158.0,480.0,1.0,Semiurban,Y
293,LP001945,Female,No,,Graduate,No,5417,0.0,143.0,480.0,0.0,Urban,N
301,LP001972,Male,Yes,,Not Graduate,No,2875,1750.0,105.0,360.0,1.0,Semiurban,Y
332,LP002100,Male,No,,Graduate,No,2833,0.0,71.0,360.0,1.0,Urban,Y
335,LP002106,Male,Yes,,Graduate,Yes,5503,4490.0,70.0,,1.0,Semiurban,Y
346,LP002130,Male,Yes,,Not Graduate,No,3523,3230.0,152.0,360.0,0.0,Rural,N


In [23]:
dataset['Dependents'].value_counts()

0     345
1     102
2     101
3+     51
Name: Dependents, dtype: int64

Here we just impute it with the mode, which is 0

In [24]:
dataset['Dependents'].fillna('0', inplace=True)

In [25]:
dataset['Dependents'].isnull().any()

False

### Imputing column 'Self_Employed'

In [26]:
dataset[dataset['Self_Employed'].isnull()]

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
11,LP001027,Male,Yes,2,Graduate,,2500,1840.0,109.0,360.0,1.0,Urban,Y
19,LP001041,Male,Yes,0,Graduate,,2600,3500.0,115.0,,1.0,Urban,Y
24,LP001052,Male,Yes,1,Graduate,,3717,2925.0,151.0,360.0,,Semiurban,N
29,LP001087,Female,No,2,Graduate,,3750,2083.0,120.0,360.0,1.0,Semiurban,Y
30,LP001091,Male,Yes,1,Graduate,,4166,3369.0,201.0,360.0,,Urban,N
95,LP001326,Male,No,0,Graduate,,6782,0.0,,360.0,,Urban,N
107,LP001370,Male,No,0,Not Graduate,,7333,0.0,120.0,360.0,1.0,Rural,N
111,LP001387,Female,Yes,0,Graduate,,2929,2333.0,139.0,360.0,1.0,Semiurban,Y
114,LP001398,Male,No,0,Graduate,,5050,0.0,118.0,360.0,1.0,Semiurban,Y
158,LP001546,Male,No,0,Graduate,,2980,2083.0,120.0,360.0,1.0,Rural,Y


In [27]:
dataset['Self_Employed'].value_counts()

No     500
Yes     82
Name: Self_Employed, dtype: int64

We impute it with column's mode

In [28]:
dataset['Self_Employed'].fillna('No', inplace=True)

In [29]:
dataset['Self_Employed'].isnull().any()

False

### Imputing column 'LoanAmount'

In [30]:
dataset[dataset['LoanAmount'].isnull()]

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
35,LP001106,Male,Yes,0,Graduate,No,2275,2067.0,,360.0,1.0,Urban,Y
63,LP001213,Male,Yes,1,Graduate,No,4945,0.0,,360.0,0.0,Rural,N
81,LP001266,Male,Yes,1,Graduate,Yes,2395,0.0,,360.0,1.0,Semiurban,Y
95,LP001326,Male,No,0,Graduate,No,6782,0.0,,360.0,,Urban,N
102,LP001350,Male,Yes,0,Graduate,No,13650,0.0,,360.0,1.0,Urban,Y
103,LP001356,Male,Yes,0,Graduate,No,4652,3583.0,,360.0,1.0,Semiurban,Y
113,LP001392,Female,No,1,Graduate,Yes,7451,0.0,,360.0,1.0,Semiurban,Y
127,LP001449,Male,No,0,Graduate,No,3865,1640.0,,360.0,1.0,Rural,Y
202,LP001682,Male,Yes,3+,Not Graduate,No,3992,0.0,,180.0,1.0,Urban,N


We impute it with column's mean

In [31]:
dataset['LoanAmount'].fillna(dataset['LoanAmount'].mean(), inplace=True)

In [32]:
dataset['LoanAmount'].isnull().any()

False

### Imputing column 'Loan_Amount_Term'

In [33]:
dataset[dataset['Loan_Amount_Term'].isnull()]

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
19,LP001041,Male,Yes,0,Graduate,No,2600,3500.0,115.0,,1.0,Urban,Y
36,LP001109,Male,Yes,0,Graduate,No,1828,1330.0,100.0,,0.0,Urban,N
44,LP001136,Male,Yes,0,Not Graduate,Yes,4695,0.0,96.0,,1.0,Urban,Y
45,LP001137,Female,No,0,Graduate,No,3410,0.0,88.0,,1.0,Urban,Y
73,LP001250,Male,Yes,3+,Not Graduate,No,4755,0.0,95.0,,0.0,Semiurban,N
112,LP001391,Male,Yes,0,Not Graduate,No,3572,4114.0,152.0,,0.0,Rural,N
165,LP001574,Male,Yes,0,Graduate,No,3707,3166.0,182.0,,1.0,Rural,Y
197,LP001669,Female,No,0,Not Graduate,No,1907,2365.0,120.0,,1.0,Urban,Y
223,LP001749,Male,Yes,0,Graduate,No,7578,1010.0,175.0,,1.0,Semiurban,Y
232,LP001770,Male,No,0,Not Graduate,No,3189,2598.0,120.0,,1.0,Rural,Y


We impute it with column's median

In [34]:
dataset['Loan_Amount_Term'].fillna(dataset['Loan_Amount_Term'].median(), inplace=True)

In [35]:
dataset['Loan_Amount_Term'].isnull().any()

False

### Imputing column 'Credit_History'

In [36]:
dataset[dataset['Credit_History'].isnull()]

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
16,LP001034,Male,No,1,Not Graduate,No,3596,0.0,100.0,240.0,,Urban,Y
24,LP001052,Male,Yes,1,Graduate,No,3717,2925.0,151.0,360.0,,Semiurban,N
30,LP001091,Male,Yes,1,Graduate,No,4166,3369.0,201.0,360.0,,Urban,N
42,LP001123,Male,Yes,0,Graduate,No,2400,0.0,75.0,360.0,,Urban,Y
79,LP001264,Male,Yes,3+,Not Graduate,Yes,3333,2166.0,130.0,360.0,,Semiurban,Y
83,LP001273,Male,Yes,0,Graduate,No,6000,2250.0,265.0,360.0,,Semiurban,N
86,LP001280,Male,Yes,2,Not Graduate,No,3333,2000.0,99.0,360.0,,Semiurban,Y
95,LP001326,Male,No,0,Graduate,No,6782,0.0,146.412162,360.0,,Urban,N
117,LP001405,Male,Yes,1,Graduate,No,2214,1398.0,85.0,360.0,,Urban,Y
125,LP001443,Female,No,0,Graduate,No,3692,0.0,93.0,360.0,,Rural,Y


In [37]:
dataset['Credit_History'].value_counts()

1.0    475
0.0     89
Name: Credit_History, dtype: int64

We impute it with column's mode

In [38]:
dataset['Credit_History'].fillna(1, inplace=True)

In [39]:
dataset['Credit_History'].isnull().any()

False

Let's check are there any missing values left?

In [40]:
dataset.isnull().any().any()

False

Missing values are gone. What about outliers? Let's find out

In [41]:
dataset.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

Here we only consider the numerical data type

In [46]:
dataset.select_dtypes(include=['int', 'float']).describe()

Unnamed: 0,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,614.0,614.0
mean,1621.245798,146.412162,342.410423,0.855049
std,2926.248369,84.037468,64.428629,0.352339
min,0.0,9.0,12.0,0.0
25%,0.0,100.25,360.0,1.0
50%,1188.5,129.0,360.0,1.0
75%,2297.25,164.75,360.0,1.0
max,41667.0,700.0,480.0,1.0


Not a wierd looking summary statistics, is it? In CoapplicantIncome column, the maximum value is quite large compare to its mean even its third quartile. But it kinda makes sense if the number is that high because there are people with high income, but mostly lie in medium-low income. So we just leave it as it is. 

Now, save the data for next phase which is predictive modeling.

In [47]:
dataset.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,146.412162,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [48]:
dataset.to_csv('clean.csv')

__Note:__ <br>

In here i use the pretty easy technique to fill the missing values. But there are another technqiues such as make new features that represent the missing values, drop entirely rows or columns with missing values, fill with another aggregate functions, etc. So, the good move it try each different technique that gives us the best predictive modeling performance.