<a href="https://colab.research.google.com/github/akhil14shukla/IME672A-Course-Project/blob/master/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing the required libraries

In [33]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import *
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

Taking csv data as input

In [34]:
df = pd.read_csv("https://raw.githubusercontent.com/akhil14shukla/IME672A-Course-Project/master/hmeq.csv")
y = df["BAD"]
df.drop(["BAD"],axis=1,inplace=True)

In [35]:
# sns.pairplot(df)

Understanding the Data

In [36]:
print(df.dtypes)

BAD          int64
LOAN         int64
MORTDUE    float64
VALUE      float64
REASON      object
JOB         object
YOJ        float64
DEROG      float64
DELINQ     float64
CLAGE      float64
NINQ       float64
CLNO       float64
DEBTINC    float64
dtype: object


Most of the data is already numerical, only two are of strings type.

In [37]:
# Calculating the number of missing values in each attributes
print(df.isna().sum())
# Number of rows/tuples where more than 3 attributes are missing
sum((df.isna().sum(axis=1))>3)

BAD           0
LOAN          0
MORTDUE     518
VALUE       112
REASON      252
JOB         279
YOJ         515
DEROG       708
DELINQ      580
CLAGE       308
NINQ        510
CLNO        222
DEBTINC    1267
dtype: int64


339

We can consider those tuples where more than 3 attributes are missing

Reason/Meaning of null values in dataset, and how we will fill these:<br><br>
REASON - This shows the reason why the person is taking the loan. There are two available values : Debt consolidation and Home Improvement. Thus, missing value must denote that the reason was other than the two available options. So, we will fill the null values with _"Other reasons"_.

In [38]:
df["REASON"].fillna("Other reason",inplace=True)

In [39]:
df[df["MORTDUE"].isna()]

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
3,1,1500,,,Other reason,,,,,,,,
9,1,2000,,62250.0,HomeImp,Sales,16.0,0.0,0.0,115.800000,0.0,13.0,
24,1,2400,,17180.0,HomeImp,Other,,0.0,0.0,14.566667,3.0,4.0,
40,1,3000,,8800.0,HomeImp,Other,2.0,0.0,1.0,77.766667,0.0,3.0,
41,1,3000,,33000.0,HomeImp,Other,1.0,0.0,1.0,23.300000,1.0,2.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5880,0,53700,,84205.0,HomeImp,Other,,0.0,0.0,339.665615,0.0,7.0,22.639940
5883,0,53800,,81322.0,HomeImp,Self,9.0,0.0,0.0,171.447555,0.0,22.0,24.709060
5884,0,53900,,91309.0,HomeImp,Other,,0.0,0.0,349.795748,0.0,6.0,22.061330
5930,1,72300,,85000.0,DebtCon,Other,1.0,0.0,0.0,117.166667,9.0,23.0,


In [40]:
print(df["JOB"].isna().sum())
print(df["JOB"].value_counts())
# We can fill the missing values with the mode, i.e. "Other", or we can fill the missing values depending on the distribution of the non-null values. 
df["JOB"].fillna(df["JOB"].mode()[0],inplace=True)
print(df["JOB"].isna().sum())

279
Other      2388
ProfExe    1276
Office      948
Mgr         767
Self        193
Sales       109
Name: JOB, dtype: int64
0


In [41]:
# Temporary filling the rest null values with 0
df.fillna(0,inplace=True)

In [43]:
df = df.join(pd.get_dummies(df["JOB"]))
df = df.join(pd.get_dummies(df["REASON"]))
df.drop(["JOB", "REASON"],axis=1, inplace=True)

In [44]:
df

Unnamed: 0,LOAN,MORTDUE,VALUE,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC,Mgr,Office,Other,ProfExe,Sales,Self,DebtCon,HomeImp,Other reason
0,1100,25860.0,39025.0,10.5,0.0,0.0,94.366667,1.0,9.0,0.000000,0,0,1,0,0,0,0,1,0
1,1300,70053.0,68400.0,7.0,0.0,2.0,121.833333,0.0,14.0,0.000000,0,0,1,0,0,0,0,1,0
2,1500,13500.0,16700.0,4.0,0.0,0.0,149.466667,1.0,10.0,0.000000,0,0,1,0,0,0,0,1,0
3,1500,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0,0,1,0,0,0,0,0,1
4,1700,97800.0,112000.0,3.0,0.0,0.0,93.333333,0.0,14.0,0.000000,0,1,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5955,88900,57264.0,90185.0,16.0,0.0,0.0,221.808718,0.0,16.0,36.112347,0,0,1,0,0,0,1,0,0
5956,89000,54576.0,92937.0,16.0,0.0,0.0,208.692070,0.0,15.0,35.859971,0,0,1,0,0,0,1,0,0
5957,89200,54045.0,92924.0,15.0,0.0,0.0,212.279697,0.0,15.0,35.556590,0,0,1,0,0,0,1,0,0
5958,89800,50370.0,91861.0,14.0,0.0,0.0,213.892709,0.0,16.0,34.340882,0,0,1,0,0,0,1,0,0


Building the Model

In [45]:
# Dividing the dataset into training and cross-validation
x_train, x_test, y_train, y_test = train_test_split(df,y)

In [46]:
# Training Decision Tree Model
dtree = DecisionTreeClassifier()
dtree.fit(x_train,y_train)

DecisionTreeClassifier()

Testing the models on cross-validation dataset, and comparing with training dataset

In [52]:
print("Accuracy on Training Dataset : ",dtree.score(x_train,y_train))
print("Accuracy on CV Dataset : ",dtree.score(x_test,y_test))

Accuracy on Training Dataset :  1.0
Accuracy on CV Dataset :  0.8617449664429531
