# Lending Club Data - Decision Tree

Lending Club Loan Data where the data is imbalanced, big and has multiple features with different data types. For the purpose of modelling, I will be taking all default loans as the target variable and will be trying to predict if a loan will default or not. Also cleaning the data .

1. Importing Libraries & Importing Data
2. Data Cleaning
3. Creating Dummies
4. Modelling-Decision Tree

# Importing Libraries & Importing Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os

#Reading Dataset
dataset= pd.read_csv('XYZCorp_LendingData.txt',sep="\t")

#Printing Dataset
dataset

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,default_ind
0,1077501,1296599,5000.0,5000.0,4975.000000,36 months,10.65,162.87,B,B2,...,,,,,,,,,,0
1,1077430,1314167,2500.0,2500.0,2500.000000,60 months,15.27,59.83,C,C4,...,,,,,,,,,,1
2,1077175,1313524,2400.0,2400.0,2400.000000,36 months,15.96,84.33,C,C5,...,,,,,,,,,,0
3,1076863,1277178,10000.0,10000.0,10000.000000,36 months,13.49,339.31,C,C1,...,,,,,,,,,,0
4,1075358,1311748,3000.0,3000.0,3000.000000,60 months,12.69,67.79,B,B5,...,,,,,,,,,,0
5,1075269,1311441,5000.0,5000.0,5000.000000,36 months,7.90,156.46,A,A4,...,,,,,,,,,,0
6,1069639,1304742,7000.0,7000.0,7000.000000,60 months,15.96,170.08,C,C5,...,,,,,,,,,,0
7,1072053,1288686,3000.0,3000.0,3000.000000,36 months,18.64,109.43,E,E1,...,,,,,,,,,,0
8,1071795,1306957,5600.0,5600.0,5600.000000,60 months,21.28,152.39,F,F2,...,,,,,,,,,,1
9,1071570,1306721,5375.0,5375.0,5350.000000,60 months,12.69,121.45,B,B5,...,,,,,,,,,,1


Making a copy of the dataframe so that I do not have to re-read the entire dataset again in order to save memory

In [2]:
loan= dataset.copy(deep=True)
loan.head()


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,default_ind
0,1077501,1296599,5000.0,5000.0,4975.0,36 months,10.65,162.87,B,B2,...,,,,,,,,,,0
1,1077430,1314167,2500.0,2500.0,2500.0,60 months,15.27,59.83,C,C4,...,,,,,,,,,,1
2,1077175,1313524,2400.0,2400.0,2400.0,36 months,15.96,84.33,C,C5,...,,,,,,,,,,0
3,1076863,1277178,10000.0,10000.0,10000.0,36 months,13.49,339.31,C,C1,...,,,,,,,,,,0
4,1075358,1311748,3000.0,3000.0,3000.0,60 months,12.69,67.79,B,B5,...,,,,,,,,,,0


# Data Cleaning

We do not need 'id' and 'member_id',as it has no real predictive power so we can drop it 

In [3]:

loan.drop(['id','member_id'],1,inplace=True)

The column 'emp_title' have a lot of unique entries and it has no real predictive power so we drop it.

In [4]:
loan.drop(['emp_title'],1,inplace=True)

Here 'emp_length' has missing values so replacing n/a values with nan and null values with zero 
'emp_length' is a categorical data,but here the intervals are easy to define so we can convert it inti numerical data with a simplifier

In [5]:
loan.replace('n/a',np.nan,inplace=True)
loan['emp_length'].fillna(value=0,inplace=True)

loan['emp_length'].replace(to_replace='[^0-9]+', value='', inplace=True, regex=True)
loan['emp_length']=loan['emp_length'].astype(int)


The column 'pymnt_plan' shows the loan is not finalised yet and the borrower has been adviced a plan,so it do not predictive power.
The columns 'title' and 'desc' are arbitrary free text from applicant so not so useful in prediction.
Thus,dropping all these three column.

In [6]:
loan.drop(['pymnt_plan','desc','title'],1,inplace=True)

Removing all the columns having missing data more than 30%
As this much missing data will not help for modelling and exploration

In [7]:
#function for removing data

def removeNv(dataframe,axis=1,percent=0.3):
    df=dataframe.copy()
    ishape=df.shape
    colnames=(df.isnull().sum()/len(df))
    colnames=list(colnames[colnames.values>=percent].index)
    df.drop(labels=colnames,axis=1,inplace=True)
    print("Number ofColumns dropped \t:",len(colnames))
    print("\n old dataset rows column ",ishape,"\n New data rows columns",df.shape)
    return df

In [8]:
#calling the above function

loan=removeNv(loan,axis=1,percent=0.3)

Number ofColumns dropped 	: 20

 old dataset rows column  (855969, 67) 
 New data rows columns (855969, 47)


After removing all the columns that is not needed in our prediction now we are left with 47 columns

As we see the data now there are still some column that has missing values especially the columns having date. So,fixing the data types and then handling the missing data.

Converting the date object columns into integer number of years or months just because I do not want to blow up the number of feature columns by performing one-hot encoding on them. For filling the null values I have taken the dates with the highest number of counts.

In [9]:
loan['earliest_cr_line']= pd.to_datetime(loan['earliest_cr_line'].fillna('2001-08-01')).apply(lambda x: int(x.strftime('%m')))
loan['issue_d']= pd.to_datetime(loan['issue_d']).apply(lambda x: int(x.strftime('%Y')))
loan['last_pymnt_d']= pd.to_datetime(loan['last_pymnt_d'].fillna('2016-01-01')).apply(lambda x: int(x.strftime('%m')))
loan['last_credit_pull_d']= pd.to_datetime(loan['last_credit_pull_d'].fillna("2016-01-01")).apply(lambda x: int(x.strftime('%m')))
loan['next_pymnt_d'] = pd.to_datetime(loan['next_pymnt_d'].fillna(value = '2016-02-01')).apply(lambda x:int(x.strftime("%Y")))


After replacing th enull values in the above code we extracting month in 'earliest_cr_line','last_pymnt_d','last_credit_pull_d' and Year in 'issue_d','next_pymnt_d'

Some attribute we encountered like out_prncp,out_prncp_inv, 'total_pymnt', 'total_pymnt_inv', total_rec_prncp', 'grade', 'sub_grade','total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee'as they are detail attribute about the loan after it was granted

In [11]:
loan.drop(['out_prncp','out_prncp_inv','total_pymnt','total_pymnt_inv','total_rec_prncp','grade','sub_grade'],1,inplace=True)

In [12]:
loan.drop(['total_rec_int','total_rec_late_fee','recoveries','collection_recovery_fee'],1,inplace=True)

In [13]:
loan.shape

(855969, 36)

After removing  the columns that is not needed in our prediction now we are left with 36 columns

In [15]:
loan.drop(['zip_code','purpose','addr_state'],1,inplace=True)

Columns like 'zip_code','purpose','addr_state' are basically required in visualisation not in prediction.
There are some other columns whic have null values we will replace it with 0 as they are certainly be of high feature importance due to their description.

In [17]:
loan.fillna(0.0,inplace=True)
loan.fillna(0,inplace=True)

### Dummies
For rest categorical variable

In [18]:
loan = pd.get_dummies(loan)
print(loan.shape)

(855969, 43)


### Target Variable

In [19]:
loan['default_ind'].value_counts()

0    809502
1     46467
Name: default_ind, dtype: int64

# Modelling
## Decision Tree

#### Splitting data into train and test 

In [20]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(loan.drop('default_ind',axis=1),loan['default_ind'],test_size=0.20,random_state=101)

In [21]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn import metrics

In [22]:
dt_c = DecisionTreeClassifier()

dt_c.fit(X_train,y_train)

y_predict = dt_c.predict(X_test)

print("DecisionTreeClassifier",metrics.accuracy_score(y_predict,y_test),'\n')
print(metrics.classification_report(y_test, y_predict))

DecisionTreeClassifier 0.9782352185240137 

              precision    recall  f1-score   support

           0       0.99      0.99      0.99    161867
           1       0.80      0.81      0.80      9327

   micro avg       0.98      0.98      0.98    171194
   macro avg       0.89      0.90      0.90    171194
weighted avg       0.98      0.98      0.98    171194

