# Study of Lending Club loans

## Introduction

Lending Club is an online credit marketplace, facilitating personal loans, business loans, and financing for elective medical procedures. Lending Club has provided datasets that contain complete loan data for all loans issued through the specific time period stated, including the current loan status (Current, Late, Fully Paid, etc.) , latest payment information and the financial and credit rating information about the loan applicant collected by Lending Club.

## Objective

 The Capstone project aims to learn the payment patterns of applicants and build a predictive model that will predict the repayment class of the loan. For this study, the loan repayment type has been classified into three as follows
 
  - **Good loans** - The installments will be always on time and the loan will be fully paid on or before the end of loan term.
  - **Bad loans** - The loan will end up in Charge Off and LendingClub will end up in spending for collection and recoveries process.
  - **Delinquent loans** - The loan installments will be delayed one or more times but not likely to end up in charge off.
  
## Data

 We will be considering the LendingClub dataset for the year 2014. At a high level, the dataset contains the below attribute groups. 
 
  - *Applicant Identity* - Basic Identity information about the applicant like home state, employment length, home-ownership type etc
  - *Applicant Income* - The reported annual income as provided by the applicant and the verification status
  - *Loan* - Attributes directly related to the loan like the loan purpose, amount, interest rate, status, payment of various fees etc
  - *Credit History* - Attributes collected about the applicant's credit history like no. of bankcard accounts, mortgage accounts, credit balances, debt to income ratio etc
  - *Delinquency History* - Attributes collected about the applicant's previous delinquent payments like no. of chargeoffs / collections, no. of accounts past due payment etc.
  
  
  - The *loan_status* attribute can be one among the following.
      - 'Fully Paid'
      - 'Current'
      - 'In Grace Period'
      - 'Late (16-30 days)'
      - 'Late (31-120 days)'
      - 'Default'
      - 'Charged Off'
  
    The detailed list of attributes can be viewed [ here ] ('https://github.com/anurekhat/Capstone/blob/master/LendingClub/LCDataDictionary2014.xlsx '). 
  

###  Data Clean ups, filtering and transformations
 
 
  - The below attributes have been dropped for further analysis as they are manual text entries that serves more as just informational. 
      - *emp_title*
      - *desc*
      - *title*
      
      
  - There are two available terms for loans - 36 months and 60 months. For this study, we consider only the 36 months term loans as the 60 months term loans initiated in 2014 will be mostly still running as on 2017 and including them may skew the features of 'Charged Off' loans. 
  
  
  - The loan records with *total_rec_late_fee* as a non-zero value have been marked as delinquent. i.e If the applicant has ever paid a late fee, that loan is marked as delinquent. Please see the explanation on **Classification** below
  
  - The below date attributes have been converted to the appropriate duration attribute, considering Apr 2017 as the reference date
      - *issue_d* converted to *mo_sin_loan_funded* (months since loan was funded)
      - *earliest_cr_line* converted to *mo_sin_earliest_cr_line* (months since the earliest credit line was opened)
      - *last_pymnt_d* converted to *mo_sin_last_pymnt* (months since the last payment)
      - *last_credit_pull_d* converted to *mo_sin_last_credit_pull* (months since the last credit pull by LC)
      
  - **Imputing Missing Values**
      - *last_pymnt_d* (*mo_sin_last_pymnt*) : This attribute is expected to be null when there has not been any installments that were paid. The date is set to May 2017 so that the duration attribute holds the value -1 and these records are considered separate by the model estimator
      - *last_credit_pull_d* (*mo_sin_last_credit_pull*) : This attribute is expected to be null when LC has not pulled the credit info of the applicant. The date is set to May 2017 so that the duration attribute holds the value -1 and these records are considered separate by the model estimator
      - *mths_since_last_delinq* : This attribute is expected to be null when the applicant has no prior delinquencies. Imputing these to -1 to set these apart
      - *mths_since_last_record* : This attribute is expected to be null when the applicant has no public records. Imputing these to -1 to set these apart
      - *mths_since_last_major_derog* : This attribute is expected to be null when the applicant has no previous derogatory records. Imputing these to -1 to set these apart
      - *mo_sin_old_il_acct* : This attribute is expected to be null when the applicant has no bank installment accounts. Imputing these to -1 to set these apart
      - *mths_since_recent_bc_dlq* : This attribute is expected to be null when there's no previous bankcard delinquency. Imputing these to -1 to set these apart
      - *mths_since_recent_inq* : This attribute is expected to be null when there's no previous inquiries. Imputing these to -1 to set these apart
      - *mths_since_recent_revol_delinq* : This attribute is expected to be null when there's no previous revolving account delinquencies. Imputing these to -1 to set these apart
      - *num_tl_120dpd_2m* : This attribute is expected to be null when there's no accounts currently 120 days past due. Imputing these to 0
      - *mths_since_recent_bc* : This attribute is expected to be null when info not available. Imputing these to the median
      - *bc_open_to_buy* : This attribute is expected to be null when info is not available. Imputing these to the median
      - *bc_util* : This attribute is expected to be null when info is not available. Imputing these to the median
      - *avg_cur_bal* : This attribute is expected to be null when info is not available. Imputing these to the median
      - *percent_bc_gt_75* : This attribute is expected to be null when info is not available. Imputing these to the median
      - *revol_util* : This attribute is expected to be null when info is not available. Imputing these to the median
      
  - **Categorical features** 
      Categorical features have been converted to the 'One Hot Encoding' format using pandas `get_Dummies()` method
      
  - The following attributes are not being considered as the 'X' independent variables in the model as these are not expected to impact the outcome variable of loan_class
      - *term* : As it will be always 36 months
      - *zip_code* : The 3 digit coded value is not expected to impact outcome
      - *loan_status* : This is more part of the outcome than an input variable
      - *application_type* : As it will be always 'INDIVIDUAL'
      - *id* : Identifier 
      - *next_pymnt_d* : Not expected to impact outcome

      
      
#### Classification
   - All loan records where *loan_status* is 'Fully Paid' or 'Current' and *total_rec_late_fee* is zero are classified as **Good**
      
      
   - All loan records with *loan_status* 'Charged Off' are classified as **Bad**
      
      
   - All loan records that have a non-zero value for *total_rec_late_fee* or loan_status is one among { 'In Grace Period', 'Late (16-30 days)', 'Late (31-120 days)', 'Default'} is classified as **Delinquent**. Note that there could be some >Charged Off loans that could have been also delinquent. The Charged off status will take priority in such cases and those loans will be classified as **Bad** 
      
  
## Methodology


The cleaned up dataset has been divided into training (70%) and test (30%) datasets. Models with the below algorithm have been trained and tested with the dataset.

    - Logistic Regression
    - Linear Support Vector Classifier (LinearSVC)
    - Random Forest Classifier
    - XG Boost
    
  The below table summarizes the scores from the different models
  

Algorithm | Score | Learning | Test
--- | --- | --- | --- 
Logistic Regression| Accuracy | 0.9608 | 0.9612
Logistic Regression| F1 |  | 0.9537
LinearSVC| Accuracy| 0.9813|0.9824
LinearSVC| F1| |0.9800
RandomForest| Accuracy |0.9991 |0.9811
RandomForest|F1| |0.9792
xgBoost| Accuracy|0.9887 |0.9878
xgBoost| F1 | |0.9871


## Testing on 2012-2013 dataset

The XG Boost model was found to have the better scores based on the results from the 2014 test dataset. The LendingClub's 2012-2013 dataset has been used to further test the xgboost model. The results are as below.

    - Accuracy score :  0.974159099333
    - F1 score :  0.975060440745

 Class|   F1 |  Precision |   Recall | Total
          ---| --- | --- | --- | ---
        Bad | 0.907538 |   0.837017 | 0.991035 | 11043
        Delinquent |  0.955626  | 0.991776 | 0.922018 |  1308
        Good|  0.985482 |  0.998751 | 0.972562 | 73985

    - Confusion Matrix

    [[10944    10    89]
 
     [  101  1206     1]
 
     [ 2030     0 71955]]
 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st
import statsmodels.stats.weightstats as wst
%matplotlib inline
pd.options.display.max_columns = None
pd.options.display.max_rows = None
plt.rcParams['figure.figsize'] = (10.0, 8.0)

In [2]:
# Read the dataset.

# Define a function for string % convertion to a float ratio
def p2f(x):
    perc = None
    if (x != None) :
        if (type(x) is str) :
            x = x.strip('%')
        perc = float(x)/100
    return perc

loan_data = pd.read_csv('./LoanStats2014.csv', skiprows=1, skipfooter=2, parse_dates = [15, 26, 45,47, 48], 
                        infer_datetime_format = True, engine = 'python', converters = {'int_rate' : p2f, 'revol_util' : p2f})

In [3]:
#Removing the attribute columns that have all null values. Only the attributes listed in the above DataDictionary will remain.
loan_data.dropna(axis = 1, how = 'all', inplace= True)

#Since id values are not provided in the dataset, create id variable from the index.
loan_data['id'] = loan_data.index


In [4]:
loan_data['term_mnths'] = loan_data.term.str.strip('months').astype(int)

In [5]:
len(loan_data)

235629

In [6]:
#Filtering out the 36 months term loans
loan_data = loan_data[loan_data.term_mnths == 36]
len(loan_data)

162570

In [7]:
#Dropping textual columns from the dataset
loan_data.drop(['emp_title', 'desc', 'title'], axis = 1, inplace = True)

In [8]:
#Marking loans that had at least one late payment
loan_data['late_fee_rec_indicator'] = (loan_data.total_rec_late_fee > 0)


In [9]:
#loans across different loan statuses
#sns.countplot(x = 'loan_status', hue = 'late_fee_rec_indicator', data = loan_data)

In [10]:
# Classify the loans as 'Good' ,'Bad', 'Delinquent'

def classify_loan(row) :
    val = ''
    if ((row.loan_status in ['Fully Paid', 'Current']) & (row.late_fee_rec_indicator == False) ) :
        val = 'Good'
    elif row.loan_status in ['Charged Off'] :
        val = 'Bad'
    elif ( (row.loan_status not in ['Fully Paid', 'Current', 'Charged Off']) | 
         row.late_fee_rec_indicator == True) :
        val = 'Delinquent'
    return val

loan_data['loan_class'] = loan_data.apply(classify_loan, axis = 1)

In [11]:
pd.DataFrame(loan_data.groupby('loan_class').size())

Unnamed: 0_level_0,0
loan_class,Unnamed: 1_level_1
Bad,19380
Delinquent,5018
Good,138172


In [None]:
#pd.DataFrame( loan_data.groupby(['loan_class', 'loan_status','late_fee_rec_indicator']).size())

In [12]:
#Transform date fields into appropriate duration units
from datetime import datetime
loans = loan_data.copy()
ref_datetime = datetime.strptime('2017-04-02', '%Y-%m-%d')
# Issue date
loans['mo_sin_loan_funded'] = ((ref_datetime - loans['issue_d']) / np.timedelta64(1, 'M')).astype(int)

# Earliest credit line
loans['mo_sin_earliest_cr_line'] = ((ref_datetime - loans['earliest_cr_line']) / np.timedelta64(1, 'M')).astype(int)

In [13]:
#Imputing missing values and converting to duration units
#Last payment date is null when no installments have been made. Setting the nulls to May 2017, so that the calculated duration
# field will have the value -1
loans['last_pymnt_d'] = loans['last_pymnt_d'].fillna(pd.to_datetime('2017-05-02'))
loans['mo_sin_last_pymnt'] = ((ref_datetime - loans['last_pymnt_d']) / np.timedelta64(1, 'M')).astype(int)

#last_credit_pull_d
#Setting the nulls to May 2017, so that the calculated duration field will have the value -1
loans['last_credit_pull_d'] = loans['last_credit_pull_d'].fillna(pd.to_datetime('2017-05-02'))
loans['mo_sin_last_credit_pull'] = ((ref_datetime - loans['last_credit_pull_d']) / np.timedelta64(1, 'M')).astype(int)

In [14]:
# Imputing missing values

#mths_since_last_delinq
# this will be null when the applicant has no prior delinquencies. Imputing these to -1 to set these apart.
loans['mths_since_last_delinq'] = loans['mths_since_last_delinq'].fillna(-1)

#mths_since_last_record
# this will be null when the applicant has no public records. Imputing these to -1 to set these apart.
loans['mths_since_last_record'] = loans['mths_since_last_record'].fillna(-1)

#mths_since_last_major_derog
# this will be null when the applicant has no previous derogatory records. Imputing these to -1 to set these apart.
loans['mths_since_last_major_derog'] = loans['mths_since_last_major_derog'].fillna(-1)

#mo_sin_old_il_acct
# this will be null when the applicant has no bank installment accounts. Imputing these to -1 to set these apart.
loans['mo_sin_old_il_acct'] = loans['mo_sin_old_il_acct'].fillna(-1)

#mths_since_recent_bc_dlq
# this will be null when there's no previous bankcard delinquency. Imputing these to -1 to set these apart.
loans['mths_since_recent_bc_dlq'] = loans['mths_since_recent_bc_dlq'].fillna(-1)

#mths_since_recent_inq
# this will be null when there's no previous inquiries. Imputing these to -1 to set these apart.
loans['mths_since_recent_inq'] = loans['mths_since_recent_inq'].fillna(-1)

#mths_since_recent_revol_delinq
# this will be null when there's no previous revolving account delinquencies. Imputing these to -1 to set these apart.
loans['mths_since_recent_revol_delinq'] = loans['mths_since_recent_revol_delinq'].fillna(-1)

#num_tl_120dpd_2m
# this will be null when there's no accounts currently 120 days past due. Imputing these to 0.
loans['num_tl_120dpd_2m'] = loans['num_tl_120dpd_2m'].fillna(0)

#mths_since_recent_bc
# this will be null when info not available. Imputing these to the median
loans['mths_since_recent_bc'] = loans['mths_since_recent_bc'].fillna(loans.mths_since_recent_bc.median())

#bc_open_to_buy
# this will be null when info is not available. Imputing these to the median
loans['bc_open_to_buy'] = loans['bc_open_to_buy'].fillna(loans.bc_open_to_buy.median())

#bc_util
# this will be null when info is not available. Imputing these to the median
loans['bc_util'] = loans['bc_util'].fillna(loans.bc_util.median())

#avg_cur_bal
# this will be null when info is not available. Imputing these to the median
loans['avg_cur_bal'] = loans['avg_cur_bal'].fillna(loans.avg_cur_bal.median())

#percent_bc_gt_75
# this will be null when info is not available. Imputing these to the median
loans['percent_bc_gt_75'] = loans['percent_bc_gt_75'].fillna(loans.percent_bc_gt_75.median())

#revol_util
# this will be null when info is not available. Imputing these to the median
loans['revol_util'] = loans['revol_util'].fillna(loans.revol_util.median())

In [15]:
#Drop the now unused columns
loans.drop(['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d'], axis = 1, inplace = True)


In [16]:
# Convert categorical features to One Hot Encode format using Pandas getDummies
categorical_features = ['grade', 'sub_grade', 'home_ownership','emp_length', 'verification_status', 'pymnt_plan', 'purpose', 'addr_state',
                       'initial_list_status']
loans_with_dummies = pd.get_dummies(columns = categorical_features, data= loans)

In [17]:
# Implement logistic regression with Grid Search Cross Validation for tuning hyper parameters

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Identifying X values - Removing all derived columns, date columns, id and constants
X = loans_with_dummies.drop(['term', 'zip_code','loan_status', 'application_type','id','term_mnths', 'late_fee_rec_indicator', 
                             'loan_class', 'next_pymnt_d' ], axis = 1)
loans_with_dummies['loan_class'] = loans_with_dummies.loan_class.astype('category')
Y = loans_with_dummies.loan_class.cat.codes


In [18]:
# Split the data into a training and test set.
Xlr, Xtestlr, ylr, ytestlr = train_test_split(X.values, Y.values, test_size = 0.30, random_state = 5 )

print(Xlr.shape)
print (ylr.shape)

print(Xtestlr.shape)
print(ytestlr.shape)

(113799, 201)
(113799,)
(48771, 201)
(48771,)


## Logistic Regression Model

In [19]:


#parameters = { 'C' : [ 0.1, 1, 10, 100]}
parameters = {'C' : [1]}
#Using the class_weight parameter as balanced so that the algo adjusts for the less frequent loan classes of bad and delinquent
lr = LogisticRegression(class_weight= 'balanced')
clf = GridSearchCV(lr, parameters)

In [20]:

# Fit the model on the best model
clf.fit(Xlr, ylr)
print ('Best Params',clf.best_params_)

Best Params {'C': 1}


In [21]:
# Accuracy score
from sklearn.metrics import accuracy_score

# Print the accuracy from the test data.
print('Accuracy score from learning dataset',accuracy_score(ylr, clf.predict(Xlr) ))

# Print the accuracy from the test data.
print('Accuracy score from testing dataset',accuracy_score(ytestlr, clf.predict(Xtestlr)))

Accuracy score from learning dataset 0.960869603424
Accuracy score from testing dataset 0.961247462631


In [22]:
# Log loss
from sklearn.metrics import log_loss

# Print the log loss from the test data.
print('Log loss from learning dataset',log_loss(ylr, clf.predict_proba(Xlr)))

# Print the log loss from the test data.
print('Log loss from testing dataset',log_loss(ytestlr, clf.predict_proba(Xtestlr)))


Log loss from learning dataset 0.25464750644
Log loss from testing dataset 0.256346470061


In [23]:
# precision recall f1 score
from sklearn.metrics import precision_recall_fscore_support

print ('Learing dataset: \n')
prec, recall, f1, support = precision_recall_fscore_support(ylr, clf.predict(Xlr))
prf_lr = pd.DataFrame({'Precision': prec, 'Recall': recall, 'F1': f1, 'Total': support})
print (prf_lr)

print ('Test dataset: \n')
prec, recall, f1, support = precision_recall_fscore_support(ytestlr, clf.predict(Xtestlr))

prf_test = pd.DataFrame({'Precision': prec, 'Recall': recall, 'F1': f1, 'Total': support})
print (prf_test)

Learing dataset: 

         F1  Precision    Recall  Total
0  0.994838   0.999926  0.989802  13630
1  0.119616   0.214967  0.082862   3536
2  0.977231   0.965819  0.988917  96633
Test dataset: 

         F1  Precision    Recall  Total
0  0.995896   1.000000  0.991826   5750
1  0.126126   0.212121  0.089744   1482
2  0.977495   0.967107  0.988108  41539


In [24]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

print ('Learing dataset: \n')
print (confusion_matrix(ylr, clf.predict(Xlr)))

print ('Test dataset: \n')
print (confusion_matrix(ytestlr, clf.predict(Xtestlr)))

Learing dataset: 

[[13491     0   139]
 [    0   293  3243]
 [    1  1070 95562]]
Test dataset: 

[[ 5703     0    47]
 [    0   133  1349]
 [    0   494 41045]]


In [25]:
from sklearn.metrics import f1_score

f1_score(ytestlr, clf.predict(Xtestlr), average = 'weighted')

0.95379369703207106

## SVM Model

 Re-using the learning set and test set created earlier using train_test_split (70% | 30%)

In [26]:
from sklearn import svm
from sklearn import preprocessing

#Scale X variables
Xlr_svm = preprocessing.scale(Xlr)
Xtestlr_svm = preprocessing.scale(Xtestlr)

# dual = False as n_samples > n_features
#class_weight = balanced so that the algo adjusts for the less frequent loan classes of bad and delinquent
#using the default ovr classifier for the multi class outcome
#parameters = { 'C': [0.1, 1, 10]}
parameters = {'C' : [1]}
clf_svm = GridSearchCV(svm.LinearSVC(dual = False, class_weight= 'balanced'), parameters, n_jobs = -1)



In [27]:
clf_svm.fit(Xlr_svm, ylr )

GridSearchCV(cv=None, error_score='raise',
       estimator=LinearSVC(C=1.0, class_weight='balanced', dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
       fit_params={}, iid=True, n_jobs=-1, param_grid={'C': [1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [28]:
print(clf_svm.best_params_)
pd.DataFrame(clf_svm.cv_results_)

{'C': 1}


Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_C,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,462.778926,0.122827,0.981063,0.981327,1,{'C': 1},1,0.981415,0.981058,0.981204,0.98156,0.98057,0.981362,138.169983,0.023181,0.000359,0.000206


In [29]:
# Accuracy score

# Print the accuracy from the test data.
print('Accuracy score from learning dataset',accuracy_score(ylr, clf_svm.predict(Xlr_svm) ))

# Print the accuracy from the test data.
print('Accuracy score from testing dataset',accuracy_score(ytestlr, clf_svm.predict(Xtestlr_svm)))

Accuracy score from learning dataset 0.981379449732
Accuracy score from testing dataset 0.982469090238


In [30]:
print ('Learing dataset: \n')
prec, recall, f1, support = precision_recall_fscore_support(ylr, clf_svm.predict(Xlr_svm))
prf_lr = pd.DataFrame({'Precision': prec, 'Recall': recall, 'F1': f1, 'Total': support})
print (prf_lr)

print ('Test dataset: \n')
prec, recall, f1, support = precision_recall_fscore_support(ytestlr, clf_svm.predict(Xtestlr_svm))

prf_test = pd.DataFrame({'Precision': prec, 'Recall': recall, 'F1': f1, 'Total': support})
print (prf_test)

Learing dataset: 

         F1  Precision    Recall  Total
0  0.999119   1.000000  0.998239  13630
1  0.639023   0.815905  0.525170   3536
2  0.989124   0.982638  0.995695  96633
Test dataset: 

         F1  Precision    Recall  Total
0  0.998868   1.000000  0.997739   5750
1  0.632219   0.886724  0.491228   1482
2  0.989851   0.981949  0.997882  41539


In [31]:
# Confusion Matrix


print ('Learing dataset: \n')
print (confusion_matrix(ylr, clf_svm.predict(Xlr_svm)))

print ('Test dataset: \n')
print (confusion_matrix(ytestlr, clf_svm.predict(Xtestlr_svm)))

Learing dataset: 

[[13606     3    21]
 [    0  1857  1679]
 [    0   416 96217]]
Test dataset: 

[[ 5737     5     8]
 [    0   728   754]
 [    0    88 41451]]


In [32]:


f1_score(ytestlr, clf_svm.predict(Xtestlr_svm), average = 'weighted')

0.9800467734710463

## Random Forest Model

 Re-using the learning set and test set created earlier using train_test_split (70% | 30%)

In [33]:
from sklearn.ensemble import RandomForestClassifier

parameters = {'n_estimators' : [10, 20, 30],
             'max_features': ['auto'],
             'class_weight' :['balanced']}
clf_rf = GridSearchCV( RandomForestClassifier(), parameters, n_jobs = -1)



In [34]:
clf_rf.fit(Xlr, ylr)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [10, 20, 30], 'max_features': ['auto'], 'class_weight': ['balanced']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [36]:
#pd.DataFrame(clf_rf.cv_results_)

In [37]:
clf_rf.score(Xtestlr, ytestlr)

0.98066473929179221

In [38]:
# Accuracy score

# Print the accuracy from the test data.
print('Accuracy score from learning dataset',accuracy_score(ylr, clf_rf.predict(Xlr) ))

# Print the accuracy from the test data.
print('Accuracy score from testing dataset',accuracy_score(ytestlr, clf_rf.predict(Xtestlr)))

Accuracy score from learning dataset 0.999499116864
Accuracy score from testing dataset 0.980664739292


In [39]:
#Precision, Recall, F1
print ('Learing dataset: \n')
prec, recall, f1, support = precision_recall_fscore_support(ylr, clf_rf.predict(Xlr))
prf_lr = pd.DataFrame({'Precision': prec, 'Recall': recall, 'F1': f1, 'Total': support})
print (prf_lr)

print ('Test dataset: \n')
prec, recall, f1, support = precision_recall_fscore_support(ytestlr, clf_rf.predict(Xtestlr))

prf_test = pd.DataFrame({'Precision': prec, 'Recall': recall, 'F1': f1, 'Total': support})
print (prf_test)

Learing dataset: 

         F1  Precision   Recall  Total
0  1.000000    1.00000  1.00000  13630
1  0.991875    1.00000  0.98388   3536
2  0.999705    0.99941  1.00000  96633
Test dataset: 

         F1  Precision    Recall  Total
0  0.977991   0.998550  0.958261   5750
1  0.692776   0.975490  0.537112   1482
2  0.988902   0.978439  0.999591  41539


In [40]:
# Confusion Matrix


print ('Learing dataset: \n')
print (confusion_matrix(ylr, clf_rf.predict(Xlr)))

print ('Test dataset: \n')
print (confusion_matrix(ytestlr, clf_rf.predict(Xtestlr)))

Learing dataset: 

[[13630     0     0]
 [    0  3479    57]
 [    0     0 96633]]
Test dataset: 

[[ 5510     6   234]
 [    5   796   681]
 [    3    14 41522]]


In [41]:
#F1 score
f1_score(ytestlr, clf_rf.predict(Xtestlr), average = 'weighted')

0.97861689594227819

## XG Boost

 Re-using the learning set and test set created earlier using train_test_split (70% | 30%)

In [42]:
import xgboost as xgb

dtrain = xgb.DMatrix(Xlr, label = ylr)
#clf_xg = xgb.xgboost()
dtest = xgb.DMatrix(Xtestlr, label = ytestlr)

In [43]:
param = {'booster': 'gbtree','silent':0, 'objective':'multi:softmax', 'num_class' : 3 , 'eta':0.3}
param['scale_pos_weight']= 0.176
param['max_delta_step'] = 5
param['min_samples_leaf'] = 15
param['eval_metric'] = 'merror'

evallist  = [(dtest,'eval'), (dtrain,'train')]

num_round = 15
bst = xgb.train( param, dtrain, num_round, evallist )


[0]	eval-merror:0.026594	train-merror:0.025642
[1]	eval-merror:0.02278	train-merror:0.022144
[2]	eval-merror:0.02073	train-merror:0.020475
[3]	eval-merror:0.01751	train-merror:0.017223
[4]	eval-merror:0.017264	train-merror:0.017215
[5]	eval-merror:0.016875	train-merror:0.016301
[6]	eval-merror:0.015932	train-merror:0.015009
[7]	eval-merror:0.014168	train-merror:0.013445
[8]	eval-merror:0.013328	train-merror:0.012531
[9]	eval-merror:0.012835	train-merror:0.012215
[10]	eval-merror:0.012548	train-merror:0.011889
[11]	eval-merror:0.012528	train-merror:0.011854
[12]	eval-merror:0.012528	train-merror:0.011635
[13]	eval-merror:0.012179	train-merror:0.011327
[14]	eval-merror:0.012179	train-merror:0.011248


In [44]:
# Accuracy score
Xlr_xg = xgb.DMatrix(Xlr)
Xtestlr_xg = xgb.DMatrix(Xtestlr)
# Print the accuracy from the test data.
print('Accuracy score from learning dataset',accuracy_score(ylr, bst.predict(Xlr_xg) ))

# Print the accuracy from the test data.
print('Accuracy score from testing dataset',accuracy_score(ytestlr, bst.predict(Xtestlr_xg)))

Accuracy score from learning dataset 0.988752097997
Accuracy score from testing dataset 0.987820631113


In [45]:
#Precision, Recall, F1
print ('Learing dataset: \n')
prec, recall, f1, support = precision_recall_fscore_support(ylr, bst.predict(Xlr_xg))
prf_xg = pd.DataFrame({'Precision': prec, 'Recall': recall, 'F1': f1, 'Total': support})
print (prf_xg)

print ('Test dataset: \n')
prec, recall, f1, support = precision_recall_fscore_support(ytestlr, bst.predict(Xtestlr_xg))

prf_test_xg = pd.DataFrame({'Precision': prec, 'Recall': recall, 'F1': f1, 'Total': support})
print (prf_test_xg)

Learing dataset: 

         F1  Precision    Recall  Total
0  0.988258   0.994796  0.981805  13630
1  0.839370   0.985883  0.730769   3536
2  0.993553   0.987997  0.999172  96633
Test dataset: 

         F1  Precision    Recall  Total
0  0.987317   0.993137  0.981565   5750
1  0.817439   0.965961  0.708502   1482
2  0.993129   0.987667  0.998652  41539


In [46]:
# Confusion Matrix


print ('Learing dataset: \n')
print (confusion_matrix(ylr, bst.predict(Xlr_xg)))

print ('Test dataset: \n')
print (confusion_matrix(ytestlr, bst.predict(Xtestlr_xg)))

Learing dataset: 

[[13382    20   228]
 [    7  2584   945]
 [   63    17 96553]]
Test dataset: 

[[ 5644     9    97]
 [   11  1050   421]
 [   28    28 41483]]


In [47]:
#F1 score
f1_score(ytestlr, bst.predict(Xtestlr_xg), average = 'weighted')

0.98710517452665747

## Testing the best model in the 2012-2013 dataset

In [48]:
loans_1213 = pd.read_csv('./LoanStats2012to2013.csv', skiprows=1, skipfooter=2, parse_dates = [15, 26, 45,47, 48], 
                        infer_datetime_format = True, engine = 'python', converters = {'int_rate' : p2f, 'revol_util' : p2f})

In [49]:
#Removing the attribute columns that have all null values. Only the attributes listed in the above DataDictionary will remain.
loans_1213.dropna(axis = 1, how = 'all', inplace= True)

#Since id values are not provided in the dataset, create id variable from the index.
loans_1213['id'] = loans_1213.index

In [50]:
loans_1213['term_mnths'] = loans_1213.term.str.strip('months').astype(int)
len(loans_1213)

188181

In [51]:
#Filtering out the 36 months term loans
loans_1213 = loans_1213[loans_1213.term_mnths == 36]
len(loans_1213)

143892

In [52]:
#Dropping textual columns from the dataset
loans_1213.drop(['emp_title', 'desc', 'title'], axis = 1, inplace = True)

In [53]:
#Marking loans that had at least one late payment
loans_1213['late_fee_rec_indicator'] = (loans_1213.total_rec_late_fee > 0)

In [54]:
loans_1213['loan_class'] = loans_1213.apply(classify_loan, axis = 1)

In [55]:
pd.DataFrame(loans_1213.groupby('loan_class').size())

Unnamed: 0_level_0,0
loan_class,Unnamed: 1_level_1
Bad,18268
Delinquent,2178
Good,123446


In [56]:
ref_datetime = datetime.strptime('2017-04-02', '%Y-%m-%d')
# Issue date
loans_1213['mo_sin_loan_funded'] = ((ref_datetime - loans_1213['issue_d']) / np.timedelta64(1, 'M')).astype(int)

# Earliest credit line
loans_1213['mo_sin_earliest_cr_line'] = ((ref_datetime - loans_1213['earliest_cr_line']) / np.timedelta64(1, 'M')).astype(int)

In [57]:
#Imputing missing values and converting to duration units
#Last payment date is null when no installments have been made. Setting the nulls to May 2017, so that the calculated duration
# field will have the value -1
loans_1213['last_pymnt_d'] = loans_1213['last_pymnt_d'].fillna(pd.to_datetime('2017-05-02'))
loans_1213['mo_sin_last_pymnt'] = ((ref_datetime - loans_1213['last_pymnt_d']) / np.timedelta64(1, 'M')).astype(int)

#last_credit_pull_d
#Setting the nulls to May 2017, so that the calculated duration field will have the value -1
loans_1213['last_credit_pull_d'] = loans_1213['last_credit_pull_d'].fillna(pd.to_datetime('2017-05-02'))
loans_1213['mo_sin_last_credit_pull'] = ((ref_datetime - loans_1213['last_credit_pull_d']) / np.timedelta64(1, 'M')).astype(int)

In [58]:
# Imputing missing values

#mths_since_last_delinq
# this will be null when the applicant has no prior delinquencies. Imputing these to -1 to set these apart.
loans_1213['mths_since_last_delinq'] = loans_1213['mths_since_last_delinq'].fillna(-1)

#mths_since_last_record
# this will be null when the applicant has no public records. Imputing these to -1 to set these apart.
loans_1213['mths_since_last_record'] = loans_1213['mths_since_last_record'].fillna(-1)

#mths_since_last_major_derog
# this will be null when the applicant has no previous derogatory records. Imputing these to -1 to set these apart.
loans_1213['mths_since_last_major_derog'] = loans_1213['mths_since_last_major_derog'].fillna(-1)

#mo_sin_old_il_acct
# this will be null when the applicant has no bank installment accounts. Imputing these to -1 to set these apart.
loans_1213['mo_sin_old_il_acct'] = loans_1213['mo_sin_old_il_acct'].fillna(-1)

#mths_since_recent_bc_dlq
# this will be null when there's no previous bankcard delinquency. Imputing these to -1 to set these apart.
loans_1213['mths_since_recent_bc_dlq'] = loans_1213['mths_since_recent_bc_dlq'].fillna(-1)

#mths_since_recent_inq
# this will be null when there's no previous inquiries. Imputing these to -1 to set these apart.
loans_1213['mths_since_recent_inq'] = loans_1213['mths_since_recent_inq'].fillna(-1)

#mths_since_recent_revol_delinq
# this will be null when there's no previous revolving account delinquencies. Imputing these to -1 to set these apart.
loans_1213['mths_since_recent_revol_delinq'] = loans_1213['mths_since_recent_revol_delinq'].fillna(-1)

#num_tl_120dpd_2m
# this will be null when there's no accounts currently 120 days past due. Imputing these to 0.
loans_1213['num_tl_120dpd_2m'] = loans_1213['num_tl_120dpd_2m'].fillna(0)

#mths_since_recent_bc
# this will be null when info not available. Imputing these to the median
loans_1213['mths_since_recent_bc'] = loans_1213['mths_since_recent_bc'].fillna(loans_1213.mths_since_recent_bc.median())

#bc_open_to_buy
# this will be null when info is not available. Imputing these to the median
loans_1213['bc_open_to_buy'] = loans_1213['bc_open_to_buy'].fillna(loans_1213.bc_open_to_buy.median())

#bc_util
# this will be null when info is not available. Imputing these to the median
loans_1213['bc_util'] = loans_1213['bc_util'].fillna(loans_1213.bc_util.median())

#avg_cur_bal
# this will be null when info is not available. Imputing these to the median
loans_1213['avg_cur_bal'] = loans_1213['avg_cur_bal'].fillna(loans_1213.avg_cur_bal.median())

#percent_bc_gt_75
# this will be null when info is not available. Imputing these to the median
loans_1213['percent_bc_gt_75'] = loans_1213['percent_bc_gt_75'].fillna(loans_1213.percent_bc_gt_75.median())

#revol_util
# this will be null when info is not available. Imputing these to the median
loans_1213['revol_util'] = loans_1213['revol_util'].fillna(loans_1213.revol_util.median())

In [59]:
#Drop the now unused columns
loans_1213.drop(['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d'], axis = 1, inplace = True)


In [60]:
# Convert categorical features to One Hot Encode format using Pandas getDummies
categorical_features = ['grade', 'sub_grade', 'home_ownership','emp_length', 'verification_status', 'pymnt_plan', 'purpose', 'addr_state',
                       'initial_list_status']
loans_1213_with_dummies = pd.get_dummies(columns = categorical_features, data= loans_1213)

In [61]:
# Identifying X values - Removing all derived columns, date columns, id and constants
X_1213 = loans_1213_with_dummies.drop(['term', 'zip_code','loan_status', 'application_type','id','term_mnths', 'late_fee_rec_indicator', 
                             'loan_class', 'next_pymnt_d' ], axis = 1)
loans_1213_with_dummies['loan_class'] = loans_1213_with_dummies.loan_class.astype('category')
Y_1213 = loans_1213_with_dummies.loan_class.cat.codes


In [62]:
 # Split the data into a training and test set.
Xlr_1213, Xtestlr_1213, ylr_1213, ytestlr_1213 = train_test_split(X_1213.values, Y_1213.values, test_size = 0.6, random_state = 5 )

print(Xlr_1213.shape)
print (ylr_1213.shape)

print(Xtestlr_1213.shape)
print(ytestlr_1213.shape)

(57556, 201)
(57556,)
(86336, 201)
(86336,)


In [63]:
#Run the test on the xgboost model

# Accuracy score
X_1213_xg = xgb.DMatrix(Xtestlr_1213)
# Print the accuracy from the test data.
print('Accuracy score : ',accuracy_score(ytestlr_1213, bst.predict(X_1213_xg) ))

print('F1 score : ', f1_score(ytestlr_1213, bst.predict(X_1213_xg), average = 'weighted'))

#Precision, Recall, F1

prec, recall, f1, support = precision_recall_fscore_support(ytestlr_1213, bst.predict(X_1213_xg))

prf_test_xg = pd.DataFrame({'Precision': prec, 'Recall': recall, 'F1': f1, 'Total': support})
print ('\n',prf_test_xg)

print('\n',confusion_matrix(ytestlr_1213, bst.predict(X_1213_xg)))

Accuracy score :  0.974159099333
F1 score :  0.975060440745

          F1  Precision    Recall  Total
0  0.907538   0.837017  0.991035  11043
1  0.955626   0.991776  0.922018   1308
2  0.985482   0.998751  0.972562  73985

 [[10944    10    89]
 [  101  1206     1]
 [ 2030     0 71955]]


In [None]:
#Run the test on the RandomForest model

# Accuracy score

print('Accuracy score : ',accuracy_score(ytestlr_1213, clf_rf.predict(Xtestlr_1213) ))

print('F1 score : ', f1_score(ytestlr_1213, clf_rf.predict(Xtestlr_1213), average = 'weighted'))

#Precision, Recall, F1

prec, recall, f1, support = precision_recall_fscore_support(ytestlr_1213, clf_rf.predict(Xtestlr_1213))

prf_test_xg = pd.DataFrame({'Precision': prec, 'Recall': recall, 'F1': f1, 'Total': support})
print ('\n',prf_test_xg)

print('\n',confusion_matrix(ytestlr_1213, clf_rf.predict(Xtestlr_1213)))