## Decision Tree (C5.0) To Predict Risky Bank Loans

##### Step-1: Import data
The analysis is carried out using 'credit.csv' dataset available at the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml). The dataset represents loans obtained from a credit agency in Germany.

In [9]:
credit <- read.csv(file.choose(), header = TRUE)
#str(credit)
credit$default <- as.factor(credit$default)

##### Step-2: Explore data

In [15]:
head(credit)

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_length,installment_rate,personal_status,other_debtors,ellip.h,property,age,installment_plan,housing,existing_credits,job,dependents,telephone,foreign_worker,default
1,unknown,6,critical,car (new),250,> 1000 DM,1 - 4 yrs,2,female,none,<8b>,real estate,41,bank,own,2,unskilled resident,1,none,yes,1
2,1 - 200 DM,9,repaid,car (new),276,< 100 DM,1 - 4 yrs,4,married male,none,<8b>,real estate,22,none,rent,1,unskilled resident,1,none,yes,1
3,< 0 DM,6,critical,radio/tv,338,501 - 1000 DM,> 7 yrs,4,single male,none,<8b>,other,52,none,own,2,skilled employee,1,none,yes,1
4,< 0 DM,12,fully repaid this bank,retraining,339,< 100 DM,> 7 yrs,4,married male,none,<8b>,other,45,bank,own,1,unskilled resident,1,none,yes,1
5,< 0 DM,6,repaid,domestic appliances,343,< 100 DM,0 - 1 yrs,4,female,none,<8b>,real estate,27,none,own,1,skilled employee,1,none,yes,1
6,unknown,6,critical,car (new),362,101 - 500 DM,1 - 4 yrs,4,female,none,<8b>,other,52,none,own,2,unskilled resident,1,none,yes,1


The 'default' variable is recorded as 1 for non-default and 2 for default of loan. The data suggests 30 percent of the loans went into default.

In [75]:
rbind(table(credit$default), prop.table(table(credit$default)))

1,2
700.0,300.0
0.7,0.3


##### Step-3: Prepare data 
Use 80% data as training dataset and the remaining as test dataset. Randomize before splitting to ensure consistent distribution of both training and test datasets.  

In [39]:
set.seed(100)
g <- runif(nrow(credit))
creditr <- credit[order(g),]
credittrain <- creditr[1:800, ]
credittest <- creditr[801:1000,]
rbind(prop.table(table(credittrain$default)), prop.table(table(credittest$default)))

1,2
0.71375,0.28625
0.645,0.355


##### Step-4: Train the model

In [76]:
#install.packages("C50")
library(C50)                # library C50 contains decision C5.0 algorithm
m1 <- C50::C5.0(credittrain[,-21], credittrain$default)
m1


Call:
C5.0.default(x = credittrain[, -21], y = credittrain$default)

Classification Tree
Number of samples: 800 
Number of predictors: 20 

Tree size: 59 

Non-standard options: attempt to group attributes


** Decision tree model parameters **

In [46]:
summary(m1)


Call:
C5.0.default(x = credittrain[, -21], y = credittrain$default)


C5.0 [Release 2.07 GPL Edition]  	Tue Jan 26 16:01:13 2016
-------------------------------

Class specified by attribute `outcome'

Read 800 cases (21 attributes) from undefined.data

Decision tree:

checking_balance in {> 200 DM,unknown}: 1 (373/46)
checking_balance in {< 0 DM,1 - 200 DM}:
:...credit_history in {fully repaid,fully repaid this bank}:
    :...housing = rent: 2 (13)
    :   housing in {for free,own}:
    :   :...other_debtors in {co-applicant,guarantor}: 1 (3)
    :       other_debtors = none:
    :       :...housing = for free: 2 (10/1)
    :           housing = own:
    :           :...purpose in {car (new),car (used),education,others,
    :               :           repairs}: 2 (10/1)
    :               purpose in {domestic appliances,radio/tv,
    :               :           retraining}: 1 (5/1)
    :               purpose = business:
    :               :...job in {mangement self-employed,skille

##### Step-5: Evaluate the model performance on test data

In [60]:
p1 <- predict(m1, credittest)
table(credittest$default, p1, dnn = c("Actual default", "Predicted default"))

              Predicted default
Actual default   1   2
             1 110  19
             2  39  32

In [61]:
library(gmodels)
CrossTable(credittest$default, p1, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, 
           dnn = c('actual default', 'predicted default'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  200 

 
               | predicted default 
actual default |         1 |         2 | Row Total | 
---------------|-----------|-----------|-----------|
             1 |       110 |        19 |       129 | 
               |     0.550 |     0.095 |           | 
---------------|-----------|-----------|-----------|
             2 |        39 |        32 |        71 | 
               |     0.195 |     0.160 |           | 
---------------|-----------|-----------|-----------|
  Column Total |       149 |        51 |       200 | 
---------------|-----------|-----------|-----------|

 


Interpretations:
* Model accurately predicted 71% events
* 85% of non-defaults are correctly predicted
* 55% of defaults are correctly predicted by the model

##### Step-6: Improving model performance

** Using adaboost **

In [77]:
m2 <- C50::C5.0(credittrain[,-21], credittrain$default, trials = 10)
p2 <- predict(m2, credittest)
prop.table(table(credittest$default, p2, dnn = c('actual default', 'predicted default')))

              predicted default
actual default     1     2
             1 0.570 0.075
             2 0.175 0.180

Interpretations:
* Overall model accuracy improved to 75%
* True positive rate improved to 88% of non-defaults being correctly predicted
* True negative rate slipped to 51% of defaults beingcorrectly predicted by the model

** Using random forest **

In [78]:
require(randomForest)
m3 <- randomForest::randomForest(credittrain[,-21], credittrain$default, ntree = 1000)
p3 <- predict(m3, credittest)
prop.table(table(credittest$default, p3, dnn = c('actual default', 'predicted default')))

              predicted default
actual default     1     2
             1 0.605 0.040
             2 0.210 0.145

Interpretations:
* Overall model accuracy improved to 74%
* True positive rate improved to 82% of non-defaults being correctly predicted
* True negative rate slipped to 41% of defaults beingcorrectly predicted by the model