# Logistic Regression: Predicting the Defaulters

In [2]:
BankData = read.csv("D:/Imarticus/BankCreditCard.csv")

### Check the number of rows and columns

In [3]:
nrow(BankData)

In [4]:
ncol(BankData)

In [5]:
colnames(BankData)

### Describing the dataset


In [6]:
str(BankData)

'data.frame':	30000 obs. of  25 variables:
 $ Customer.ID           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Credit_Amount         : num  20000 220000 90000 50000 50000 50000 500000 200000 240000 20000 ...
 $ Gender                : int  2 2 2 2 1 1 1 2 2 1 ...
 $ Academic_Qualification: int  2 2 2 2 2 1 1 2 3 3 ...
 $ Marital               : int  1 2 2 1 1 2 2 2 1 2 ...
 $ Age_Years             : int  24 26 34 37 57 37 29 23 28 35 ...
 $ Repayment_Status_Jan  : int  2 0 0 0 0 0 0 0 0 0 ...
 $ Repayment_Status_Feb  : int  2 2 0 0 0 0 0 0 0 0 ...
 $ Repayment_Status_March: int  0 0 0 0 0 0 0 0 2 0 ...
 $ Repayment_Status_April: int  0 0 0 0 0 0 0 0 0 0 ...
 $ Repayment_Status_May  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Repayment_Status_June : int  0 2 0 0 0 0 0 0 0 0 ...
 $ Jan_Bill_Amount       : num  3933 3683 39339 46990 8637 ...
 $ Feb_Bill_Amount       : num  3103 1735 14037 48333 5570 ...
 $ March_Bill_Amount     : num  689 2682 23559 49292 35835 ...
 $ April_Bill_Amount     : num  0 3272 24

### Check Class Bias (Imbalanced Classification Problems)

In [7]:
table(BankData$Default_Payment)


    0     1 
23364  6636 

### Clearly there is a class bias, a condition observed when the proporion of events is much smaller than propertion of non-events. So we must sample the observations in approximately equal proportions to get better models

### Converting columns to factors according to data description

In [8]:
cols.to.factors = c("Gender","Academic_Qualification","Marital","Repayment_Status_Jan","Repayment_Status_Feb","Repayment_Status_March","Repayment_Status_April","Repayment_Status_May","Repayment_Status_June","Default_Payment")

In [9]:
BankData[cols.to.factors] = lapply(BankData[cols.to.factors], factor)

In [10]:
str(BankData)

'data.frame':	30000 obs. of  25 variables:
 $ Customer.ID           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Credit_Amount         : num  20000 220000 90000 50000 50000 50000 500000 200000 240000 20000 ...
 $ Gender                : Factor w/ 2 levels "1","2": 2 2 2 2 1 1 1 2 2 1 ...
 $ Academic_Qualification: Factor w/ 6 levels "1","2","3","4",..: 2 2 2 2 2 1 1 2 3 3 ...
 $ Marital               : Factor w/ 4 levels "0","1","2","3": 2 3 3 2 2 3 3 3 2 3 ...
 $ Age_Years             : int  24 26 34 37 57 37 29 23 28 35 ...
 $ Repayment_Status_Jan  : Factor w/ 7 levels "0","1","2","3",..: 3 1 1 1 1 1 1 1 1 1 ...
 $ Repayment_Status_Feb  : Factor w/ 7 levels "0","1","2","3",..: 3 3 1 1 1 1 1 1 1 1 ...
 $ Repayment_Status_March: Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 3 1 ...
 $ Repayment_Status_April: Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Repayment_Status_May  : Factor w/ 6 levels "0","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Repayment_Status_June : F

In [11]:
summary(BankData)

  Customer.ID    Credit_Amount     Gender    Academic_Qualification Marital  
 Min.   :    1   Min.   :  20000   1:11888   1:10585                0:   54  
 1st Qu.: 7501   1st Qu.:  50000   2:18112   2:14030                1:13659  
 Median :15000   Median : 220000             3: 4917                2:15964  
 Mean   :15000   Mean   : 192917             4:  123                3:  323  
 3rd Qu.:22500   3rd Qu.: 270000             5:  280                         
 Max.   :30000   Max.   :2000000             6:   65                         
                                                                             
   Age_Years     Repayment_Status_Jan Repayment_Status_Feb
 Min.   :21.00   0:23182              0:25562             
 1st Qu.:28.00   1: 3688              1:   28             
 Median :34.00   2: 2667              2: 3927             
 Mean   :35.49   3:  322              3:  326             
 3rd Qu.:41.00   4:   76              4:   99             
 Max.   :79.00   5:   

### Check for missing values

In [12]:
col_name = colnames(BankData)[apply(BankData, 2, function(n) any(is.na(n)))]

In [13]:
if(length(col_name) > 0) print("Nulls present") else print("No Nulls")

[1] "No Nulls"


In [14]:
col_name = colnames(BankData)[apply(BankData,2,function(n)any(n ==""))] 
if(length(col_name)>0)print("Blanks present")else print("No Blanks")

[1] "No Blanks"


### Level Encoding

In [15]:
levels(BankData$Gender)[levels(BankData$Gender) == "1"]= "Male" 
levels(BankData$Gender)[levels(BankData$Gender) == "2"]="Female" 

In [16]:
str(BankData$Gender)

 Factor w/ 2 levels "Male","Female": 2 2 2 2 1 1 1 2 2 1 ...


In [17]:
levels(BankData$Academic_Qualification)[levels(BankData$Academic_Qualification) == "1"] = "Undergraduate" 
levels(BankData$Academic_Qualification)[levels(BankData$Academic_Qualification) == "2"] = "Graduate" 
levels(BankData$Academic_Qualification)[levels(BankData$Academic_Qualification) == "3"] = "Postgraduate" 
levels(BankData$Academic_Qualification)[levels(BankData$Academic_Qualification) == "4"] = "Professional" 
levels(BankData$Academic_Qualification)[levels(BankData$Academic_Qualification) == "5"] = "Others" 
levels(BankData$Academic_Qualification)[levels(BankData$Academic_Qualification) == "6"] = "Unknown"

In [18]:
str(BankData$Academic_Qualification)

 Factor w/ 6 levels "Undergraduate",..: 2 2 2 2 2 1 1 2 3 3 ...


### Randomly shuffling the dataset


In [19]:
grp = runif(nrow(BankData)) 
BankData = BankData[order(grp),]

### Create train and test samples

In [20]:
library(caret) 
library(ggplot2)

"package 'caret' was built under R version 3.6.3"Loading required package: lattice
Loading required package: ggplot2
"package 'ggplot2' was built under R version 3.6.3"

In [22]:
train.rows = createDataPartition(y=BankData$Default_Payment,p=0.7,list = FALSE) 
train.data = BankData[train.rows,] 
#70% data goes here table(train.data$Default_Payment)

table(train.data$Default_Payment)/length(train.data$Default_Payment)


        0         1 
0.7787724 0.2212276 

In [23]:
test.rows = createDataPartition(y=BankData$Default_Payment,p=0.3,list = FALSE) 
test.data = BankData[test.rows,] 

#30% data goes here table(test.data$Default_Payment)

table(test.data$Default_Payment)/length(test.data$Default_Payment)



        0         1 
0.7788024 0.2211976 

In [24]:
nrow(train.data) 
nrow(test.data)

### Build the Logistic regression model

In [25]:
glm_full_model = glm(Default_Payment ~ ., family = "binomial",data = train.data )

In [26]:
summary(glm_full_model)


Call:
glm(formula = Default_Payment ~ ., family = "binomial", data = train.data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2616  -0.6031  -0.5165  -0.3408   3.3025  

Coefficients:
                                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)                        -3.432e+00  6.692e-01  -5.128 2.93e-07 ***
Customer.ID                        -1.925e-07  2.182e-06  -0.088 0.929700    
Credit_Amount                      -1.343e-06  1.776e-07  -7.562 3.98e-14 ***
GenderFemale                       -1.468e-01  3.835e-02  -3.828 0.000129 ***
Academic_QualificationGraduate     -3.946e-02  4.420e-02  -0.893 0.371909    
Academic_QualificationPostgraduate -6.529e-02  5.906e-02  -1.106 0.268936    
Academic_QualificationProfessional -9.892e-01  4.323e-01  -2.288 0.022136 *  
Academic_QualificationOthers       -1.175e+00  2.877e-01  -4.085 4.41e-05 ***
Academic_QualificationUnknown      -7.848e-01  5.040e-01  -1.557 0.119462    
Marital1 

## Predicting probabilities obtained from the model

### Predict the Y-values

In [27]:
predict_full_model = predict(glm_full_model,test.data,type = "response")

### Note: response means that it gives probabilities

In [28]:
head(predict_full_model) 
table(train.data$Default_Payment)
predictions_full_model = ifelse(predict_full_model <= 0.5,0,1) 



    0     1 
16355  4646 

In [29]:
library(caret) 
library(e1071)

"package 'e1071' was built under R version 3.6.3"

### Build the confusion matrix

In [30]:
table(predicted = predictions_full_model,actual = test.data$Default_Payment) 
caret::confusionMatrix(as.factor(predictions_full_model),as.factor(test.data$Default_Payment))

         actual
predicted    0    1
        0 6675 1280
        1  335  711

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6675 1280
         1  335  711
                                          
               Accuracy : 0.8206          
                 95% CI : (0.8125, 0.8285)
    No Information Rate : 0.7788          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3726          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9522          
            Specificity : 0.3571          
         Pos Pred Value : 0.8391          
         Neg Pred Value : 0.6797          
             Prevalence : 0.7788          
         Detection Rate : 0.7416          
   Detection Prevalence : 0.8838          
      Balanced Accuracy : 0.6547          
                                          
       'Positive' Class : 0               
                        