# Example: Credit Card Default

We have seem this example in class. Our task will be to perform logistic regression and interpret the results.

---
* Model the default (binary variable) using various predictors
    * Model 1: default ~ balance
    * Model 2: default ~ student
    * Model 3: default ~ balance + income + student
    
* Discuss the results
    * What can you say about the coefficient for student in Models 1 and 2? Could you explain it?
    * Using Model 1, what is our estimated probability of default for someone with a balance of USD1000? How about USD2000?

In [2]:
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np

In [8]:
df = pd.read_csv("default_data.csv", index_col=0)
df.head(10)

Unnamed: 0,default,student,balance,income
1,No,No,729.526495,44361.625074
2,No,Yes,817.180407,12106.1347
3,No,No,1073.549164,31767.138947
4,No,No,529.250605,35704.493935
5,No,No,785.655883,38463.495879
6,No,Yes,919.58853,7491.558572
7,No,No,825.513331,24905.226578
8,No,Yes,808.667504,17600.451344
9,No,No,1161.057854,37468.529288
10,No,No,0.0,29275.268293


In [9]:
# add manually intercept
df['intercept'] = 1.0

# change categorical variables to numerical
df['default_num'] = np.where(df.default=='Yes', 1,0)
df['student_num'] = np.where(df.student=='Yes', 1,0)

display(df)

Unnamed: 0,default,student,balance,income,intercept,default_num,student_num
1,No,No,729.526495,44361.625074,1.0,0,0
2,No,Yes,817.180407,12106.134700,1.0,0,1
3,No,No,1073.549164,31767.138947,1.0,0,0
4,No,No,529.250605,35704.493935,1.0,0,0
5,No,No,785.655883,38463.495879,1.0,0,0
6,No,Yes,919.588530,7491.558572,1.0,0,1
7,No,No,825.513331,24905.226578,1.0,0,0
8,No,Yes,808.667504,17600.451344,1.0,0,1
9,No,No,1161.057854,37468.529288,1.0,0,0
10,No,No,0.000000,29275.268293,1.0,0,0


In [5]:
# default ~ balance
cols_to_keep = [ 'balance','intercept']
logit  = sm.Logit(df['default_num'], df[cols_to_keep])
result = logit.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.079823
         Iterations 10


0,1,2,3
Dep. Variable:,default_num,No. Observations:,10000.0
Model:,Logit,Df Residuals:,9998.0
Method:,MLE,Df Model:,1.0
Date:,"Mon, 05 Nov 2018",Pseudo R-squ.:,0.4534
Time:,14:47:22,Log-Likelihood:,-798.23
converged:,True,LL-Null:,-1460.3
,,LLR p-value:,6.233e-290

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
balance,0.0055,0.000,24.952,0.000,0.005,0.006
intercept,-10.6513,0.361,-29.491,0.000,-11.359,-9.943


In [6]:
# default ~ balance
cols_to_keep = ['student_num','intercept']
logit  = sm.Logit(df['default_num'], df[cols_to_keep])
result = logit.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.145434
         Iterations 7


0,1,2,3
Dep. Variable:,default_num,No. Observations:,10000.0
Model:,Logit,Df Residuals:,9998.0
Method:,MLE,Df Model:,1.0
Date:,"Mon, 05 Nov 2018",Pseudo R-squ.:,0.004097
Time:,14:47:32,Log-Likelihood:,-1454.3
converged:,True,LL-Null:,-1460.3
,,LLR p-value:,0.0005416

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
student_num,0.4049,0.115,3.520,0.000,0.179,0.630
intercept,-3.5041,0.071,-49.554,0.000,-3.643,-3.366


In [7]:
# default ~ balance
cols_to_keep = ['balance','income','student_num','intercept']
logit  = sm.Logit(df['default_num'], df[cols_to_keep])
result = logit.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.078577
         Iterations 10


0,1,2,3
Dep. Variable:,default_num,No. Observations:,10000.0
Model:,Logit,Df Residuals:,9996.0
Method:,MLE,Df Model:,3.0
Date:,"Mon, 05 Nov 2018",Pseudo R-squ.:,0.4619
Time:,14:47:33,Log-Likelihood:,-785.77
converged:,True,LL-Null:,-1460.3
,,LLR p-value:,3.257e-292

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
balance,0.0057,0.000,24.737,0.000,0.005,0.006
income,3.033e-06,8.2e-06,0.370,0.712,-1.3e-05,1.91e-05
student_num,-0.6468,0.236,-2.738,0.006,-1.110,-0.184
intercept,-10.8690,0.492,-22.079,0.000,-11.834,-9.904
