# Modul 03

## Session 03 Supervised Learning Classification

## Logistic Regression

What to do in this chapter:
1. Build a logistics regression model
> * target: default
> * features: employ, debtinc, creddebt, othdebt
2. Interpret the result
3. Validate the model using accuracy in 20% testing data

------------------------

# 1. Build a logistics regression model

## Data

In [45]:
import pandas as pd
import numpy as np

In [47]:
bankloan = pd.read_csv('./datasets/bankloan.csv')
bankloan

Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41,3,17,12,176,9.3,11.359392,5.008608,1
1,27,1,10,6,31,17.3,1.362202,4.000798,0
2,40,1,15,14,55,5.5,0.856075,2.168925,0
3,41,1,15,14,120,2.9,2.658720,0.821280,0
4,24,2,2,0,28,17.3,1.787436,3.056564,1
...,...,...,...,...,...,...,...,...,...
695,36,2,6,15,27,4.6,0.262062,0.979938,1
696,29,2,6,4,21,11.5,0.369495,2.045505,0
697,33,1,15,3,32,7.6,0.491264,1.940736,0
698,45,1,19,22,77,8.4,2.302608,4.165392,0


note:
1. age: usia nasabah
2. ed : education
3. employ : employmnet
4. adddress : stay duration
5. income : income per month
6. debtinc : debt percentage in income
7. creddebt : debt in 1000 dollars
8. othdebt : other debt in 1000 dollars

In [50]:
feature = ['employ', 'debtinc', 'creddebt', 'othdebt']
target = 'default'

X = bankloan[feature]
y = bankloan[target]

In [51]:
X.describe()

Unnamed: 0,employ,debtinc,creddebt,othdebt
count,700.0,700.0,700.0,700.0
mean,8.388571,10.260571,1.553553,3.058209
std,6.658039,6.827234,2.117197,3.287555
min,0.0,0.4,0.011696,0.045584
25%,3.0,5.0,0.369059,1.044178
50%,7.0,8.6,0.854869,1.987567
75%,12.0,14.125,1.901955,3.923065
max,31.0,41.3,20.56131,27.0336


----------------------

## Model: Logistic Regression

In [52]:
import statsmodels.api as sm

In [53]:
model = sm.Logit(y, sm.add_constant(X))
result = model.fit()

Optimization terminated successfully.
         Current function value: 0.411165
         Iterations 7


In [54]:
print(result.summary())

                           Logit Regression Results                           
Dep. Variable:                default   No. Observations:                  700
Model:                          Logit   Df Residuals:                      695
Method:                           MLE   Df Model:                            4
Date:                Fri, 09 Jul 2021   Pseudo R-squ.:                  0.2844
Time:                        16:10:29   Log-Likelihood:                -287.82
converged:                       True   LL-Null:                       -402.18
Covariance Type:            nonrobust   LLR p-value:                 2.473e-48
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.2302      0.236     -5.210      0.000      -1.693      -0.767
employ        -0.2436      0.029     -8.456      0.000      -0.300      -0.187
debtinc        0.0885      0.021      4.200      0.0

output:
1. LLR p-value: 2.473e-48
2. p-value:
    * const: 0.000
    * employ: 0.000
    * debtinc: 0.000
    * creddebt: 0.000
    * othdebt: 0.94
3. coef (hanya nilai p-value yang diatas 0.5):
    * employ: -0.2436
    * debtinc: 0.0885
    * creddebt: 0.5041

The model --> default = 1 is high risk (bad payment)

interpretation:
1. LLR p-value: 2.473e-48 < 0.05 (reject hypothesis), means at least one variable significantly affeced risk default
2. p-value:
    * const: 0.000 < 0.05 (reject), means the model needs intercept
    * employ: 0.000 < 0.05 (reject), the correlation between employ and risk default are significant and the correlations are negative (employ value range: 0-31)
    * debtinc: 0.000 < 0.05, the correlation between debtinc and risk default are significant and the correlations are positice (debtinc value range: 0.4-41.3%)
    * creddebt: 0.000 < 0.05 (reject), the correlation between creddebt and risk default are significant and the correlations are positive (creddebt value range: 0.011-20.56 (times 1000 dollars))
    * othdebt: 0.94 > 0.05 (accept), not significant to risk default
3. coef (hanya nilai p-value yang diatas 0.5):
    * employ: -0.2436, when employ duration incerase by 5 years, risk default decreased by 0.2436 times (when other variables are constant)
    * debtinc: 0.0885, when debtinc incerase by 5%, risk default increased by 1.5565 times (when other variables are constant)
    * creddebt: 0.5041, when creddebt incerase by 5000 dollars, risk default increased by 12.434 times (when other variables are constant)

Interpretation coef:<br>
change coef to exponential<br>
coef --> Odd Ratio(OR) --> exp(beta(c-a))

In [57]:
# employ
c = 20
a = 15

OR_employ = np.exp(0.2436*(c-a))
print(OR_employ)

3.380420128015566


In [58]:
# debtinc
c = 20
a = 15

OR_debtinc = np.exp(0.0885*(c-a))
print(OR_debtinc)

1.5565938428137092


In [59]:
# creddebt
c = 20
a = 15

OR_creddebt = np.exp(0.5041*(c-a))
print(OR_creddebt)

12.434812515742879


---------------------

## Multicollinearity

In [60]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [61]:
def calc_vif(X):
    
    vif = pd.DataFrame()
    vif['variable'] = X.columns
    vif['vif'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    
    return vif

In [63]:
calc_vif(X)

Unnamed: 0,variable,vif
0,employ,2.222753
1,debtinc,3.045977
2,creddebt,2.816577
3,othdebt,4.116876


Interpretation:

The result indicated no multicollinearity because vif are lower than 4

---------------------

## Validation

validation on 20% test data

In [64]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [75]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size = 0.2,
    random_state = 2020
)

In [76]:
sm_logit_train = sm.Logit(y_train, sm.add_constant(X_train))
result_train = sm_logit_train.fit()

Optimization terminated successfully.
         Current function value: 0.408607
         Iterations 7


In [77]:
# risk default
y_prob = result_train.predict(sm.add_constant(X_test))

# class default or non-default
y_class = np.where(y_prob > 0.5, 1, 0)

In [78]:
print('Accuracy: ', accuracy_score(y_test, y_class)*100, '%')

Accuracy:  82.14285714285714 %


interpretation:

The model accuracy will correctly predict 82 of 100 account