# Binary Logistic Regression: Predict Default Risk 

## data: Bankloan  
Analisis data bankloan.csv  
* Buat model logistik regresi
    * target: default
    * features: employ, debtinc, creddebt, othdebt
* intepretasi hasil
* validasi hasil prediksi, dengan data test 20 %

In [1]:
# library
import pandas as pd
import numpy as np

In [2]:
bankloan= pd.read_csv('bankloan.csv')
bankloan.head()

Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41,3,17,12,176,9.3,11.359392,5.008608,1
1,27,1,10,6,31,17.3,1.362202,4.000798,0
2,40,1,15,14,55,5.5,0.856075,2.168925,0
3,41,1,15,14,120,2.9,2.65872,0.82128,0
4,24,2,2,0,28,17.3,1.787436,3.056564,1


In [3]:
fitur=['employ','debtinc','creddebt','othdebt']
target='default'

In [4]:
x= bankloan[fitur]
y=bankloan[target]

In [6]:
x.describe()

Unnamed: 0,employ,debtinc,creddebt,othdebt
count,700.0,700.0,700.0,700.0
mean,8.388571,10.260571,1.553553,3.058209
std,6.658039,6.827234,2.117197,3.287555
min,0.0,0.4,0.011696,0.045584
25%,3.0,5.0,0.369059,1.044178
50%,7.0,8.6,0.854869,1.987567
75%,12.0,14.125,1.901955,3.923065
max,31.0,41.3,20.56131,27.0336


## Modelling

In [8]:
import statsmodels.api as sm

In [9]:
sm_logit=sm.Logit(y, sm.add_constant(x))
result=sm_logit.fit()

Optimization terminated successfully.
         Current function value: 0.411165
         Iterations 7


In [11]:
# cek kolineariti
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(x):
    vif=pd.DataFrame()
    vif['variables']=x.columns
    vif['VIF']=[variance_inflation_factor(x.values, i) for i in range (x.shape[1])]
    return vif

In [12]:
calc_vif(x) #tidak ada masalah kolinearity

Unnamed: 0,variables,VIF
0,employ,2.222753
1,debtinc,3.045977
2,creddebt,2.816577
3,othdebt,4.116876


In [10]:
print(result.summary())

                           Logit Regression Results                           
Dep. Variable:                default   No. Observations:                  700
Model:                          Logit   Df Residuals:                      695
Method:                           MLE   Df Model:                            4
Date:                Wed, 15 Dec 2021   Pseudo R-squ.:                  0.2844
Time:                        10:57:34   Log-Likelihood:                -287.82
converged:                       True   LL-Null:                       -402.18
Covariance Type:            nonrobust   LLR p-value:                 2.473e-48
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.2302      0.236     -5.210      0.000      -1.693      -0.767
employ        -0.2436      0.029     -8.456      0.000      -0.300      -0.187
debtinc        0.0885      0.021      4.200      0.0

## Intepretasi

odd= exp^(bo+bi*Xi)

In [16]:
# B1: employ, coef= -0.2436
# nilai coef minus: semakin besar nilai employ, semaki kecil probability seseorang untuk gagal bayar

print(np.exp(-0.2436)) # odd ratio c/d, ketika c>d
np.exp(0.2436)

# ketika lama waktu bekerja seseorang (`employ`) bertambah 1 tahun, maka kemungkinan seseorang untuk gagal bayar akan 
# menurun 1.28 kali


0.7838010920039364


1.2758338948511923

In [18]:
# kemungkinan gagal bayar ketika waktu bekerja= 10

np.exp(10*-0.2436)/(1+np.exp(10*-0.2436))

0.08046838884058552

In [19]:
# kemungkinan gagal bayar ketika waktu bekerja= 11
np.exp(11*-0.2436)/(1+np.exp(11*-0.2436))

0.06418789934222706

In [22]:
# kemungkinan gagal bayar yang bekerja 10 tahun/ yang bekerja 11 tahun
(np.exp(10*-0.2436)/(1+np.exp(10*-0.2436)))/(np.exp(11*-0.2436)/(1+np.exp(11*-0.2436)))

1.2536379857448938

In [29]:
# B2 (debtinc)
np.exp(0.0885)
#ketika debtinc meningkat 1 poin, maka kemungkinan gagal bayar akan meningkat 1.0925 kali

1.0925342526104793

In [30]:
# B3 (creddebt)
np.exp(0.5041*5)
# ketika creddebt meningkat 1 poin, maka kemungkinan gagal bayar akan meningkat 1.655 kali

#othdebt tidak signifikan, tidak perlu diintepretasi

12.434812515742879

In [25]:
print(result.summary())

                           Logit Regression Results                           
Dep. Variable:                default   No. Observations:                  700
Model:                          Logit   Df Residuals:                      695
Method:                           MLE   Df Model:                            4
Date:                Wed, 15 Dec 2021   Pseudo R-squ.:                  0.2844
Time:                        11:16:53   Log-Likelihood:                -287.82
converged:                       True   LL-Null:                       -402.18
Covariance Type:            nonrobust   LLR p-value:                 2.473e-48
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.2302      0.236     -5.210      0.000      -1.693      -0.767
employ        -0.2436      0.029     -8.456      0.000      -0.300      -0.187
debtinc        0.0885      0.021      4.200      0.0

LLR-test (LLR p-value) (analog dari F-test):  
b1, b2, b3 ,b4: p-val<=0.05, artinya kita punya cukup bukti bahwa setidaknya salah satu dari fitur memiliki pengaruh terhadap  nilai target  

Partial test: Wald test (analog dari T-test)
* b0 (constant): p-val <=0.05, artinya kita membutuhkan intercep di model kita
* b1 (employ): pval <=0.05, artinya kita punya cukup bukti bahwa perubahan nilai employ berpengaruh terhadap kemungkinan gagal bayar
* b2 (debtinc): pval <=0.05, artinya kita punya cukup bukti bahwa perubahan nilai debtinc berpengaruh terhadap kemungkinan gagal bayar
* b3 (creddebt): pval<=0.05, artinya kita punya cukup bukti bahwa perubahan nilai creddebt berpengaruh terhadap kemungkinan gagal bayar
* b4 (othdebt): pval>0.05, artinya kita **tidak punya cukup bukti** bahwa perubahan nilai creddebt berpengaruh terhadap kemungkinan gagal bayar  

Parameter Estimate(coef):
* B1(employ)= -0.2436: lamanya seseorang bekerja menurunkan kemungkinan gagal bayar.  ketika lama waktu bekerja seseorang (`employ`) bertambah 1 tahun, maka kemungkinan seseorang untuk gagal bayar akan menurun 1.28 kali.

* B2 (debtinc) = 0.0885: ratio debt dan income meningkatkan kemungkinan gagal bayar. Ketika ratio debt dan income naik 1 poin, maka kemungkinan gagal bayar akan meningkat 1.0925 kali.

* B3 (creddebt) = 0.5041: creddebt meningkatkan kemungkinan gagal bayar. ketika nilai creddebt meningkat 5 poin, maka kemungkinan gagal bayar akan meningkat 12.435 kali

* B4 (othedebt) = -0.004: tidak diintepretasikan karena hasil uji tidak signifikan

**sangat disarankan untuk melakukan prediksi hanya di rentang observasi data masing-masing fitur)**

Coefficient Determination (Pseudo R-squared)
Pseudo R-squared= 0.2844/28.44%, artinya model kita mampu menjelaskan 28.44% dari variasi kemungkinan gagal bayar

# Validasi Model
Train: 80%  
Test: 20%

In [33]:
# !pip install -U scikit-learn

In [32]:
from sklearn.model_selection import train_test_split # untuk membagi data secara acak
from sklearn.metrics import accuracy_score # metric evaluasi model

In [42]:
y.value_counts()/y.shape

0    0.738571
1    0.261429
Name: default, dtype: float64

In [50]:
# splitting data

x_train, x_test, y_train, y_test= train_test_split(x,y,
                                                   stratify=y, #membagi data sesuai dengan proporsi y
                                                   test_size=0.2,
                                                   random_state=2020) #agar pengacakan konsisten

In [51]:
y_train.value_counts()/y_train.shape #dengan statify ratio kelas 1 dan 0 di data train dan test sesuai dengan data awal

0    0.739286
1    0.260714
Name: default, dtype: float64

In [53]:
x_test.shape

(140, 4)

In [54]:
x_train.shape

(560, 4)

In [57]:
sm_logit_train= sm.Logit(y_train,sm.add_constant(x_train))
result= sm_logit_train.fit() #model belajar dari data train

Optimization terminated successfully.
         Current function value: 0.411145
         Iterations 7


In [61]:
y_predict_proba=result.predict(sm.add_constant(x_test)) # hasil prediksi probability gagal bayar berdasarkan data test
y_predict_proba.head()

118    0.584807
309    0.308932
339    0.308407
686    0.398796
639    0.236440
dtype: float64

In [63]:
y_predict_class=np.where(y_predict_proba>0.5,1,0) 
#kalau nilai prediksi lebih dari 0.5, masuk ke kelas 1(gagal bayar), kalau tidak kelas 0 (tidak gagal bayar)
y_predict_class

array([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 1, 0, 0, 0, 0])

In [65]:
print('nilai akurasi model di dataset test:',accuracy_score(y_test, y_predict_class))

nilai akurasi model di dataset test: 0.8214285714285714


Akurasi model adalah 82.14%, artinya model mampu melakukan 8 prediksi benar dari 10 prediksi

# Latihan  

Gunakan data titanic dari seaborn
* target: `survived`
* fitur: pclass, sex, age, fare
* isi missing value jika ada
* cek multicolinearity
* split data menjadi train: test, 80:20
* buat intepretasi hasil summary
* cek akurasi model terhadap data test

In [66]:
import seaborn as sns
sns.load_dataset('titanic')

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True
