# Binary Logistic Regression

## Framework  

Problem: How to predict default risk of the new applicant so we can allocate loan efficiently and increase profit from loan ?  
Data:  
* What is being predicted ? default risk of the new applicant  
* What is needed in prediction ? Demographical, Transaction behaviour, income, ect  

ML Objective: Maximize (profit - potential revenue lost)   
Action: Do not allocate loan to a customer when the risk is too high, higher than 50%  
Value: Profit increade



## Bankloan 

Analyze data bankloan.csv    
* build a logistics regression model  
    * target : default
    * features : employ, debtinc, creddebt, othdebt  
        
Interpret The Result  
Validate the model using accuracy in 20% testing data   


In [1]:
import pandas as pd
import numpy as np

In [3]:
bankloan=pd.read_csv('bankloan.csv')
bankloan

Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41,3,17,12,176,9.3,11.359392,5.008608,1
1,27,1,10,6,31,17.3,1.362202,4.000798,0
2,40,1,15,14,55,5.5,0.856075,2.168925,0
3,41,1,15,14,120,2.9,2.658720,0.821280,0
4,24,2,2,0,28,17.3,1.787436,3.056564,1
...,...,...,...,...,...,...,...,...,...
695,36,2,6,15,27,4.6,0.262062,0.979938,1
696,29,2,6,4,21,11.5,0.369495,2.045505,0
697,33,1,15,3,32,7.6,0.491264,1.940736,0
698,45,1,19,22,77,8.4,2.302608,4.165392,0


In [4]:
fitur=['employ', 'debtinc', 'creddebt', 'othdebt']
target='default'

In [5]:
x=bankloan[fitur]
y=bankloan[target]

In [6]:
x.describe()

Unnamed: 0,employ,debtinc,creddebt,othdebt
count,700.0,700.0,700.0,700.0
mean,8.388571,10.260571,1.553553,3.058209
std,6.658039,6.827234,2.117197,3.287555
min,0.0,0.4,0.011696,0.045584
25%,3.0,5.0,0.369059,1.044178
50%,7.0,8.6,0.854869,1.987567
75%,12.0,14.125,1.901955,3.923065
max,31.0,41.3,20.56131,27.0336


### Modelling

In [7]:
import statsmodels.api as sm

In [8]:
sm_logit= sm.Logit(y,sm.add_constant(x))
result=sm_logit.fit()

Optimization terminated successfully.
         Current function value: 0.411165
         Iterations 7


In [9]:
print(result.summary())

                           Logit Regression Results                           
Dep. Variable:                default   No. Observations:                  700
Model:                          Logit   Df Residuals:                      695
Method:                           MLE   Df Model:                            4
Date:                Thu, 01 Jul 2021   Pseudo R-squ.:                  0.2844
Time:                        10:11:23   Log-Likelihood:                -287.82
converged:                       True   LL-Null:                       -402.18
Covariance Type:            nonrobust   LLR p-value:                 2.473e-48
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.2302      0.236     -5.210      0.000      -1.693      -0.767
employ        -0.2436      0.029     -8.456      0.000      -0.300      -0.187
debtinc        0.0885      0.021      4.200      0.0

## Collinearity

In [10]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [11]:
calc_vif(x) #no problem with multicollinearity

Unnamed: 0,variables,VIF
0,employ,2.222753
1,debtinc,3.045977
2,creddebt,2.816577
3,othdebt,4.116876


### Model Intepretation

odd= exp^(bo+bi*Xi)  
exp^(-1.2302+ bi*Xi)

In [14]:
#B1 (employ)
print(np.exp(-0.2436*1)) # odd (2 tahun)/odd(1 tahun), di mana odd( 2 tahun) < odd(1 tahun)
np.exp(0.2436*1)  # odd (1 tahun)/odd(2 tahun)

# ketika tahun bekerja meningkat 1 tahun dan variabel lainnya tetap, maka kemungkinan untuk gagal bayar menurun 1.28 kali.

0.7838010920039364


1.2758338948511923

In [20]:
np.exp(-1.2302-(0.2436*1))/np.exp(-1.2302-(0.2436*2))

1.2758338948511923

In [15]:
# B2 (debtinc)
np.exp(0.0885*1)
# ketika ratio antara hutang dan pendapatan meningkat 1 poin, maka kemungkinan untuk gagal bayar meningkat 1.093 kali.

1.0925342526104793

In [17]:
# B3 (creddebt)  
np.exp(0.5041*5)  
# ketika jumlah creddebt bertambah 5 USD, maka kemungkinan untuk gagal bayar meningkat 12.43 kali

12.434812515742879

LLR-test (LLR p-value):  
B1, B2, B3, B4: p-val<=0.05, artinya kita punya cukup bukti bahwa ada setidaknya satu fitur yang memiliki pengaruh signifikan terhadap nilai kemungkinan gagal bayar.  

Wald Test(P>|z|):  
1. B0 (constant): p-val <=0.05, artinya kita butuh B0 di model kita
2. B1 (employ): p-val<=0.05, artinya kita punya cukup bukti bahwa employ mempengaruhi nilai kemungkinan gagal bayar
3. B2 (debtinc): p-val<=0.05, artinya kita punya cukup bukti bahwa ratio income dan debt mempengaruhi nilai kemungkinan gagal bayar
4. B3 (creddebt): p-val<=0.05, artinya kita punya cukup bukti bahwa nilai creddebt mempengaruhi nilai kemungkinan gagal bayar
5. B4 (othdebt): p-value>0.05, artinya kita **tidak punya cukup bukti** bahwa nilai athdebt mempengaruhi nilai kemungkinan gagal bayar.  

Parameter Estimate(coef):  
1. B1= -0.2436: Tahun bekerja menurunkan kemungkinan gagal bayar. Ketika tahun bekerja meningkat 1 tahun dan variabel lainnya tetap, maka kemungkinan untuk gagal bayar menurun 1.28 kali.    
2. B2= 0.0885: Ratio debt dan income meningkatkan kemungkinan gagal bayar. Ketika ratio antara hutang dan pendapatan meningkat 1 poin, maka kemungkinan untuk gagal bayar meningkat 1.093 kali.    
3. B3= 0.5041: Nilai creddebt memingkatkan kemungkinan gagal bayar. Ketika jumlah creddebt bertambah 5 USD, maka kemungkinan untuk gagal bayar meningkat 12.43 kali.  
4. B4= -0.0040: tidak diintepretasikan, karena hasil uji tidak signifikan.  
#sangat disarankan untuk melakukan prediksi di dalam rentang observasi data (nilai max dan min setiap fitur).  

Coefficient Determination (Pseudo R-squ.):  
Pseudo R-squared= 28.44%, Model yang kita buat mampu menjelaskan 28.44% dari varians kemungkinan gagal bayar.

### Validation
Train data=80%  
Test data= 20%

In [21]:
from sklearn.model_selection import train_test_split #dipakai untuk membagi data secara acak
from sklearn.metrics import accuracy_score

In [23]:
#splitting data
x_train, x_test, y_train, y_test= train_test_split(
    x,
    y,
    stratify=y,
    test_size=0.2,
    random_state=2020
)

In [38]:
x.shape

(700, 4)

In [36]:
x_train.shape

(560, 4)

In [37]:
x_test.shape

(140, 4)

In [26]:
sm_logit_train = sm.Logit(y_train,sm.add_constant(x_train))
result_train = sm_logit.fit()

Optimization terminated successfully.
         Current function value: 0.411145
         Iterations 7


In [29]:
y_predict_proba= result_train.predict(sm.add_constant(x_test))
y_predict_proba #isinya probability gagal bayar

118    0.584807
309    0.308932
339    0.308407
686    0.398796
639    0.236440
         ...   
597    0.544007
58     0.012081
467    0.102066
148    0.060111
681    0.214943
Length: 140, dtype: float64

In [30]:
y_predict_class=np.where(y_predict_proba>.5,1,0)
y_predict_class

array([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 1, 0, 0, 0, 0])

In [31]:
print('model accuracy score in test dataset:', accuracy_score(y_test,y_predict_class))

model accuracy score in test dataset: 0.8214285714285714


Akurasi 82.14%, artinya model yang kita buat mampu memprediksi secara tepat 8 dari 10 kali

## Latihan

Gunakan data titanic dari seaborn.  
* target=survived  
* fitur= 'pclass','sex','age','fare'  



* isi missing value  
* Cek multicollinearity
* split data menjadi train dan test (80:20)
* buat intepretasi hasil summary
* cek akurasi model terhadap data test

In [87]:
import seaborn as sns

In [88]:
df=sns.load_dataset('titanic')

In [89]:
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [90]:
df[['survived','pclass','sex','age','fare']]

Unnamed: 0,survived,pclass,sex,age,fare
0,0,3,male,22.0,7.2500
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.9250
3,1,1,female,35.0,53.1000
4,0,3,male,35.0,8.0500
...,...,...,...,...,...
886,0,2,male,27.0,13.0000
887,1,1,female,19.0,30.0000
888,0,3,female,,23.4500
889,1,1,male,26.0,30.0000


In [91]:
x= df[['pclass','sex','age','fare']]
y=df['survived']

In [92]:
from scipy.stats import normaltest
normaltest(df['age'].dropna())

NormaltestResult(statistic=18.105032952089758, pvalue=0.00011709599657350757)

In [93]:
x['age']=x['age'].fillna(x['age'].median())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x['age']=x['age'].fillna(x['age'].median())


In [94]:
x['age'].median()

28.0

In [95]:
x.isna().sum()

pclass    0
sex       0
age       0
fare      0
dtype: int64

In [96]:
x=pd.get_dummies(x,columns=['sex'], drop_first=True)
x

Unnamed: 0,pclass,age,fare,sex_male
0,3,22.0,7.2500,1
1,1,38.0,71.2833,0
2,3,26.0,7.9250,0
3,1,35.0,53.1000,0
4,3,35.0,8.0500,1
...,...,...,...,...
886,2,27.0,13.0000,1
887,1,19.0,30.0000,0
888,3,28.0,23.4500,0
889,1,26.0,30.0000,1


In [74]:
calc_vif(x)

Unnamed: 0,variables,VIF
0,pclass,3.728285
1,age,4.057797
2,fare,1.426615
3,sex_male,2.901604


In [75]:
#splitting data
x_train, x_test, y_train, y_test= train_test_split(
    x,
    y,
    stratify=y,
    test_size=0.2,
    random_state=2020
)

In [81]:
sm_logit = sm.Logit(y_train,sm.add_constant(x_train))
result = sm_logit.fit()

Optimization terminated successfully.
         Current function value: 0.448448
         Iterations 6


In [82]:
print(result.summary())

                           Logit Regression Results                           
Dep. Variable:               survived   No. Observations:                  712
Model:                          Logit   Df Residuals:                      707
Method:                           MLE   Df Model:                            4
Date:                Thu, 01 Jul 2021   Pseudo R-squ.:                  0.3264
Time:                        11:37:39   Log-Likelihood:                -319.29
converged:                       True   LL-Null:                       -473.99
Covariance Type:            nonrobust   LLR p-value:                 1.020e-65
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.5490      0.563      8.087      0.000       3.447       5.652
pclass        -1.0983      0.149     -7.364      0.000      -1.391      -0.806
age           -0.0356      0.008     -4.273      0.0

In [83]:
y_predict_proba= result.predict(sm.add_constant(x_test))
y_predict_class=np.where(y_predict_proba>.5,1,0)

In [84]:
print('model accuracy score in test dataset:', accuracy_score(y_test,y_predict_class))

model accuracy score in test dataset: 0.7932960893854749


In [100]:
from sklearn.linear_model import LogisticRegression

In [102]:
model=LogisticRegression()
model.fit(x_train, y_train)

LogisticRegression()

In [105]:
accuracy_score(y_test,model.predict(x_test))

0.7932960893854749