# Building a logistic regression #

Let's look into how to build a logistic regression model using ```statsmodels``` and ```sklearn```. Again, the former is better suited to return a descriptive output, while the latter can be used with cross-validation.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics import confusion_matrix as cm
from sklearn.model_selection import train_test_split

# we can use the credit dataset
credit = pd.read_csv("credit_regress.csv")
credit.rename (columns={'VAR1_A14' : 'No_account', 'VAR2' : 'Duration'}, inplace=True)

# Again, we need to add a constant
credit = sm.add_constant(credit)
print(credit.head())

   const  Unnamed: 0  Bad_2  ID  Duration  VAR5  VAR8  VAR11  VAR13  VAR16  \
0    1.0           0      0   1         6  1169     4      4     67      2   
1    1.0           1      1   2        48  5951     2      2     22      1   
2    1.0           2      0   3        12  2096     2      3     49      1   
3    1.0           3      0   4        42  7882     2      4     45      1   
4    1.0           4      1   5        24  4870     3      4     53      2   

   ...  VAR12_A124  VAR14_A142  VAR14_A143  VAR15_A152  VAR15_A153  \
0  ...           0           0           1           1           0   
1  ...           0           0           1           1           0   
2  ...           0           0           1           1           0   
3  ...           0           0           1           0           1   
4  ...           1           0           1           0           1   

   VAR17_A172  VAR17_A173  VAR17_A174  VAR19_A192  VAR20_A202  
0           0           1           0         

  return ptp(axis=axis, out=out, **kwargs)


In [2]:
# we will use two variables, and will split the data into training and testing
X=credit[["const","No_account", "Duration"]]
Y=credit[["Bad_2"]]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=2) 

# This time, we use the logit function
logreg = sm.Logit(Y_train, X_train)
 
result = logreg.fit()
prob1 = result.predict(X_test)

print(result.summary())

Optimization terminated successfully.
         Current function value: 0.535760
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                  Bad_2   No. Observations:                  700
Model:                          Logit   Df Residuals:                      697
Method:                           MLE   Df Model:                            2
Date:                Tue, 12 Nov 2019   Pseudo R-squ.:                  0.1064
Time:                        16:29:04   Log-Likelihood:                -375.03
converged:                       True   LL-Null:                       -419.70
Covariance Type:            nonrobust   LLR p-value:                 3.983e-20
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.1121      0.185     -5.999      0.000      -1.476      -0.749
No_account    -1.5785      0.

We can replicate this with ```sklearn``` to some extent:

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train,Y_train.values.ravel())

print('Intercept')
print(logreg.intercept_)
print ("Coefficients")
print(logreg.coef_)
print ("Odds Ratios")
print (np.exp(logreg.coef_))

Intercept
[-0.55222617]
Coefficients
[[-0.55222617 -1.51749913  0.03313594]]
Odds Ratios
[[0.57566685 0.21925954 1.03369104]]


The coefficients vary slighly, possible due to the solver used.

Parameters give the effects on logit. Odds ratios give the effects on odds. Intercept is of little interest usually. 

Not having a checking account decreases logit by -1.52/-1.58 as compared to having it. It also means that Odds are reduced by a factor of 0.21. In other words, Odds of Default/Bad are almost 5 times lower for those without a checking account.

For Duration an increase of one month (one unit) means an increase in logit of 0.03. Or an increase in odds by a factor of 1.03. A convenient way of interpreting the effect on odds for a numeric variable is to think of percentage deviation from 1, i.e. $(e^\beta - 1) \times 100$ indicates the percentage increase or decrease due to a one-unit change in the predictor.  So for Duration, a one month increase leads to a 3% increase in odds. Therefore, loans with longer duration are higher risks. 

In [4]:
# To check how predictive accuracy varies for different samples one can use cross_validate module 

from sklearn.model_selection import cross_validate
from sklearn.metrics import roc_auc_score, recall_score, precision_score

classifier = LogisticRegression(solver='liblinear')
# metrics you want to have computed
metrics = ['roc_auc','recall','precision']

# To show train metrics, we add the extra return_train_score parameter
outcomes = cross_validate(classifier, X_train, Y_train.values.ravel(), scoring=metrics, cv=5, return_train_score=True)

for metric in outcomes.keys():
    print(metric+" value: "+str(outcomes[metric]))

fit_time value: [0.00363302 0.00283551 0.00261807 0.00263596 0.02295828]
score_time value: [0.00572824 0.00438762 0.00404811 0.00400805 0.00737143]
test_roc_auc value: [0.64841463 0.730625   0.708375   0.729125   0.77790404]
train_roc_auc value: [0.73562813 0.71790501 0.72047354 0.72009994 0.70346273]
test_recall value: [0.17073171 0.225      0.175      0.2        0.2       ]
train_recall value: [0.20625    0.1863354  0.20496894 0.19254658 0.19254658]
test_precision value: [0.5        0.5625     0.58333333 0.8        0.57142857]
train_precision value: [0.62264151 0.6        0.6        0.55357143 0.59615385]
