# Logistic regression

Code Created by Luis Enrique Acevedo Galicia

Date: 2019-09-03

Here, I present a simple and easy way to create a logistic regression. The file data_bank_train.csv contains data of bank interest rate and people getting the credit .

# The Libraries

In [24]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

# The data 

In [25]:
data = pd.read_csv('data_bank_train.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,interest,credits,Month 1,Month 2,previous,duration,Result
0,0,1.808904,0,1,0,0,129,no
1,1,1.040052,0,0,2,1,286,yes
2,2,6.587448,0,1,0,0,179,no
3,3,5.58672,0,0,0,0,698,yes
4,4,6.584736,0,1,0,0,169,no


Preprocessing data

In [26]:
#removing count column
data = data.drop('Unnamed: 0',axis=1)
# Transform to boolean 
data['Result'] = data['Result'].map({'yes':1, 'no':0})
data.head()

Unnamed: 0,interest,credits,Month 1,Month 2,previous,duration,Result
0,1.808904,0,1,0,0,129,0
1,1.040052,0,0,2,1,286,1
2,6.587448,0,1,0,0,179,0
3,5.58672,0,0,0,0,698,1
4,6.584736,0,1,0,0,169,0


learning about this data set

In [27]:
data.describe()

Unnamed: 0,interest,credits,Month 1,Month 2,previous,duration,Result
count,518.0,518.0,518.0,518.0,518.0,518.0,518.0
mean,3.845312,0.034749,0.266409,0.388031,0.127413,394.177606,0.5
std,2.545081,0.183321,0.442508,0.814527,0.333758,344.29599,0.500483
min,0.86106,0.0,0.0,0.0,0.0,21.0,0.0
25%,1.413969,0.0,0.0,0.0,0.0,167.0,0.0
50%,1.987896,0.0,0.0,0.0,0.0,278.5,0.5
75%,6.721014,0.0,1.0,0.0,0.0,494.75,1.0
max,6.73932,1.0,1.0,5.0,1.0,2665.0,1.0


# # The Regression

In [28]:
#Independent variables
IND_V=['interest','credits','Month 1','Month 2','duration']

X = data[IND_V]
y = data['Result']

In [30]:
#The regression model

Rlogit = sm.Logit(y,X)
RTS_logit = Rlogit.fit()
RTS_logit.summary2()

Optimization terminated successfully.
         Current function value: 0.341242
         Iterations 7


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.508
Dependent Variable:,Result,AIC:,363.5265
Date:,2019-02-09 19:19,BIC:,384.7763
No. Observations:,518,Log-Likelihood:,-176.76
Df Model:,4,LL-Null:,-359.05
Df Residuals:,513,LLR p-value:,1.2499e-77
Converged:,1.0000,Scale:,1.0
No. Iterations:,7.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
interest,-0.6073,0.0587,-10.3386,0.0000,-0.7224,-0.4922
credits,2.2836,1.0678,2.1385,0.0325,0.1907,4.3765
Month 1,-1.8300,0.3102,-5.8990,0.0000,-2.4380,-1.2220
Month 2,0.4454,0.1726,2.5807,0.0099,0.1071,0.7837
duration,0.0069,0.0007,10.3428,0.0000,0.0056,0.0082


In [33]:
def CMatrix(data,real_values,Logmodel):
       
        #Prediction the model values
        PRD_values = Logmodel.predict(data)
        # Set the bins 
        BINS=np.array([0,0.5,1])
        # Histogram based on BINS
        HG = np.histogram2d(real_values, PRD_values, bins=BINS)[0]
        # Accuracy of the model
        ACR = (HG[0,0]+HG[1,1])/HG.sum()
        # The confusion matrix
        return HG, ACR

In [36]:
CMatrix(X,y,RTS_logit)

(array([[214.,  45.],
        [ 26., 233.]]), 0.862934362934363)

# Testing the model

In [37]:
#getting the new data and preprocessing it
data_new = pd.read_csv('data_bank_test.csv')
data_new = data_new.drop(['Unnamed: 0'], axis = 1)
data_new['Result'] = data_new['Result'].map({'yes':1, 'no':0})
data_new.head()

Unnamed: 0,interest,credits,Month 1,Month 2,previous,duration,Result
0,1.780428,0,1,0,0,499,0
1,6.727116,0,0,0,0,144,0
2,6.584736,0,1,0,0,104,0
3,5.58672,0,0,0,0,1480,1
4,6.729828,0,0,0,0,48,0


In [39]:
#new variables
y_new = data_new['Result']
X_new = data_new[IND_V]
X_new.head()

Unnamed: 0,interest,credits,Month 1,Month 2,duration
0,1.780428,0,1,0,499
1,6.727116,0,0,0,144
2,6.584736,0,1,0,104
3,5.58672,0,0,0,1480
4,6.729828,0,0,0,48


In [40]:
#confusion Matrix
CMatrix(X_new, y_new, RTS_logit)

(array([[92., 19.],
        [13., 98.]]), 0.8558558558558559)