# Logistic Regression

This code is about the development and test of a logistic regression model in the banking industry.

Create a logistic regression based on the bank data provided. The data is based on the marketing campaign efforts of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

Note that the first column of the dataset is the index.

Source: S. Moro, P. Cortez and P. Rita (2014): A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31

## Import the relevant libraries

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

## Load the data

Load the ‘Bank_data.csv’ dataset.

In [None]:
raw_data = pd.read_csv('bank-data-train.csv')
raw_data

Variable Description: 
<i> Interest rate</i> indicates the 3-month interest rate between banks 
<i> duration </i> indicates the time since the last contact was made with a given consumer. 
The <i> previous </i> variable shows whether the last marketing campaign was successful with this customer. 
The <i>march</i> and <i> may </i> are Boolean variables that account for when the call was made to the specific customer 
<i> credit </i> shows if the customer has enough credit to avoid defaulting.

Objective: Analyze whether the bank marketing strategy was successful. The outcome variable y need to be transformed into Boolean values in order to run regressions.

In [None]:
data = raw_data.copy()
data = data.drop(['Unnamed: 0'], axis = 1)
data['y'] = data['y'].map({'yes':1, 'no':0})
data

In [None]:
data.describe()

### Declare the dependent and independent variables

Use 'duration' as the independet variable

In [None]:
y = data['y']
x1 = data['duration']

### Simple Logistic Regression

Run the regression and graph the scatter plot

In [None]:
x = sm.add_constant(x1)
reg_log = sm.Logit(y,x)
results_log = reg_log.fit()
results_log.summary()

In [None]:
plt.scatter(x1,y)
plt.xlabel('Duration', fontsize = 20)
plt.ylabel('Subscription', fontsize = 20)
plt.show()

## Expand the model

Switch to a multivariate logistic regression model. 
Add the ‘interest_rate’, ‘march’, ‘credit’ and ‘previous’ estimators to our model and run the regression again. 

### Declare the independent variable(s)

In [None]:
estimators=['interest_rate','credit','march','previous','duration']

X1_all = data[estimators]
y = data['y']

In [None]:
X_all = sm.add_constant(X1_all)
reg_logit = sm.Logit(y,X_all)
results_logit = reg_logit.fit()
results_logit.summary2()

### Confusion Matrix

Find the confusion matrix of the model and estimate its accuracy. 

In [None]:
def confusion_matrix(data,actual_values,model):
        pred_values = model.predict(data)
        bins=np.array([0,0.5,1])
        cm = np.histogram2d(actual_values, pred_values, bins=bins)[0]
        accuracy = (cm[0,0]+cm[1,1])/cm.sum()
        return cm, accuracy

In [None]:
confusion_matrix(X_all,y,results_logit)

## Test the model

Load the test data from the 'Bank_data_testing.csv' file

### Load new data 

In [None]:
raw_data2 = pd.read_csv('bank-data-test.csv')
data_test = raw_data2.copy()

data_test = data_test.drop(['Unnamed: 0'], axis = 1)

In [None]:
data_test['y'] = data_test['y'].map({'yes':1, 'no':0})
data_test

### Declare the dependent and the independent variables

In [None]:
y_test = data_test['y']

X1_test = data_test[estimators]
X_test = sm.add_constant(X1_test)

Determine the test confusion matrix and the test accuracy and compare them with the train confusion matrix and the train accuracy.

In [None]:
confusion_matrix(X_test, y_test, results_logit)

In [None]:
confusion_matrix(X_all,y, results_logit)

Looking at the test acccuracy we see a number which is a tiny but lower: 86.04%, compared to 86.29% for train accuracy. 

In general, we always expect the test accuracy to be lower than the train one. If the test accuracy is higher, this is just due to luck.