In this project, I will be creating a classification model taking three features - default, housing, and loan to predict consumer response (binary variable) to the marketing call from the bank. Two classification methods will be used to complete this task:

1. Logistic Regression
2. Naive Bayes Classification

Both methods will be evaluated using area under the ROC curve cross-validation method.

Based on the classification results, I will be advising the management about the machine learning method to use for getting guidance for the telephone marketing campaign and suggesting the target audience of the marketing campaign.

### Data Preparation

In [1]:
# Setting the random seed to generate reproducible results

RANDOM_SEED = 1

In [2]:
# Importing base packages Numpy and Pandas

import numpy as np
import pandas as pd

# Importing warnings filter
from warnings import simplefilter

# Ignoring all future warnings
simplefilter(action='ignore', category=FutureWarning)

In [3]:
# Importing bank.csv data into a DataFrame

bank = pd.read_csv('bank.csv', sep = ';')

In [4]:
# Examining the shape - number of rows and columns, of bank DataFrame

print(bank.shape)

(4521, 17)


In [5]:
# Dropping Null values as we do not want to work with incomplete data

print(bank.dropna().head().to_string(line_width=78))

   age          job  marital  education default  balance housing loan  \
0   30   unemployed  married    primary      no     1787      no   no   
1   33     services  married  secondary      no     4789     yes  yes   
2   35   management   single   tertiary      no     1350     yes   no   
3   30   management  married   tertiary      no     1476     yes  yes   
4   59  blue-collar  married  secondary      no        0     yes   no   

    contact  day month  duration  campaign  pdays  previous poutcome  \
0  cellular   19   oct        79         1     -1         0  unknown   
1  cellular   11   may       220         1    339         4  failure   
2  cellular   16   apr       185         1    330         1  failure   
3   unknown    3   jun       199         4     -1         0  unknown   
4   unknown    5   may       226         1     -1         0  unknown   

  response  
0       no  
1       no  
2       no  
3       no  
4       no  


In [6]:
# Shape of input data after dropping data with missing values

print(bank.shape)

(4521, 17)


#### Note: The shape of bank DataFrame did not change with dropna(), this means that there were no rows with missing data.

In [7]:
# Displaying the list of column names

list(bank.columns.values)

['age',
 'job',
 'marital',
 'education',
 'default',
 'balance',
 'housing',
 'loan',
 'contact',
 'day',
 'month',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'response']

In [8]:
# Displaying the first 10 records of the DataFrame

print(bank.head().to_string(line_width=78))

   age          job  marital  education default  balance housing loan  \
0   30   unemployed  married    primary      no     1787      no   no   
1   33     services  married  secondary      no     4789     yes  yes   
2   35   management   single   tertiary      no     1350     yes   no   
3   30   management  married   tertiary      no     1476     yes  yes   
4   59  blue-collar  married  secondary      no        0     yes   no   

    contact  day month  duration  campaign  pdays  previous poutcome  \
0  cellular   19   oct        79         1     -1         0  unknown   
1  cellular   11   may       220         1    339         4  failure   
2  cellular   16   apr       185         1    330         1  failure   
3   unknown    3   jun       199         4     -1         0  unknown   
4   unknown    5   may       226         1     -1         0  unknown   

  response  
0       no  
1       no  
2       no  
3       no  
4       no  


In [9]:
# Mapping function that converts text no/yes to integer 0/1

convert_to_binary = {'no' : 0, 'yes' : 1}

# Mapping values in a binary variable for having credit in default

default = bank['default'].map(convert_to_binary)

# Mapping values in a binary variable for having a mortgage or housing loan

housing = bank['housing'].map(convert_to_binary)

# Mapping values in a binary variable for having a personal loan

loan = bank['loan'].map(convert_to_binary)

# Defining binary variable for response variable using the Mapping function

response = bank['response'].map(convert_to_binary)

# Creating an array using three explanatory and one response variables.
# Transposing the data using .T to restructure the array as per our requirement

model_data = np.array([np.array(default), np.array(housing), np.array(loan), 
    np.array(response)]).T

# Shape of the model data

print(model_data.shape)

(4521, 4)


### Classification of Data using Logistic Regression and Naive Bayes Classification

In [10]:
# Specifying the set of classifiers being evaluated

from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression

In [11]:
# Shape of the additive model X input and y response

print('\nData dimensions:', model_data.shape)


Data dimensions: (4521, 4)


#### Training and Test Datasets

In [12]:
# Creating arrays with test cases

my_default = np.array([1, 1, 1, 1, 0, 0, 0, 0], np.int32)
my_housing = np.array([1, 1, 0, 0, 1, 1, 0, 0], np.int32)
my_loan = np.array([1, 0, 1, 0, 1, 0, 1, 0], np.int32)

# Reshaping the array as per our requirements and creating the test dataset

my_X_test = np.vstack([my_default, my_housing, my_loan]).T

# Displaying contents of the Test dataset

print("my_X_test:\n", my_X_test)

# X_train is the data(feature) array containing training data
# y_train is the target values for the training data

X_train = model_data[:, 0:model_data.shape[1]-1]
y_train = model_data[:, model_data.shape[1]-1]

# Displaying contents of the Training datasets

print("\nX_train:\n", X_train)
print("\ny_train:\n", y_train)

# Displaying value counts for the Training dataset
# to see the distribution of responses

uniqTr, countsTr = np.unique(y_train, return_counts=True)
print("\ny_train unique value counts:\n", dict(zip(uniqTr, countsTr)))

my_X_test:
 [[1 1 1]
 [1 1 0]
 [1 0 1]
 [1 0 0]
 [0 1 1]
 [0 1 0]
 [0 0 1]
 [0 0 0]]

X_train:
 [[0 0 0]
 [0 1 1]
 [0 1 0]
 ...
 [0 0 0]
 [0 0 0]
 [0 1 1]]

y_train:
 [0 0 0 ... 0 0 0]

y_train unique value counts:
 {0: 4000, 1: 521}


#### Logistic Regression

In [13]:
# Fit logistic regression to full data set
# clf is the LogisticRegression function object

clf = LogisticRegression()

clf.fit(X_train, y_train)

# Predicting specific test cases using the test dataset created above

y_my_test_proba = clf.predict_proba(my_X_test)

y_my_test_predict = clf.predict(my_X_test)

# Create DataFrame for displaying test cases and predicted probabilities

my_targeting_df = pd.DataFrame(np.hstack([my_X_test, y_my_test_proba,
                                          np.vstack(y_my_test_predict)]))

# Giving labels to columns in the DataFrame

my_targeting_df.columns = ['default', 'housing', 'loan', 
                           'predict_NO', 'predict_YES', 'prediction']

# Displaying the prediction DataFrame

print('\n\nLogistic regression model predictions for test cases:')
print(my_targeting_df)

# Testing the accuracy of the Training Dataset using Logistic Regression model

print('\n\nOverall training set accuracy:', clf.score(X_train, y_train))



Logistic regression model predictions for test cases:
   default  housing  loan  predict_NO  predict_YES  prediction
0      1.0      1.0   1.0    0.945729     0.054271         0.0
1      1.0      1.0   0.0    0.892349     0.107651         0.0
2      1.0      0.0   1.0    0.900786     0.099214         0.0
3      1.0      0.0   0.0    0.811988     0.188012         0.0
4      0.0      1.0   1.0    0.953277     0.046723         0.0
5      0.0      1.0   0.0    0.906588     0.093412         0.0
6      0.0      0.0   1.0    0.914016     0.085984         0.0
7      0.0      0.0   0.0    0.834890     0.165110         0.0


Overall training set accuracy: 0.8847600088476001


#### Note: All the predictions for this test set are for a Negative response. This could be because our Training dataset is skewed with 4000 negative responses and 521 positive responses.

#### Naive Bayes Classification

In [14]:
# Fit Naive Bayes Classification to full data set
# clfNB is the BernoulliNB Naive Bayes function object
# X_train is the data(feature) array containing training data (defined above)
# y_train is the target values for the training data (defined above)

clfNB = BernoulliNB()
clfNB.fit(X_train, y_train)

# Predicting specific test cases using the test dataset created above

y_my_test_probaNB = clfNB.predict_proba(my_X_test)

y_my_test_predictNB = clfNB.predict(my_X_test)

# Create DataFrame for displaying test cases and predicted probabilities

my_targetingNB_df = pd.DataFrame(np.hstack([my_X_test, y_my_test_probaNB,
                                            np.vstack(y_my_test_predictNB)]))

# Giving labels to columns in the DataFrame

my_targetingNB_df.columns = ['default', 'housing', 'loan', 
                           'predict_NO', 'predict_YES', 'prediction']

# Displaying the prediction DataFrame

print('\n\nNaive Bayes Classification model predictions for test cases:')
print(my_targetingNB_df)

# Testing the accuracy of the Training Dataset using Naive Bayes model

print('\n\nOverall training set accuracy:', clfNB.score(X_train, y_train))



Naive Bayes Classification model predictions for test cases:
   default  housing  loan  predict_NO  predict_YES  prediction
0      1.0      1.0   1.0    0.947911     0.052089         0.0
1      1.0      1.0   0.0    0.896225     0.103775         0.0
2      1.0      0.0   1.0    0.904384     0.095616         0.0
3      1.0      0.0   0.0    0.817810     0.182190         0.0
4      0.0      1.0   1.0    0.953537     0.046463         0.0
5      0.0      1.0   0.0    0.906885     0.093115         0.0
6      0.0      0.0   1.0    0.914286     0.085714         0.0
7      0.0      0.0   0.0    0.835042     0.164958         0.0


Overall training set accuracy: 0.8847600088476001


#### Note: All the predictions for this test set are for a Negative response similar to Logistic Regression. This could be because our Training dataset is skewed with 4000 negative responses and 521 positive responses.

Based on the Overall training set accuracy, which is same for both the models, it can be said that either the training dataset is too small to generate an observable difference OR the dataset in not diverse enough OR we need to include more features into the model. Let's do cross-validation to test the accuracy of both models.

### Cross-Validation of the Models using Area under the ROC Curve

In [15]:
# Specifying the cross-validation ROC scoring code 
# from Scikit Learn documentation

from sklearn.metrics import roc_auc_score

# Specifying the k-fold cross-validation design

from sklearn.model_selection import KFold

# Setting the names list containing names of the Classification methods
names = ["Naive_Bayes", "Logistic_Regression"]

# Setting classifiers list with Naive Bayes and 
# Logistic Regression classifier functions
classifiers = [BernoulliNB(alpha=1.0, binarize=0.5, 
                           class_prior = [0.5, 0.5], fit_prior=False), 
               LogisticRegression()]

# Specifying the folds for validation - 10-fold cross-validation
N_FOLDS = 10

# Setting up numpy array for storing results
cv_results = np.zeros((N_FOLDS, len(names)))

# Setting the KFold object using number of folds and 
# random seed set previously

kf = KFold(n_splits = N_FOLDS, shuffle=False, random_state = RANDOM_SEED)

In [16]:
# Checking the splitting process by looking at fold observation counts
# Initializing the fold count and looping through the splits of model data
# The variable index_for_fold will be used as the results array index

index_for_fold = 0  

# This loop will run 10 times, once for each fold

for train_index, test_index in kf.split(model_data):
    print('\nFold index:', index_for_fold,
          '------------------------------------------')

    # Generating Training and Test datasets using train and 
    # test indexes generated from KFold Split
    # 0:model_data.shape[1]-1 slices for explanatory variables
    # model_data.shape[1]-1 is the index for the response variable   

    X_train = model_data[train_index, 0:model_data.shape[1]-1]
    X_test = model_data[test_index, 0:model_data.shape[1]-1]
    y_train = model_data[train_index, model_data.shape[1]-1]
    y_test = model_data[test_index, model_data.shape[1]-1]   
 
    # Initializing the method count and looping through the methods 
    # (Linear Regression and Naive Bayes)
    # This loop will run twice, once for each method, per fold
    # The variable index_for_method will be used as results array index
    
    index_for_method = 0
    for name, clf in zip(names, classifiers):
    
        # Fitting the classifier method on the training set for this fold
        
        clf.fit(X_train, y_train)  
        
        # Evaluating the test set for this fold by predicting values 
        # for Test dataset
        
        y_test_predict = clf.predict_proba(X_test)
        
        # Getting the ROC_AUC for test dataset
        
        fold_method_result = roc_auc_score(y_test, y_test_predict[:,1]) 
        
        # Displaying the Name of the method and AUC ROC for this fold
        
        print('Method Name:', name)
        print('Area under ROC curve:', fold_method_result)
        
        # Storing the result in an array having fold X method 
        # dimension (10 X 2)
        
        cv_results[index_for_fold, index_for_method] = fold_method_result
        
        # Incrementing method number array index
        
        index_for_method += 1
        
    # Incrementing fold number array index
  
    index_for_fold += 1

# Generating a DataFrame using the results array created by KFold methods

cv_results_df = pd.DataFrame(cv_results)

# Setting the column names for the DataFrame

cv_results_df.columns = names

# Displaying results from the results DataFrame

print('\n----------------------------------------------')
print('Average results from ', N_FOLDS, '-fold cross-validation\n',
      '\nMethod                 Area under ROC Curve', sep = '')     
print(cv_results_df.mean())

cv_results_df


Fold index: 0 ------------------------------------------
Method Name: Naive_Bayes
Area under ROC curve: 0.5878522062732588
Method Name: Logistic_Regression
Area under ROC curve: 0.5878522062732588

Fold index: 1 ------------------------------------------
Method Name: Naive_Bayes
Area under ROC curve: 0.633727778967659
Method Name: Logistic_Regression
Area under ROC curve: 0.633727778967659

Fold index: 2 ------------------------------------------
Method Name: Naive_Bayes
Area under ROC curve: 0.6575354454752319
Method Name: Logistic_Regression
Area under ROC curve: 0.6575354454752319

Fold index: 3 ------------------------------------------
Method Name: Naive_Bayes
Area under ROC curve: 0.6373879576644552
Method Name: Logistic_Regression
Area under ROC curve: 0.6355648959335594

Fold index: 4 ------------------------------------------
Method Name: Naive_Bayes
Area under ROC curve: 0.5743993962517295
Method Name: Logistic_Regression
Area under ROC curve: 0.5743993962517295

Fold index:

Unnamed: 0,Naive_Bayes,Logistic_Regression
0,0.587852,0.587852
1,0.633728,0.633728
2,0.657535,0.657535
3,0.637388,0.635565
4,0.574399,0.574399
5,0.574684,0.584238
6,0.562512,0.562512
7,0.731144,0.731144
8,0.573533,0.573533
9,0.577827,0.57682


The similarity of ROC AUC scores for both the models can be due to various factors:

1. The training dataset is too small to generate an observable difference.
2. The dataset in not diverse enough (the training dataset had 4000 negative responses and 521 positive responses).
3. We need to include more features into the model.

Based on the test results and average of the 10-Fold cross-validation, I think Logistic Regression method performed slightly better than the Naive Bayes method so the management should consider using the Logistic Regression method.

Both models predicted that the following groups are more likely than others to give a positive response (Probability > 10%) for the telephone marketing campaign:

1. People having Credit in Default, no Housing Loan and no Personal Loan.
2. People not having Credit in Default, no Housing Loan and no Personal Loan.
3. People having Credit in Default and a Housing Loan but no Personal Loan.