### Partition the Data

In [1]:
import pandas as pd
import numpy as np
import random

from sklearn.model_selection import train_test_split

In [2]:
bank = pd.read_csv("/Users/antran/Google Drive/USD_Teaching/ADS-502/WebsiteDataSets/bank-additional.csv", sep=';')

Again, for simplicity and demonstration purposes, only a few predictors along with the target variable are selected 

In [3]:
bank = bank[['job', 'marital', 'housing', 'loan', 'y']]

To partition the data set, we will use the command `train_test_split()`. Let's use 25% of our data as our test set

In [4]:
bank_train, bank_test = train_test_split(bank, 
                                         test_size = 0.25,
                                         random_state = 7)

To confirm that the data set was partitioned correctly, you can compare the
shapes of the original, training, and test data sets using the shape feature

In [5]:
print('Original number of instances before partitioning: ', bank.shape[0],
      '\nNumber of instances in Training set: ', bank_train.shape[0],
      '\nNumber of instances in Test set: ', bank_test.shape[0])

Original number of instances before partitioning:  4119 
Number of instances in Training set:  3089 
Number of instances in Test set:  1030


In [6]:
print('Proportion of training instances: ', bank_train.shape[0]/bank.shape[0]*100,
      '\nProportion of test instances: ', bank_test.shape[0]/bank.shape[0]*100)

Proportion of training instances:  74.99393056567128 
Proportion of test instances:  25.00606943432872


Balance the Training Data Set

First, we identify how many records in `bank_train` have the less common value,
`yes` for response, using the `value_counts()` command

In [7]:
bank_train['y'].value_counts()

no     2751
yes     338
Name: y, dtype: int64

In [8]:
ratio = bank_train['y'].value_counts()[1] / bank_train.shape[0] * 100
ratio

10.942052444156685

The count of `yes` responses will change depending on the partition. Say, we want to increase the percentage of `yes` responses to 30%. Since we have p = 0.3, we need to resample ~800 records whose response is `yes` and add
them to our training data set. To begin resampling, we isolate the records which we want to resample

In [9]:
to_resample = bank_train.loc[bank_train['y'] == "yes"]

Next, we need to sample from our records of interest

In [10]:
our_resample = to_resample.sample(n = 840, replace = True)

Finally, we add the resampled records to our original training data set

In [11]:
bank_train_rebal = pd.concat([bank_train, our_resample], axis=0)

Check that the desired percent of `yes` responses was obtained, examine
the table of the response variable

In [12]:
bank_train_rebal['y'].value_counts()

no     2751
yes    1178
Name: y, dtype: int64

In [13]:
ratio = bank_train_rebal['y'].value_counts()[1] / bank_train_rebal.shape[0] * 100
ratio

29.98218376177144

### Naive Bayes

In [14]:
from sklearn.naive_bayes import MultinomialNB

Again, a good practice is to split predictors and target variables into different objects

In [15]:
X_train = bank_train_rebal[['job', 'marital', 'housing', 'loan']]
y_train = bank_train_rebal['y']

As before, sklearn does not automatically handle categorical variables. This means we need to convert our categorical variables into dummy variables versions of themselves before we can run the algorithm.

In [16]:
X_train_processed = pd.DataFrame()

for var in X_train.columns:
    dummies = pd.get_dummies(X_train[var])
    X_train_processed = pd.concat([X_train_processed, dummies], axis=1) 

Finally, we run the Naïve Bayes algorithm

In [17]:
nb = MultinomialNB().fit(X_train_processed, y_train)

To make predictions on the test data using the Naïve Bayes model, we first need to set up the X variables in the test data set the exact same way as we did for the training dataset

In [18]:
X_test = bank_test[['job', 'marital', 'housing', 'loan']]
y_test = bank_test['y']

In [19]:
X_test_processed = pd.DataFrame()

for var in X_test.columns:
    dummies = pd.get_dummies(X_test[var])
    X_test_processed = pd.concat([X_test_processed, dummies], axis=1) 

In [20]:
predictions = nb.predict(X_test_processed)

### Model evaluation

In [21]:
from sklearn.metrics import confusion_matrix

In [22]:
cm = confusion_matrix(y_test, predictions)

In [23]:
cm

array([[899,  18],
       [109,   4]])

In [24]:
TN = cm[0][0]
FP = cm[0][1]
FN = cm[1][0]
TP = cm[1][1]

print('TN: ', TN,
      '\nFP: ', FP,
      '\nFN: ', FN,
      '\nTP: ', TP)

TN:  899 
FP:  18 
FN:  109 
TP:  4


$$ Accuracy = \frac{TN+TP}{TN+TP+FP+FN} $$

$$ Error Rate = 1 - Accuracy $$

$$ Sensitivity = Recall = \frac{TP}{TP+FN} $$

$$ Specificity = \frac{TN}{TN+FP} $$

$$ Precision = \frac{TP}{TP+FP} $$

$$ F1 = 2* \frac{Precision*Recall}{Precision + Recall} $$

$$ F2 = 5* \frac{Precision*Recall}{4*Precision + Recall} $$

$$ F0.5 = 1.25* \frac{Precision*Recall}{0.25*Precision + Recall} $$