Assignment 5: Naive Bayes Classifier

The problems in this assignment are based on the exercise 8.1 of Chapter 8 in Data Mining for Business Analytics.

Context: Develop a model to predict whether a new customer will accept a loan offer.

Data: The file UniversalBank.csv contains data on 5000 customers of Universal Bank. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.

In this exercise we focus on two predictors: Online (whether or not the customer is an active user of online banking services) and CreditCard (does the customer hold a credit card issued by the bank), and the outcome Personal Loan.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import math
from sklearn.naive_bayes import MultinomialNB

Data preparation: Remove all unnecessary columns from the dataset and convert Online and CreditCard to categories. Split the data into training (60%), and validation (40%) sets (use random_state=1).

In [2]:
# Load the data
bank_df = pd.read_csv("dmba/UniversalBank.csv")

# Verify data is loaded correctly
print("Shape", bank_df.shape)  # determine data frame dimensions
bank_df.head(5)  # view the first 15 observations

Shape (5000, 14)


Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


In [3]:
# Remove all columns except for Online, CreditCard and Personal Loan
bank_df = bank_df[["Online", "CreditCard", "Personal Loan"]]

# Convert Online and CreditCard to categories
bank_df.Online = bank_df.Online.astype('category')
bank_df.CreditCard = bank_df.CreditCard.astype('category')

# Re-Verify data is loaded correctly
print("Shape", bank_df.shape)  # determine data frame dimensions
bank_df.tail(5)  # view the first 15 observations

Shape (5000, 3)


Unnamed: 0,Online,CreditCard,Personal Loan
4995,1,0,0
4996,1,0,0
4997,0,0,0
4998,1,0,0
4999,1,1,0


In [4]:
#y = bank_df["Personal Loan"]
#X = bank_df.drop(columns=["Personal Loan"])

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
#print("The shape of the training set is', X_train.shape, ", test set shape is ", X_test.shape)

train_df, test_df = train_test_split(bank_df, test_size=0.4, random_state=1)
print("The shapes of the training and test sets are",train_df.shape, " and", test_df.shape, "respectively.")

The shapes of the training and test sets are (3000, 3)  and (2000, 3) respectively.


Question 1 (2 points) Create a pivot table for the training data with Online as a column variable, CreditCard as a row variable, and Personal Loan as a secondary row variable. The values inside the cells should convey the count (number of records).

In [5]:
train_df.pivot_table(index=['CreditCard', 'Personal Loan'], columns=['Online'], aggfunc=len)

Unnamed: 0_level_0,Online,0,1
CreditCard,Personal Loan,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,792,1117
0,1,73,126
1,0,327,477
1,1,39,49


Question 2 (2 points) Consider the task of classifying a customer who owns a bank credit card and is actively using online banking services. Looking at the pivot table that you created, what is the probability that this customer will accept the loan offer?

In [6]:
### The probability that a customer accepts a loan given they have a CC and 
### is active online is P(Loan = 1|CC = 1, Online = 1)

prob_CCandOnline = 49 / (49+477)
print("The probability that a customer accepts a loan given they have a CC and is active online is")
print("given by the equation 'P(Loan = 1|CC = 1, Online = 1)' and calculated to the value of %s" % prob_CCandOnline)

The probability that a customer accepts a loan given they have a CC and is active online is
given by the equation 'P(Loan = 1|CC = 1, Online = 1)' and calculated to the value of 0.09315589353612168


Question 3 (6 points) Create two separate pivot tables for the training data. One will have Personal Loan (rows) as a function of Online (columns) and the other will have Personal Loan (rows) as a function of CreditCard. Compute the probabilities below (report three decimals. Note: P(A|B) means "the probability of A given B".

    P(CC=1|Loan=1) = the proportion of credit card holders among the loan acceptors
    P(Online=1|Loan=1)
    P(Loan=1) = the proportion of loan acceptors
    P(CC=1|Loan=0)
    P(Online=1|Loan=0)
    P(Loan=0)

CreditCard abbreviated as CC, Personal Loan abbreviated as Loan)

In [7]:
# Using Table 8.4 pages 216 and 217
pd.set_option("precision", 3)

predictors = ["CreditCard", "Online"]
outcome = ["Personal Loan"]

# The probability of a personal loan
print("Reporting to three decimals, the probability of a personal loan is \n %s" 
      % (train_df['Personal Loan'].value_counts() / len(train_df)))

for predictor in predictors:
    # construct the frequency table
    df = train_df[['Personal Loan', predictor]]
    freqTable = df.pivot_table(index='Personal Loan', columns=predictor, aggfunc=len)

    # divide each row by the sum of the row to get conditional probabilities
    propTable = freqTable.apply(lambda x: x / sum(x), axis=1)
    print(propTable)
    print()

pd.reset_option("precision")

Reporting to three decimals, the probability of a personal loan is 
 0    0.904
1    0.096
Name: Personal Loan, dtype: float64
CreditCard         0      1
Personal Loan              
0              0.704  0.296
1              0.693  0.307

Online             0      1
Personal Loan              
0              0.412  0.588
1              0.390  0.610



Reporting to three decimals:
P(CC=1|Loan=1)     = 0.307 the proportion of credit card holders among the loan acceptors
P(Online=1|Loan=1) = 0.610
P(Loan=1)          = 0.096 the proportion of loan acceptors
P(CC=1|Loan=0)     = 0.296
P(Online=1|Loan=0) = 0.588
P(Loan=0)          = 0.904

Question 4 (2 points) Compute the naive Bayes probability P(Loan=1|CC=1,Online=1). Note: Use the quantities that you computed in the previous question. Refer to the naive Bayes formula (8.3) in the book.

In [8]:
pd.set_option("precision", 3)

# P(Loan=1|CC=1,Online=1) = P(Loan=1)*P(CC=1 | Loan=1)*P(Online=1 | Loan=1)
# / [P(Loan=1)*[P(CC=1 | Loan=1) * P(Online=1 | Loan=1)] + P(Loan=0) * [P(CC=1 | Loan=0) * P(Online=1 | Loan=0)]]
prob_L1_CC1_On1 = (0.096 * 0.307 * 0.610) / ((0.096 * 0.307 * 0.610)+(0.904 * 0.296 * 0.588))

print("P(Loan=1|CC=1,Online=1) is equal to %s." % prob_L1_CC1_On1)

pd.reset_option("precision")

P(Loan=1|CC=1,Online=1) is equal to 0.10254503559808173.


Question 5 (2 points) Of the two values that you computed earlier (computed in Q2 and Q4), which is a more accurate estimate of P(Loan=1|CC=1,Online=1)?

In [9]:
print("The value calculated for P(Loan=1|CC=1,Online=1) in Q4 of %s is more accurate." % prob_L1_CC1_On1)

The value calculated for P(Loan=1|CC=1,Online=1) in Q4 of 0.10254503559808173 is more accurate.


Question 6 (6 points) In Python, run naive Bayes on the training data. Use data points that match the condition CreditCard=1,Online=1 to find the predicted probability for P(Loan=1|CC=1,Online=1). 

In [10]:
# Using the example from Table 8.4
#train_df = pd.get_dummies(train_df, prefix_sep="---")
#train_df['Personal Loan'] = train_df['Personal Loan'].astype('category')

X = pd.get_dummies(train_df, prefix_sep="---")
y = train_df['Personal Loan'].astype('category')
train_df.head()

#predictors = ['Online---0', 'Online---1', 'CreditCard---0', 'CreditCard---1']
#X = pd.get_dummies(train_df["predictors"])
#y = train_df["Personal Loan"].astype("category")

# split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=1)

# run naive Bayes
nb = MultinomialNB(alpha=0.01)
nb.fit(X_train, y_train)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [12]:
X_train.head()

Unnamed: 0,Personal Loan,Online---0,Online---1,CreditCard---0,CreditCard---1
4829,0,0,1,0,1
3863,0,0,1,1,0
2633,0,0,1,1,0
811,0,1,0,1,0
4771,0,0,1,1,0


In [11]:
predProb_train = nb.predict_proba(X_train)
predicted = pd.concat([X_train, pd.DataFrame(predProb_train, index=X_train.index)], axis=1)
# predProb_test = nb.predict_proba(X_test)
correct = (predicted.CreditCard---1==1) & (predicted.Online---1==1)
predicted[correct].head()

AttributeError: 'DataFrame' object has no attribute 'CreditCard'