# Predictive Analytics 1 - Machine Learning Tools - Using Python
Instructor(s) - Peter Gedeck

## Solution: Assignment 5 - Naive Bayes Classifier

In [1]:
# %matplotlib inline
import math
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

DATA = Path('.').resolve().parent / 'data'

## Data preparation
Remove all unnecessary columns from the dataset and convert _Online_ and _CreditCard_ to categories. Split the data into training (60%), and validation (40%) sets (use <code>random_state=1</code>).

In [2]:
# Load the data
bank_df = pd.read_csv(DATA / 'UniversalBank.csv')

# Consider only the required variables and reorder the columns at the same time
bank_df = bank_df[['Online', 'CreditCard', 'Personal Loan']]
bank_df.Online = bank_df.Online.astype('category')
bank_df.CreditCard = bank_df.CreditCard.astype('category')
bank_df.head()

Unnamed: 0,Online,CreditCard,Personal Loan
0,0,0,0
1,0,0,0
2,0,0,0
3,0,0,0
4,0,1,0


Split dataset into training and validation set

In [3]:
train_df, valid_df = train_test_split(bank_df, test_size=0.4, random_state=1)
print('Training set:', train_df.shape, 'Validation set:', valid_df.shape)

Training set: (3000, 3) Validation set: (2000, 3)


## Question 1
Create a pivot table for the training data with Online as a column variable, CreditCard as a row variable, and Personal Loan as a secondary row variable. The values inside the cells should convey the count (number of records).

In [4]:
train_df.pivot_table(index=['CreditCard', 'Personal Loan'], 
                     columns=['Online'], aggfunc=len)

Unnamed: 0_level_0,Online,0,1
CreditCard,Personal Loan,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,792,1117
0,1,73,126
1,0,327,477
1,1,39,49


## Question 2
Consider the task of classifying a customer who owns a bank credit card and is actively using online banking services. Looking at the pivot table that you created, what is the probability that this customer will accept the loan offer?

Use the pivot table created in Question 1 for the answer

There are 477 + 49 = 526 records where online = 1 and cc = 1. 
46 of them accept the loan, so the conditional probability is 49/526 = 0.0932

In [5]:
p11 = 49 / (477 + 49)
print('Count based probability P(Loan = 1|CC = 1, Online = 1) = ', p11)

Count based probability P(Loan = 1|CC = 1, Online = 1) =  0.09315589353612168


## Question 3
Create two separate pivot tables for the training data. One will have Personal Loan (rows) as a function of Online (columns) and the other will have Personal Loan (rows) as a function of CreditCard. Compute the probabilities below (report three decimals. Note: P(A|B) means "the probability of A given B".

Compute the probabilities below (report three decimals). Note: P(A|B) means "the probability of A given B".

1. P(CreditCard = 1|Loan = 1) = the proportion of credit card holders among the loan acceptors
2. P(Online = 1|Loan = 1)
3. P(Loan = 1) = the proportion of loan acceptors
4. P(CC = 1|Loan = 0)
5. P(Online = 1|Loan = 0)
6. P(Loan = 0)

<small><em>CreditCard</em> abbreviated as CC, <em>Personal Loan</em> abbreviated as Loan)</small>

Pivot table for Loan (rows) as a function of Online (columns). Here we can use the `pivot_table` method of the pandas data frame.

In [6]:
predictors = ['CreditCard', 'Online']

print(train_df['Personal Loan'].value_counts() / len(train_df))
print()

for predictor in predictors:
    # construct the frequency table
    df = train_df[['Personal Loan', predictor]]
    freqTable = df.pivot_table(index='Personal Loan', columns=predictor, aggfunc=len)

    # divide each row by the sum of the row to get conditional probabilities
    propTable = freqTable.apply(lambda x: x / sum(x), axis=1)
    print(propTable)
    print()

0    0.904333
1    0.095667
Name: Personal Loan, dtype: float64

CreditCard            0         1
Personal Loan                    
0              0.703649  0.296351
1              0.693380  0.306620

Online                0         1
Personal Loan                    
0              0.412459  0.587541
1              0.390244  0.609756



1. P(CreditCard = 1|Loan = 1) = 0.306620
2. P(Online = 1|Loan = 1) = 0.609756
3. P(Loan = 1) = 0.095667
4. P(CC = 1|Loan = 0) = 0.296351
5. P(Online = 1|Loan = 0) = 0.587541
6. P(Loan = 0) = 0.904333

## Question 4 (2 points)
Compute the naive Bayes probability P(Loan=1|CC=1,Online=1). Note: Use the quantities that you computed in the previous question. Refer to the naive Bayes formula (8.3) in the book.

```
P(Loan=1|CC=1,Online=1) = 
   P(Loan=1) * P(CC=1|Loan=1) * P(Online=1|Loan=1) / 
   [P(Loan=1) * [P(CC=1|Loan=1) * P(Online=1|Loan=1)] + 
    P(Loan=0) * [P(CC=1|Loan=0) * P(Online=1|Loan=0)]]
```

In [7]:
# P(Loan = 1) * P(CC = 1 / Loan = 1) * P(Online = 1 / Loan = 1)
p1 = 0.095667 * 0.306620 * 0.609756
# P(Loan = 0) * P(CC = 1 / Loan = 0) * P(Online = 1 / Loan = 0)
p2 = 0.904333 * 0.296351 * 0.587541

print('Naive Bayes probability P(Loan = 1|CC = 1, Online = 1) = ', p1 / (p1 + p2))

Naive Bayes probability P(Loan = 1|CC = 1, Online = 1) =  0.1020046248320646


## Question 5
Of the two values that you computed earlier (computed in Q2 and Q4), which is a more accurate estimate of P(Loan=1|CC=1,Online=1)?

The value obtained from the crossed pivot table is the more accurate estimate, since it does not make the simplifying assumption that the probabilities (of taking a loan if you are a credit card holder and if you are an online customer) are independent. It is feasible in this case because there are few variables and few categories to consider, and thus there are ample data for all possible combinations.

## Question 6
In Python, run naive Bayes on the training data. Use data points that match the condition <em>CreditCard=1,Online=1</em> to find the predicted probability for P(Loan=1|CC=1,Online=1).

Change the types of variables to categories and use hot-one-encoding for the independent variables.

In [8]:
train_df = pd.get_dummies(train_df, prefix_sep='_')
train_df['Personal Loan'] = train_df['Personal Loan'].astype('category')
train_df.head()

Unnamed: 0,Personal Loan,Online_0,Online_1,CreditCard_0,CreditCard_1
4522,0,1,0,1,0
2851,0,0,1,1,0
2313,0,0,1,0,1
982,0,1,0,0,1
1164,1,0,1,1,0


In [9]:
predictors = ['Online_0', 'Online_1', 'CreditCard_0', 'CreditCard_1']
nb = MultinomialNB(alpha=0.01)
nb.fit(train_df[predictors], train_df['Personal Loan'])

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

Predict probabilities and check for the probability of "1" in the row where Online = 1 and CreditCard = 1

In [10]:
predProb = nb.predict_proba(train_df.drop(columns=['Personal Loan']))
predicted = pd.concat([train_df, pd.DataFrame(predProb, index=train_df.index)], axis=1)

matches = (predicted.Online_1 == 1) & (predicted.CreditCard_1 == 1)
predicted[matches].head()

Unnamed: 0,Personal Loan,Online_0,Online_1,CreditCard_0,CreditCard_1,0,1
2313,0,0,1,0,1,0.897993,0.102007
1918,1,0,1,0,1,0.897993,0.102007
4506,0,0,1,0,1,0.897993,0.102007
586,0,0,1,0,1,0.897993,0.102007
3591,0,0,1,0,1,0.897993,0.102007


This gives `P(Loan=1|Online=1,CC=1) = 0.1020`