# Naive Bayes Classifier - Personal Loan Acceptance

This program is a solution to the problem 8.1 of chapter 8 of following book. 

Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python, First Edition.

Galit Shmueli, Peter C. Bruce, Peter Gedeck, and Nitin R. Patel

© 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc.

##  Chapter 8, Problem 8.1

Personal Loan Acceptance. The file UniversalBank.csv contains data on 5000 customers of Universal Bank. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (=9.6%) accepted the personal loan that was offered to them in the earlier campaign. In this exercise, we focus on two predictors: Online (whether or not the customer is an active user of online banking services) and Credit Card (abbreviated CC below) (does the customer hold a credit card issued by the bank), and the outcome Personal Loan (abbreviated Loan below). Partition the data into training (60%) and validation (40%) sets.

a. Create a pivot table for the training data with Online as a column variable, CC as a row variable, and Loan as a secondary row variable. The values inside the table should convey the count. Use the pandas dataframe methods melt() and pivot().

b. Consider the task of classifying a customer who owns a bank credit card and is actively using online banking services. Looking at the pivot table, what is the probability that this customer will accept the loan offer? [This is the probability of loan acceptance (Loan = 1) conditional on having a bank credit card (CC = 1) and being an active user of online banking services (Online = 1).]

c. Create two separate pivot tables for the training data. One will have Loan (rows) as a function of Online (columns) and the other will have Loan (rows) as a function of CC.

d. Compute the following quantities [P(A | B) means “the probability of A given B”]: 

    i. P(CC = 1 | Loan = 1) (the proportion of credit card holders among the loan acceptors)

    ii. P(Online = 1 | Loan = 1) 

    iii. P(Loan = 1) (the proportion of loan acceptors)

    iv. P(CC = 1 | Loan = 0)

    v. P(Online = 1 | Loan = 0)

    vi. P(Loan = 0)

e. Use the quantities computed above to compute the naive Bayes probability P(Loan = 1 | CC = 1, Online = 1).

f. Compare this value with the one obtained from the pivot table in (b). Which is a more accurate estimate?

g. Which of the entries in this table are needed for computing P(Loan = 1 | CC = 1, Online = 1)? In Python, run naive Bayes on the data. Examine the model output on training data, and find the entry that corresponds to P(Loan = 1 | CC = 1, Online = 1). Compare this to the number you obtained in (e).

## Importing Libraries

In [89]:
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

Printing versions of libraries

In [90]:
print('pandas version: {}'.format(pd.__version__))
print('sklearn version: {}'.format(sklearn.__version__))

pandas version: 1.5.3
sklearn version: 1.2.1


## Loading Dataset

In [91]:
df = pd.read_csv('UniversalBank.csv')
df.head()

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


Converting to categorical

In this excercise we are concerned with only three variable, Online, CreditCard and Personal Loan. Therefore, we will convert these variables to categorical. 

In [92]:
df.Online = df.Online.astype('category')
df.CreditCard = df.CreditCard.astype('category')
df['Personal Loan'] = df['Personal Loan'].astype('category')

## Partitioning Data

Partitioning the data into training (60%) and validation (40%) data.

In [93]:
# split the original data frame into a train and test using same random_state
train_df, valid_df = train_test_split(df[['CreditCard', 'Personal Loan', 'Online']], test_size=0.4, random_state = 1)
display(train_df.head())

Unnamed: 0,CreditCard,Personal Loan,Online
4522,0,0,0
2851,0,0,1
2313,1,0,1
982,1,0,0
1164,0,1,1


## Create a pivot table for the training data

##### a. Creating a pivot table for the training data with Online as a column variable, CC as a row variable, and Loan as a secondary row variable. The values inside the table will convey the count. Using the pandas dataframe methods melt() and pivot().

In [94]:
# Melt the dataframe to have a single column for each variable
train_df_melted = train_df.melt(id_vars=['CreditCard', 'Personal Loan'], var_name='Online', value_name='Count')
display(train_df_melted)

# Pivot the melted dataframe to create the desired table
train_df_pivoted = train_df_melted.pivot_table(index=['CreditCard', 'Personal Loan'], columns='Online', values='Count', aggfunc='sum')
# Display the pivot table
display(train_df_pivoted)

Unnamed: 0,CreditCard,Personal Loan,Online,Count
0,0,0,Online,0
1,0,0,Online,1
2,1,0,Online,1
3,1,0,Online,0
4,0,1,Online,1
...,...,...,...,...
2995,0,0,Online,1
2996,0,0,Online,1
2997,1,0,Online,1
2998,0,0,Online,1


Unnamed: 0_level_0,Online,Online
CreditCard,Personal Loan,Unnamed: 2_level_1
0,0,1117
0,1,126
1,0,477
1,1,49


##### b. Considering the task of classifying a customer who owns a bank credit card and is actively using online banking services. Looking at the pivot table created above, calculating the probability that this customer will accept the loan offer. This is the probability of loan acceptance (Loan = 1) conditional on having a bank credit card (CC = 1) and being an active user of online banking services (Online = 1).

In [95]:
# I calculate following values only to make sure that above pivot table creation was correct.
total_count = len(train_df[(train_df['CreditCard'] == 1) & (train_df['Online'] == 1)].index)
print(total_count)
personal_loan_n_count = len(train_df[(train_df['CreditCard'] == 1) & (train_df['Online'] == 1) & (train_df['Personal Loan'] == 0)].index)
print(personal_loan_n_count)
personal_loan_y_count = len(train_df[(train_df['CreditCard'] == 1) & (train_df['Online'] == 1) & (train_df['Personal Loan'] == 1)].index)
print(personal_loan_y_count)

# Calculating the probability that the customer mentioned above will accept the loan offer. 
# We know probability = (number of desired or successful outcomes)/(total number of possible outcomes). 
#Let's name 'number of desired or successful outcomes' as numerator and 'total number of possible outcomes' as denominator
numerator = 49 # I got this value by observing above pivot table 
denoninator = 49 + 477 # I got this value by observing above pivot table
probability = numerator/denoninator
print('Probability:', probability)

526
477
49
Probability: 0.09315589353612168


Therefore, the probability that the customer will accept the loan offer is 0.09.

##### c. Creating two separate pivot tables for the training data. One will have Loan (rows) as a function of Online (columns) and the other will have Loan (rows) as a function of CC.

In [96]:
train_df_melted_1 = train_df[['Personal Loan', 'Online']].melt(id_vars=['Personal Loan'], var_name='Online', value_name='Count')
train_df_pivot_1 = train_df_melted_1.pivot_table(index=['Personal Loan'], columns='Online', values='Online', aggfunc='sum')
display(train_df_pivot_1.head())

# I calculate following only to see if pivot table creation was correct 
temp_df = train_df[['Personal Loan', 'Online']]
print(len(temp_df[(temp_df['Personal Loan'] == 0) & (temp_df['Online'] == 1)].index))
print(len(temp_df[(temp_df['Personal Loan'] == 1) & (temp_df['Online'] == 1)].index))

train_df_melted_2 = train_df[['Personal Loan', 'CreditCard']].melt(id_vars=['Personal Loan'], var_name='CreditCard', value_name='Count')
train_df_pivot_2 = train_df_melted_2.pivot_table(index=['Personal Loan'], columns='CreditCard', values='CreditCard', aggfunc='sum')
display(train_df_pivot_2.head())

# I calculate following only to see if pivot table creation was correct 
temp_df = train_df[['Personal Loan', 'CreditCard']]
print(len(temp_df[(temp_df['Personal Loan'] == 0) & (temp_df['CreditCard'] == 1)].index))
print(len(temp_df[(temp_df['Personal Loan'] == 1) & (temp_df['CreditCard'] == 1)].index))


Online,Online
Personal Loan,Unnamed: 1_level_1
0,1594
1,175


1594
175


CreditCard,CreditCard
Personal Loan,Unnamed: 1_level_1
0,804
1,88


804
88


##### d. Computing the following quantities. [P(A | B) means “the probability of A given B”]:

##### i. P(CC = 1 | Loan = 1) (the proportion of credit card holders among the loan acceptors)

#####     ii. P(Online = 1 | Loan = 1)

#####     iii. P(Loan = 1) (the proportion of loan acceptors)

#####     iv. P(CC = 1 | Loan = 0)

#####     v. P(Online = 1 | Loan = 0)

#####     vi. P(Loan = 0)

In [97]:
#P(CC = 1 | Loan = 1) (the proportion of credit card holders among the loan acceptors)
prob_CC_1_given_Loan_1 = len(train_df[(train_df['Personal Loan'] == 1) & (train_df['CreditCard'] == 1)].index)/len(train_df[(train_df['Personal Loan'] == 1)].index)
print('P(CC = 1 | Loan = 1) (the proportion of credit card holders among the loan acceptors):', prob_CC_1_given_Loan_1)

P(CC = 1 | Loan = 1) (the proportion of credit card holders among the loan acceptors): 0.30662020905923343


In [98]:
#P(Online = 1 | Loan = 1)
prob_Online_1_given_Loan_1 = len(train_df[(train_df['Personal Loan'] == 1) & (train_df['Online'] == 1)].index)/len(train_df[(train_df['Personal Loan'] == 1)].index)
print('P(Online = 1 | Loan = 1):', prob_Online_1_given_Loan_1)

P(Online = 1 | Loan = 1): 0.6097560975609756


In [99]:
#P(Loan = 1) (the proportion of loan acceptors)
prob_Loan_1 = len(train_df[(train_df['Personal Loan'] == 1)].index)/len(train_df.index)
print('P(Loan = 1) (the proportion of loan acceptors):', prob_Loan_1)

P(Loan = 1) (the proportion of loan acceptors): 0.09566666666666666


In [100]:
#P(CC = 1 | Loan = 0)
prob_CC_1_given_Loan_0 = len(train_df[(train_df['Personal Loan'] == 0) & (train_df['CreditCard'] == 1)].index)/len(train_df[(train_df['Personal Loan'] == 0)].index)
print('P(CC = 1 | Loan = 0):', prob_CC_1_given_Loan_0)

P(CC = 1 | Loan = 0): 0.2963509030593439


In [101]:
#P(Online = 1 | Loan = 0)
prob_Online_1_given_Loan_0 = len(train_df[(train_df['Personal Loan'] == 0) & (train_df['Online'] == 1)].index)/len(train_df[(train_df['Personal Loan'] == 0)].index)
print('P(Online = 1 | Loan = 0):', prob_Online_1_given_Loan_0)

P(Online = 1 | Loan = 0): 0.5875414670106893


In [102]:
#P(Loan = 0)
prob_Loan_0 = len(train_df[(train_df['Personal Loan'] == 0)].index)/len(train_df.index)
print('P(Loan = 0):', prob_Loan_0)

P(Loan = 0): 0.9043333333333333


##### e. Using above quantities computed above to compute the naive Bayes probability P(Loan = 1 | CC = 1, Online = 1).

In [103]:
print('                                                               P(CC = 1 | Loan = 1) P(Online = 1 | Loan = 1) P(Loan = 1)')
print('P(Loan = 1 | CC = 1, Online = 1) = --------------------------------------------------------------------------------------------------------------------')
print('                                     P(CC = 1 | Loan = 1) P(Online = 1 | Loan = 1) P(Loan = 1) + P(CC = 1 | Loan = 0) P(Online = 1 | Loan = 0) P(Loan = 0)')

                                                               P(CC = 1 | Loan = 1) P(Online = 1 | Loan = 1) P(Loan = 1)
P(Loan = 1 | CC = 1, Online = 1) = --------------------------------------------------------------------------------------------------------------------
                                     P(CC = 1 | Loan = 1) P(Online = 1 | Loan = 1) P(Loan = 1) + P(CC = 1 | Loan = 0) P(Online = 1 | Loan = 0) P(Loan = 0)


In [104]:
numerator = (prob_CC_1_given_Loan_1 * prob_Online_1_given_Loan_1 * prob_Loan_1)
denominator = (prob_CC_1_given_Loan_1 * prob_Online_1_given_Loan_1 * prob_Loan_1) + (prob_CC_1_given_Loan_0 * prob_Online_1_given_Loan_0 * prob_Loan_0)
p_Loan_1_given_CC_1_Online_1 = numerator/denominator
print('P(Loan = 1 | CC = 1, Online = 1): ', p_Loan_1_given_CC_1_Online_1)

P(Loan = 1 | CC = 1, Online = 1):  0.10200430617247218


##### f. Compare this value with the one obtained from the pivot table in (b). Which is a more accurate estimate?

The one we obtained from the pivot table in (b) is more accurate. This is because naive bayes is based on the assumption that features are independent of each other, given the class label. Since we do not know how independent features are, we will consider naive bayes estimate as less accurate than the value we obtained from the pivot table in (b).

##### g. Which of the entries in this table are needed for computing P(Loan = 1 | CC = 1, Online = 1)? In Python, run naive Bayes on the data. Examine the model output on training data, and find the entry that corresponds to P(Loan = 1 | CC = 1, Online = 1). Compare this to the number you obtained in (e).

In above three pivot tables, we will need following entries for computing P(Loan = 1 | CC = 1, Online = 1). Other values will have to be calculated separately.

    i. Third Table, value in second row and second column
    ii. Second Table, value in second row and second column 
    iii. Third Table, value in first row and second column 
    iv. Second Table, value in first row and second column 

In [105]:
X = pd.get_dummies(df[['CreditCard', 'Online']])
y = df['Personal Loan']
classes = list(y.cat.categories)

Splitting dataset into training and test sets

In [106]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.40, random_state=1)

Running naive Bayes 

In [107]:
nb = MultinomialNB(alpha=0.01)
nb.fit(X_train, y_train)

Predicting probabilities 

In [108]:
predProb_train = nb.predict_proba(X_train)

Predicting class membership

In [109]:
y_train_pred = nb.predict(X_train)

In [110]:
# classify a specific customer by searching in the dataset
# for a customer with the same predictor values
temp_df = pd.concat([pd.DataFrame({'actual': y_train, 'predicted': y_train_pred}),
    pd.DataFrame(predProb_train, index=y_train.index)], axis=1)
 

mask = ((X_train.CreditCard_1 == 1) & (X_train.Online_1 == 1))
        
temp_df[mask].head(1)

Unnamed: 0,actual,predicted,0,1
2313,0,0,0.897993,0.102007
