# Naive Bayes Classifier

#### Import libraries

In [1]:
import os
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from dmba import classificationSummary

# Set SEED
SEED = 1

# Problem 8.1 Personal Loan Acceptance

The file _UniversalBank.csv_ contains data on 5000 customers of Universal Bank. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign. In this exercise, we focus on two predictors: Online (whether or not the customer is an active user of online banking services) and Credit Card (abbreviated CC below) (does the customer hold a credit card issued by the bank), and the outcome Personal Loan (abbreviated Loan below).

Partition the data into training (60%) and validation (40%) sets.

**Create a dataframe for the `UniversalBank.csv` data**

Only include the following columns:

- Online
- CreditCard
- Personal_Loan
  
Display the dataframe info and first 5 rows of data.

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [2]:
cols = ['Online', 'CreditCard', 'Personal_Loan']
bank_df = pd.read_csv(os.path.join('..', 'data', 'UniversalBank.csv'))
bank_df.columns = [c.replace(' ','_') for c in bank_df.columns]
bank_df = bank_df[cols]
bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Online         5000 non-null   int64
 1   CreditCard     5000 non-null   int64
 2   Personal_Loan  5000 non-null   int64
dtypes: int64(3)
memory usage: 117.3 KB


In [3]:
bank_df.head()

Unnamed: 0,Online,CreditCard,Personal_Loan
0,0,0,0
1,0,0,0
2,0,0,0
3,0,0,0
4,0,1,0


**Split dataset into training and validation sets.**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [4]:
train_df, valid_df = train_test_split(bank_df, test_size=0.4, random_state=SEED)
print('Training Set:', train_df.shape, 'Validation Set:', valid_df.shape)

Training Set: (3000, 3) Validation Set: (2000, 3)


#### 8.1.a

Create a table for the training data with CreditCard and Online as row variables, and Personal Loan as a column variable. The values inside the table should convey the count.

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [5]:
ct = pd.crosstab(index=[train_df['CreditCard'], train_df['Online']], 
                 columns=train_df['Personal_Loan'])
ct 

Unnamed: 0_level_0,Personal_Loan,0,1
CreditCard,Online,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,792,73
0,1,1117,126
1,0,327,39
1,1,477,49


### 8.1.b.

Consider the task of classifying a customer who owns a bank credit card and is actively using online banking services. Looking at the table, what is the probability that this customer will accept the loan offer? (This is the probability of loan acceptance (Loan = 1) conditional on having a bank credit card (CC = 1) and being an active user of online banking services (Online = 1)).

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [6]:
p1 = ct.loc[(1, 1), 1]
p2 = ct.loc[(1, 1), 0]

print('Count based probability P(Loan = 1|CC = 1, Online = 1) = ', p1 / (p2 + p1))

Count based probability P(Loan = 1|CC = 1, Online = 1) =  0.09315589353612168


### 8.1.c.

Create two separate tables for the training data. One will have Loan (rows) as a function of Online (columns) and the other will have Loan (rows) as a function of CC. Also, show the percentages of customers that accepted the personal loan

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [7]:
lp = train_df['Personal_Loan'].value_counts(normalize=True)
lp0 = lp[0]
print(f'P(Loan = 0) = {lp0}')
lp1 = lp[1]
print(f'P(Loan = 1) = {lp1}')

P(Loan = 0) = 0.9043333333333333
P(Loan = 1) = 0.09566666666666666


In [8]:
cc_n = pd.crosstab(index=train_df['CreditCard'], columns=train_df['Personal_Loan'], 
                   normalize='columns')
cc_n

Personal_Loan,0,1
CreditCard,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.703649,0.69338
1,0.296351,0.30662


In [9]:
ol_n = pd.crosstab(index=train_df['Online'], columns=train_df['Personal_Loan'], 
                   normalize='columns')
ol_n

Personal_Loan,0,1
Online,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.412459,0.390244
1,0.587541,0.609756


### 8.1.d.

Compute the following quantities, P(A | B) means “the probability of A given B”]:

<ul>
<li>i. P(CC = 1 | Loan = 1) (the proportion of credit card holders among the loan acceptors)</i>
<li>ii. P(Online = 1|Loan = 1)</li>
<li>iii. P(Loan = 1) = the proportion of loan acceptors</li>
<li>iv. P(CC = 1|Loan = 0)</li>
<li>v.  P(Online = 1|Loan = 0)</li>
<li>vi. P(Loan = 0)</li>
</ul>

<ul>
    <li>i. P(CreditCard = 1|Loan = 1) = 0.306620</li>
    <li>ii. P(Online = 1|Loan = 1) = 0.609756</li>
    <li>iii. P(Loan = 1) = 0.095667</li> 
    <li>iv. P(CC = 1|Loan = 0) = 0.296351</li> 
    <li>v. P(Online = 1|Loan = 0) = 0.587541</li> 
    <li>vi. P(Loan = 0) = 0.904333</li>
</ul>

### 8.1.e.

Use the quantities computed above to compute the naive Bayes probability P(Loan = 1 j CC = 1, Online = 1).

```
P(Loan=1|CC=1,Online=1) = 
   P(Loan=1) * P(CC=1|Loan=1) * P(Online=1|Loan=1) / 
   [P(Loan=1) * [P(CC=1|Loan=1) * P(Online=1|Loan=1)] + 
    P(Loan=0) * [P(CC=1|Loan=0) * P(Online=1|Loan=0)]]
```

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [10]:
# P(Loan = 1) * P(CC = 1 / Loan = 1) * P(Online = 1 / Loan = 1)
# p1 = 0.095667 * 0.306620 * 0.609756
p1 = lp1 * cc_n.loc[1,1] * ol_n.loc[1,1]
# P(Loan = 0) * P(CC = 1 / Loan = 0) * P(Online = 1 / Loan = 0)
#p2 = 0.904333 * 0.296351 * 0.587541
p2 = lp0 * cc_n.loc[1,0] * ol_n.loc[1,0]

print('Naive Bayes probability P(Loan = 1|CC = 1, Online = 1) = ', p1 / (p1 + p2))

Naive Bayes probability P(Loan = 1|CC = 1, Online = 1) =  0.10200430617247219


### 8.1.f.

Compare this value with the one obtained from the pivot table in (b). Which is a more accurate estimate?

The value obtained from the crossed tables is the more accurate estimate, since it does not make the simplifying assumption that the probabilities (of taking a loan if you are a credit card holder and if you are an online customer) are independent. It is feasible in this case because there are few variables and few categories to consider, and thus there are ample data for all possible combinations.

### 8.1.g.

Which of the entries in this table are needed for computing P(Loan = 1 | CC = 1, Online = 1)? In Python, run naive Bayes on the data. Examine the model output on training data, and find the entry that corresponds to P(Loan = 1 | CC = 1,
Online = 1). Compare this to the number you obtained in (e).

In Python, run naive Bayes on the training data. Use data points that match the condition <em>CreditCard=1,Online=1</em> to find the predicted probability for P(Loan=1|CC=1,Online=1).

Change the types of variables to categories and use one-hot-encoding for the independent variables.

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [11]:
train_df[cols[:2]] = train_df[cols[:2]].astype('category')
train_df2 = pd.get_dummies(train_df, prefix_sep='_')
train_df2['Personal_Loan'] = train_df2['Personal_Loan'].astype('category')
train_df2.head()

Unnamed: 0,Personal_Loan,Online_0,Online_1,CreditCard_0,CreditCard_1
4522,0,True,False,True,False
2851,0,False,True,True,False
2313,0,False,True,False,True
982,0,True,False,False,True
1164,1,False,True,True,False


In [12]:
predictors = ['Online_0', 'Online_1', 'CreditCard_0', 'CreditCard_1']
nb = MultinomialNB(alpha=0.01)
nb.fit(train_df2[predictors], train_df2['Personal_Loan'])

Predict probabilities and check for the probability of "1" in the row where Online = 1 and CreditCard = 1

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [13]:
predProb = nb.predict_proba(train_df2[predictors])
predicted = pd.concat([train_df2, pd.DataFrame(predProb, index=train_df2.index)], axis=1)

matches = (predicted.Online_1 == 1) & (predicted.CreditCard_1 == 1)
predicted[matches].head()

Unnamed: 0,Personal_Loan,Online_0,Online_1,CreditCard_0,CreditCard_1,0,1
2313,0,False,True,False,True,0.897993,0.102007
1918,1,False,True,False,True,0.897993,0.102007
4506,0,False,True,False,True,0.897993,0.102007
586,0,False,True,False,True,0.897993,0.102007
3591,0,False,True,False,True,0.897993,0.102007


This gives `P(Loan=1|Online=1,CC=1) = 0.1020`

# Problem 8.2 Automobile Accidents.

The file _accidentsFull.csv_ contains information on 42,183 actual automobile accidents in 2001 in the United States that involved one of three levels of injury: NO INJURY, INJURY, or FATALITY. For each accident, additional information is recorded, such as day of week, weather conditions, and road type. A firm might be interested in developing a system for quickly classifying the severity of an accident based on initial reports and associated data in the system (some of which rely on GPS-assisted reporting).

**Create a dataframe for the `accidentsFull.csv` data**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [14]:
accidents_df = pd.read_csv(os.path.join('..', 'data', 'accidentsFull.csv'))
accidents_df.head()

Unnamed: 0,HOUR_I_R,ALCHL_I,ALIGN_I,STRATUM_R,WRK_ZONE,WKDY_I_R,INT_HWY,LGTCON_I_R,MANCOL_I_R,PED_ACC_R,...,SUR_COND,TRAF_CON_R,TRAF_WAY,VEH_INVL,WEATHER_R,INJURY_CRASH,NO_INJ_I,PRPTYDMG_CRASH,FATALITIES,MAX_SEV_IR
0,0,2,2,1,0,1,0,3,0,0,...,4,0,3,1,1,1,1,0,0,1
1,1,2,1,0,0,1,1,3,2,0,...,4,0,3,2,2,0,0,1,0,0
2,1,2,1,0,0,1,0,3,2,0,...,4,1,2,2,2,0,0,1,0,0
3,1,2,1,1,0,0,0,3,2,0,...,4,1,2,2,1,0,0,1,0,0
4,1,1,1,0,0,1,0,3,2,0,...,4,0,2,3,1,0,0,1,0,0


Our goal here is to predict whether an accident just reported will involve an injury (MAX_SEV_IR = 1 or 2) or will not (MAX_SEV_IR = 0). 

**For this purpose, create a dummy variable called INJURY that takes the value “yes” if MAX_SEV_IR = 1 or 2, and otherwise “no.”**

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [15]:
accidents_df['INJURY'] = np.where(accidents_df['MAX_SEV_IR']>0, 'yes', 'no')
accidents_df['INJURY'].value_counts()

INJURY
yes    21462
no     20721
Name: count, dtype: int64

### 8.2.a.

Using the information in this dataset, if an accident has just been reported and no further information is available, what should the prediction be? (INJURY = Yes or No?) Why?

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [16]:
accidents_df['INJURY'].value_counts(normalize=True) 

INJURY
yes    0.508783
no     0.491217
Name: proportion, dtype: float64

So the probability of injury is almost 50.87%.

### 8.2.c.

Let us now return to the entire dataset. Partition the data into training (60%) and validation (40%).

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [17]:
# predictors and outcome
predictors = ['HOUR_I_R', 'ALIGN_I', 'WRK_ZONE', 'WKDY_I_R', 'INT_HWY', 'LGTCON_I_R', 'PROFIL_I_R', 'SPD_LIM',
              'SUR_COND', 'TRAF_CON_R', 'TRAF_WAY', 'WEATHER_R']
outcome = 'INJURY'
# get dummies
X = pd.get_dummies(accidents_df[predictors])
y = accidents_df['INJURY'].astype('category')
classes = list(y.cat.categories)
# partition the data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.40, random_state=SEED)

print('Training Set:', X_train.shape, 'Validation Set:', X_valid.shape)

Training Set: (25309, 12) Validation Set: (16874, 12)


### 8.2.c.i

Assuming that no information or initial reports about the accident itself are available at the time of prediction (only location characteristics, weather conditions, etc.), which predictors can we include in the analysis? (Use the data descriptions page from www.dataminingbook.com ).

All the following predictors are non-specific to the accident. They either describe calendar time or road conditions:
HOUR_I_R, ALIGN_I, WRK_ZONE, WKDY_I_R, INT_HWY, LGTCON_I_R, PROFIL_I_R, SPD_LIM, SUR_CON, TRAF_CON_R, TRAF_WAY and WEATHER_R.

### 8.2.c.ii.

Run a naive Bayes classifier on the complete training set with the relevant predictors (and INJURY as the response). Note that all predictors are categorical. Show the confusion matrix.

<h4 style="color:blue"> Write Your Code Below: </h4>

<h3 style="color:teal"> Expected Output: </h3>

In [18]:
# fit the model
accidents_nb = MultinomialNB(alpha=0.01)
accidents_nb.fit(X_train, y_train)
# predict probabilities for training and validation sets
predProb_train = accidents_nb.predict_proba(X_train)
predProb_valid = accidents_nb.predict_proba(X_valid)
# predict class memberships for validation data
y_train_pred = accidents_nb.predict(X_train)
y_valid_pred = accidents_nb.predict(X_valid)

In [19]:
# confusion matrix
# training
print('training data\n')
classificationSummary(y_train, y_train_pred, class_names=classes)
# validation 
print('\nvalidation data\n')
classificationSummary(y_valid, y_valid_pred, class_names=classes)

training data

Confusion Matrix (Accuracy 0.5291)

       Prediction
Actual   no  yes
    no 4197 8195
   yes 3724 9193

validation data

Confusion Matrix (Accuracy 0.5288)

       Prediction
Actual   no  yes
    no 2838 5491
   yes 2460 6085
