# Bernoulli vs Multinomial random variables?
This Notebook compares bernoulli and multinomial Naive Bayes classifiers in terms of their prediction accuracy when it comes to text-classification.
The MultinomialNB algorithm implements my own binomial-prediction algorithm, which improves the prediction accuracy as we shall see...

An SMS spam dataset is going to be used for classification & prediction.

In [113]:
import numpy as np
import pandas as pd
from BernoulliNB import BernoulliNB
from MultinomialNB import MultinomialNB

## Load data

In [114]:
df = pd.read_csv('spam.csv', encoding='ISO-8859-1')
df = df.loc[:, ['v1', 'v2']].sample(frac=1)
df['v1'] = df['v1'].map({'ham': 0, 'spam': 1})

In [115]:
df.head()

Unnamed: 0,v1,v2
1795,0,I hope your alright babe? I worry that you mig...
277,0,"Awesome, I'll see you in a bit"
2926,0,Ok... U enjoy ur shows...
901,0,How is it possible to teach you. And where.
2075,0,Must come later.. I normally bathe him in da a...


## Create train-test splits

In [116]:
X = []
Y = []
for _, row in df.iterrows():
    X.append(row['v2'])
    Y.append(row['v1'])
Y = np.array(Y)

In [117]:
total_samples = len(df)

train_size = 2/3
num_train = int(len(X)*train_size)

X_train, X_test = X[:num_train], X[num_train:]
Y_train, Y_test = Y[:num_train], Y[num_train:]

## Train both classifiers

In [118]:
NB_Bernoulli = BernoulliNB()
NB_Multinomial = MultinomialNB()

NB_Bernoulli.train(X_train, Y_train, text_data=True)
NB_Multinomial.train(X_train, Y_train, text_data=True)

## Test both classifiers

In [139]:
n_tests = len(X_test)

bern_correct = 0
multi_correct = 0

label = ["ok", "spam"]

for i in range(n_tests):
    y = Y[i]
    
    y_hat_bern, _ = NB_Bernoulli.predict(X[i])
    y_hat_multi = NB_Multinomial.predict(X[i])
    
    if y_hat_bern == y:
        bern_correct += 1
    else:
        print("Bernoulli error: classified", label[y_hat_bern], "instead of", label[y])
        print("Text:", X[i], "\n")
    if y_hat_multi == y:
        multi_correct += 1
    else:
        print("Mulitnomial error: classified", label[y_hat_multi], "instead of", label[y])
        print("Text:", X[i], "\n")

Mulitnomial error: classified spam instead of ok
Text: Tiwary to rcb.battle between bang and kochi. 

Mulitnomial error: classified spam instead of ok
Text: Gibbs unsold.mike hussey 

Mulitnomial error: classified ok instead of spam
Text: Guess who am I?This is the first time I created a web page WWW.ASJESUS.COM read all I wrote. I'm waiting for your opinions. I want to be your friend 1/1 

Mulitnomial error: classified spam instead of ok
Text: As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune 

Mulitnomial error: classified spam instead of ok
Text: Hi Shanil,Rakhesh here.thanks,i have exchanged the uncut diamond stuff.leaving back. Excellent service by Dino and Prem. 

Bernoulli error: classified ok instead of spam
Text: Hello darling how are you today? I would love to have a chat, why dont you tell me what you look like and what you are in to sexy? 

Mulitnomial error: class

In [140]:
print("Bernoulli accuracy:", bern_correct / n_tests)
print("Multinomial accuracy:", multi_correct / n_tests)

Bernoulli accuracy: 0.9978471474703983
Multinomial accuracy: 0.9790096878363832


Both assumptions of bernoulli- and multinomial random variables yield very accurate prediction results. The BernoulliNB classifiers outperforms the MultinomialNB classifier by a slight margin.

Now lets switch on binomial-prediction and change the smoothing parameter, which defaults to 1 (Laplace-Smoothing) and see if this affects the prediction accuracy of the MultinomialNB classifier.

In [141]:
bern_correct = 0
multi_correct = 0

for i in range(n_tests):
    y = Y[i]
    
    y_hat_bern, _ = NB_Bernoulli.predict(X[i])
    y_hat_multi = NB_Multinomial.predict(X[i], smoothing=1, binomial=True)
    
    if y_hat_bern == y:
        bern_correct += 1
    else:
        print("Bernoulli error: classified", label[y_hat_bern], "instead of", label[y])
        print("Text:", X[i], "\n")
    if y_hat_multi == y:
        multi_correct += 1
    else:
        print("Mulitnomial error: classified", label[y_hat_multi], "instead of", label[y])
        print("Text:", X[i], "\n")

Mulitnomial error: classified spam instead of ok
Text: Tiwary to rcb.battle between bang and kochi. 

Mulitnomial error: classified spam instead of ok
Text: Gibbs unsold.mike hussey 

Mulitnomial error: classified ok instead of spam
Text: Guess who am I?This is the first time I created a web page WWW.ASJESUS.COM read all I wrote. I'm waiting for your opinions. I want to be your friend 1/1 

Mulitnomial error: classified spam instead of ok
Text: Hi Shanil,Rakhesh here.thanks,i have exchanged the uncut diamond stuff.leaving back. Excellent service by Dino and Prem. 

Bernoulli error: classified ok instead of spam
Text: Hello darling how are you today? I would love to have a chat, why dont you tell me what you look like and what you are in to sexy? 

Mulitnomial error: classified ok instead of spam
Text: Hello darling how are you today? I would love to have a chat, why dont you tell me what you look like and what you are in to sexy? 

Mulitnomial error: classified spam instead of ok
Text:

In [124]:
print("Bernoulli accuracy:", bern_correct / n_tests)
print("Multinomial accuracy:", multi_correct / n_tests)

Bernoulli accuracy: 0.9978471474703983
Multinomial accuracy: 0.9827771797631862


Binomial prediction indeed increases the performance of the MultinomialNB classifier! Now lets tune the smoothing parameter and show the misclassified samples.

In [138]:
bern_correct = 0
multi_correct = 0

label = ["ok", "spam"]

for i in range(n_tests):
    y = Y[i]
    
    y_hat_bern, _ = NB_Bernoulli.predict(X[i])
    y_hat_multi = NB_Multinomial.predict(X[i], smoothing=0.1, binomial=True)
    
    if y_hat_bern == y:
        bern_correct += 1
    else:
        print("Bernoulli error: classified", label[y_hat_bern], "instead of", label[y])
        print("Text:", X[i], "\n")
    if y_hat_multi == y:
        multi_correct += 1
    else:
        print("Mulitnomial error: classified", label[y_hat_multi], "instead of", label[y])
        print("Text:", X[i], "\n")

Mulitnomial error: classified ok instead of spam
Text: Guess who am I?This is the first time I created a web page WWW.ASJESUS.COM read all I wrote. I'm waiting for your opinions. I want to be your friend 1/1 

Bernoulli error: classified ok instead of spam
Text: Hello darling how are you today? I would love to have a chat, why dont you tell me what you look like and what you are in to sexy? 

Mulitnomial error: classified ok instead of spam
Text: Hello darling how are you today? I would love to have a chat, why dont you tell me what you look like and what you are in to sexy? 

Mulitnomial error: classified ok instead of spam
Text: How come it takes so little time for a child who is afraid of the dark to become a teenager who wants to stay out all night? 

Mulitnomial error: classified spam instead of ok
Text: Nokia phone is lovly.. 

Bernoulli error: classified spam instead of ok
Text: Customer place i will call you. 

Bernoulli error: classified spam instead of ok
Text: Yavnt tried ye

In [136]:
print("Bernoulli accuracy:", bern_correct / n_tests)
print("Multinomial accuracy:", multi_correct / n_tests)

Bernoulli accuracy: 0.9978471474703983
Multinomial accuracy: 0.9967707212055974


As the smoothing parameter approaches 0, both classifiers achieve the same accuracy. This is, as one can see in the misclassified examples, probably close to bayes error-rate. One can even see that there are texts which the BernoulliNB classifier misclassifies, but the MultinomialNB doesn't! This suggests that the Multinomial algorithm is not strictly underperforming.