# Bernoulli vs Multinomial random variables?
This Notebook compares bernoulli and multinomial Naive Bayes classifiers in terms of their prediction accuracy when it comes to text-classification.
The MultinomialNB algorithm implements my own binomial-prediction algorithm, which improves the prediction accuracy as we shall see...

An SMS spam dataset is going to be used for classification & prediction.

In [64]:
import numpy as np
import pandas as pd
from BernoulliNB import BernoulliNB
from MultinomialNB import MultinomialNB
# compare custom algorithm to sklearn's
from sklearn.naive_bayes import MultinomialNB as MultinomialSkl
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

## Load data

In [4]:
df = pd.read_csv('spam.csv', encoding='ISO-8859-1')
df = df.loc[:, ['v1', 'v2']].sample(frac=1)
df['v1'] = df['v1'].map({'ham': 0, 'spam': 1})

In [5]:
df.head()

Unnamed: 0,v1,v2
1225,0,"sir, you will receive the account no another 1..."
5036,0,(You didn't hear it from me)
3030,0,gonna let me know cos comes bak from holiday ...
4385,0,", im .. On the snowboarding trip. I was wonder..."
2655,0,Great! I have to run now so ttyl!


## Create train-test splits

In [6]:
X = []
Y = []
for _, row in df.iterrows():
    X.append(row['v2'])
    Y.append(row['v1'])
Y = np.array(Y)

In [7]:
total_samples = len(df)

train_size = 2/3
num_train = int(len(X)*train_size)

X_train, X_test = X[:num_train], X[num_train:]
Y_train, Y_test = Y[:num_train], Y[num_train:]

## Train classifiers

In [75]:
NB_Bernoulli = BernoulliNB()
NB_Multinomial = MultinomialNB()
NB_Multinomial_Skl = MultinomialSkl()

NB_Bernoulli.train(X_train, Y_train, text_data=True)
NB_Multinomial.train(X_train, Y_train, text_data=True)

# transform data & fit sklearn classifier
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = NB_Multinomial_Skl.fit(X_train_tfidf, Y_train)

## Test custom classifiers

In [16]:
n_tests = len(X_test)

bern_correct = 0
multi_correct = 0

label = ["ok", "spam"]

for i in range(n_tests):
    y = Y[i]
    
    y_hat_bern, _ = NB_Bernoulli.predict(X[i])
    y_hat_multi = NB_Multinomial.predict(X[i])
    
    if y_hat_bern == y:
        bern_correct += 1
    if y_hat_multi == y:
        multi_correct += 1

In [17]:
print("Bernoulli accuracy:", bern_correct / n_tests)
print("Multinomial accuracy:", multi_correct / n_tests)

Bernoulli accuracy: 0.9978471474703983
Multinomial accuracy: 0.9698600645855759


Both assumptions of bernoulli- and multinomial random variables yield very accurate prediction results. The BernoulliNB classifiers outperforms the MultinomialNB classifier by a slight margin.

Now lets switch on binomial-prediction and change the smoothing parameter, which defaults to 1 (Laplace-Smoothing) and see if this affects the prediction accuracy of the MultinomialNB classifier.

In [76]:
bern_correct = 0
multi_correct = 0

for i in range(n_tests):
    y = Y[i]
    
    y_hat_bern, _ = NB_Bernoulli.predict(X[i])
    y_hat_multi = NB_Multinomial.predict(X[i], smoothing=1, binomial=True)
    
    if y_hat_bern == y:
        bern_correct += 1
    #else:
    #    print("Bernoulli error: classified", label[y_hat_bern], "instead of", label[y])
    #    print("Text:", X[i], "\n")
    if y_hat_multi == y:
        multi_correct += 1
    #else:
    #    print("Mulitnomial error: classified", label[y_hat_multi], "instead of", label[y])
    #    print("Text:", X[i], "\n")

In [77]:
print("Bernoulli accuracy:", bern_correct / n_tests)
print("Multinomial accuracy:", multi_correct / n_tests)

Bernoulli accuracy: 0.9978471474703983
Multinomial accuracy: 0.9741657696447793


Binomial prediction indeed increases the performance of the MultinomialNB classifier! 

Now lets tune the smoothing parameter and compare the accuracy of the custom algorithm with binomial-prediction to the sklearn algorithm.

In [78]:
bern_correct = 0
multi_correct = 0
multi_skl_correct = 0

label = ["ok", "spam"]

# sklearn prediction
X_test_counts = count_vect.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

y_hat_skl = clf.predict(X_test_tfidf)

for i in range(n_tests):
    y = Y[i]
    
    y_hat_bern, _ = NB_Bernoulli.predict(X[i])
    y_hat_multi = NB_Multinomial.predict(X[i], smoothing=0.1, binomial=True)
    
    if y_hat_bern == y:
        bern_correct += 1
    else:
        print("Bernoulli error: classified", label[y_hat_bern], "instead of", label[y])
        print("Text:", X[i], "\n")
    if y_hat_multi == y:
        multi_correct += 1
    else:
        print("Mulitnomial error: classified", label[y_hat_multi], "instead of", label[y])
        print("Text:", X[i], "\n")
    
    if y_hat_skl[i] == y:
        multi_skl_correct += 1

Mulitnomial error: classified spam instead of ok
Text: Have you laid your airtel line to rest? 

Bernoulli error: classified spam instead of ok
Text: Customer place i will call you. 

Bernoulli error: classified spam instead of ok
Text: Have you heard from this week? 

Mulitnomial error: classified spam instead of ok
Text: Unlimited texts. Limited minutes. 

Bernoulli error: classified spam instead of ok
Text: Are you free now?can i call now? 

Bernoulli error: classified spam instead of ok
Text: what is your account number? 



In [80]:
print("Bernoulli accuracy:", bern_correct / n_tests)
print("Multinomial accuracy:", multi_correct / n_tests)
print("Multinomial Sklearn accuracy:", multi_skl_correct / n_tests)

Bernoulli accuracy: 0.9978471474703983
Multinomial accuracy: 0.9989235737351991
Multinomial Sklearn accuracy: 0.7976318622174381


As the smoothing parameter approaches 0, both classifiers achieve about the same accuracy. This is as one can see in the misclassified examples probably close to bayes error-rate. One can see that there are texts which the BernoulliNB classifier misclassifies, but the MultinomialNB doesn't and vice versa.