## Multinomial and Bernoulli Naive Bayes

For understanding Multinomial and Bernoulli Naive Bayes, we will start with a small example and understand the end to end process. In another notebook, we will build a full-fledged email spam classifier.

To start with, let's take a few sentences and classify them in two different classes - *education* or *cinema*. Each sentence will represent one document. In real-world cases, a document be any piece of text such as an email, a news article, a book review, a tweet etc.
The analysis and the algorithm involved doesn’t depend on the type of document we use.

The notebook is divided into the following sections:
1. Importing and preprocessing data
2. Building the model: Multinomial Naive Bayes
3. Building the model: Bernoulli Naive Bayes

### 1. Importing and Preprocessing Data

Let us first look at the sentences and their classes. We have kept the training sentences in file example_train.csv. Test sentences have been put in the file example_test.csv.

In [None]:
import numpy as np
import pandas as pd
import sklearn

# training data
train_docs = pd.read_csv('/content/sample_data/popularity.csv')
train_docs

In [None]:
import numpy as np
import pandas as pd
import sklearn

# training data
train_docs = pd.read_csv('/content/sample_data/movie_review_train.csv')
train_docs

Unnamed: 0,class,text
0,Pos,a common complaint amongst film critics is ...
1,Pos,whew this film oozes energy the kind of b...
2,Pos,steven spielberg s amistad which is bas...
3,Pos,he has spent his entire life in an awful litt...
4,Pos,being that it is a foreign language film with...
...,...,...
1595,Neg,if anything stigmata should be taken as...
1596,Neg,john boorman s zardoz is a goofy cinemati...
1597,Neg,the kids in the hall are an acquired taste ...
1598,Neg,there was a time when john carpenter was a gr...


So as you can see there are 5 documents (sentences) , 3 are of "education" class and 2 are of "cinema" class.

In [None]:
# convert label to a numerical variable
train_docs['Label'] = train_docs['class'].map({'Neg':0, 'Pos':1})
train_docs

Unnamed: 0,class,text,Label
0,Pos,a common complaint amongst film critics is ...,1
1,Pos,whew this film oozes energy the kind of b...,1
2,Pos,steven spielberg s amistad which is bas...,1
3,Pos,he has spent his entire life in an awful litt...,1
4,Pos,being that it is a foreign language film with...,1
...,...,...,...
1595,Neg,if anything stigmata should be taken as...,0
1596,Neg,john boorman s zardoz is a goofy cinemati...,0
1597,Neg,the kids in the hall are an acquired taste ...,0
1598,Neg,there was a time when john carpenter was a gr...,0


Let's now split the dataframe into X and y labels.

In [None]:
# train_docs = train_docs.drop('class', axis =1)
# train_docs.head()
# convert the df to a numpy array
train_array = train_docs.values

# split X and y
X_train = train_array[:,0]
y_train = train_array[:,1]
y_train = y_train.astype('int') # sklearn needs y as integers

print("X_train")
print(X_train)
print("y_train")
print(y_train)

X_train
[' a common complaint amongst film critics is   why aren t there more literate scripts available      quiz show gives signs of hope that the art of writing isn t dead in hollywood and that we need not only look to independent films for thoughtful content    paul attanasio s script takes what could have been a tepid thriller   the quiz show scandals of the late 50s   and delivers a telling parable about the emptiness of the post war american dream and the golden bubble that surrounds and protects tv networks and their sponsors    the film is riddled with telling symbols   e   g    a  58 chrysler   a radio announcement of sputnik   but is never heavy handed    deft direction by robert redford and keen performances by ralph fiennes   john turturro and rob morrow dovetail perfectly with the carefully honed script    redford departs from the usually overlight     cable tv quality   sets and camera work so common in recent 20th century period pieces    quiz show perfectly captures th

### Creating the Bag of Words Representation

We now have to convert the data into a format which can be used for training the model. We'll use the **bag of words representation** for each sentence (document).

Imagine breaking X in individual words and putting them all in a bag. Then we pick all the unique words from the bag one by one and make a dictionary of unique words.

This is called **vectorization of words**. We have the class ```CountVectorizer()``` in scikit learn to vectorize the words.


In [None]:
# create an object of CountVectorizer() class
from sklearn.feature_extraction.text import CountVectorizer
# help(CountVectorizer)

In [None]:
vec = CountVectorizer()

Here ```vec``` is an object of class ```CountVectorizer()```. This has a method called  ```fit()``` which converts a corpus of documents to a matrix of 'tokens'.

In [None]:
# fit the vectorizer on training data
vec.fit(X_train)
vec.vocabulary_


{'common': 6284,
 'complaint': 6360,
 'film': 11832,
 'critics': 7378,
 'aren': 1810,
 'literate': 18603,
 'scripts': 27918,
 'available': 2319,
 'quiz': 25250,
 'gives': 13262,
 'signs': 28809,
 'hope': 15074,
 'art': 1928,
 'writing': 35521,
 'isn': 16771,
 'dead': 7904,
 'hollywood': 14963,
 'need': 21281,
 'look': 18770,
 'independent': 15988,
 'films': 11851,
 'thoughtful': 32060,
 'content': 6771,
 'paul': 23045,
 'attanasio': 2173,
 'script': 27912,
 'takes': 31486,
 'tepid': 31839,
 'thriller': 32090,
 'scandals': 27629,
 'late': 18017,
 '50s': 313,
 'delivers': 8215,
 'telling': 31765,
 'parable': 22815,
 'emptiness': 10340,
 'post': 24178,
 'war': 34707,
 'american': 1334,
 'dream': 9554,
 'golden': 13449,
 'bubble': 4285,
 'surrounds': 31156,
 'protects': 24849,
 'tv': 33056,
 'networks': 21378,
 'sponsors': 29948,
 'riddled': 26743,
 'symbols': 31370,
 '58': 328,
 'chrysler': 5614,
 'radio': 25314,
 'announcement': 1534,
 'sputnik': 30053,
 'heavy': 14532,
 'handed': 14185,

```Countvectorizer()``` has converted the documents into a set of unique words alphabetically sorted and indexed.


**Stop Words**

We can see a few trivial words such as  'and','is','of', etc. These words don't really make any difference in classyfying a document. These are called **stop words**. So we would like to get rid of them.

We can remove them by passing a parameter stop_words='english' while instantiating ```Countvectorizer()``` as follows:

In [None]:
# fitting the vectorizer on training data again
# removing the stop words this time
vec = CountVectorizer(stop_words='english',min_df=.03,max_df = .8)
vec.fit(X_train)
vec.vocabulary_

{'common': 264,
 'critics': 323,
 'aren': 78,
 'available': 101,
 'gives': 618,
 'hope': 693,
 'art': 81,
 'writing': 1632,
 'isn': 753,
 'dead': 342,
 'hollywood': 690,
 'need': 970,
 'look': 853,
 'films': 549,
 'content': 287,
 'paul': 1037,
 'script': 1248,
 'takes': 1429,
 'thriller': 1471,
 'late': 805,
 'delivers': 357,
 'telling': 1449,
 'post': 1089,
 'war': 1571,
 'american': 59,
 'dream': 413,
 'tv': 1512,
 'radio': 1145,
 'heavy': 673,
 'direction': 386,
 'robert': 1205,
 'performances': 1043,
 'john': 766,
 'rob': 1204,
 'perfectly': 1041,
 'usually': 1537,
 'quality': 1136,
 'sets': 1272,
 'camera': 192,
 'work': 1618,
 'recent': 1165,
 'century': 213,
 'period': 1044,
 'pieces': 1056,
 'years': 1638,
 'old': 1001,
 'images': 716,
 'true': 1502,
 'era': 464,
 'generation': 605,
 'gone': 625,
 '15': 4,
 'world': 1623,
 'themes': 1461,
 'good': 626,
 'life': 831,
 'family': 508,
 'match': 894,
 'father': 521,
 'fame': 506,
 'audience': 99,
 'appear': 72,
 'familiar': 507,
 

Notice that the vocabulary has reduced to 12 from 15. Another way of printing the 'vocabulary' is as follows:

In [None]:
# printing feature names

len(vec.vocabulary_.keys())
print(vec.get_feature_names_out())
print(len(vec.get_feature_names_out()))

['000' '10' '100' ... 'york' 'young' 'younger']
1643


So our final dictionary is made of 12 words (after discarding the stop words). Now, to do classification, we need to represent all the documents with these words (or tokens) as features.

Every document will be converted into a *feature vector* representing presence of these words in that document. Let's convert each of our training documents in to a feature vector.

In [None]:
# another way of representing the features
X_transformed = vec.transform(X_train)
X_transformed

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 217396 stored elements and shape (1600, 1643)>

You can see X_tranformed is a 5 x 12 **sparse matrix**. It has 5 rows for each of our 5 documents and 12 columns each
for one word of the dictionary which we just created. Let us print X_transformed.

In [None]:
print(X_transformed)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 217396 stored elements and shape (1600, 1643)>
  Coords	Values
  (0, 4)	1
  (0, 59)	2
  (0, 72)	1
  (0, 78)	1
  (0, 81)	1
  (0, 99)	3
  (0, 101)	1
  (0, 192)	1
  (0, 211)	1
  (0, 213)	1
  (0, 220)	1
  (0, 264)	2
  (0, 287)	1
  (0, 316)	1
  (0, 323)	1
  (0, 328)	1
  (0, 340)	1
  (0, 342)	1
  (0, 357)	1
  (0, 386)	1
  (0, 413)	2
  (0, 445)	2
  (0, 464)	1
  (0, 503)	1
  (0, 506)	1
  :	:
  (1599, 1247)	1
  (1599, 1258)	2
  (1599, 1267)	1
  (1599, 1271)	1
  (1599, 1331)	1
  (1599, 1335)	1
  (1599, 1339)	1
  (1599, 1366)	3
  (1599, 1371)	1
  (1599, 1375)	1
  (1599, 1379)	1
  (1599, 1413)	1
  (1599, 1421)	1
  (1599, 1429)	2
  (1599, 1447)	1
  (1599, 1501)	1
  (1599, 1533)	1
  (1599, 1550)	1
  (1599, 1555)	1
  (1599, 1556)	1
  (1599, 1570)	1
  (1599, 1579)	2
  (1599, 1589)	1
  (1599, 1609)	1
  (1599, 1616)	1


This representation can be understood as follows:

Consider first 4 rows of the output: (0,2), (0,5), (0,7) and (0,11). It says that the first document (index 0) has
7th , 2nd , 5th and 11th 'word' present in the document, and that they appear only
once in the document- indicated by the right hand column entry.

Similarly, consider the entry (4,4) (third from bottom). It says that the fifth document has the fifth word present twice. Indeed, the 5th word('good') appears twice in the 5th document.

In real problems, you often work with large documents and vocabularies, and each document contains only a few words in the vocabulary. So it would be a waste of space to store the vocabulary in a typical dataframe, since most entries would be zero. Also, matrix products, additions etc. are much faster with sparse matrices. That's why we use sparse matrices to store the data.


Let us convert this sparse matrix into a more easily interpretable array:

In [None]:
# converting transformed matrix back to an array
# note the high number of zeros
X_transformed.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

To make the dataset more readable, let us examine the vocabulary and the document-term matrix together in a pandas dataframe. The way to convert a matrix into a dataframe is ```pd.DataFrame(matrix, columns=columns)```.


In [None]:
# converting matrix to dataframe
pd.DataFrame(X_transformed.toarray(),
             columns=vec.get_feature_names_out())

Unnamed: 0,000,10,100,13,15,1995,1996,1997,1998,1999,...,written,wrong,wrote,yeah,year,years,yes,york,young,younger
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,3,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1595,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,1,0,0,0
1596,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
1597,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1598,0,0,0,0,0,2,1,0,0,0,...,1,1,0,0,0,0,0,0,0,0


This table shows how many times a particular word occurs in document. In other words, this is a frequency table of the words.

A corpus of documents can thus be represented by a matrix with one row per document and one column per
token (e.g. word) occurring in the corpus.

Let's now import and transform the test data as well.

In [None]:
# test data
test_docs = pd.read_csv('/content/sample_data/movie_review_test.csv')
test_docs

Unnamed: 0,class,text
0,Pos,films adapted from comic books have had plent...
1,Pos,every now and then a movie comes along from a...
2,Pos,you ve got mail works alot better than it des...
3,Pos,jaws is a rare film that grabs your atte...
4,Pos,moviemaking is a lot like being the general m...
...,...,...
395,Neg,one of the first films of 1999 is this mtv pi...
396,Neg,susan granger s review of america s sweethe...
397,Neg,susan granger s review of jeepers creepers ...
398,Neg,this independent film written and directed by...


In [None]:
# convert label to a numerical variable
test_docs['Label_Test'] = test_docs['class'].map({'Pos':1, 'Neg':0})
test_docs

Unnamed: 0,class,text,Label_Test
0,Pos,films adapted from comic books have had plent...,1
1,Pos,every now and then a movie comes along from a...,1
2,Pos,you ve got mail works alot better than it des...,1
3,Pos,jaws is a rare film that grabs your atte...,1
4,Pos,moviemaking is a lot like being the general m...,1
...,...,...,...
395,Neg,one of the first films of 1999 is this mtv pi...,0
396,Neg,susan granger s review of america s sweethe...,0
397,Neg,susan granger s review of jeepers creepers ...,0
398,Neg,this independent film written and directed by...,0


In [None]:
# test_docs = test_docs.drop('class', axis =1)
# test_docs.head()

# convert to numpy array
test_numpy_array = test_docs.values

# split into X and y
X_test = test_numpy_array[:,0]
y_test = test_numpy_array[:,1]

# print("X_test")
# print(X_test)
# print("y_test")
print(y_test)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [None]:
# transform the test data
# note that you *never* fit on test data, only on training data
# and only transform the test data
X_test_transformed = vec.transform(X_test)
print(X_test_transformed)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 51663 stored elements and shape (400, 1643)>
  Coords	Values
  (0, 11)	1
  (0, 18)	2
  (0, 23)	1
  (0, 29)	1
  (0, 31)	1
  (0, 53)	1
  (0, 73)	1
  (0, 94)	1
  (0, 110)	1
  (0, 118)	1
  (0, 133)	1
  (0, 134)	1
  (0, 140)	1
  (0, 141)	1
  (0, 151)	3
  (0, 152)	1
  (0, 173)	1
  (0, 176)	1
  (0, 177)	1
  (0, 188)	2
  (0, 189)	1
  (0, 195)	1
  (0, 197)	1
  (0, 203)	1
  (0, 206)	1
  :	:
  (399, 1314)	2
  (399, 1325)	1
  (399, 1354)	1
  (399, 1371)	1
  (399, 1375)	2
  (399, 1382)	1
  (399, 1399)	2
  (399, 1411)	1
  (399, 1416)	1
  (399, 1418)	1
  (399, 1430)	1
  (399, 1463)	2
  (399, 1471)	1
  (399, 1475)	1
  (399, 1514)	2
  (399, 1516)	1
  (399, 1542)	1
  (399, 1546)	2
  (399, 1549)	1
  (399, 1609)	3
  (399, 1632)	1
  (399, 1633)	3
  (399, 1637)	1
  (399, 1638)	1
  (399, 1641)	2


In [None]:
# convert to non-sparse array
X_test=X_test_transformed.toarray()
X_test

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 2, 0]])

Let us summarise all we have done till now:

- ```vect.fit(train)``` learns the vocabulary of the training data
- ```vect.transform(train)``` uses the fitted vocabulary to build a document-term matrix from the training data
- ```vect.transform(test)``` uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

### 2. Building the Model: Multinomial Naive Bayes

In [None]:
# building a multinomial NB model
from sklearn.naive_bayes import MultinomialNB

# instantiate NB class
mnb=MultinomialNB()

# fitting the model on training data
mnb.fit(X_transformed, y_train)

# note that we are using the sparse matrix X_transformed,
# though you can also use the non-sparse version
# mnb.fit(X_transformed.toarray(), y_train)

# predicting probabilities of test data
proba = mnb.predict_proba(X_test)


array([[0.32808399, 0.67191601]])

In [None]:
# probability of each class (test data)
print("probability of test document belonging to class CINEMA" , proba[:,0])
print("probability of test document belonging to class EDUCATION" , proba[:,1])

probability of test document belonging to class CINEMA [0.32808399]

probability of test document belonging to class EDUCATION [0.67191601]


### 3. Building the Model: Bernoulli Naive Bayes

In [None]:
from sklearn.naive_bayes import BernoulliNB

# instantiating bernoulli NB class
bnb=BernoulliNB()

# fitting the model
bnb.fit(X_transformed, y_train)

# also works
# bnb.fit(X_transformed.toarray(), y_train)

# predicting probability of test data
bnb.predict_proba(X_test)

prob_bnb = bnb.predict_proba(X_test)
# print("probability of test document belonging to class CINEMA" , prob_bnb[:,0])
# print("probability of test document belonging to class EDUCATION" , prob_bnb[:,1])


In [None]:
# accuracy
from sklearn import metrics

# predict class
y_pred_class = bnb.predict(X_test_transformed)

# # predict probability
# y_pred_proba =bnb.predict_proba(X_test)

y_test = y_test.astype(int)
accuracy = metrics.accuracy_score(y_test, y_pred_class)
print(f"Accuracy: {accuracy}")

Accuracy: 0.79


In [None]:
metrics.confusion_matrix(y_test, y_pred_class)

array([[177,  23],
       [ 61, 139]])

In the next notebook, we will use Multinomial and Bernoulli Naive Bayes to solve an interesting real problem - classifying SMSes as spam or ham. We'll also see how to decide the optimal cutoff probability and evaluate the model.
