<a href="https://colab.research.google.com/github/chitra-psg/Course/blob/master/_imdb_review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Multinomial and Bernoulli Naive Bayes**

For understanding Multinomial and Bernoulli Naive Bayes, we will start with a small example and understand the end to end process. In another notebook, we will build a full-fledged email spam classifier.

To start with, let's take a few sentences and classify them in two different classes - education or cinema. Each sentence will represent one document. In real-world cases, a document be any piece of text such as an email, a news article, a book review, a tweet etc. The analysis and the algorithm involved doesn’t depend on the type of document we use.

The notebook is divided into the following sections:

Importing and preprocessing data
Building the model: Multinomial Naive Bayes
Building the model: Bernoulli Naive Bayes

In [14]:
import numpy as np
import pandas as pd
import sklearn

# training data
train_docs = pd.read_csv('https://raw.githubusercontent.com/chitra-psg/imdb/main/movie_review_train.csv') 
train_docs

Unnamed: 0,class,text
0,Pos,a common complaint amongst film critics is ...
1,Pos,whew this film oozes energy the kind of b...
2,Pos,steven spielberg s amistad which is bas...
3,Pos,he has spent his entire life in an awful litt...
4,Pos,being that it is a foreign language film with...
...,...,...
1595,Neg,if anything stigmata should be taken as...
1596,Neg,john boorman s zardoz is a goofy cinemati...
1597,Neg,the kids in the hall are an acquired taste ...
1598,Neg,there was a time when john carpenter was a gr...


In [15]:
# convert label to a numerical variable']'
train_docs['class'] = train_docs['class'].map({'Neg':0, 'Pos':1})
train_docs

Unnamed: 0,class,text
0,1,a common complaint amongst film critics is ...
1,1,whew this film oozes energy the kind of b...
2,1,steven spielberg s amistad which is bas...
3,1,he has spent his entire life in an awful litt...
4,1,being that it is a foreign language film with...
...,...,...
1595,0,if anything stigmata should be taken as...
1596,0,john boorman s zardoz is a goofy cinemati...
1597,0,the kids in the hall are an acquired taste ...
1598,0,there was a time when john carpenter was a gr...


In [16]:
# convert the df to a numpy array 
train_array = train_docs.values

# split X and y
X_train = train_array[:,1]
y_train = train_array[:,0]
y_train = y_train.astype('int') # sklearn needs y as integers

print("X_train")
print(X_train)
print("y_train")
print(y_train)

X_train
[' a common complaint amongst film critics is   why aren t there more literate scripts available      quiz show gives signs of hope that the art of writing isn t dead in hollywood and that we need not only look to independent films for thoughtful content    paul attanasio s script takes what could have been a tepid thriller   the quiz show scandals of the late 50s   and delivers a telling parable about the emptiness of the post war american dream and the golden bubble that surrounds and protects tv networks and their sponsors    the film is riddled with telling symbols   e   g    a  58 chrysler   a radio announcement of sputnik   but is never heavy handed    deft direction by robert redford and keen performances by ralph fiennes   john turturro and rob morrow dovetail perfectly with the carefully honed script    redford departs from the usually overlight     cable tv quality   sets and camera work so common in recent 20th century period pieces    quiz show perfectly captures th

In [17]:
# create an object of CountVectorizer() class 
from sklearn.feature_extraction.text import CountVectorizer 
#help(CountVectorizer)

In [18]:
vec = CountVectorizer()

In [19]:
# fit the vectorizer on training data 
vec.fit(X_train)
len(vec.vocabulary_)

36162

In [20]:
# printing feature names
print(vec.get_feature_names_out())
print(len(vec.get_feature_names_out()))

['00' '000' '007' ... 'zus' 'zwick' 'zwigoff']
36162


In [21]:
# fitting the vectorizer on training data again
# removing the stop words this time
vec = CountVectorizer(stop_words='english')
vec.fit(X_train)
len(vec.vocabulary_)

35858

In [22]:
# printing feature names
print(vec.get_feature_names_out())
print(len(vec.get_feature_names_out()))

['00' '000' '007' ... 'zus' 'zwick' 'zwigoff']
35858


In [23]:
# another way of representing the features
X_transformed = vec.transform(X_train)
X_transformed

<1600x35858 sparse matrix of type '<class 'numpy.int64'>'
	with 394677 stored elements in Compressed Sparse Row format>

In [24]:
print(X_transformed)

  (0, 57)	1
  (0, 225)	1
  (0, 313)	4
  (0, 328)	1
  (0, 1334)	2
  (0, 1534)	1
  (0, 1676)	1
  (0, 1810)	1
  (0, 1928)	1
  (0, 2058)	1
  (0, 2173)	1
  (0, 2229)	3
  (0, 2277)	1
  (0, 2319)	1
  (0, 2997)	1
  (0, 3604)	1
  (0, 4200)	1
  (0, 4285)	1
  (0, 4573)	1
  (0, 4582)	1
  (0, 4690)	1
  (0, 4838)	1
  (0, 4839)	1
  (0, 4869)	2
  (0, 5131)	1
  :	:
  (1599, 30670)	1
  (1599, 30808)	1
  (1599, 30928)	1
  (1599, 31107)	1
  (1599, 31162)	1
  (1599, 31198)	1
  (1599, 31483)	1
  (1599, 31486)	2
  (1599, 31717)	1
  (1599, 31761)	1
  (1599, 31954)	1
  (1599, 32120)	1
  (1599, 32544)	1
  (1599, 32894)	1
  (1599, 33323)	1
  (1599, 33801)	1
  (1599, 33949)	1
  (1599, 34330)	1
  (1599, 34400)	1
  (1599, 34424)	1
  (1599, 34706)	1
  (1599, 34788)	2
  (1599, 34965)	1
  (1599, 35351)	1
  (1599, 35396)	1


In [25]:
# converting transformed matrix back to an array
# note the high number of zeros
X_transformed.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [26]:
# converting matrix to dataframe
pd.DataFrame(X_transformed.toarray(), 
             columns=vec.get_feature_names_out())

Unnamed: 0,00,000,007,00s,03,04,05,05425,10,100,...,zucker,zuehlke,zuko,zukovsky,zulu,zundel,zurg,zus,zwick,zwigoff
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1595,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1596,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1597,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1598,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
