# Making a Naive Bayes classifier in Python

We start with a toy example of how to do machine learning with scikit-learn, adapted from the scikit-learn tutorial.

In [1]:
# Here is the toy data
X = [[ 1,  2,  3],  # 2 samples, 3 features
     [11, 12, 13]]

y = [0, 1]  # classes of each sample

In [2]:
# This will train a Naive Bayes classifier on the training data in X,
# using the gold labels in y
from sklearn.naive_bayes import MultinomialNB
# When we make a NaiveBayes classifier without changing any 
# of the default parameter values, it does
# add-one smoothing, that's what "alpha = 1.0" means.
# class_prior = None means that we're not setting P(c)
# for the outcome classes c ahead of seeing the data,
# and fit_prior=True means that we're compuring P(c)
# from the training data
nb = MultinomialNB()
nb.fit(X, y)

MultinomialNB()

In [3]:
# Prediction on the training data:
# We replicate the training data, predicting a label of 0
# for the first datapoint, and 0 for the second datapoint.
# But then, testing on the training data does not tell us much.
nb.predict(X)

array([0, 1])

In [4]:
# Here is data that the model has not previously seen
testdata = [
    [4, 5, 6], 
    [14, 15, 16]]

# The model predicts label 1 for both datapoints.
nb.predict(testdata)

array([1, 1])

Now we move to an NLP task: We do the mini sentiment analysis task from Jurafsky and Martin 3rd edition chapter 4. Here is the data:

In [5]:
# the mini corpus from J&M version 3
jm_corpus = [
    'just plain boring',
    'entirely predictable and lacks energy',
    'no surprises and very few laughs',
    'very powerful',
    'the most fun film of the summer'
    ]
# gold labels on these documents
jm_y = [0, 0, 0, 1, 1]


scikit-learn needs its data to be in a different format: a vector of one entry per feature (in our case, word type), with a value for each feature. In our case, these values are the word counts.

scikit-learn comes with a number of predefined data structures that transform into its input format. There is even a feature extractor specifically for word counts, the CountVectorizer. Here it is: 

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
# make an object that does the feature extraction
counting_feature_extractor = CountVectorizer()

# fit_transform does two things: 
# 1. it checks which words are in this corpus, and assumes that these are 
#   all the words/features the model will pay attention to.
# 2. It transforms the corpus into a matrix of counts
jm_X = counting_feature_extractor.fit_transform(jm_corpus)
jm_X

<5x20 sparse matrix of type '<class 'numpy.int64'>'
	with 22 stored elements in Compressed Sparse Row format>

In [7]:
# Here is a list of all the words that the feature extractor 
# has seen:
print(", ".join(counting_feature_extractor.get_feature_names_out()))

and, boring, energy, entirely, few, film, fun, just, lacks, laughs, most, no, of, plain, powerful, predictable, summer, surprises, the, very


In [8]:
# Let's take a closer look at the matrix of counts.
# the only word ever to appear twice in a corpus is "the", 
# in the last document
jm_X.toarray()

array([[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 2, 0]])

In [9]:
# Now we can train a Naive Bayes model on this data as before
nb_jm = MultinomialNB()
nb_jm.fit(jm_X, jm_y)

MultinomialNB()

In [10]:
# We test this on the test corpus
# (consisting of just one document)
# from the Jurafsky and Martin example
jm_testcorpus = [ 'predictable with no fun']

# To transform the datapoint into a feature representation,
# we just call "transform", not "fit_transform":
# The feature extractor has been fitted to the training data,
# and is now applied to the test data,
# meaning that it only pays attention to the words
# that appeared in the training data. 
jm_testX = counting_feature_extractor.transform(jm_testcorpus)
jm_testX

<1x20 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [11]:
# inspecting the matrix of counts for the test data
jm_testX.toarray()

array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]])

In [12]:
# Let's see what the model predicts for this test datapoint
nb_jm.predict(jm_testX)

array([0])

## Inspecting what the model has learned

Naive Bayes classifiers are quite interpretable: We can look at the class probabilities to see which classes are overall most likely:

In [13]:
import numpy as np

# asking the model for log probabilities
class_logprobs = nb_jm.class_log_prior_
# turning them back into probabilities
class_probs = list(np.exp(class_logprobs))
# making a list of class names
classes = ["neg", "pos"]

for classno in range(2):
    print("class no.", classno, "is", classes[classno], "with prob.", class_probs[classno])



class no. 0 is neg with prob. 0.6000000000000001
class no. 1 is pos with prob. 0.4


And we can look at the feature probabilities to see which features the model found to be most indicative of each class.

In [23]:
# asking the model for log probabilities
feature_logprobs = nb_jm.feature_log_prob_
# turning them back into probabilities
feature_probs = np.exp(feature_logprobs)
# getting a list of all word types from the feature extractor
features = counting_feature_extractor.get_feature_names_out()

for classno in range(2):
    print("\nclass", classes[classno])
    
    # extracting row no. [classno]
    thisclass_feature_probs = list(feature_probs[classno])
    # combining each feature with its probability into a tuple,
    # probability first
    features_and_probs = list(zip(thisclass_feature_probs, features))
    # and sorting those tuples by probability, highest first
    features_and_probs.sort(reverse = True)
    
    for feature_prob, feature in features_and_probs:
        print("\t", feature, ":", round(feature_prob, 4))
    


class neg
	 and : 0.0882
	 very : 0.0588
	 surprises : 0.0588
	 predictable : 0.0588
	 plain : 0.0588
	 no : 0.0588
	 laughs : 0.0588
	 lacks : 0.0588
	 just : 0.0588
	 few : 0.0588
	 entirely : 0.0588
	 energy : 0.0588
	 boring : 0.0588
	 the : 0.0294
	 summer : 0.0294
	 powerful : 0.0294
	 of : 0.0294
	 most : 0.0294
	 fun : 0.0294
	 film : 0.0294

class pos
	 the : 0.1034
	 very : 0.069
	 summer : 0.069
	 powerful : 0.069
	 of : 0.069
	 most : 0.069
	 fun : 0.069
	 film : 0.069
	 surprises : 0.0345
	 predictable : 0.0345
	 plain : 0.0345
	 no : 0.0345
	 laughs : 0.0345
	 lacks : 0.0345
	 just : 0.0345
	 few : 0.0345
	 entirely : 0.0345
	 energy : 0.0345
	 boring : 0.0345
	 and : 0.0345


As you can see, it's the function words that have some of the highest probabilities -- because they simply occur a lot. So removing stop words is not a bad idea.

## Other feature extractors

scikit-learn has many other feature extractors. There is also one that assumes that each data point is given as a Python dictionary of the form

```{feature1:value1, feature2:value2, ...}```

so that a whole dataset is given as a list of such dictionaries. 

We can transform our data into this format ourselves if we like, using NLTK's language processing functions for the preprocessing:

In [15]:
import nltk

jm_corpus_countdict = [ ]
for document in jm_corpus:
    doc_preprocessed = nltk.word_tokenize(document)
    doc_countdict = nltk.FreqDist(doc_preprocessed)
    jm_corpus_countdict.append(doc_countdict)
    
jm_corpus_countdict

[FreqDist({'just': 1, 'plain': 1, 'boring': 1}),
 FreqDist({'entirely': 1, 'predictable': 1, 'and': 1, 'lacks': 1, 'energy': 1}),
 FreqDist({'no': 1, 'surprises': 1, 'and': 1, 'very': 1, 'few': 1, 'laughs': 1}),
 FreqDist({'very': 1, 'powerful': 1}),
 FreqDist({'the': 2, 'most': 1, 'fun': 1, 'film': 1, 'of': 1, 'summer': 1})]

In [16]:
# We now use a feature extractor that can deal with
# data points that are count dictionaries
from sklearn.feature_extraction import DictVectorizer
dict_feature_extractor = DictVectorizer()
jm_X2 = dict_feature_extractor.fit_transform(jm_corpus_countdict)

In [17]:
jm_X2.toarray()

array([[0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0.],
       [1., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
        0., 1., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
        0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0.,
        1., 0., 2., 0.]])

In [18]:
# we can now again train a classifier
nb_jm2 = MultinomialNB()
nb_jm2.fit(jm_X2, jm_y)

MultinomialNB()

In [19]:
# and we preprocess the test corpus in the same way
jm_testcorpus_countdict = [ ]
for document in jm_testcorpus:
    doc_preprocessed = nltk.word_tokenize(document) 
    doc_dict = nltk.FreqDist(doc_preprocessed)
    jm_testcorpus_countdict.append(doc_dict)
    
jm_testcorpus_countdict

[FreqDist({'predictable': 1, 'with': 1, 'no': 1, 'fun': 1})]

In [20]:
# transforming this data into a matrix
jm_testX2 = dict_feature_extractor.transform(jm_testcorpus_countdict)

In [21]:
# and handing it to the Naive Bayes classifier to classify
nb_jm2.predict(jm_testX2)

array([0])

The classifier has labeled the single test datapoint correctly: It has negative sentiment. 

But interestingly, if we remove stopwords from this data, it will mis-classify the test datapoint as having positive sentiment -- because "no" is a stopword!

**Try it for yourself**: Here is the first step, the preprocessing of the training data, using variable names starting in "sw" for "stopwords". Can you do the rest? You will need to:

* preprocess the training data into dictionaries of counts (already done)
* transform the training data into a matrix of feature values
* train a Naive Bayes classifier with the data
* preprocess the test data into a dictionary of counts
* transform the test data into a matrix of feature values
* use `predict()` to see what class the Naive Bayes classifier assigns to the test data

In [22]:
from nltk.corpus import stopwords
en_stopwords = stopwords.words("english")

sw_jm_corpus_countdict = [ ]
for document in jm_corpus:
    doc_preprocessed = [w for w in nltk.word_tokenize(document) 
                        if w not in en_stopwords]
    sw_jm_corpus_countdict.append(nltk.FreqDist(doc_preprocessed))
    
sw_jm_corpus_countdict

[FreqDist({'plain': 1, 'boring': 1}),
 FreqDist({'entirely': 1, 'predictable': 1, 'lacks': 1, 'energy': 1}),
 FreqDist({'surprises': 1, 'laughs': 1}),
 FreqDist({'powerful': 1}),
 FreqDist({'fun': 1, 'film': 1, 'summer': 1})]