# Implementing Naive Bayes

We're going to walk through an implementation of a Naive Bayes classifier, and in doing so get comfortable with some basic aspects of machine learning on collections of text.

In the first week of this course we explored basic input and output using Python, and also reviewed some string operations that we can use to "tokenize" text (divide it into words or word-like tokens) and "normalize" it (for instance, by rendering everything lowercase). We could continue using those basic Python functions to convert the texts we use into numbers.

But since normalizing and tokenizing text is a very common operation, there are Python libraries that take care of it for us. Using them will simplify our code. Standard libraries also reduces sources of distortion that can creep into a model when (say) the model trained using one tokenizing process, and then applied to data produced with a different process.

### Using CountVectorizer to turn texts into a pandas dataframe

One library we'll use a lot is [scikit-learn.](https://scikit-learn.org/stable/about.html) As a package it's abbreviated ```sklearn.```


In [120]:
# !pip install sklearn   # only uncomment and run if needed
                        # in other words, if you get an error when you attempt
                        # to import sklearn below

In [121]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
import numpy as np
import glob, math
from pathlib import Path

In [122]:
sample_texts = ["It was pathetic. The worst part was the boxing scenes.",
         "No plot twists or great scenes.",
         "and satire, and great plot twists",
         "Great scenes; great film."]
text_titles = ['reviewA', 'reviewB', 'reviewC', 'reviewD']

count_vectorizer = CountVectorizer()
count_vectors = count_vectorizer.fit_transform(sample_texts)

vector_frame = pd.DataFrame(count_vectors.toarray(), index = text_titles, 
                            columns = count_vectorizer.get_feature_names())
vector_frame

Unnamed: 0,and,boxing,film,great,it,no,or,part,pathetic,plot,satire,scenes,the,twists,was,worst
reviewA,0,1,0,0,1,0,0,1,1,0,0,1,2,0,2,1
reviewB,0,0,0,1,0,1,1,0,0,1,0,1,0,1,0,0
reviewC,2,0,0,1,0,0,0,0,0,1,1,0,0,1,0,0
reviewD,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,0


In [154]:
vector_frame.sum(axis = 'rows')

and         2
boxing      1
film        1
great       4
it          1
no          1
or          1
part        1
pathetic    1
plot        2
satire      1
scenes      3
the         2
twists      2
was         2
worst       1
dtype: int64

We'll call this a term-doc matrix; typically words ("features") are columns and documents are rows.

A "vector" is essentially a list of numbers. Next week we'll talk about the geometrical interpretation that makes it possible to interpret a list of numbers as a line in space.

### Relative frequencies (normalized by doc length)

Suppose we wanted to have the relative frequency of each word as a percentage of its document. That representation has a lot of advantages, since it factors out document length and provides, essentially, a unigram probability model.

Before proceeding, pause for a moment and think about what we would need to do mathematically to generate relative frequencies. Then you'll understand the following code.

In [123]:
rowsums = vector_frame.sum(axis = 'columns')   # change to 'rows' and see what happens
rowsums

reviewA    10
reviewB     6
reviewC     6
reviewD     4
dtype: int64

For future reference, ```axis = 'columns'``` is often abbreviated ```axis = 1``` and ```axis = 'rows'``` is 0.

In [124]:
vector_frame.divide(rowsums, axis = 'rows')  # change to 'columns' and see what happens :(

Unnamed: 0,and,boxing,film,great,it,no,or,part,pathetic,plot,satire,scenes,the,twists,was,worst
reviewA,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.1,0.1,0.0,0.0,0.1,0.2,0.0,0.2,0.1
reviewB,0.0,0.0,0.0,0.166667,0.0,0.166667,0.166667,0.0,0.0,0.166667,0.0,0.166667,0.0,0.166667,0.0,0.0
reviewC,0.333333,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.166667,0.0,0.0
reviewD,0.0,0.0,0.25,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0


### Load folders of files into a dataframe

We're using [a dataset of movie reviews developed by Bo Pang, Lillian Lee,and Shivakumar Vaidhyanathan.](https://www.cs.cornell.edu/people/pabo/movie-review-data/)

Our first task is to load them into a dataframe.

It's worth stepping through this to make sure you understand what's happening.

In [125]:
negative_dir = '../../data/review_polarity/txt_sentoken/neg'
positive_dir = '../../data/review_polarity/txt_sentoken/pos'
neg_paths = glob.glob(f'{negative_dir}/*.txt')
pos_paths = glob.glob(f'{positive_dir}/*.txt')

all_paths = neg_paths + pos_paths      # notice the order

all_classes = [0] * 1000 + [1] * 1000  # notice the same order
                                       # if it's not clear what's in that list, inspect
                                       # by using len() and saying all_classes[-10 : ]

We now have a list of 2000 paths to files, paired with a list of class labels that are either zero (negative sentiment) or one (positive sentiment).

In [155]:
count_vectorizer = CountVectorizer(input = 'filename',   # notice that we're now setting this up
                                  max_features = 5000)   # to automatically read a list of paths
                                                       # and also only taking the top 5000 words
    
word_counts = count_vectorizer.fit_transform(all_paths)  # that line does all the work!

titles = [Path(text).stem for text in all_paths]
count_df = pd.DataFrame(word_counts.toarray(), index = titles, 
                      columns = count_vectorizer.get_feature_names())

count_df = count_df.assign(class_label = all_classes)  # adding a column for class_label
count_df.head()

Unnamed: 0,000,10,100,11,12,13,13th,14,15,16,...,younger,your,yourself,youth,zane,zany,zero,zeta,zone,class_label
cv676_22202,0,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
cv839_22807,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
cv155_7845,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
cv465_23401,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
cv398_17047,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [157]:
len(all_paths)

2000

We now have a term-doc matrix for the top 5000 words in 2000 movie reviews, along with a column that indicates whether each review was negative or positive.

### Generating train and test sets

Now let's divide this into train and test sets. By convention we're going to call the matrix of feature values *X* (it gets a capital because the names of matrices are by convention capital letters). The vector of class labels, we'll call *y.* We're about to learn a function that predicts *y* from *X.* To be super-fancy we can refer to our predictions as $\hat{y}$.

In [127]:
neg_counts = count_df.loc[count_df['class_label'] == 0, ]
pos_counts = count_df.loc[count_df['class_label'] == 1, ]

train_X = pd.concat([neg_counts.iloc[0:800, : ], pos_counts.iloc[0:800, : ]], axis = 'rows')
train_y = train_X['class_label']
train_X = train_X.drop('class_label', axis = 'columns')  # we don't want this as a feature. Why not?
train_X.shape

(1600, 5000)

Notice that we just took the first 800 elements of the negative and positive dataframes as our training set. (That's four-fifths of the data). We're trusting that the data is already well-randomized, so all the reviews of action movies aren't at the end, etc.

#### a quick thought experiment

I wrote a line of code that drops the class_label from train_X.

    train_X = train_X.drop('class_label', axis = 'columns')
    
What would happen if I forgot to do that, and we trained a naive Bayes model to predict $\hat{y}$ on train_X with that extra column? What would you expect our accuracy to be?

Now write some code that generates a test set. We'll need both a matrix of feature values and a vector of class labels.

In [128]:
# Lines that create a test_X and a test_y

test_X = pd.concat([neg_counts.iloc[800: , : ], pos_counts.iloc[800: , : ]], axis = 'rows')
test_y = test_X['class_label']
test_X = test_X.drop('class_label', axis = 'columns')
test_X.shape

(400, 5000)

At the end your shape should be (400, 5000).

### Applying Naive Bayes can be simple, if we just want results

If our goal is simply to get predictions, that's easy. Scikit-learn has [several forms of Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html) built in. We'll use [Multinomial Naive Bayes.](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)

In [158]:
bayes = MultinomialNB(alpha = 1)
bayes.fit(train_X, train_y)

MultinomialNB(alpha=1)

In [159]:
yhat = bayes.predict(test_X)
yhat[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [160]:
sum(yhat == test_y) / len(test_y)

0.83

### But let's implement naive Bayes ourselves, to understand it. First, training.

Let's review the pseudo-code from Jurafsky and Martin:

![pseudocode for the train function](pseudocode_train.png)

Because the probabilities we're dealing with are very, very small, and get smaller as you multiply them, naive Bayes is conventionally implemented by adding logarithms instead of multiplying probabilities. To confirm that this is (or should be!) the same thing, check it out:

In [132]:
logsum = math.log(0.1) + math.log(0.1)

product = 0.1 * 0.1

print(product)
print(math.exp(logsum))

0.010000000000000002
0.010000000000000004


Floating-point math isn't perfect! but that's basically the same thing.

Okay, let's generate the class priors, expressed as logarithms.

The code for this is actually extremely simple. *Notice that we do not have to construct a for-loop at all, because pandas takes care of it for us!* We just count the number of instances in each class, and divide by the total number of instances.

In [133]:
def train_bayes(X, y):
    '''
    This function only performs the first part of the training:
    it generates a class prior for each class, which takes the form
    of a pandas Series
    '''
    
    priors = y.value_counts() / y.shape[0]
    
    logpriors = np.log(priors)
    
    return logpriors
        

In [150]:
train_y.value_counts() / 1600

0    0.5
1    0.5
Name: class_label, dtype: float64

In [134]:
# Use the function we just defined to generate a class prior, and print
# it out to see what it looks like.

# Then play around with the math here, step by step,
# to understand why that works. Test .value_counts() and shape,
# and then divide, etc

# To confirm that we're getting the right value, look at this:

math.log(800/ 1600)

# the difference between math.log() and np.log() is that
# the numpy version automatically *broadcasts* the function
# across a vector

-0.6931471805599453

The next part is almost equally simple. Again, no for-loop! Remember how we summed up the rows to normalize frequencies? Now we can sum columns to find the total number of times a word appears in a class. Pandas does that on the whole dataframe at once and saves us from writing a loop, which would tend to be slower as well as more verbose. What we need to do is, step by step.

```For each class c:```

    1) Create a dataframe by selecting rows with class == c. We can use train_y to do that. This creates what Jurafsky and Martin call ```bigdoc```. Then, for each class:

    2) Sum the *columns* of that class frame to create a vector of wordcounts for the class. Now we have ```count(w, c)```. We'll call that a classvector.

    3) Add one to all the elements of the classvector (Laplace smoothing).
    
    4) Divide the smoothed classvector by its own sum, producing a vector of smoothed likelihoods.

    5) Take the np.log() of the smoothed likelihoods.

    6) Package the two vectors in a single data frame or dictionary. Voila! You have your loglikelihoods.
   
Let's work through that step by step. First, turn train_X into two classvectors. There are ways to do even the two classes without a loop, but let's not get too fancy with it right now; it's okay to loop across the classes and store the resulting vectors in a dictionary.

This takes us up to step three, which is a good place to pause and inspect the results to see what we've actually got.


In [169]:
# WORKSPACE for TRAINING

counts = dict()

for c in [0, 1]:        # instead of [0,1] you can say np.unique(train_y) if you prefer
    
    class_frame = train_X.loc[train_y == c, : ]
    colsums = class_frame.sum(axis = 'rows')
    counts[c] = colsums
    

At this point we have raw counts of the number of words in positive and negative reviews.

In [170]:
counts[0] # negative movie reviews

000      54
10      159
100      33
11       15
12       22
       ... 
zane      4
zany     13
zero     32
zeta     24
zone     16
Length: 5000, dtype: int64

We can smooth those vectors by adding one to them, so even rare words will have some chance of inclusion. Then, if we divide by the total number of words, we'll have probabilities we can call "likelihoods" (the probability of an outcome, given a set of parameter values, can also be viewed the other way around, as the likelihood of the parameter values if we've observed the outcome). Here the "parameter values" are words and the class label is the outcome.

In [171]:
likelihoods = dict()

for c in [0, 1]:
    counts[c] = counts[c] + 1
    likelihoods[c] = counts[c] / np.sum(counts[c])
    
ratios = likelihoods[0] / likelihoods[1]
ratios.sort_values(inplace = True)

In [172]:
ratios[0:10]

shrek        0.017515
ordell       0.023851
gattaca      0.027341
mulan        0.029116
ripley       0.035031
leila        0.035031
sweetback    0.036161
mallory      0.038654
lambeau      0.038654
gladiator    0.040035
dtype: float64

In [173]:
ratios[-10 :]

jawbreaker    15.693681
jakob         15.693681
magoo         16.254170
webb          17.375147
memphis       20.177590
palmetto      23.540522
bats          23.540522
sphere        28.024431
brenner       35.871271
jolie         41.476157
dtype: float64

In [174]:
ratios['exciting']

0.8668890556884498

In [175]:
ratios['boring']

4.757317988534175

## The finished train function

In [176]:
def train_bayes(X, y):
    '''
    This function does the
    '''
    
    priors = y.value_counts() / y.shape[0]
    
    logpriors = np.log(priors)
    
    loglikelihoods = dict()
    
    for c in np.unique(y):
    
        class_frame = train_X.loc[train_y == c, : ]
        colsums = class_frame.sum(axis = 'rows') + 1
        likelihoods = colsums / np.sum(colsums)
        loglikelihoods[c] = np.log(likelihoods)   
    
    return logpriors, loglikelihoods

In [178]:
logpriors, loglikelihoods = train_bayes(train_X, train_y)

In [207]:
# Inspect to see what you've got:

loglikelihoods[1][495:505]

blatantly     -11.134372
bleak          -9.712986
blend         -10.441225
blind          -9.861406
block          -9.989240
blockbuster   -10.135843
blond         -10.777697
blonde        -10.307694
blood          -8.649465
bloody         -9.583775
dtype: float64

## Now write a function that uses our Naive Bayes model to make predictions about test_X

Here's a reminder of the function we need:

![pseudocode for the test function](pseudocode_test.png)

This becomes quite simple when you realize that there's no need to do a loop across "positions." We've already counted words, and adding the loglikelihood[word] to the probability each time you see a word is the same as adding ```loglikelihood[word] * occurences_of_word.```  

Well each row in ```test_X``` has the number of occurrences of the word. So we need to multiply each entry in that row by the corresponding entry in the loglikelihood vector.  In pandas, the multiplication symbol automatically does elementwise multiplication.

In [181]:
vector_example = pd.Series([0,1,2,3,4], index = ['a', 'b', 'c', 'd', 'e'])
vector_example

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [182]:
vector_example * 2

a    0
b    2
c    4
d    6
e    8
dtype: int64

In [204]:
vector_example * vector_example    # note elementwise multiplication

a     0
b     1
c     4
d     9
e    16
dtype: int64

In [186]:
def test_bayes(logpriors, loglikelihoods, test_X):
    '''
    Tests a model that is passed in as "logpriors" and
    "loglikelihoods" against the data passed in as test_X,
    and returns a vector of predictions.
    '''

    predictions = []
    
    for idx in test_X.index:
        class_probabilities = dict()
        for c in [0, 1]:
            probability_sum = logpriors[c]
            this_row = test_X.loc[idx, : ]
            probability_sum = probability_sum + np.sum(loglikelihoods[c] * this_row) 
            class_probabilities[c] = probability_sum
            
        if class_probabilities[1] > class_probabilities[0]:   # this is the argmax part
            predictions.append(1)                          # we choose the class with the larger probability
        else:
            predictions.append(0)
    
    return predictions
        

In [187]:
predictions = test_bayes(logpriors, loglikelihoods, test_X)

In [188]:
sum(predictions == test_y) / len(predictions)

0.83