## Naive Bayes Lab
Nov 12, 2018

\[double click to edit\]

Your names:

Optional 2-min response (complete at end of lab): one thing you learned/reinforced, one (or more) questions you have:

### Part 1
Use pen(cil) and paper for this part, and enter your responses in this notebook

We will build a naive Bayes sentiment classifier for movie reviews using words as features and we will employ add-1 smoothing. Sentiment labels are positive (+) and negative (-). Our corpus has five training instances and one test instance:

Training Set:

    - just plain boring 
    - entirely predictable and lacks energy
    - no surprises and very few laughs 
    + very powerful 
    + the most fun film of the summer 

Test Set:

    ? predictable no fun 

In a naive Bayes model with words as features, the most probable class *c* for a test instance is:
<img src="formula.png" alt="formula" style="width: 300px;"/>

**1.1** What class labels are in set C ?


positive and negative 

**1.2** For the test sentence, what is w ?

3

**1.3** Compute the prior for the two classes + and -.

P( 'pos' ) = 0.4 = 2/5
P( 'neg' ) = 0.6 = 3/5

**1.4** Now compute the likelihoods for each word given the class (leave in the form of fractions). Remember to +1 to the count of each word given each class.


P( predictable | 'pos' ) = 1/9
P( no | 'pos' ) = 1/9
P( fun | 'pos' ) = 2/9

P( predictable | 'neg' ) = 1/7
P( no | 'neg' ) = 1/7
P( fun | 'neg' ) = 1/14

**1.5** Now compute whether the model would predict the sentence in the test set to be of class positive or negative (okay to leave in fractions or to use a calculator).


predictable(neg 1/7) no(neg 1/7) fun(pos 2/9) 
pos = 1/9*2/5 + 1/9*2/5 + 2/9*2/5 = 8/45
neg = 1/7*3/5 + 1/7*3/5 + 2/9*3/5 = 3/14 ~ bigger 

negative

**1.6** What would the answer be without add-1 smoothing?

In [None]:
the label would be pos

### Part 2

Use NLTK to train a naive Bayes classifier.

In [15]:
import random
import nltk

In [16]:
# create a list of words for each sentence in the training
# the first one is provided
sent1 = 'just plain boring'.split()
sent2 = 'entirely predictable and lacks energy'.split()
sent3 = 'no surprises and very few laughs'.split()
sent4 = 'very powerful'.split()
sent5 = 'the most fun film of the summer'.split()

In [17]:
# make a variable called 'words' that appends all five lists of words
words = []
words.extend(sent1)
words.extend(sent2)
words.extend(sent3)
words.extend(sent4)
words.extend(sent5)

print(words)

['just', 'plain', 'boring', 'entirely', 'predictable', 'and', 'lacks', 'energy', 'no', 'surprises', 'and', 'very', 'few', 'laughs', 'very', 'powerful', 'the', 'most', 'fun', 'film', 'of', 'the', 'summer']


In [18]:
# make a tuple for each review in the training
# the first one is provided
rev1 = (sent1, 'neg')
rev2 = (sent2, 'neg')
rev3 = (sent3, 'neg')
rev4 = (sent4, 'pos')
rev5 = (sent5, 'pos')

In [19]:
# make a list called 'revs' that contains the five tuples
revs = [rev1, rev2, rev3, rev4, rev5]

In [25]:
# run this, inspect the output, make sure you understand the output
all_word_freqs = nltk.FreqDist(w.lower() for w in words)
print(all_word_freqs)
# for w in words:
#     print(w)

<FreqDist with 20 samples and 23 outcomes>


In [29]:
# this function takes a document (as a string) and returns the
# feature values (word counts) for all the words in the training data
document = 'predictable no fun'.split()
def document_features(document):
    document_words = set(document)
    features = {}
    for word in all_word_freqs:
        features['contains({})'.format(word)] = (word in document_words)
    return features

print(document_features(document))

{'contains(just)': False, 'contains(plain)': False, 'contains(boring)': False, 'contains(entirely)': False, 'contains(predictable)': True, 'contains(and)': False, 'contains(lacks)': False, 'contains(energy)': False, 'contains(no)': True, 'contains(surprises)': False, 'contains(very)': False, 'contains(few)': False, 'contains(laughs)': False, 'contains(powerful)': False, 'contains(the)': False, 'contains(most)': False, 'contains(fun)': True, 'contains(film)': False, 'contains(of)': False, 'contains(summer)': False}


What do you expect document_features to return when you pass in the test document ('predictable no fun')?

In [None]:
# enter answer here
# it would return false for all the words except fun, as the training data only contains the word fun

In [None]:
# now call document_features on the test document, print the output
"""
{'contains(just)': False, 'contains(plain)': False, 'contains(boring)': False, 'contains(entirely)': False, 
 'contains(predictable)': True, 'contains(and)': False, 'contains(lacks)': False, 'contains(energy)': False, 
 'contains(no)': True, 'contains(surprises)': False, 'contains(very)': False, 'contains(few)': False, 
 'contains(laughs)': False, 'contains(powerful)': False, 'contains(the)': False, 'contains(most)': False, 
 'contains(fun)': True, 'contains(film)': False, 'contains(of)': False, 'contains(summer)': False}
"""

In [39]:
# if you have completed the steps above, this code cell should output the 
# same class that you computed in Part 1
train_set = [(document_features(d), c) for (d,c) in revs]
classifier = nltk.NaiveBayesClassifier.train(train_set)
classifier.classify(document_features('predictable no fun'.split()))

'neg'

In [53]:
# copy the last line of the above cell and try out other test sentences
# classifier.classify(document_features('i hate this'.split()))
# classifier.classify(document_features('beautiful'.split()))
classifier.classify(document_features('fun and powerful'.split()))
# classifier.classify(document_features('fun in summer'.split()))

'pos'

Does the behavior match with your expectations?

No, it matches my expectations for negative sentences but not positive.
I believe this might be because we have more training data for the negative sentences and less for positive.
When we use words that appeared in the positive training data, it classifies it correctly, but doesn't otherwise. 

If time remains, try creating new train and test sets, or see https://www.nltk.org/book/ch06.html section 1.3, to explore creating a naive Bayes model using NLTK's movie review data set. It has about 600 labeled reviews.