Lecture 7:  Naive Bayes Scavenger Hunt
===============

10/2/2023, CS 4120 Natural Language Processing, Muzny

Today, you'll be building investigating three Naive Bayes classifiers that have already been built and trained for you. Your mission is to determine which classifier is which. They are all binary classifiers.


- one of these classifiers is an authorship attributor (the two labels **do** correspond to two specific authors)
- one of these classifiers is a language identifier (the two labels **do** correspond to two specific languages)
- one of these classifiers is a sentiment analyser (the two labels **do** correspond to positive and negative)

Remember, for a given new, unlabeled document, they will calculate:

$$ P(feature_1, feature_2, feature_3, ..., feature_n | c)P(c)$$

Where the features for a document are a "bag of words" and $c$ is a candidate class. They then select the class that has the highest probability to be the actual label of the new document.


Supporting files
-------------------
(Download these from Canvas)
1. `classifier1.pickle`
2. `classifier2.pickle`
3. `classifier3.pickle`


Task 1: Which Classifier is Which?
-------------------------
We have given you 3 Naïve Bayes classifiers. All three of these are binary classifiers that choose between the label '0' or '1' (these are strings). __They all also use a bag-of-words as features.__


Your first job is to conduct experiments to determine two things:
1. Which classifier is which?
2. What specific classes do you believe that they are choosing between? (what are better labels for each classifier than '0' and '1'?)
    1. Note: this is a __difficult__ task, especially for authorship attribution. It is of utmost importance that you consider the particular data set that they might have been trained on. They were all trained using some of [nltk's available corpora](http://www.nltk.org/nltk_data/).
        1. For authorship attribution, try to determine the style of text that the two classes are looking for, but don't spend more than 5 - 10 minutes on this task. :)

In [1]:
# load your trained classifiers from pickled files
# (we've already trained your classifiers for you)
import pickle
#import nltk  # not necessary, but you can uncomment if you want

# add more imports here as you would like

In [2]:

def word_feats(words):
    """
    This function converts a list of words so that they are featurized
    for nltk's format for bag-of-words
    Parameters:
    words - list of words where each element is a single word 
    Returns: dict mapping every word to True
    """
    return dict([(word, True) for word in words])

f = open('classifier1.pickle', 'rb')
classifier1 = pickle.load(f)
f.close()

f = open('classifier2.pickle', 'rb')
classifier2 = pickle.load(f)
f.close()

f = open('classifier3.pickle', 'rb')
classifier3 = pickle.load(f)
f.close()

# in a list, if you find that helpful to use
classifiers = [classifier1, classifier2, classifier3]

In [3]:
# Here's an example of how to run a test sentence through the classifiers
# edit at your leisure
test = "this is a test sentence"
# you can either split on whitespace or use nltk's word_tokenize
featurized = word_feats(test.split()) 

for classifier in classifiers:
    print(classifier.prob_classify(featurized).samples())  # will tell you what classes are available
    print(classifier.prob_classify(featurized).prob('0'))  # get the probability for class '0'
    print(classifier.prob_classify(featurized).prob('1'))  # get the probability for class '1'
    print(classifier.classify(featurized))  # just get the label that it wants to assign
    print()

dict_keys(['0', '1'])
0.6325082240556184
0.3674917759443814
0

dict_keys(['1', '0'])
0.9999999621751425
3.7824855426585115e-08
0

dict_keys(['0', '1'])
0.867841315914037
0.1321586840859627
0



In [28]:
# TODO: put in as many experiments as you'd like here (and feel free to add more cells as needed)
# we recommend testing a variety of sentences. You can make these up or get them from sources
# on the web
sentence = "Hamlet"
# you can either split on whitespace or use nltk's word_tokenize
featurized = word_feats(sentence.split()) 

for classifier in classifiers:
    print(classifier.prob_classify(featurized).samples())  # will tell you what classes are available
    print(classifier.prob_classify(featurized).prob('0'))  # get the probability for class '0'
    print(classifier.prob_classify(featurized).prob('1'))  # get the probability for class '1'
    print(classifier.classify(featurized))  # just get the label that it wants to assign
    print()


dict_keys(['0', '1'])
0.5
0.5
1

dict_keys(['1', '0'])
0.47432905484247373
0.5256709451575262
1

dict_keys(['0', '1'])
0.05203965097074141
0.9479603490292587
1



Answer the questions outlined at the beginning of this task here:

1. Which classifier is which?
    1. classifier1 is Sentiment
    1. classifier2 is Language
    1. classifier3 is Authorship
2. What specific classes do you believe that they are choosing between?
    1. classifier1's '0' label should be Negative and its '1' label should be Positive
    1. classifier2's '0' label should be English and its '1' label should be Spanish
    1. classifier3's '0' label should be Shake and its '1' label should be Jane Austen

Answer the following questions about Naïve Bayes classifiers in general:

1. If a naïve bayes classifier for sentiment was trained on a certain corpus—hand labeled sentences from Shakespeare's plays, for instance—using BoW as features, but then evaluated on a test set of IMDB movie reviews, what do you think its performance might be? __YOUR ANSWER HERE__

2. Justify your answer by comparing/contrasting with other possible test sets that you might evaluate this classifier against. __YOUR ANSWER HERE__

Task 2: Calculating Naive Bayes Probabilities
----

Given the following training data and assuming that you are using a Bag of Words as your features what is the value of $P(c = 0 | x = \texttt{I have two dogs and one fluffy cat})$ ? (don't take the argmax here—this is the innards of equation 4.8/4.9 from the text)


Use a multinomial (in terms of features) naive bayes classifier with laplace smoothing.

In [None]:
# calculate the size of your vocabulary here
# you'll also need the number of words for each class
# (use the len() function to get this)
words_0 = "cats are good that is one fuzzy cat a fuzzy cat is not a fluffy dog".split()
words_1 = "dogs are happy I like my fluffy dog".split()


In [None]:
# just do the math by hand in this cell

In [None]:
# now we'll do it in a more programmatic fashion
x = "I have two dogs and one fluffy cat".split()

# calculate the priors from the training data in this cell
N_0 = 3
N_1 = 2

# YOUR CODE HERE


print("Probability of class 0 for:", x)
# print out the probability here

In [None]:
# now calculate the probability of class 1 as well



In [None]:
# verify that you can change the text of x to whatever 
# you want to get new (correct) probabilities


# bonus challenge: make this into a function!
