# Supervised learning
### Experimental set-up and Naive Bayes for Polarity Detection

CSI4106 Artificial Intelligence  
Fall 2018  
Caroline Barrière

***


The supervised classification task tackled in this notebook is **polarity detection**, which is one possible activity within the quite popular trend of *Opinion Mining* in AI.  Many companies want to know whether there are positive or negative reviews about them.  Reviews can be on hotels, restaurants, movies, customer service of any kind, etc.

This notebook will allow you to better understand an ***experimental set-up*** for supervised machine learning.  The notion of training set, test set, evaluation, bias, etc.  The notebook also introduces the notion of comparative evaluation.  To say if a method is good or not, we often compare it to a *baseline* approach.  

This notebook makes use of a really nice and popular machine learning package, called **scikit-learn** (http://scikit-learn.org/stable/).  It contains many pre-coded machine learning algorithms which you can call.  To use this package, you must download it.  At the command prompt, type *pip install sklearn* to download the package.  

In this notebook we will use the Naive Bayes implementation, but we will probably explore other ML algorithms included in scikit-learn in future notebooks.

***


***HOMEWORK***:  
Go through the notebook by running each cell, one at a time. Look for (**TO DO**) for the tasks that you need to perform.  
Make sure you *sign* (type your name) the notebook at the end. Once you're done, submit your notebook.

***


**1. Polarity detection**  

In polarity detection, we use two classes: positive and negative.  This is different from sentiment analysis for example, in which the classes might be (sad, happy, anxious, angry, etc).  It's also more restricted than *rating* in which we would like assign a value (0..5) to evaluate a particular service.  So, the polarity detection task aims to assign either *positive* or *negative* to a statement.

**2. Application domain:  Movie reviews**  

Polarity detection could be used on reviews of anything.  In this notebook, we wish to apply polarity detection within the domain of movies.  Movie viewers are asked for comments about movies.  For our experimental set-up, we will first built a *test set* of reviews.  We manually determine the polarity of these reviews.  We **SHOULD NOT** use this test set to build our model later on.

In [1]:
# let's establish a few test sentences. 
# For each sentence, we have a corresponding tag: "Neg" or "Pos"

test_reviews = ["Can't believe I wasted my time on this", 
                "Really awful",
                "Actors were good",
                "So boring",
                "Stayed at the edge of my seat",
                "I never like any movie",
                "Argghhhh I hated it"]

test_tags = ["Neg", "Neg", "Pos", "Neg", "Pos", "Neg", "Neg"]

**3. Bias:  Available resources**  

For polarity detection, some researchers have established lists of positive and negative words.  The ones used in this notebook have been downloaded from [here](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) (a website on Opinion Mining by renowned research Bing Lu) and stored locally.  The files *positive-words.txt* and *negative-words.txt* are in the Jupyter Notebook module in Brightspace.  Make sure you place these files in the same repertory as your notebook, or otherwise, within the code below, specify the path where to find the files on your disk.

As we discussed in class, using any external resource is somewhat of a *bias* that we introduce in the study of a problem. Although in this particular case, the lists themselves have been compiled from data by other researchers.

In [2]:
# Read the positive words

with open("positive-words.txt") as f:
    posWords = f.readlines()
posWords = [p[0:len(p)-1] for p in posWords if p[0].isalpha()] 

# print the first 50 words
print(posWords[:50])

['a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed', 'acclamation', 'accolade', 'accolades', 'accommodative', 'accomodative', 'accomplish', 'accomplished', 'accomplishment', 'accomplishments', 'accurate', 'accurately', 'achievable', 'achievement', 'achievements', 'achievible', 'acumen', 'adaptable', 'adaptive', 'adequate', 'adjustable', 'admirable', 'admirably', 'admiration', 'admire', 'admirer', 'admiring', 'admiringly', 'adorable', 'adore', 'adored', 'adorer', 'adoring', 'adoringly', 'adroit', 'adroitly', 'adulate', 'adulation', 'adulatory', 'advanced', 'advantage', 'advantageous']


In [5]:
# Read the negative words

with open("negative-words.txt", encoding = "ISO-8859-1") as f:
    negWords = f.readlines()
negWords = [p[0:len(p)-1] for p in negWords if p[0].isalpha()] 

print(negWords[:50])

['abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted', 'aborts', 'abrade', 'abrasive', 'abrupt', 'abruptly', 'abscond', 'absence', 'absent-minded', 'absentee', 'absurd', 'absurdity', 'absurdly', 'absurdness', 'abuse', 'abused', 'abuses', 'abusive', 'abysmal', 'abysmally', 'abyss', 'accidental', 'accost', 'accursed', 'accusation', 'accusations', 'accuse', 'accuses', 'accusing', 'accusingly', 'acerbate', 'acerbic', 'acerbically', 'ache', 'ached', 'aches', 'achey', 'aching', 'acrid', 'acridly', 'acridness', 'acrimonious', 'acrimoniously']


**4. Baseline approach**  

Before we evaluate the performances of a supervised learning approach, we can start by establishing a very simple baseline approach.  It's always good to start simple.  A baseline allows us to measure whether the additional complexity of the various models we develop is worth it or not.

The *baseline algorithm* we will use simply counts the number of positive and negative words in the review and outputs the category corresponding to the maximum.  This approach DOES NOT LEARN anything.  It just uses a particular *reasoning* (strategy at test time).  You might be surprised to find out how many *AI start-ups* within the area of Opinion Mining, do use this kind of simple approach.  

In [6]:
# first let's define methods to count positive and negative words

def countPos(text):
    count = 0
    for t in text.split():
        if t in posWords:
            count += 1
    return count

def countNeg(text):
    count = 0
    for t in text.split():
        if t in negWords:
            count += 1
    return count

In [7]:
# simple counting algorithm as baseline approach to polarity detection
def baselinePolarity(review):
    numPos = countPos(review)
    numNeg = countNeg(review)
    if numPos > numNeg:
        return "Pos"   
    else:
        return "Neg"   

In [8]:
# Test the baseline method
print(baselinePolarity("This was a really good movie"))

Pos


**5. Evaluation**  
We saw in class that there could be multiple ways of evaluating an algorithm.  In the case of classification, a common evaluation method is simply to calculate *number of wrong choices*.

To test our *baseline algorithm* we use the test set, defined earlier and calculate the number of wrong assignments.

In [15]:
# Let's establish the polarity for each review

nbWrong = 0
for i in range(len(test_reviews)):
    polarity = baselinePolarity(test_reviews[i])
    print(test_reviews[i] + " -- " + polarity)
    if (polarity != test_tags[i]):
        nbWrong += 1

print('\nThere are %s wrong assignments' %nbWrong)    

Can't believe I wasted my time on this -- Neg
Really awful -- Neg
Actors were good -- Pos
So boring -- Neg
Stayed at the edge of my seat -- Neg
I never like any movie -- Pos
Argghhhh I hated it -- Neg

There are 2 wrong assignments


**(TO DO - Q1)** Make a small test set with another 6 sentences of your choice.  Then rerun the baseline polarity algorithm, and determine how many wrong assignments it makes.  For the number of wrong assignments, you can simply copy the code from the previous cell and slightly modify it.

In [23]:
# new test set
my_reviews = ["favourite movie ever",
             "2 hours of my life I'm never getting back",
             "So funny",
             "Mind blown",
             "the movie was dope",
             "stupid movie"]
my_tags = ["Pos","Neg","Pos","Pos","Pos","Neg"]

# test baseline polarity algorithm
print(baselinePolarity("favourite movie ever!"))

nbWrong1 = 0
for i in range(len(my_reviews)):
    polarity = baselinePolarity(my_reviews[i])
    print(my_reviews[i] + " -- " + polarity)
    if (polarity != my_tags[i]):
        nbWrong1 += 1

print('\nThere are %s wrong assignments' %nbWrong1) 

Neg
favourite movie ever -- Neg
2 hours of my life I'm never getting back -- Neg
So funny -- Neg
Mind blown -- Neg
the movie was dope -- Neg
stupid movie -- Neg

There are 4 wrong assignments


#### 6. Supervised learning method

We will now train a supervised learning model for polarity detection.  If you haven't done it already, you will need to install the package **sklearn**, using *pip install sklearn*.  

***6.1 Training data***  

In supervised learning, we need training data.  This training data must be *different* but *representative* of the eventual test data.

Let's define a *training set*.  Usually a training set should be as large and varied as possible.  Training sets are very valuable, but they are costly to obtain, as they require tagging (human annotation) to generate them.

In [24]:
# small training set... normally we require hundreds of sentences

train_reviews = ["this movie was stupid, so stupid",
                  "I hated this movie",
                  "I loved it",
                  "What a waste of time",
                  "Amazing!",
                  "What a pity",
                  "Very good movie indeed.  Glad I saw it."]

train_tags = ["Neg", "Neg", "Pos", "Neg", "Pos", "Neg", "Pos"]

***6.2 Pre-processing of input data*** 

This Machine Learning package, *scikit-learn*, is somewhat particular in the way the data must be formatted to be used by the training algorithms.  So, we must perform some preprocessing on the sentences above.  Luckily *scikit-learn* provides some pre-defined functions for doing text pre-processing.  

We easily transform each sentence into a list of indexes into a dictionary.  The dictionary is built from the words in the sentences.  The keys of the dictionary are the words, and the value is an index.

In [25]:
from sklearn.feature_extraction.text import CountVectorizer

# The CountVectorizer builds a dictionary of all words (count_vect.vocabulary_), 
# and generates a matrix (train_counts), to represent each sentence
# as a set of indices into the dictionary. The words in the dictionary are the words found in train_reviews.

count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train_reviews)

To understand what the code above does, first let's print the vocabulary gathered from the sentences in train_reviews.  

In [26]:
# print the vocabulary (dictionary of words)
print(count_vect.vocabulary_)

{'this': 13, 'movie': 7, 'was': 16, 'stupid': 12, 'so': 11, 'hated': 3, 'loved': 6, 'it': 5, 'what': 18, 'waste': 17, 'of': 8, 'time': 14, 'amazing': 0, 'pity': 9, 'very': 15, 'good': 2, 'indeed': 4, 'glad': 1, 'saw': 10}


You can interpret the output above as: 

'saw':10  to mean that the word 'saw' has been assigned index 10  
'time':14 to mean that the word 'time' has been assigned index 14

Then, let's print the *train_counts*.  

In [27]:
# print the content of the training examples in terms of frequency of words (each word represented by its index)
print(train_counts)

  (0, 11)	1
  (0, 12)	2
  (0, 16)	1
  (0, 7)	1
  (0, 13)	1
  (1, 3)	1
  (1, 7)	1
  (1, 13)	1
  (2, 5)	1
  (2, 6)	1
  (3, 14)	1
  (3, 8)	1
  (3, 17)	1
  (3, 18)	1
  (4, 0)	1
  (5, 9)	1
  (5, 18)	1
  (6, 10)	1
  (6, 1)	1
  (6, 4)	1
  (6, 2)	1
  (6, 15)	1
  (6, 5)	1
  (6, 7)	1


You can interpret each line above as:  

(0, 11) 1  -- sentence 0 (in train_reviews) has 1 time the word 11 (index of the word in count_vect.vocabulary, that is the word 'so')  
(0, 12) 2  -- sentence 0 (in train_reviews) has 2 times word 12 (index of the word in count_vect.vocabulary, that is the word 'stupid')  

So the train_counts contain for each sentence, the BOW associated with that sentence, but in the form of a list of indexes (each index corresponding to a word).

***6.3 Naive Bayes learning***

With the data preprocessed, we are ready to test the Naive Bayes algorithm provided by scikit-learn.  That algorithm required the training data to be represented in terms of *train counts* which is why we did the pre-processing above.

It's as easy as performing *fit*, as you see below, to train the model.  But you know what's underneath!!!  It creates prior probabilities for classes (Neg, Pos) and posterior probabilities of words (features) per class (e.g. P(stupid|Pos) or P(stupid|Neg).  All these probabilities are used in Bayes Theorem.  

**(TO-DO - Q2)** What are the prior probabilities of the Pos and Neg classes using the training set above.  (this does not require coding, just do it manually).

Answers  

P(Pos) =  3/7    (num of docs (pos)/ total number of docs)
P(Neg) = 4/7   (num of docs (pos)/ total number of docs)

In [28]:
# Test of a naive bayes algorithm, the "fit" is the training
from sklearn.naive_bayes import MultinomialNB

# Training the model
clf = MultinomialNB().fit(train_counts, train_tags)   

***6.4 Evaluation***

Let's first look at how the model performs on the training set, on which it learned.  To apply the model for classification (prediction), we use the *predict* method below.

In [29]:
# Testing on training set
predicted = clf.predict(train_counts)
for doc, category in zip(train_reviews, predicted):   # zip allows to go through two lists simultaneously
    print('%r => %s' % (doc, category))

'this movie was stupid, so stupid' => Neg
'I hated this movie' => Neg
'I loved it' => Pos
'What a waste of time' => Neg
'Amazing!' => Pos
'What a pity' => Neg
'Very good movie indeed.  Glad I saw it.' => Pos


Unsurprisingly, on the training set we get 100%, no mistakes....  But we should test on a real **test set**.  We have two test sets, one that I created above (test_reviews) and the one that you created (my_reviews).  

**(TO_DO - Q3)** Test the trained model on these two test sets.  Write the code below to do so.  Before testing, each test set must be transformed through the preprocessing steps, so their format is compatible with the learner.

In [44]:
# Pre-process test set test_reviews
test_reviews_counts = count_vect.transform(test_reviews)
# predict the results
testPredicted = clf.predict(test_reviews_counts)
# print the results
for doc, category in zip(test_reviews, testPredicted):   # zip allows to go through two lists simultaneously
    print('%r => %s' % (doc, category))


# Pre-process the test set my_reviews
my_reviews_counts = count_vect.transform(my_reviews)

# predict the results
myPredicted = clf.predict(my_reviews_counts)
# print the results
for doc, category in zip(my_reviews, myPredicted):   # zip allows to go through two lists simultaneously
    print('%r => %s' % (doc, category))

"Can't believe I wasted my time on this" => Neg
'Really awful' => Neg
'Actors were good' => Pos
'So boring' => Neg
'Stayed at the edge of my seat' => Neg
'I never like any movie' => Neg
'Argghhhh I hated it' => Pos
'favourite movie ever' => Neg
"2 hours of my life I'm never getting back" => Neg
'So funny' => Neg
'Mind blown' => Neg
'the movie was dope' => Neg
'stupid movie' => Neg


**(TO_DO - Q4)** A common **Evaluation Measure**, related to *number of wrong choices* is called **Recall**.  Recall is the number of good assignments for a particular class, divided by the number of test examples of that class.  For example, if the test set contains 5 positive examples and the algorithm only found 2, then the recall for the class Pos is 2/5.  Write a small method below that will calculate a class' recall.  It will receive three parameters, (1) the set of correct tags (e.g. (Pos, Neg, Pos)), (2) the predictions (e.g (Pos, Pos, Neg)) and (3) the class of interest (e.g. Pos).  It will return the recall of that class (e.g. 50%).

In [56]:
# Number wrong
def recall(goodTags, predictions, classOfInterest):
    count = 0
    div = 0
    for i in range(len(goodTags)):
        if(goodTags[i]  == classOfInterest):
            div += 1
            if(goodTags[i] == predictions[i]):
                count += 1
    return count/div
   

**(TO DO - Q5)** Use the recall method to calculate the recall on the two test sets.  Print those recalls.

In [57]:
# recall
print(recall(test_tags, testPredicted, "Pos"))
print(recall(test_tags, testPredicted, "Neg"))
print(recall(my_tags,myPredicted,"Pos"))
print(recall(my_tags,myPredicted,"Neg"))


0.5
0.8
0.0
1.0


#### 7. Discussion

**(TO DO - Q6)** Is the Naive Bayes approach performing much better than the baseline approach?  Present and discuss the results below.  Give two suggestions to help the Naive Bayes approach within the context of our experiment of polarity detection for movie reviews.

Comparative results: Baseline: 2 wrong, 4 wrong. NB: 2 wrong, 4 wrong. The results didn't improve using NB. With such a small data set it is hard for the NB method to learn. 

Three suggestions:  
1) Have a bigger training and testing data set. With more data the probabilities of if its positive or negative will be more clear

2) Allow a neutral option. Sometimes the reviews can have both positive and negative aspects. the reviewer can just be giving a neutral review. so if theres an equal amount of positive and negative probability it can be given as neutral.

3) allow for a smoothing technique such as laplace. I put words such as "dope" that do not appear in teh features so it has a zero probability making the set of features equal 0. Adding 1 to all the features would avoid this. 


**(TO-DO - OPTIONAL)**  We only used two classes Positive and Negative.  But... it would be nice to also introduce a Neutral class for when the review is neither positive, nor negative... just neutral.  Perform a new experimental set-up in which you train and test the naive bayes classifier with 3 values.  You can also change the baseline classifier to predict three values.  If you do this optional task, do it below, do not change the cells above.

#### Signature

I, Felix Singerman, declare that the answers provided in this notebook are my own.