# CSC421 Assignment 3 - Part II Naive Bayes Classification (5 points) #
### Author: George Tzanetakis 

This notebook is based on the supporting material for topics covered in **Chapter 13 Quantifying Uncertainty**and **Chapter 20 - Statistical Learning Method** from the book *Artificial Intelligence: A Modern Approach.* This part does NOT rely on the provided code so you can complete it just using basic Python. 

```
Misunderstanding of probability may be the greatest of all impediments
to scientific literacy.

Gould, Stephen Jay
```



# Introduction 


Text categorization is the task of assigning a given document to one of a fixed set of categories, on the basis of text it contains. Naive Bayes models are often used for this task. In these models, the query variable is
the document category, and the effect variables are the presence/absence
of each word in the language; the assumption is that words occur independently in documents within a given category (condititional independence), with frequencies determined by document category. Download the following file: http://www.cs.cornell.edu/People/pabo/movie-review-data/review_polarity.tar.gz containing a dataset that has been used for text mining consisting of movie reviews classified into negative and positive. You
will see that there are two folders for the positivie and negative category and they each contain multiple text files with the reviews. You can find more information about the dataset at: 
http://www.cs.cornell.edu/People/pabo/movie-review-data/


Our goal will be to build a simple Naive Bayes classifier for this dataset. More complicated approaches using term frequency and inverse document frequency weighting and many more words are possible but the basic concepts
are the same. The goal is to understand the whole process so DO NOT use existing machine learning packages but rather build the classifier from scratch.

Our feature vector representation for each text file will be simply a binary vector that shows which of the following words are present in the text file: Awful Bad Boring Dull Effective Enjoyable Great Hilarious. For example the text file cv996 11592.txt would be represented as (0, 0, 0, 0, 1, 0, 1, 0) because it contains Effective and Great but none of the other words.

# Question 2A (Minimum) CSC421 -  (1 point, CSC581C - 0 points) 

Write code that parses the text files and calculates the probabilities for
each dictionary word given the review polarity

In [10]:
# YOUR CODE GOES HERE 
from os import listdir
import numpy as np

def parseFiles():
    keyWords = ["awful", "bad","boring","dull","effective","enjoyable","great","hilarious"]
    input_neg = "./review_polarity/txt_sentoken/neg/"
    input_pos = "./review_polarity/txt_sentoken/pos/"
    
    neg_names = [(input_neg+f) for f in listdir(input_neg)]
    pos_names = [(input_pos+f) for f in listdir(input_pos)]
    
    num_neg = len(neg_names)
    num_pos = len(pos_names)
    
    neg_polarities = np.zeros((num_neg,len(keyWords)))
    pos_polarities = np.zeros((num_pos,len(keyWords)))
    
    
    for n in range(0,num_neg):
        with open(neg_names[n],'r') as f:
            words = f.read().split()
            words = [w.lower() for w in words] # change every word to lowercase for easier parsing
        neg_polarities[n,:] = [int(w in words) for w in keyWords]
    
    for n in range(0,num_pos):
        with open(pos_names[n],'r') as f:
            words = f.read().split()
            words = [w.lower() for w in words] # change every word to lowercase for easier parsing
        pos_polarities[n,:] = [int(w in words) for w in keyWords]
    
    
    probGivenNeg = np.divide(np.sum(neg_polarities,0,int),num_neg)
    probGivenPos = np.divide(np.sum(pos_polarities,0,int),num_pos)
    
    probPos = num_pos / (num_pos + num_neg)
    probNeg = num_neg / (num_pos + num_neg)
    
    return probGivenNeg, probGivenPos, probNeg, probPos, neg_polarities, pos_polarities
    

probGivenNeg, probGivenPos, probNeg, probPos, neg_polarities, pos_polarities = parseFiles()
print("Probability of each word given a positive review: " + str(probGivenPos))
print("Probability of each word given a negative review: " + str(probGivenNeg))

Probability of each word given a positive review: [0.019 0.255 0.048 0.023 0.12  0.095 0.408 0.125]
Probability of each word given a negative review: [0.101 0.505 0.169 0.091 0.046 0.053 0.286 0.05 ]


# Question 2B (Minimum) (CSC421 - 1 point, CSC581C - 0 point) 


Explain how the probability estimates for each dictionary word given the review polarity can be combined to form a Naive Bayes classifier. You can look up Bernoulli Bayes model for this simple model where only presence/absence of a word is modeled.

Your answer should be a description of the process with equations and a specific example as markdown text NOT python code. You will write the code in the next questinon. 

**\# YOUR MARKDOWN TEXT GOES HERE**

With Naive Bayes, we are assuming attributes are conditionally independent given the class value. As such, naive Bayes tells us  

\begin{align}
P(c_1|e_1,e_2,...,e_n) = \alpha P(e_1|c_1)...P(e_n|c_1)P(c_1)
\end{align}  

Where $E=e_1,e_2,...,e_n$ are the instances, and $C=c_1,c_2,...,c_m$ are the classes.  

In our case the classes are the types of reviews, ie. C={pos,neg}. The instances are the 8 dictionary words, ie. E={Awful, Bad, Boring, Dull, Effective, Enjoyable, Great, Hilarious}.  
  
If we have the probability estimates for each dictionary word given the review polarity, eg. P(Awful | pos), we can build a Naive Bayes classifier based on the information above.
  
We can calculate the probability of each word occuring given the polarity directly from our data, as well as the probability of being a positive or negative review. Using this, if we get a new review and want to classify if it is a positive or negative review, we can first determine if each word occurs or not, building our evidence. From this, we can use the following:  
\begin{align}
&P(c|E) = P(c|e_1,e_2,...,e_n)\\
&= P(c,e_1,e_2,...,e_n)\\
&= P(e_1,e_2,...,e_n,c)\\
&= aP(e_1|c)P(e_2|c)...P(e_n|c)P(c)
\end{align} 
  
We would calculate this for $c=pos$ and $c=neg$, and the higher pribability indicates the polarity the review would be classified as. Note $\alpha$ is just a normalization factor. We can calculate it to get a probability between [0,1] for each, however comparing the un-normalized probabilities directly will give the same classification answer.  
  
To give a concrete example, suppose the evidence for a review gives the following: $[Awful, Bad, Boring, Dull, Effective, Enjoyable, Great, Hilarious] = [1,1,0,1,0,0,0,1]$  

We want to know if this review is positive or negative.  
First, we will calculate $P(pos | e_1,e_2,...,e_8)$. This will give $P(pos | e_1,e_2,...,e_8) = \alpha P(e_1|pos)P(e_2|pos)...P(e_8|pos)P(pos)$.  

Next, we will calculate $P(neg | e_1,e_2,...,e_8)$. This will give $P(neg | e_1,e_2,...,e_8) = \alpha P(e_1|neg)P(e_2|neg)...P(e_8|neg)P(neg)$.  

In both cases, we will get a value multiplied by $\alpha$ (we dont know $\alpha$). If we want to get $\alpha$, we add the 2 values. The class with the higher value will indicate the more probable classification.

# Question 2C (Expected) 1 point 

Write Python code for classifying a particular test instance (in our case movie review) following a Bernolli Bayes approach. Your code should calculate the likelihood the review is positive given the correspondng conditional probabilities for each dictionary word as well as the likelihood the review is negative given the corresponding conditional probabilities for each dictionary word. Check that your code works by providing a few example cases of prediction. Your code should be written from "scratch" and only use numpy/scipy but not machine learning libraries like scikit-learn or tensorflow. 


In [8]:
# YOUR CODE GOES HERE 

"""
need to classify particular test instance following a bernolli bayes 
approach.

-calculate 
(1) the likelihood the review is pos given the corresponding 
conditional probs for each dictionary word as well as 
(2) the likelihood the review is negative given the corresponding 
conditional probs for each dictionary word.
- check that your code works by providing a few examples for prediction.
"""

def calcLikelihood(evidence, probGivenNeg, probGivenPos, probNeg, probPos):
    
    posLikelihood = 1
    negLikelihood = 1
    for n in range(0,len(evidence)):
        if evidence[n] == 1:
            posLikelihood *= probGivenPos[n]
            negLikelihood *= probGivenNeg[n]
        else:
            posLikelihood *= (1-probGivenPos[n])
            negLikelihood *= (1-probGivenNeg[n])
    posLikelihood *= probPos
    negLikelihood *= probNeg
    
    if posLikelihood > negLikelihood:
        classifier = "pos"
    elif negLikelihood > posLikelihood:
        classifier = "neg"
    else:
        classifier = "equal"
    
    return classifier

probGivenNeg, probGivenPos, probNeg, probPos, neg_polarities, pos_polarities = parseFiles()
keyWords = ["awful", "bad","boring","dull","effective","enjoyable","great","hilarious"]

review1 = "this is an awful and boring movie."
words = review1.split()
evidence = [int(w in words) for w in keyWords]
classifier = calcLikelihood(evidence, probGivenNeg, probGivenPos, probNeg, probPos)
print("review 1: " + classifier)

review2 = "enjoyable movie with great humour. was a bit dull at times however."
words = review2.split()
evidence = [int(w in words) for w in keyWords]
classifier = calcLikelihood(evidence, probGivenNeg, probGivenPos, probNeg, probPos)
print("review 2: " + classifier)



review 1: neg
review 2: pos


# QUESTION 2D (Expected ) 1 point

Calculate the classification accuracy and confusion matrix that you would obtain using the whole data set for both training and testing. Do not use machine learning libraries like scikit-learn or tensorflow for this only the basic numpy/scipy stuff. 

In [6]:
# YOUR CODE GOES HERE

def calcClassificationAccuracy():
    probGivenNeg, probGivenPos, probNeg, probPos, neg_polarities, pos_polarities = parseFiles()
    keyWords = ["awful", "bad","boring","dull","effective","enjoyable","great","hilarious"]
    
#     loop through positive polarities and count correct/incorrect
    correctPos = 0
    incorrectPos = 0
    correctNeg = 0
    incorrectNeg = 0
    for n in range(0,len(pos_polarities)):
        classifier = calcLikelihood(pos_polarities[n,:], probGivenNeg, probGivenPos, probNeg, probPos)
        if classifier == "pos":
            correctPos += 1
        else:
            incorrectPos += 1

#     loop through negative 
    for n in range(0,len(neg_polarities)):
        classifier = calcLikelihood(neg_polarities[n,:], probGivenNeg, probGivenPos, probNeg, probPos)
        if classifier == "neg":
            correctNeg += 1
        else:
            incorrectNeg += 1
            
    accuracy = (correctPos + correctNeg) / (correctPos + incorrectPos + correctNeg + incorrectNeg)
    
    print("Accuracy: " + str(accuracy))
    
    print("%15s | %15s | %15s" % (" ","Pos Review", "Neg Review"))
    print("----------------|-----------------|----------------")
    print("%15s | %15d | %15d" % ("Pos Class",correctPos,incorrectNeg))
    print("%15s | %15d | %15d" % ("Neg Class",incorrectPos,correctNeg))

calcClassificationAccuracy()

Accuracy: 0.674
                |      Pos Review |      Neg Review
----------------|-----------------|----------------
      Pos Class |             756 |             408
      Neg Class |             244 |             592


# QUESTION 2E (Advanced) 1 point 

One can consider the Naive Bayes classifier a generative model that can generate binary feature vectors using the associated probabilities from the training data. The idea is similar to how we do direct sampling in
Bayesian Networks and depends on generating random number from a discrete distribution. Describe how you would generate random movie reviews consisting solely of the words from the dictionary using your model. Show 5 examples of randomly generated positive reviews and 5 examples of randomly generated negative reviews. Each example should consists of a subset of the words in the dictionary. Hint: use probabilities to generate both the presence and absence of a word

In [7]:
# YOUR CODE GOES HERE 

"""
Describe how you would generate random movie reviews consisting solely of the words 
from the dictionary using your model.

Ans: 
From the previous calculations, we have probabilities of a keyword being present 
for both positive and negative reviews. To determine the subset of words in a randomly
generated review, we can generate a random number between 0 and 1, and if the word is 
greater than the random number, we would add it to the sublist. If not, it would be absent.
We can do this using the probability of a word occuring for a positive or negative review,
depending on the type of review we want.
"""

import random

def generateReview(reviewType):
    probGivenNeg, probGivenPos, probNeg, probPos, neg_polarities, pos_polarities = parseFiles()
    keyWords = ["awful", "bad","boring","dull","effective","enjoyable","great","hilarious"]
    
    rands = np.random.rand(len(keyWords))
    
    if reviewType == "pos":
        words = [w for (i,w) in enumerate(keyWords) if probGivenPos[i]>rands[i]]
    else:
        words = [w for (i,w) in enumerate(keyWords) if probGivenNeg[i]>rands[i]]
    random.shuffle(words)
    return " ".join(words)

i = 0
while i < 5:
    review = generateReview("pos")
    if review == "":  # want sentences with at least 1 word
        continue
    print("Positive review #" + str(i) + ": " + str(review))
    i += 1

print()
i = 0
while i < 5:
    review = generateReview("neg")
    if review == "":  # want sentences with at least 1 word
        continue
    print("Negative review #" + str(i) + ": " + str(review))
    i += 1

Positive review #0: great bad
Positive review #1: hilarious
Positive review #2: hilarious
Positive review #3: hilarious great
Positive review #4: great

Negative review #0: great effective enjoyable awful
Negative review #1: dull bad
Negative review #2: bad dull
Negative review #3: bad
Negative review #4: bad


# QUESTION 2F (ADVANCED) (CSC421 - 0 points, CSC581C - 2 points)

Check the associated README file and see what convention is used for the 10-fold cross-validation. Calculate the classification accuracy and confusion matrix using the recommended 10-fold cross-validation. Again do NOT use 
ML libraries such as scikit-learn or tensorflow and just use numpy/scipy. 

In [None]:
# YOUR CODE GOES HERE 