# CSC421 Assignment 3 - Part II Naive Bayes Classification (5 points) #
### Author: George Tzanetakis 

This notebook is based on the supporting material for topics covered in **Chapter 13 Quantifying Uncertainty**and **Chapter 20 - Statistical Learning Method** from the book *Artificial Intelligence: A Modern Approach.* This part does NOT rely on the provided code so you can complete it just using basic Python. 

```
Misunderstanding of probability may be the greatest of all impediments
to scientific literacy.

Gould, Stephen Jay
```



# Introduction 


Text categorization is the task of assigning a given document to one of a fixed set of categories, on the basis of text it contains. Naive Bayes models are often used for this task. In these models, the query variable is
the document category, and the effect variables are the presence/absence
of each word in the language; the assumption is that words occur independently in documents within a given category (condititional independence), with frequencies determined by document category. Download the following file: http://www.cs.cornell.edu/People/pabo/movie-review-data/review_polarity.tar.gz containing a dataset that has been used for text mining consisting of movie reviews classified into negative and positive. You
will see that there are two folders for the positivie and negative category and they each contain multiple text files with the reviews. You can find more information about the dataset at: 
http://www.cs.cornell.edu/People/pabo/movie-review-data/


Our goal will be to build a simple Naive Bayes classifier for this dataset. More complicated approaches using term frequency and inverse document frequency weighting and many more words are possible but the basic concepts
are the same. The goal is to understand the whole process so DO NOT use existing machine learning packages but rather build the classifier from scratch.

Our feature vector representation for each text file will be simply a binary vector that shows which of the following words are present in the text file: Awful Bad Boring Dull Effective Enjoyable Great Hilarious. For example the text file cv996 11592.txt would be represented as (0, 0, 0, 0, 1, 0, 1, 0) because it contains Effective and Great but none of the other words.

# Question 2A (Minimum) CSC421 -  (1 point, CSC581C - 0 points) 

Write code that parses the text files and calculates the probabilities for
each dictionary word given the review polarity

In [82]:
import os
import numpy as np
from scipy import stats

In [46]:
dictionary = {
    "awful": 0,
    "bad": 1,
    "boring": 2, 
    "dull": 3, 
    "effective": 4,
    "enjoyable": 5,
    "great": 6,
    "hilarious": 7,
}
# YOUR CODE GOES HERE 

# Calculate negative review encodings
negative_encodings = []
positive_encodings = []

negative_reviews = ["reviews/neg/" + f_name for f_name in os.listdir("reviews/neg/")]
positive_reviews = ["reviews/pos/" + f_name for f_name in os.listdir("reviews/pos/")]

# Loop through negative files
for file in negative_reviews:
    encoding = np.array([0] * 8)
    for word in open(file).read().split(" "):
        if word in dictionary: encoding[dictionary[word]] = 1
    negative_encodings.append(encoding)

# Loop through positive files
for file in positive_reviews:
    encoding = np.array([0] * 8)
    for word in open(file).read().split(" "):
        if word in dictionary: encoding[dictionary[word]] = 1
    positive_encodings.append(encoding)

    
negative_encodings = np.array(negative_encodings)
positive_encodings = np.array(positive_encodings)

neg_probs = np.sum(negative_encodings, axis=0) / len(negative_reviews)
pos_probs = np.sum(positive_encodings, axis=0) / len(positive_reviews)

print("negative probs",neg_probs)
print("positive probs",pos_probs)


negative probs [0.099 0.503 0.166 0.09  0.046 0.053 0.282 0.048]
positive probs [0.019 0.254 0.048 0.023 0.12  0.094 0.405 0.125]


# Question 2B (Minimum) (CSC421 - 1 point, CSC581C - 0 point) 


Explain how the probability estimates for each dictionary word given the review polarity can be combined to form a Naive Bayes classifier. You can look up Bernoulli Bayes model for this simple model where only presence/absence of a word is modeled.

Your answer should be a description of the process with equations and a specific example as markdown text NOT python code. You will write the code in the next questinon. 

## YOUR MARKDOWN TEXT GOES HERE 
The Naive bayes classifier will take in the encoding vector of a given text file. From this vector, it will calculate the probability of pos given ( x_1, x_2 ... x_8 ). With this vector the program will determine the probability of each class given that vector, i.e. P(C=pos|X) = P(x_1 | C=pos) * P(x_2 | C=pos) * ... P(x_8 | C=pos). This information can be retrieved from the models I defined above which are represented as
```Python
neg_probs=[0.099, 0.503, 0.166, 0.09,  0.046, 0.053, 0.282, 0.048]
```
The translation of notation is `neg_probs[i]` = P(x_i | C=neg)

#### Example
X = (1,0,1,1,0,0,0,1)

P(C=pos|X) = pos_probs[0] * (1 - pos_probs[1]) * pos_probs[3] * ... * pos_probs[7]

P(C=neg|X) = neg_probs[0] * (1 - neg_probs[1]) * neg_probs[3] * ... * neg_probs[7]

# Question 2C (Expected) 1 point 

Write Python code for classifying a particular test instance (in our case movie review) following a Bernolli Bayes approach. Your code should calculate the likelihood the review is positive given the correspondng conditional probabilities for each dictionary word as well as the likelihood the review is negative given the corresponding conditional probabilities for each dictionary word. Check that your code works by providing a few example cases of prediction. Your code should be written from "scratch" and only use numpy/scipy but not machine learning libraries like scikit-learn or tensorflow. 


In [80]:
# YOUR CODE GOES HERE 
def classify(X):
    pos_score = 1
    neg_score = 1
    
    # Compute class scores
    for word,prob in zip(X,neg_probs):
        if word == 1: neg_score *= prob
        else: neg_score *= (1 - prob)            
    for word,prob in zip(X,pos_probs):
        if word == 1: pos_score *= prob
        else: pos_score *= (1 - prob)
    
    # Return class with higher probability

    return np.argmax([pos_score, neg_score])
    
    #Returns list of length 2, first position is positive class, second is negative
    
    return ret
            

[[756. 244.]
 [411. 589.]]
Accuracy: 
 [[0.756 0.244]
 [0.411 0.589]]


# QUESTION 2D (Expected ) 1 point

Calculate the classification accuracy and confusion matrix that you would obtain using the whole data set for both training and testing. Do not use machine learning libraries like scikit-learn or tensorflow for this only the basic numpy/scipy stuff. 

In [81]:
# YOUR CODE GOES HERE
confusion_matrix = np.zeros(shape=(2,2))

for X in positive_encodings:
    confusion_matrix[0][classify(X)] += 1
    
for X in negative_encodings:
    confusion_matrix[1][classify(X)] += 1

print(confusion_matrix)

confusion_accuracy = confusion_matrix / len(negative_encodings)
                                            
print("Accuracy: \n",confusion_accuracy)

[[756. 244.]
 [411. 589.]]
Accuracy: 
 [[0.756 0.244]
 [0.411 0.589]]


# QUESTION 2E (Advanced) 1 point 

One can consider the Naive Bayes classifier a generative model that can generate binary feature vectors using the associated probabilities from the training data. The idea is similar to how we do direct sampling in
Bayesian Networks and depends on generating random number from a discrete distribution. Describe how you would generate random movie reviews consisting solely of the words from the dictionary using your model. Show 5 examples of randomly generated positive reviews and 5 examples of randomly generated negative reviews. Each example should consists of a subset of the words in the dictionary. Hint: use probabilities to generate both the presence and absence of a word

In [96]:
# YOUR CODE GOES HERE 
def make_review(probabilities):
    terms = ["awful","bad","boring","dull","effective","enjoyable","great","hilarious"]

    rv = stats.rv_discrete(name = 'Review', 
                           values = (np.arange(len(terms)), probabilities / np.sum(probabilities)))
    numeric_samples = rv.rvs(size=10)
    mapped_samples = [terms[x] for x in numeric_samples]
    return mapped_samples

print("Negative Reviews:")
for _ in range(5):
    print(make_review(neg_probs))
print()
print("Positive Reviews:")
for _ in range(5):
    print(make_review(pos_probs))

Negative Reviews:
['bad', 'boring', 'awful', 'great', 'bad', 'boring', 'awful', 'bad', 'great', 'bad']
['hilarious', 'bad', 'enjoyable', 'boring', 'boring', 'bad', 'bad', 'bad', 'bad', 'effective']
['hilarious', 'bad', 'great', 'bad', 'bad', 'bad', 'bad', 'bad', 'great', 'enjoyable']
['awful', 'bad', 'awful', 'bad', 'bad', 'bad', 'bad', 'bad', 'bad', 'bad']
['bad', 'bad', 'boring', 'great', 'great', 'hilarious', 'bad', 'effective', 'bad', 'dull']

Positive Reviews:
['enjoyable', 'hilarious', 'effective', 'great', 'great', 'effective', 'great', 'enjoyable', 'bad', 'great']
['hilarious', 'dull', 'bad', 'great', 'enjoyable', 'great', 'enjoyable', 'bad', 'awful', 'enjoyable']
['great', 'great', 'dull', 'effective', 'bad', 'bad', 'great', 'enjoyable', 'hilarious', 'great']
['bad', 'hilarious', 'great', 'great', 'effective', 'great', 'enjoyable', 'bad', 'enjoyable', 'effective']
['hilarious', 'hilarious', 'great', 'bad', 'hilarious', 'hilarious', 'great', 'great', 'great', 'great']


# QUESTION 2F (ADVANCED) (CSC421 - 0 points, CSC581C - 2 points)

Check the associated README file and see what convention is used for the 10-fold cross-validation. Calculate the classification accuracy and confusion matrix using the recommended 10-fold cross-validation. Again do NOT use 
ML libraries such as scikit-learn or tensorflow and just use numpy/scipy. 

In [50]:
# YOUR CODE GOES HERE 