## Naive Bayes Sentiment Polarity Classifier

This notebook outlines our sentiment polarity model which takes movie reviews and predicts whether they are positive or negative based on the language used.

In [1]:
import glob
import os
import math
import re
import csv
import random

## Data Input

This function reads in the data and adds it to the training or test set while also labelling the data as positive or negative. 

This data contains 1000 positive and 1000 negative reviews. the data is split into 800 positive and 800 negative reviews for training and 100 postive and 100 negative reviews for testing. (90/10 split)

In [2]:
def train_test():
    train = []
    test = []
    
    neg_path = 'review_polarity/txt_sentoken/neg/'

    #negative data
    os.chdir(neg_path)
    for file in os.listdir():
        #find test data
        if file[2]=='9':
            with open(file, 'r') as f:
                text = f.read()
                text = text.lower()
                text = re.sub(r'[^\w\s]','',text)
                text = text.replace("\n", "")
                test.append([file, text, 0])
        #find training data
        else:
            with open(file, 'r') as f:
                text = f.read()
                text = text.lower()
                text = re.sub(r'[^\w\s]','',text)
                text = text.replace("\n", "")
                train.append([file, text, 0])

    #positive data
    pos_path = '../../../review_polarity/txt_sentoken/pos/'
    os.chdir(pos_path)
    for file in os.listdir():
        #find test data
        if file[2]=='9':
            with open(file, 'r') as f:
                text = f.read()
                text = text.lower()
                text = re.sub(r'[^\w\s]','',text)
                text = text.replace("\n", "")
                test.append([file, text, 1])
        #find training data
        else:
            with open(file, 'r') as f:
                text = f.read()
                text = text.lower()
                text = re.sub(r'[^\w\s]','',text)
                text = text.replace("\n", "")
                train.append([file, text, 1])
                
    return train, test

In [3]:
def count_classes(train):
    pos = 0
    neg = 0
    #loop through each review
    for review in train:
        #find incidence where sentiment is positive
        if review[2]==1:
            pos+=1
        else:
            neg+=1
            
    return pos, neg, pos+neg

## Create word frequencies dictionaries

We create a positive, negative, and total word frequency dictionary from the training data which will then be used in the Naive Bayes function to predict word log likelihoods.

In [4]:
def count_words(reviews):
    pos_word_frequencies = {}
    neg_word_frequencies = {}
    total_vocab_frequencies = {}

    for review in reviews:        
        #find positive reviews
        if review[2]==1:
            words = review[1].strip().split()
            for key in words:
                if key in pos_word_frequencies:
                    pos_word_frequencies[key] += 1
                else:
                    pos_word_frequencies[key] = 1

                if key in total_vocab_frequencies:
                    total_vocab_frequencies[key] += 1
                else:
                    total_vocab_frequencies[key] = 1
                    
        #negative reviews
        else:
            words = review[1].strip().split()
            for key in words:
                if key in neg_word_frequencies:
                    neg_word_frequencies[key] += 1
                else:
                    neg_word_frequencies[key] = 1

                if key in total_vocab_frequencies:
                    total_vocab_frequencies[key] += 1
                else:
                    total_vocab_frequencies[key] = 1

    return pos_word_frequencies, neg_word_frequencies, total_vocab_frequencies


## Naive Bayes

Naive Bayes is a probabilistic classifer which returns the most probable (maximum posterior probability) class c in a given document.

$$\hat{c} = \underset{c\in C}{\operatorname{argmax}}P(c|d)$$

In this case we are interested in whether the document belongs to the Positive (1) or Negative (0) class.

Bayes rule allows us to infer the probability of a word in a document being positive or negative based on the following equation:

$$P(x|y)= \frac{P(y|x)P(x)}{P(y)}$$

which for our model becomes 

$$P(c|d)= \frac{P(d|c)P(c)}{P(d)}$$

where $P(d|c)$ is our likelihood and $P(c)$ is our prior.

Naive Bayes operates under the assumption that the probabilities are independant of one another and therefore may be multiplied by one another. In this model we shall use log space so our classes will be predicted using the final equation below:

$$ c_{NB} = \underset{c\in C}{\operatorname{argmax}}P(c)+\underset{i\in positions}\sum{logP(w_{i}|c)} $$


The prior is calculated as:

$$\hat{P}(c)= \frac{N_{c}}{N_{doc}}$$

We use Laplace smoothing or add one smoothing to give the following equation:

$$\hat{P}(w_{i}|c)=\frac{count(w_{i},c)+1}{(\sum_{w \in V}count(w,c))+|V|}$$

In [5]:
def naive_bayes(train, test):

    #training phase
    pos_loglikelihood = {}
    neg_loglikelihood = {}
    count_pos, count_neg, total_docs = count_classes(train)
    prior_pos = math.log(count_pos/total_docs)
    prior_neg = math.log(count_neg/total_docs)
    pos_dict, neg_dict, full_vocab_dict = count_words(train)

    #add positive log likelihoods
    pos_sum_counts = sum(pos_dict.values())
    for word in pos_dict:
        pos_loglikelihood[word] = math.log((pos_dict[word]+1)/(pos_sum_counts+len(full_vocab_dict)))

    neg_sum_counts = sum(neg_dict.values())
    #add negative log likelihoods
    for word in neg_dict:
        neg_loglikelihood[word] = math.log((neg_dict[word]+1)/(neg_sum_counts+len(full_vocab_dict)))

    #add likelihood for words in vocab but not in one of the dictionaries
    pos_keys = pos_dict.keys()
    neg_keys = neg_dict.keys()
    all_vocab_keys = full_vocab_dict.keys()

    #find words present in main vocab but not in positive vocab and vice versa
    pos_diff = all_vocab_keys - pos_keys
    neg_diff = all_vocab_keys - neg_keys
    
    for word in pos_diff:
        pos_loglikelihood[word] = math.log(1/(pos_sum_counts + len(full_vocab_dict)))
             
    for word in neg_diff:
        neg_loglikelihood[word] = math.log(1/(neg_sum_counts + len(full_vocab_dict)))

    #testing phase
    accurate_prediction = 0
    for review in test:
        
        pos_prob, neg_prob = 0, 0
        pos_prob+= prior_pos
        neg_prob+= prior_neg
        words = review[1].strip().split()
        for word in words:
            if word in full_vocab_dict:
                if word in pos_loglikelihood:
                    pos_prob += pos_loglikelihood[word]
                if word in neg_loglikelihood:
                    neg_prob += neg_loglikelihood[word]

        if pos_prob > neg_prob:
            review.append(1)
        else:
            review.append(0)
            
        if review[2]==review[3]:
            accurate_prediction+=1
        
    total_test_reviews = len(test)
    accuracy = 'The Accuracy of the Naive Bayes sentiment polarity classifier is: ' + str((accurate_prediction / total_test_reviews)*100) + '%'

    return accuracy, test

## Sample results

To review the performance of the classifer, we want to be able to sample the positive and negative results. The function below allows us to select 5 correct predictions and 5 incorrect predictions at random

In [6]:
def sample_results(predictions):
    correct = []
    incorrect = []
    for row in predictions:
        if row[2]==row[3]:
            correct.append(row)
        else:
            incorrect.append(row)

    correct_5 = random.sample(correct, 5)
    incorrect_5 = random.sample(incorrect, 5)

    return correct_5, incorrect_5


## Output of reviews and predictions for investigation

This function below allows us to output the selected correctly and incorrectly classified reviews and formats them into the `ANALYSIS.md` file for investigation and to add further comments on why the prediction was correct/incorrect

In [7]:
def output_samples(good, bad):
        with open('../../../ANALYSIS.md', 'w') as f:
            f.write('# Analysis of Sentiment Classifier \n \n')

            f.write('## Correct Predictions: \n \n')
            for review in good:
                sentiment = 'Positive' if review[2] == 1 else 'Negative'
                predicted = 'Positive' if review[3] == 1 else 'Negative'
                f.write('### Filename: '+ review[0] + '\n### Sentiment: '+ sentiment + '\n### Predicted: '+predicted +' \n \n### Review: \n')
                f.write(review[1] + '\n \n')

            f.write('\n \n \n \n \n ## Incorrect Predictions: \n \n')
            for review in bad:
                sentiment = 'Positive' if review[2] == 1 else 'Negative'
                predicted = 'Positive' if review[3] == 1 else 'Negative'
                f.write('### Filename: '+ review[0] + '\n### Sentiment: '+ sentiment+ '\n### Predicted: '+predicted +' \n \n### Review: \n') 
                f.write(review[1] + '\n \n')

            f.write('## Conclusion')


## Running the code

The below cell calls the functions to do the following:

- Train the model on the training data
- Test the model predictions on the test data
- Output the Accuracy of the model
- Output a selected sample of predictions to `ANALYSIS.md` for review

In [8]:
train, test = train_test()
results, predictions = naive_bayes(train, test)
print(results)
correct, incorrect = sample_results(predictions)
output_samples(correct, incorrect)

The Accuracy of the Naive Bayes sentiment polarity classifier is: 83.5%
