## TODO

- Get the general framework working, reading from original dataset, etc.
- Use decision tree to get baseline model results with manual list
- Getting unigram model working with feature vector and all
- Use all three/two classifiers in Proposal to work with unigram model and get results with 
- Choose one of the other testing model and get it to work with the above classifiers. Repeat this until all tests done
- Look at the README file in datatset and write the info from there in report

In [1]:
import os
import pandas as pd

In [2]:
# 'neg' and 'pos' subfolders contain the following:
negative_reviews = [[], [], []]
to_search = './Datasets/neg_org/'
for f in os.listdir(to_search):
    path = to_search+f
    fold = f[:5]
    if fold < 'cv233':
        fold_index = 0
    elif fold < 'cv466':
        fold_index = 1
    else:
        fold_index = 2
    with open(path, 'r', encoding='latin-1') as fin:
            negative_reviews[fold_index].append(fin.read().strip())  # We remove the ending newline characters
            
positive_reviews = [[], [], []]
to_search = './Datasets/pos_org/'
for f in os.listdir(to_search):
    path = to_search+f
    fold = f[:5]
    if fold < 'cv233':
        fold_index = 0
    elif fold < 'cv466':
        fold_index = 1
    else:
        fold_index = 2
    with open(path, 'r', encoding='latin-1') as fin:
            positive_reviews[fold_index].append(fin.read().strip())  # We remove the ending newline characters

In [3]:
# Although in Section 3 they give a figure for corpus, in Figure 1 they actually only use 700 pos/neg reviews to give 50% 
# baseline value for results

human1_pos = ['dazzling', 'brilliant', 'phenomenal', 'excellent', 'fantastic']
human1_neg = ['suck', 'terrible', 'awful', 'unwatchable', 'hideous']

human2_pos = ['gripping', 'mesmerizing', 'riveting', 'spectacular', 'cool', 'awesome', 'thrilling', 'badass', 'excellent', 
              'moving', 'exciting']
human2_neg = ['bad', 'cliched', 'sucks', 'boring', 'stupid', 'slow']

human3_pos = ['love', 'wonderful', 'best', 'great', 'superb', 'still', 'beautiful']
human3_neg = ['bad', 'worst', 'stupid', 'waste', 'boring', '?', '!']


def convert_str_to_dict(review):
    """
    Returns a dictionary of tokens in review along with the frequency of those tokens
    """
    freq = {}
    for word in review.split():
        if word in freq:
            freq[word] += 1
        else:
            freq[word] = 1
    return freq


def return_sentiment_tag_counts(pos_tokens, neg_tokens, review):
    """
    Given a list of tokens we use to denote positive and negative sentiment, we return a dictionary of values such that 
    {sentiment_token: frequency of token in review} for both positive and negative sentiments
    """
    pos = {w:0 for w in pos_tokens}
    neg = {w:0 for w in neg_tokens}
    review_tokens = convert_str_to_dict(review)
    
    for p in pos_tokens:
        if p in review_tokens:
            pos[p] += review_tokens[p]
    for n in neg_tokens:
        if n in review_tokens:
            neg[n] += review_tokens[n]
    
    return pos, neg


def get_statistics(pos_tokens, neg_tokens):
    ties = 0
    
    neg_correct = 0
    for neg_fold in negative_reviews:
        for neg_review in neg_fold:
            pos_dict, neg_dict = return_sentiment_tag_counts(pos_tokens, neg_tokens, neg_review)
            pos_count = sum(pos_dict.values())
            neg_count = sum(neg_dict.values())
            if pos_count == neg_count:
                ties += 1
            if neg_count > pos_count:
                neg_correct += 1

    pos_correct = 0
    for pos_fold in positive_reviews:
        for pos_review in pos_fold:
            pos_dict, neg_dict = return_sentiment_tag_counts(pos_tokens, neg_tokens, pos_review)
            pos_count = sum(pos_dict.values())
            neg_count = sum(neg_dict.values())
            if pos_count == neg_count:
                ties += 1
            if pos_count > neg_count:
                pos_correct += 1
            
    total_correct = pos_correct + neg_correct
    accuracy = (total_correct / 1400) * 100
    tied_precentage = (ties / 1400) * 100
    
    return accuracy, tied_precentage

acc1, tied1 = get_statistics(human1_pos, human1_neg)
acc2, tied2 = get_statistics(human2_pos, human2_neg)
acc3, tied3 = get_statistics(human3_pos, human3_neg)

In [4]:
data = [('Human 1', round(acc1, 1), round(tied1, 1)), 
        ('Human 2', round(acc2, 1), round(tied2, 1)), 
        ('Human 3', round(acc3, 1), round(tied3, 1))]
summary = pd.DataFrame(data, columns=['Proposed word list', 'Accuracy', 'Ties'])
summary

Unnamed: 0,Proposed word list,Accuracy,Ties
0,Human 1,19.5,75.0
1,Human 2,41.3,39.4
2,Human 3,60.4,15.5


In [125]:
r = negative_reviews[1][37]
return_sentiment_tag_counts(human1_pos, human1_neg, r)

({'brilliant': 0,
  'dazzling': 0,
  'excellent': 0,
  'fantastic': 0,
  'phenomenal': 0},
 {'awful': 0, 'hideous': 0, 'suck': 0, 'terrible': 0, 'unwatchable': 0})

### Why we can't reproduce results:

They break ties with a policy that 'maximises the accuracy' (paragraph below Figure 2). So part of the count towards Ties is used for accuracy too but we're not told how it is done. So cannot reproduce accuracy results.

Why do ties differ??