# Homework 1: Sentiment Analysis with Naïve Bayes
#### CSCI 3832 Natural Language Processing


1. Lemmas and inflected forms, hyponyms/hypernyms, the distributional hypothesis
2. Tokenization, vocabularies, and feature extraction for a Naive Bayes model 

*Yevin Kim, kimyevin17@gmail.com*

## Section 1: Free Response Questions


**Question 1: Write down the lemmas of the following inflected forms:**
1. walked
2. taught
3. best
4. are
5. running

*Your answer here*
1. walk
2. teach
3. good
4. be
5. run

**Question 2: Write down 3 hyponyms of the following words:**
1. dog
2. food
3. profession

*Your answer here*
1. Pomeranian, Poodle, German Shepherd
2. pizza, pasta, bread
3. doctor, teacher, baker

**Question 3: In your own words, describe:**
1. The distributional hypothesis (see lecture on distributional semantics)
2. How is the distributional hypothesis relvant to NLP systems?

*Your answer here*
: 1. In distributional semantics, a word's meaning can vary depending on its placement within a sentence. Thus, its meaning is determined by contextual factors rather than independent words. The distributional hypothesis is grounded in this theory, states that words with similar meanings are likely to be used in similar contexts.
: 2. The distributional hypothesis is important for NLP systems in facilitating to understand word relationships. For example, by analysing the neighbouring words (context) of a certain word, a distributional representation of the word can be formed as a vector. This process provides computers with an advantage for computation and representation.

## Section 2: Sentiment Analysis with Naive Bayes

In this section, our goal is to classify a set of movie reviews as positive or negative. For our dataset, we'll use the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/). To get started, download the dataset from the link, and extract it to where your notebook is. Next, we'll load the data and look at a couple of examples. 

*Important: for any project which involves creating or training models, you can **only** do your exploratory data analysis on the training set. Looking at the test set in any way can invalidate your results!*

In [1]:
import os

data_dir = 'aclImdb/'

pos_train_dir = data_dir + 'train/pos/'
neg_train_dir = data_dir + 'train/neg/'

def read_folder(folder):
    examples = []
    for fname in os.listdir(folder):
        with open(os.path.join(folder, fname), encoding='utf8') as f:
            examples.append(f.readline().strip())
    return examples

pos_examples = read_folder(pos_train_dir)
neg_examples = read_folder(neg_train_dir)

print('Number of positive examples: {}\nNumber of negative examples: {}\n\n'.format(len(pos_examples), len(neg_examples)))

print('Sample positive example: {}\n\n'.format(pos_examples[0]))
print('Sample negative example: {}'.format(neg_examples[0]))



Number of positive examples: 12500
Number of negative examples: 12500


Sample positive example: Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!


Sample negative example: Story of a man who has unnatural feelings for a pig. Starts out with 

Now that we've loaded the data, let's create our vocabulary. While we want our vocabulary to cover the whole training set, we'll keep them separate to see if there are any words which are frequently found in one or the other class -- these words might be informative features for classification! 

The simplest way to create a vocabulary is to split on spaces:

In [2]:
pos_words = []  # A list of all space separated tokens found across all positive examples. (Contains duplicates)
neg_words = []

pos_vocab = set()  # A list of *unique* separated tokens found in across all positive examples. (No duplicates)
neg_vocab = set()

In [3]:
''' 
Your code here. For each class (positive/negative) find both the list of types and tokens for each class. 
To separate each example into separate words, split the example on spaces. 
'''

# positive reviews
for example in pos_examples:
    token = example.split(' ')
    pos_words.extend(token)
    pos_vocab.update(token)

# negative reviews
for example in neg_examples:
    token = example.split(' ')
    neg_words.extend(token)
    neg_vocab.update(token)

print('Sample words from positive examples: {}\n\n'.format(pos_words[:10]))
print('Sample words from negative examples: {}'.format(neg_words[:10]))

# just for sanity check
print(len(pos_words))
print(len(pos_vocab))

Sample words from positive examples: ['Bromwell', 'High', 'is', 'a', 'cartoon', 'comedy.', 'It', 'ran', 'at', 'the']


Sample words from negative examples: ['Story', 'of', 'a', 'man', 'who', 'has', 'unnatural', 'feelings', 'for', 'a']
2958696
178873


In [4]:
# Sanity check

print(len(pos_words))
print(len(pos_vocab))

assert len(pos_words) == 2958696
assert len(pos_vocab) == 178873

2958696
178873


Now lets calculate word frequencies for each class. (Hint: use the Python Counter class)

In [5]:
pos_frequencies = [] # A list of tuples of the form (word, count). 
                 # The list should be sorted in descending order, using the count of each tuple as the key

neg_frequencies = []

In [6]:
from collections import Counter

''' 
Your code here. For each class (positive/negative) calculate the frequency of each word and save it in pos_counter
and neg_counter.

Print the top 15 most common word for each class. 

'''

# Calculate word frequency by using 'Counter'
pos_counter = Counter(pos_words)
neg_counter = Counter(neg_words)

# Sort word frequency in descending order
pos_frequencies = pos_counter.most_common(15)
neg_frequencies = neg_counter.most_common(15)

# top 15 most common words in positive reviews
print("Top 15 most common words in positive reviews:")
for word, count in pos_frequencies:
    print(f"{word}: {count}")

# top 15 most common words in negative reviews
print("\nTop 15 most common words in negative reviews:")
for word, count in neg_frequencies:
    print(f"{word}: {count}")

Top 15 most common words in positive reviews:
the: 148413
and: 84270
a: 79427
of: 75341
to: 65209
is: 55358
in: 45794
that: 31941
I: 30927
it: 26987
this: 26021
/><br: 24617
as: 23930
with: 22031
was: 21308

Top 15 most common words in negative reviews:
the: 138612
a: 75665
and: 68381
of: 67629
to: 67359
is: 47870
in: 39782
I: 35043
that: 32615
this: 31177
it: 27440
/><br: 26318
was: 25389
for: 20197
with: 19687


In [7]:
assert pos_frequencies[0] == ('the', 148413)
assert neg_frequencies[0] == ('the', 138612)

Looking at the top 15 words for each class we see two problems:

1. The words are essentially the same for each class, which doesn't give us any information on how to differentiate them.
2. Look at the most frequent tokens. Are there any tokens which aren't words? Any situations where tokens with different surface forms but the same meaning could be repeated (and if so, how might we control for this?)

*Your answer to 2 here*

: Non-word tokens, such as "/><br", are among the most frequent in the text data. These artifacts are often a result of HTML or other formatting present. To enhance the analysis, it is suggested that preliminary text data processing is performed to eliminate all non-word tokens and other formatting artifacts. Furthermore, lemmatization or synonym classification can be used to manage similar meaning tokens with different surface forms.

Instead of looking at the most frequent words, let's instead look at the most frequent words which explicitly do not appear in the other class. 

In [8]:
only_pos_words = [word for word in pos_words if word not in neg_vocab]
only_neg_words = [word for word in neg_words if word not in pos_vocab]

opw_counter = Counter(only_pos_words)
onw_counter = Counter(only_neg_words)

print(opw_counter.most_common()[:50])
print('\n')
print(onw_counter.most_common()[:50])

[('Edie', 82), ('Gundam', 74), ('Antwone', 58), ('/>8/10', 47), ('/>7/10', 46), ('/>10/10', 45), ('Gunga', 44), ('Gypo', 44), ('Din', 43), ('Othello', 41), ('7/10.', 37), ('Blunt', 37), ('Yokai', 37), ('Tsui', 35), ('Blandings', 34), ('Goldsworthy', 32), ('/>9/10', 31), ('Gino', 31), ('Visconti', 30), ('Bernsen', 29), ('Taker', 29), ('Brashear', 29), ('Harilal', 29), ('Clutter', 28), ("Goldsworthy's", 27), ('"Rob', 26), ('Dominick', 25), ('MJ', 25), ('/>7', 24), ('Rosenstrasse', 24), ('Sassy', 24), ('Flavia', 24), ('Ashraf', 23), ('Recommended.', 22), ('Brock', 22), ('vulnerability', 22), ('Sabu', 22), ('Korda', 22), ('Ahmad', 22), ('Stevenson', 22), ('Coop', 22), ('Riff', 22), ('flawless.', 21), ('aunts', 21), ("Gilliam's", 21), ('Solo', 21), ('Kells', 21), ("Capote's", 21), ('Cutter', 21), ('Blackie', 21)]


[('/>4/10', 56), ('/>Avoid', 55), ('2/10', 49), ('*1/2', 45), ('unwatchable.', 43), ('/>3/10', 40), ('Thunderbirds', 40), ('Gamera', 39), ('steaming', 35), ('Wayans', 33), ('Slat

We begin to see some words we would expect to denote a negative review, but not so much for the positive reviews. Why might this be the case? What types of tokens are found in positive reviews but not in negative reviews?

*Your answer here*

: Firstly, it is observed that positive reviews often include an individual's name or title, such as Edie or Gundam. This is likely due to the fact that people tend to highlight what they appreciate more frequently than what they do not. Therefore, a review is more likely to be positive if it contains these markers.

Tokens that explicitly signify positive reviews, like "8/10," "7/10," "10/10," and "9/10," are naturally more prevalent in positive reviews.

Furthermore, the  positive reviews incorporate a higher frequency of neutral vocabulary compared to emotional language, as opposed to negative reviews.


In [9]:
# Lets now make our combined vocabulary
space_vocab = list(pos_vocab.union(neg_vocab))
print('Length of space separated vocab: {}'.format(len(space_vocab)))
print(space_vocab[:50])

Length of space separated vocab: 281137
['dispels', 'filler;', 'godmother,', 'Shoot', '(Last', 'Hearty', 'who),that', 'Ibsen;', '"controversial"', 'replaced', 'god-like', 'Clan.', 'Schnappmann', '(Lemercier', 'Somers', 'entrée,', 'True,', 'seven-foot', 'half-hour', 'Phoenixville,', 'Chance.Fox', 'traitorous', 'shimmering,', 'Quite.', 'EVERYBODY---', 'five-scene', 'Simpsons"),', 'Fawcett...Until', 'Gulf', 'world-weariness', '(peeking', 'jeans,', 'SOON.Each', '(starving', "Millionaire's", 'rites.', '19.', 'TV...anything', 'cardboard', 'ending\x97in', 'Aristotle.', 'was),', 'killings', 'timeless.', 'slathered', 'Ratio', 'life..<br', 'motifs', 'taming', 'nuptials.<br']


Looking at some words from our vocab, what issue do we find by only splitting on spaces?

*Your answer here*

: First, there's the issue of tokens being found in a combined form with punctuation. For example, you might have 'Dan.', 'theme).', and 'gimmick!' In these cases, they can be included in the vocabulary as duplicates.

 Also, as mentioned earlier, HTML Tags and Markup should not be included in the ideal Vocabulary.

 It is also not appropriate for the vocabulary to contain tokens that combine numbers and alphabets, such as '1977;', '1983.Back', '65-minute', etc. 

Now, rather than naively splitting on spaces, we can use tools which are informed about English grammar rules to create a cleaner tokenization.

In [10]:
from nltk.tokenize import word_tokenize

pos_examples_tokenized = [word_tokenize(ex) for ex in pos_examples]
neg_examples_tokenized = [word_tokenize(ex) for ex in neg_examples]

print(pos_examples_tokenized[0])
print(neg_examples_tokenized[0])

['Bromwell', 'High', 'is', 'a', 'cartoon', 'comedy', '.', 'It', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life', ',', 'such', 'as', '``', 'Teachers', "''", '.', 'My', '35', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'Bromwell', 'High', "'s", 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', '``', 'Teachers', "''", '.', 'The', 'scramble', 'to', 'survive', 'financially', ',', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', 'teachers', "'", 'pomp', ',', 'the', 'pettiness', 'of', 'the', 'whole', 'situation', ',', 'all', 'remind', 'me', 'of', 'the', 'schools', 'I', 'knew', 'and', 'their', 'students', '.', 'When', 'I', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school', ',', 'I', 'immediately', 'recalled', '.........', 'at', '..........', 'High', '.', 'A', 'classic', 'line',

Looking at the first example we can see that things like apostrophes, periods, "n'ts" and ellipses are better handled.

Let's begin defining features for our model. The simplest features are simply if a word exists or not -- however, this is will be very slow if we decide to use the whole vocabulary. Instead, let's create these features for the top 100 most common words. 

In [11]:
all_tokenized_words = [word for ex in pos_examples_tokenized for word in ex] + \
    [word for ex in neg_examples_tokenized for word in ex]

atw_counter = Counter(all_tokenized_words)
top100 = [tup[0] for tup in atw_counter.most_common(100)] # A list of the top 100 most frequent word

print(top100)

['the', ',', '.', 'and', 'a', 'of', 'to', 'is', '/', '>', '<', 'br', 'in', 'I', 'it', 'that', "'s", 'this', 'was', 'The', 'as', 'with', 'movie', 'for', 'film', ')', '(', 'but', "n't", "''", '``', 'on', 'you', 'are', 'not', 'have', 'his', 'be', 'he', '!', 'one', 'at', 'by', 'all', 'an', 'who', 'they', 'from', 'like', 'It', 'her', 'so', 'or', 'about', 'has', 'just', 'out', '?', 'do', 'This', 'some', 'good', 'more', 'very', 'would', 'what', 'there', 'up', 'can', 'which', 'when', 'time', 'she', 'had', 'if', 'only', 'really', 'story', 'were', 'their', 'even', 'see', 'no', 'my', 'me', 'does', "'", 'did', ':', '-', 'than', '...', 'much', 'been', 'could', 'into', 'get', 'will', 'we', 'other']


Use the following block to define your own features for the NB model.

In [12]:
# Define features here

def top100_word_features(example): # 100 features, 1 for each word in the top 100 most frequent words
    return {word : 1 if word in example else 0 for word in top100}

''' Define your own methods here, which take in a single example, and return a feature value (could be a 0/1 truth value, or a count)
    Some ideas:
        Look at the length of examples, is there a difference between positive and negative examples?
        Are there specific words that could be very indiciative? They may not be in the top 100. 
'''

def word_feature(example):
    sens_words = ["happy", "exciting", "joyful", "fantastic", "awesome", "terrific", "sad", "disappointing", "frustrating", "terrible", "horrible"]
    return {word : 1 if word in example else 0 for word in sens_words}

def length_feature(example):
    words = example.split(' ')
    feature_dict = {word: 1 if len(word) >= 7 else 0 for word in words}
    return feature_dict

def score_feature(example):
    score_list = ["0/10", "1/10", "2/10", "3/10", "4/10", "5/10", "6/10", "7/10", "8/10", "9/10", "10/10"]
    return {word : 1 if word in score_list else 0 for word in example}

def create_feature_dictionary(example):
    features = {}
    for feat in [top100_word_features, word_feature, length_feature]: #Once you've created your methods, and them to this list
        features.update(feat(example))
    return features


Now that we've defined our features for our model, we can create our final dataset, which will consist of extracted features and the example label. 

We'll also create a *validation* split by taking 20% of the training dataset. Remember, we never use the test set to make modeling decisions (in this case, decisions about features). Experiment with multiple models that make use of different combinations of features. Measure their performance on the validation split to figure out which features are the most helpful (use the show_most_informative_features function). When you've found your final model, evaluate its performance on the held out data. 

In [13]:
from nltk.classify import NaiveBayesClassifier
import random

# Convert training examples to a set of features.
train = [(create_feature_dictionary(ex), 0) for ex in neg_examples] + \
                [(create_feature_dictionary(ex), 1) for ex in pos_examples]

random.seed(42)
random.shuffle(train)

split_percent = .2

cutoff = int(split_percent * len(train))

validation_set = train[:cutoff]
training_set = train[cutoff:]

model = NaiveBayesClassifier.train(training_set)

In [14]:
from nltk.classify.util import accuracy

print('Validation accuracy: {}'.format(accuracy(model, validation_set)))
model.show_most_informative_features(10)

Validation accuracy: 0.862
Most Informative Features
                   Avoid = 0                   0 : 1      =     43.5 : 1.0
                    4/10 = 0                   0 : 1      =     37.7 : 1.0
                     Uwe = 0                   0 : 1      =     27.1 : 1.0
                  10/10. = 0                   1 : 0      =     25.6 : 1.0
                 Matthau = 1                   1 : 0      =     25.6 : 1.0
                  awful. = 0                   0 : 1      =     25.4 : 1.0
                Terrible = 1                   0 : 1      =     25.1 : 1.0
                  Highly = 0                   1 : 0      =     24.9 : 1.0
               Excellent = 1                   1 : 0      =     24.5 : 1.0
                 stinker = 1                   0 : 1      =     24.4 : 1.0


Describe the sets of features you've considered, and note down their performance below. What is the final set of features you found?

*Your answer here*

: By default, when only the top100_word_features is considered, the validation accuracy stands at 0.6112.

Upon applying 'word_feature', which identifies a list of 11 emotional words that can effectively differentiate between positive and negative reviews, the accuracy improved slightly to 0.6348.

Additionally, using 'length_feature', which assigns a score to word tokens exceeding seven characters in length, there was a significant increase in accuracy to 0.8616.

Finally, we tested the 'score_feature', which we consider a straightforward method to differentiate positive and negative reviews numerically. However, it increased accuracy by only 0.6216.

Among the three novel feature combinations, the top validation accuracy was obtained by only adding 'word_feature' and 'length_feature', with a validation accuracy of 0.862. The remaining combinations achieved lower accuracy.

Hence, the ultimate feature set I discovered is 'top100_word_features', 'word_feature', and 'length_feature'.

Finally, test your model on the test set. 

In [15]:
# Load and process test data
pos_test_examples = read_folder(data_dir + 'test/pos/')
neg_test_examples = read_folder(data_dir + 'test/neg/')

test_set = [(create_feature_dictionary(ex), 0) for ex in neg_test_examples] + \
                [(create_feature_dictionary(ex), 1) for ex in pos_test_examples]

In [16]:
print('Test set accuracy: {}'.format(accuracy(model, test_set)))

# Note that we're looking at accuracy -- this is not always the most reliable metric and other choices like F1 might be more informative. 


Test set accuracy: 0.84004
