# Lab 1: Classification! (and some n-gram math)

9/25/2023, Felix Muzny, Ankit Ramakrishnan, Nidhi Bodar, Harshitha Somala

Agenda
------
+ Detecting the end of a sentence
    - Rule-based classifier
+ Detecting the sentiment of a sentence
    - Rule-based classifier (counting words)
    - Measuring Accuracy, Precision, Recall (evaluating a classifier)
+ N-gram Math (getting started on things for HW 3)


Looking ahead, we'll be focusing on *classification* for much of the next several weeks. Classification can take several forms. Here are some vocabulary terms to get you started:

- __classifier__: a model that takes data (text, in NLP) as input and outputs a category
- __binary classification__: a model that takes input and outputs *one of two* categories (e.g. "positive" or "negative")
- __multinomial classification__: a model that takes input and outputs *one of many* categories (e.g. "positive", "neutral" or "negative" or a language model that chooses one token from the entire vocabulary)


- __rule-based classifier__: a classifier that functions based on rules that humans come up with (e.g. "the end of a sentence is when there is a "." ")
- __statistical classifier__: a classifier that functions based on counts (statistics) that it has gathered or based on running an algorithm to automatically train parameters on a given data set. 
    
In this lab, you'll be building rule-based classifiers and evaluating them. We'll learn about our first statistical classifier next lecture

All tasks have equal weight.

# Task 0: Who is in your group?

Dave Budhram, AkshayDupuguntla, Mario Gonzalez

# Task 1: Detecting the end of a sentence


A classifier is, in essence, a function that takes some data $x$ and assigns some label $y$ to it. For a binary classifier, we can model this a function that takes a data point $x$ and returns either `True` or `False`.

Later in this class we'll learn about how to build classifiers that automatically learn how to do this, but we'll start where NLP started—writing some rule-based classifiers.

In [1]:
def classify_sentence_end(text: str, target_index: int) -> bool: 
    """
    Classify whether or not a *location* is the end of a sentence within
    a given text
    Parameters:
        text - string piece of text
        target_index - int candidate location
    returns true if the target index is the end of a sentence. 
    False otherwise. 
    """
    # TODO: write a simple, rule-based classifier that
    # decides whether or not a specific location is the 
    # end of a sentence
    end_strings = [".", "?", "!"]
    return text[target_index] in end_strings

# look at the code in the cell below to see example usage

In [2]:
# example text
# feel free to go through different examples

# This is the given example text
"""Stocks were up as advancing issues outpaced declining issues 
          on the NYSE by 1.5 to 1. Large- and small-cap stocks were both strong, 
          while the S.&P. 500 index gained 0.46% to finish at 2,457.59. Among 
          individual stocks, the two top percentage gainers in the S.&P. 500 
          were Incyte Corporation and Gilead Sciences Inc."""

example = "Stocks were up as advancing issues outpaced declining issues on the NYSE by 1.5 to 1. Large- and small-cap stocks were both strong, while the S.&P. 500 index gained 0.46% to finish at 2,457.59. Among individual stocks, the two top percentage gainers in the S.&P. 500 were Incyte Corporation and Gilead Sciences Inc."

# this code will go through and
# build up a string based on the sentence
# decisions that your classifier comes up with
# it will put "****" between the sentences
# you do not need to modify any code here
so_far = ""
for index in range(len(example)):
    # see how the classify_sentence_end function is called!
    result = classify_sentence_end(example, index)
    so_far += example[index]
    if result:
        print(so_far)
        print("****")
        so_far = ""
        
print(so_far)

Stocks were up as advancing issues outpaced declining issues on the NYSE by 1.
****
5 to 1.
****
 Large- and small-cap stocks were both strong, while the S.
****
&P.
****
 500 index gained 0.
****
46% to finish at 2,457.
****
59.
****
 Among individual stocks, the two top percentage gainers in the S.
****
&P.
****
 500 were Incyte Corporation and Gilead Sciences Inc.
****



1. How many sentences are detected using your end of sentence classifier? 10 sentences
2. Where did your end of sentence classifier make a mistake? It made a mistake when missing the decimal numbers and abbreviations.

Task 2: Determining Sentiment
----

In [3]:
# we'll use nltk to access the reviews that we want to classify eventually
import nltk
import nltk.corpus as corpus

In [4]:
def load_word_list(filename):
    """
    Loads a lexicon from a plain text file in the format of one word per line.
    Parameters:
    filename (str): path to file

    Returns:
    list: list of words
    """
    with open(filename, 'r', encoding="utf-8") as f:
        # skip the header content
        for line in f:
            if line.strip() == "":
                break
        # read the rest of the lines into a list
        return [line.strip() for line in f]
    # otherwise return an empty list
    return []

In [5]:
# load in the positive and negative word lists here
# TODO: the paths to your negative/positive word files here
neg_lex = load_word_list("./negative_words.txt")
pos_lex = load_word_list("./positive_words.txt")

# TODO: How many words are in each list?
print(len(neg_lex))
print(len(pos_lex))


# TODO: Use python's list slicing to look at the first 10 elements in each list
neg_first_ten = neg_lex[:10]
pos_first_ten = pos_lex[:10]

print(neg_first_ten)
print(pos_first_ten)


4783
2006
['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted']
['a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed', 'acclamation']


In [6]:
# TODO: which words are in both the positive and the negative lists?
in_both_lists = [x for x in neg_lex if x in pos_lex]
print(in_both_lists)


['envious', 'enviously', 'enviousness']


Now, we'll create our rule-based classifier! We have access to the word lists that you loaded and anything else you know about the world (reflect on how you as a human being can tell if a review is positive/negative). Your classifier need not be perfect, but it should be reasonable (don't just say everything is positive!).

In [7]:
def rule_based_classify(tokens, pos_lexicon, neg_lexicon, verbose = False):
    """
    This function classifies a given tokenized text as positive or negative
    based on the provided lexicons.
    Parameters:
    tokens (list) - list of strings tokenized words in the text to classify
    pos_lexicon (list) - list of strings words in the positive word lexicon
    neg_lexicon (list) - list of strings words in the negative word lexicon
    verbose (boolean) - flag indicating whether or not to print verbose (debugging) output. 
            Default value False.
    Returns:
    string "pos" if the list of tokens is positive overall, "neg" if they are negative overall.
    """
    # TODO: implement this function! This is our classifier.
    count = 0
    for token in tokens:
        if token in pos_lexicon:
            count += 1
            if verbose:
                print(token + " is positive")
        if token in neg_lexicon:
            count -= 1
            if verbose:
                print(token + " is negative")
    if verbose:
        print("Count so far: " + str(count))
    return "pos" if (count >= 0) else "neg"

In [8]:
# now, we'll test out your classifier!
# Here are two example movie reviews.
nltk.download('movie_reviews')
movies = corpus.movie_reviews

# load in a single negative review
negative_toks = movies.words('neg/cv001_19502.txt')
# uncomment the text below to see the contents of the review
# neg_text = " ".join(negative_toks)
# print(neg_text)

# load in a single positive review
positive_toks = movies.words('pos/cv992_11962.txt')
# pos_text = " ".join(positive_toks)


# TODO:
# call your rule_based_classify on these example reviews.
first_classifier = rule_based_classify(negative_toks, pos_lex, neg_lex, True)
second_classifier = rule_based_classify(positive_toks, pos_lex, neg_lex, True)

print("positive review is " + first_classifier)
print("negative review is " + second_classifier)

# Does our classification function label them correctly? Why or why not?
# take a look at the contents of the reviews
# The classifier function does not label them correctly because the negative review contains a lot of seemingly positive words.

happy is positive
bastard is negative
damn is negative
bug is negative
virus is negative
like is positive
flashy is positive
likes is positive
work is positive
wasted is negative
well is positive
like is positive
good is positive
like is positive
pretty is positive
sunken is negative
Count so far: 4
unexpected is negative
scary is negative
smart is positive
cheesy is negative
criticism is negative
wonder is positive
laughable is negative
trash is negative
well is positive
well is positive
intelligent is positive
critics is negative
complaining is negative
lack is negative
realistic is positive
bright is positive
handsome is positive
succeed is positive
well is positive
succeeded is positive
perfect is positive
cold is negative
rocky is negative
unfortunately is negative
loose is negative
creeps is negative
shrug is negative
blame is negative
strong is positive
lie is negative
led is positive
respectable is positive
attack is negative
wise is positive
stronger is positive
stronger is po

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/davebudhram/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


1. What labels does your classifier assign these two reviews? The classifier assigned both these reviews to be positive
2. Are these correct? This is incorrect.

Task 3: How good is your sentiment classifier?
-----

Given the movies dataset from `nltk`, how many of the reviews does your classifier classify correctly?

We'll look at three different metrics: __accuracy__, __precision__, and __recall__.

__accuracy__: what you think of when you think of correctness.
$$ \frac{\texttt{number correct}}{\texttt{total number}}$$

Precision and recall require differentiated between the ways in which the classifier can be correct or incorrect. 

- __true positive__: an example whose gold label is positive and that the classifier labels as positive
- __true negative__: an example whose gold label is negative and that the classifier labels as negative
- __false positive__: an example whose gold label is negative and that the classifier labels as positive
- __false negative__: an example whose gold label is positive and that the classifier labels as negative

In [9]:
import random
# you can use numpy's random functionality if you'd like to
import numpy as np

In [10]:
# To see the available file ids, this is one way that we can access them.
# This will give you a list of neg/positive file ids.
print(len(movies.fileids('neg')))
# choose 100 random items without replacement from a list
print(random.sample(movies.fileids('neg'), 100))
print(len(movies.fileids('pos')))

1000
['neg/cv114_19501.txt', 'neg/cv033_25680.txt', 'neg/cv683_13047.txt', 'neg/cv545_12848.txt', 'neg/cv435_24355.txt', 'neg/cv566_8967.txt', 'neg/cv976_10724.txt', 'neg/cv787_15277.txt', 'neg/cv918_27080.txt', 'neg/cv989_17297.txt', 'neg/cv254_5870.txt', 'neg/cv039_5963.txt', 'neg/cv151_17231.txt', 'neg/cv280_8651.txt', 'neg/cv669_24318.txt', 'neg/cv916_17034.txt', 'neg/cv759_15091.txt', 'neg/cv287_17410.txt', 'neg/cv187_14112.txt', 'neg/cv451_11502.txt', 'neg/cv092_27987.txt', 'neg/cv681_9744.txt', 'neg/cv093_15606.txt', 'neg/cv324_7502.txt', 'neg/cv906_12332.txt', 'neg/cv360_8927.txt', 'neg/cv822_21545.txt', 'neg/cv407_23928.txt', 'neg/cv196_28898.txt', 'neg/cv538_28485.txt', 'neg/cv903_18981.txt', 'neg/cv215_23246.txt', 'neg/cv497_27086.txt', 'neg/cv181_16083.txt', 'neg/cv632_9704.txt', 'neg/cv733_9891.txt', 'neg/cv496_11185.txt', 'neg/cv973_10171.txt', 'neg/cv115_26443.txt', 'neg/cv138_13903.txt', 'neg/cv946_20084.txt', 'neg/cv300_23302.txt', 'neg/cv426_10976.txt', 'neg/cv957_905

In [11]:
# TODO:
# Write code that uses your classifier to classify 100 randomly chosen
# negative reviews and 100 randomly chosen positive reviews
# count the number of true positives, true negatives, false positives, and false negatives

# to get the tokens associated with a certain file id,
# tokens = movies.words(file_id)

# takes a long time to run if you loop over all fileids as opposed to just
# 100 randomly chosen ones
# make sure you don't classify the same review twice!
# (it takes us about 10 seconds to classify 200 reviews on a 2020 macbook air)
pos_list = random.sample(movies.fileids('pos'), 100)
neg_list = random.sample(movies.fileids('neg'), 100)

true_postive_count = 0
true_negative_count = 0
false_postive_count = 0
false_negative_count = 0

for file in pos_list:
    tokens = movies.words(file)
    result = rule_based_classify(tokens, pos_lex, neg_lex)
    if (result == "pos"):
        true_postive_count += 1
    else:
        false_negative_count += 1


for file in neg_list:
    tokens = movies.words(file)
    rule_based_classify(tokens, pos_lex, neg_lex)
    result = rule_based_classify(tokens, pos_lex, neg_lex)
    if (result == "neg"):
        true_negative_count += 1
    else:
        false_postive_count += 1


    
# TODO: print out the number of true positives, false positives,
# false negatives, and true negatives

print("True positive counts:" + str(true_postive_count))
print("False positive counts:" + str(false_postive_count))
print("False negative counts:" + str(false_negative_count))
print("True negative counts:" + str(true_negative_count))


True positive counts:62
False positive counts:28
False negative counts:38
True negative counts:72


Here are the equations for accuracy, precision, and recall in terms of what we've just been counting. $tp$ means true positive, $fp$ means false positive, $fn$ means false negative, and $tn$ means true negative.

$$ accuracy = \frac{tp + tn}{tp + fp + fn + tn}$$

$$ precision = \frac{tp}{tp + fp}$$

$$ recall = \frac{tp}{tp + fn}$$

You can think of precision as "how many of my positive guesses were correct?" and recall as "how many of the positive examples did I find?" 😄

In [12]:
# TODO: calculate and print accuracy
accuracy = (true_postive_count + true_negative_count) / (true_postive_count + false_postive_count + false_negative_count + true_negative_count)
print(accuracy)

0.67


In [13]:
# TODO: calculate and print precision
precision = (true_postive_count) / (true_postive_count + false_postive_count)
print(precision)


0.6888888888888889


In [14]:
# TODO: calculate and print recall
recall = (true_postive_count) / (true_postive_count + false_negative_count)
print(recall)

0.62


Task 4: n-gram math
----

Your final task in this lab is to do some math that will help you with your n-gram language model homework. Remember in HW 1 how you implemented a `count_list` function? Some of you were clever with how you implemented it, but let's look at a less clever implementation.

In [15]:
import time
from collections import Counter

def count_list(ls: list) -> dict:
    counts = {}
    for item in ls:
        # we're not going to be clever about counting here,
        # no conditionals, no sets, nothing
        counts[item] = ls.count(item)
    return counts

# see the difference between the following two items
example = [random.randint(0, 100) for i in range(2000)]
start = time.time()
count_list(example)
end = time.time()
print("That took:", end - start, "seconds!")

# this takes a very similar amount of time to count_dict from HW 1
start = time.time()
Counter(example)
end = time.time()
print("That took:", end - start, "seconds!")

That took: 0.029832124710083008 seconds!
That took: 0.0001220703125 seconds!


In [16]:
# TODO: put your create_ngrams (or make_ngrams) function here!
def make_ngrams(tokens: list, n: int) -> list:
    """Creates n-grams for the given token sequence.
    Args:
    tokens (list): a list of tokens as strings
    n (int): the length of n-grams to create

    Returns:
    list: list of tuples of strings, each tuple being one of the individual n-grams
    """
    # TODO: implement this function!
    ans = []
    length = len(tokens)
    for i in range (length - n + 1):
        tuple = ()
        for j in range (n):
            tuple += (tokens[i + j],)
        ans.append(tuple)
    return ans

In [17]:
# TODO: calculate the bigram score of the following sequence of tokens
# for this example, we'll use a "vanilla" scoring technique
# no Laplace smoothing, no unknown tokens
training_data = ["<s>", "I", "love", "dogs", "</s>", "<s>", "I", "love", "cats", "</s>", "<s>", "I", "love", "dinosaurs", "</s>"]

# TODO: call your create_ngrams function to get your bigrams
bigrams = make_ngrams(training_data, 2)
count_bigrams = count_list(bigrams)



to_score = ["<s>", "I", "love", "cats", "</s>"]
start = time.time()
training_data_counts = count_list(training_data)

# BEGIN SCORING SECTION
# start probability at one so that we can multiply the probability of
# each subsequent next token with it
total_prob = 1
for i in range(1, len(to_score)):
    # TODO: YOUR SCORE CALCULATION CODE HERE
    curr_count = training_data_counts[to_score[i - 1]]
    curr_bigram = (to_score[i-1], to_score[i])
    bigram_count = count_bigrams[curr_bigram]
    total_prob *= bigram_count / curr_count

    

# END SCORING SECTION
end = time.time()

# print your final probability
print("Final probability:", total_prob)
print("That took", end - start, "seconds!")


Final probability: 0.3333333333333333
That took 7.414817810058594e-05 seconds!


In [18]:
# Finally, pretend that we had a lot more data
training_data = ["<s>", "I", "love", "dogs", "</s>", "<s>", "I", "love", "cats", "</s>", "<s>", "I", "love", "dinosaurs", "</s>"]
# this is the amount of training data in the berp set
training_data = training_data * 3778

# TODO: call your create_ngrams function here
bigrams2 = make_ngrams(training_data, 2)


print("Number of training tokens:", len(training_data))
start = time.time()
# and what if we had 5000 sentences to score?
for example_num in range(3000):
    # TODO: COPY AND PASTE YOUR SCORING CODE HERE (between "BEGIN SCORING SECTION" and "END SCORING SECTION")
    # (remove any print statements that you have)
    # (make sure it is appropriately indented)
    total_prob = 1
    for i in range(1, len(to_score)):
        # TODO: YOUR SCORE CALCULATION CODE HERE
        curr_count = training_data_counts[to_score[i - 1]]
        curr_bigram = (to_score[i-1], to_score[i])
        bigram_count = count_bigrams[curr_bigram]
        total_prob *= bigram_count / curr_count


    
    

end = time.time()
print("That took", end - start, "seconds!")

Number of training tokens: 56670
That took 0.00432586669921875 seconds!


What's the moral of the story? If you perform your counts at the same time you score, you'll be doing the same work over and over again which will result in a significantly slower model!

Make sure that you're gathering the counts that you need in `train` and only performing scoring calculations (as opposed to also counting things) in `score`.

This is particularly important when using larger data sets! (berp is not that big)