# NLP HWs 4.1 Text Classification with NB

In these three homeworks, each week you will use one more model or input representation to perform text classification on the same datasets. 


In [1]:
# util contains data loading functions and classes
from util import load_data, Dataset
import numpy as np
from tqdm import tqdm # show progress bar

# HW4.1 Naive Bayes



## PART I: Implement NB from scratch

### Details about the Triage dataset

The documents in our dataset are either text messages, social media (Twitter) posts, or snippets from news articles. In addition to the specific events listed above the dataset contains a number of news articles spanning dozens of different disasters. All messages have been translated and annotated by humans on the crowdsourcing platform CrowdFlower (now branded under Appen). However, some of the translations are not perfect, and you may encounter some words in other languages. Unfortunately, NLP researchers often have to work with messy data. If you are curious about the crowdsourcing translation effort for messages from Haiti in particular, feel free to check out this paper (https://nlp.stanford.edu/pubs/munro2010translation.pdf).

<b>Your task is to classify each document as being aid-related, class AID, or not aid-related, class NOT.</b> Messages that are aid-related include individuals' requests for food, water, or shelter etc. The aid class also includes news reports about dire situations and disaster relief efforts. 

<b>Training and Validation sets.</b> The data is divided into a training set, development (validation) set, and test set. Recall that the training set is used to learn, compute the statistics for, your model. These statistics are then used to classify the documents in the development and test sets. For this assignment, you should train on the training set and test your model on both the train and dev set. 



### Dataset exploration

We use classes defined in ```util.py``` to load data and labels. Take a look at that module to have a deeper understanding of what's in each class. Here are some examples usages to get you started.

In [2]:
# load data
from util import load_data
dataset = load_data("./data/triage")

# explore the dataset class
# dataset contains dataset.train and dataset.dev
train_data = dataset.train

# train_data is a list of items of type Example (defined in util.py. there are 21046 train examples)
print(type(train_data))
print(type(train_data[0]))
print(len(train_data))

# you can do the same to explore dataset.dev
# you should use only dataset.train for training
# and you can test your model on both train and dev

<class 'list'>
<class 'util.Example'>
21046


In [3]:
# development (validation) set
dev_data = dataset.dev

print(type(dev_data))
print(type(dev_data[0]))
print(len(dev_data))

<class 'list'>
<class 'util.Example'>
2573


In [4]:
# look at each example, let's look at the first one
first_data_point = train_data[0]

# each example has two parts: the words and label. 
print("words:",first_data_point.words)
print("label:",first_data_point.label)

# look at another example
fifth_data_point = train_data[5]

# each example has two parts: the words and label. 
print("words:",fifth_data_point.words)
print("label:",fifth_data_point.label)

words: ['thisisjoej', 'earthquake', 'after', 'shock', 'hahaha', 'did', 'i', 'do', 'it', 'right', 'joe']
label: 0
words: ['thanks', 'to', 'an', 'umbrella', 'grant', 'from', 'the', 'us', 'agency', 'of', 'international', "development's", 'office', 'of', 'foreign', 'disaster', 'assistance', 'usaid', 'ofda', 'umcor', 'is', 'reaching', 'the', 'most', 'vulnerable', 'to', 'provide', '60', '000', 'emergency', 'kits', 'that', 'include', 'water', 'glucose', 'and', 'biscuits', 'as', 'well', 'as', 'help', 'build', '4', '400', 'emergency', 'shelters', 'and', '800', 'duplex', 'toilets', 'for', 'displaced', 'people', 'fleeing', 'to', 'vavuniya']
label: 1


### Implementing NB

In our textbook SLP3, chapter 4 (https://web.stanford.edu/~jurafsky/slp3/4.pdf), Section 4.2 describes training and testing a NB model. In this exercise, follow chapter 4.2 and the algorithm outlined in Figure 4.2 to implement Naive Bayes algorithm to perform text classification on the data set provided in the ```data``` folder. I've included the screen shots of these algirthms outlines. You can read the textbook to get more detailed description. 

#### Implement train function using this algorithm from Figure 4.2 in SLP3

<img src="img/function2.png" alt="Drawing" style="width: 500px;"/>

#### Implement inference function using this algorithm from Figure 4.2 in SLP3
<img src="img/function1.png" alt="Drawing" style="width: 500px;"/>

In [5]:
def trainNB(dataset:Dataset,C=[0,1]) -> (dict, dict, set):
    """
    implement this function according to the algorithm outlined above. 

    for classes C, 1 is AID, 0 is NOT, as described above. 

    return log_prior, log_likelihood, and V as specified in the algorithm.
    """
    # pass
    log_prior = dict() # {0: -0.532, 1: -0.885}
    log_likelihood = dict()
    
    N_doc = len(dataset.train) # number of documents in D, should be 21046
    
    V = set() # vocabulary of D
    bigdoc = {0: list(), 1: list()}
    for doc in tqdm(dataset.train, desc ="Calculating V & bigdoc"):
        bigdoc[doc.label].extend(doc.words) # should use extend not append
        for word in doc.words:
            V.add(word)

    for c in tqdm(C, desc ="Calculating log_prior & log_likelihood"):
        N_c = sum(1 for doc in dataset.train if doc.label == c) # number of documents from D in class c
        log_prior[c] = np.log(N_c / N_doc)
        
        for w in V:
            count_w = bigdoc[c].count(w) # number of occurrences of w in bigdoc[c]
            log_likelihood[(w, c)] = np.log((count_w + 1) / (len(bigdoc[c]) + len(V)))
    
    return log_prior, log_likelihood, V

In [6]:
def NB_inference(test_doc:list, log_prior:dict, log_likelihood:dict, C:list, V:set) -> int:
    """
    implement this function to make an inference on a test example. it should return an integer, 0 or 1, these are the two possible classes in the dataset. 
    
    the test_doc argument is represented in the Example class using the words attribute, e.g., in above example in dataset exploration, the test_doc input would be first_data_point.words, which is a list of words

    the other arguments of this function, log_prior, log_likelihood, C, and V are all seen above in the trainNB() function. 
    
    """
    # pass
    sum_C = [0, 0]
    for c in C:
        sum_C[c] = log_prior[c]
        for w in test_doc:
            if w in V:
                sum_C[c] += log_likelihood[(w, c)]
    return np.argmax(sum_C) # class 0 or class 1

### training NB classifier



In [7]:
# example inference pipeline usage to evaluate your classifier
# you can run this as it is, or you are free to add more things to it. 

def testNB(split, log_prior, log_likelihood, V, C):
    """
    argument: split can be dataset.train or dataset.dev
    """
    inferences = []
    for d in split: # each document
        result = NB_inference(d.words, log_prior, log_likelihood, C, V)
        inferences.append(result)
    preds = np.array(inferences) # predicted
    gts = np.array([d.label for d in split]) # actual
    assert(len(preds) == len(gts))
    print("accuracy", sum(preds == gts) / len(gts))

dataset = load_data("./data/triage")
C = [0,1]
log_prior, log_likelihood, V = trainNB(dataset)

# evaluate your model on dev and train
testNB(dataset.dev, log_prior, log_likelihood, V, C)
testNB(dataset.train, log_prior, log_likelihood, V, C)


Calculating V & bigdoc: 100%|██████████| 21046/21046 [00:00<00:00, 212438.56it/s]
Calculating log_prior & log_likelihood: 100%|██████████| 2/2 [04:41<00:00, 140.68s/it]


accuracy 0.7329965021375826
accuracy 0.82946878266654


## Tips

1. when you train the model it can take more than a minute. It's a good idea to use a progress bar to track your training progress. you can use https://github.com/tqdm/tqdm

2. the expected accuracy on dev data is about 73% and on train data is about 83%. If you are around that number, you should be good to go.

3. in the ```util.py``` there are more functions and classes that is currently not used in this notebook. If you want to make use of them, feel free to do so. 

# PART II: Extra credit (worth extra 20% of this assignment)

Use the ```english.stop``` in the ```data``` folder to remove stop words and then train again to see if your accuracy is better. 

In [8]:
# get the stop words

file_path = "./data/english.stop"

with open(file_path, "r") as file:
    stop_words = set(file.read().splitlines())

In [9]:
# new_trainNB function to remove stop words

def new_trainNB(dataset:Dataset,C=[0,1]) -> (dict, dict, set):
    """
    implement this function according to the algorithm outlined above. 

    for classes C, 1 is AID, 0 is NOT, as described above. 

    return log_prior, log_likelihood, and V as specified in the algorithm.
    """
    log_prior = dict()
    log_likelihood = dict()
    
    N_doc = len(dataset.train) # number of documents in D, should be 21046
    
    V = set() # vocabulary of D
    bigdoc = {0: list(), 1: list()}
    for doc in tqdm(dataset.train, desc ="Calculating V & bigdoc"):
        
        no_stopwords = [w for w in doc.words if w not in stop_words] # keep the words that are not stop words
        
        bigdoc[doc.label].extend(no_stopwords) # stop words removed
        for word in no_stopwords:
            V.add(word)

    for c in tqdm(C, desc ="Calculating log_prior & log_likelihood"):
        N_c = sum(1 for doc in dataset.train if doc.label == c) # number of documents from D in class c
        log_prior[c] = np.log(N_c / N_doc)
        
        for w in V:
            count_w = bigdoc[c].count(w) # number of occurrences of w in bigdoc[c]
            log_likelihood[(w, c)] = np.log((count_w + 1) / (len(bigdoc[c]) + len(V)))
    
    return log_prior, log_likelihood, V

In [10]:
dataset = load_data("./data/triage")
C = [0,1]
log_prior, log_likelihood, V = new_trainNB(dataset) # use the new function for training
# NB_inference and testNB functions are the same

# evaluate your model on dev and train
testNB(dataset.dev, log_prior, log_likelihood, V, C)
testNB(dataset.train, log_prior, log_likelihood, V, C)

Calculating V & bigdoc: 100%|██████████| 21046/21046 [00:00<00:00, 192474.69it/s]
Calculating log_prior & log_likelihood: 100%|██████████| 2/2 [02:52<00:00, 86.11s/it]


accuracy 0.7306645938593082
accuracy 0.8446735721752352


The new accuracy on dev data is still about 73% (slightly lower) and the new accuracy on train data is about 84% (slightly higher). Removing the stop words does not seem to have much impact on the accuracy in this case.  