# edX Natural Language Processing Foundations – Assignment 3: Text Classification

Hello everyone, this assignment notebook covers the **Naive Bayes Classifier**. There are some code-completion tasks in this notebook. For code completion tasks, please write down your answer (i.e., your lines of code) between sentences that "Your code starts here" and "Your code ends here". The space between these two lines does not reflect the required or expected lines of code.

When you work on this notebook, you can insert additional code cells (e.g., for testing) or markdown cells (e.g., to keep track of your thoughts). However, before the submission, please remove all those additional cells again. Thanks!

**Important:**
* Remember to rename and save this Jupyter notebook as **A3_edXusername.ipynb** (e.g., **A3_bobsmith.ipynb**) before submission! Failure to do so will yield a penalty of 1 Point.
* Remember to rename and save the script file **A3_script.py** as **A3_edXusername.py** (e.g., **A3_bobsmith.py**) before submission! Failure to do so will yield a penalty of 1 Point.

Please also add your name and edX id in the code cell below. This is just to make any identification of your notebook doubly sure.

In [30]:
edx_username = 'XIONG_WEITAO'  # e.g., bobsmith, you can check this in edX `Account Settings`.

## Overview

Text classification is an extremely important task. For example, **spam detection** is one essential application that aims to assign an email to one of the two classes: *spam* or *not-spam*.

There are many popular methods to perform text classification, which can be divided into two categories: **Discriminative** methods and **Generative** methods. Discriminative methods like logistic regression put efforts on learning good features to discriminate between classes, while the core idea of Generative methods focus on how a given class can produce the observed data, as said by [Richard Feynman](https://en.wikipedia.org/wiki/Richard_Feynman): *What I cannot create, I do not understand.*

In this assignment, you need to implement one of the simplest classifiers, the **Naive Bayes Classifier**, which belongs to the family of generative methods. The basic idea of this classifier is to answer the question: given a document consisting of a set of words, what is the probability it belongs to a certain class.

## Setting up the Notebook

In [31]:
# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Making all the required imports:

In [32]:
import util

**Important:** This notebook also requires you to complete in a separate `.py` script file. This keeps this notebook cleaner and simplifies testing your implementations for us. As you need to rename the file `A3_script.py`, you also need to edit the import statement below accordingly.

In [33]:
from A3_XIONG_WEITAO import MyMultinomialNaiveBayes
# from A3_BobSmith_123456 import MyMultinomialNaiveBayes # <-- you well need to rename this accordingly

## Prepare Dataset
In this assignment, we use the **GoEmotions** dataset, which is a human-annotated dataset of Reddit comments extracted from popular English-language subreddits and labeled with emotion categories.

To simplify our assignment, we filter the original dataset and perform the emotion classification task only considering five labels, including "approval", "joy", "anger", "sadness" and "confusion".

For more information about the original GoEmotions dataset, you can refer to the [original paper](https://arxiv.org/pdf/2005.00547.pdf) and the [Google Blog](https://ai.googleblog.com/2021/10/goemotions-dataset-for-fine-grained.html).

In [34]:
# Note: directly run this code block.
train_texts, train_labels, test_texts, test_labels = util.prepare_corpus()
display_num = 5  # your can change this value to view more data.
print("Examples:")
for train_text, train_label in zip(train_texts[:display_num], train_labels[:display_num]):
    print("[Input]", train_text, " --> ", "[Label]", train_label)

go_emotions dataset [train] part statistics:  {'anger': 1025, 'approval': 1873, 'sadness': 817, 'confusion': 858, 'joy': 853}
Train dataset size:  5426
go_emotions dataset [test] part statistics:  {'approval': 236, 'confusion': 97, 'joy': 93, 'anger': 131, 'sadness': 102}
Text dataset size:  659
Examples:
[Input] What a load of old shite.  -->  [Label] anger
[Input] People on reddit love trying to change the definition of words to suit their points  -->  [Label] approval
[Input] i see it in a different way, but i understand why you see it like that  -->  [Label] approval
[Input] This is on a video about electric guitars what the fuck?  -->  [Label] anger
[Input] No I actually really like that I don’t have to pump my own gas.  -->  [Label] approval


## Preprocess Dataset
After loading the dataset, we need to preprocess the raw dataset, including `Lemmatization`, `Stop words Removal`, `Punctuation Removal` and `Lowercase`.

In [35]:
# Note: directly run this code block.
processed_train_texts = util.preprocess_texts(train_texts)
processed_test_texts = util.preprocess_texts(test_texts)
print("Examples:")
for train_text, processed_train_text in zip(train_texts[:display_num], processed_train_texts[:display_num]):
    print(train_text, " --> ", processed_train_text)

Examples:
What a load of old shite.  -->  ['load', 'old', 'shite']
People on reddit love trying to change the definition of words to suit their points  -->  ['people', 'reddit', 'love', 'try', 'change', 'definition', 'word', 'suit', 'point']
i see it in a different way, but i understand why you see it like that  -->  ['see', 'different', 'way', 'understand', 'see', 'like']
This is on a video about electric guitars what the fuck?  -->  ['video', 'electric', 'guitar', 'fuck']
No I actually really like that I don’t have to pump my own gas.  -->  ['actually', 'really', 'like', 'pump', 'gas']


## Naive Bayes Classifier

Let's first recall what we have learned in the course.

Given one doc $d$, we want the classifier to assign one sentiment label $\hat{c}$ to the doc.

$$\hat{c}=\underset{c \in C}{\operatorname{argmax}}  P(c|d)$$

Now, we can use **Bayesian Inference** to transform the above equation to another formulation we care about. Before that, do you still remember what is Bayesian Inference?

$$P(x|y)=\frac{P(y|x)P(x)}{P(y)}$$

Accordingly, we can get:

$$\hat{c}=\underset{c \in C}{\operatorname{argmax}} P(c|d)=\underset{c \in C}{\operatorname{argmax}} \frac{P(d|c) P(c)}{P(d)}$$

We can drop the denominator $P(d)$ since it doesn't change for each class:

$$\hat{c}=\underset{c \in C}{\operatorname{argmax}} P(c|d)=\underset{c \in C}{\operatorname{argmax}} P(d|c) P(c)$$

Can you give an intuitive explanation for the above equation? 
Do you still remember naive bayes is one of the generative methods? Why it is generative?

The above equation shows how a document is generated. Firstly, sample a class $c$, Secondly, given the sampled class $c$, we then sample the document $d$. 

Now, let's concentrated on the $\underset{c \in C}{\operatorname{argmax}} P(d|c) P(c)$, in this equation, we call $P(d|c)$ **likelihood** and call $P(c)$ **prior probability** of class $c$.

For the documemt $d$, we can represent it as a sequence of words $w_1, w_2, ..., w_n$, thus the equation becomes:

$$\hat{c}=\underset{c \in C}{\operatorname{argmax}} P(w_1, w_2, ..., w_n|c) P(c)$$

It is still diffcult to compute $P(w_1, w_2, ..., w_n|c)$ since there are extensive combinaitions of $w_1, w_2, ..., w_n$.

Therefore, **Naive Bayes makes two simplifying assumptions**.

* Bag of Words assumption: word order does not matter.
* Conditional Independence assumption: $P(w_n|c)$ are independent.

Based on the above two assumptions, we can get:

$$\hat{c}=\underset{c \in C}{\operatorname{argmax}} P(c) \prod_{i \in [1:n]} P(w_i|c)$$

Have we reached the final equation? As you know, probability is one between zero and one. A product of many probabilities, such as $\prod_{i \in [1:n]} P(w_i|c)$, may be very unstable numerically. Hench, we turn to the sum of log-probabilities rather than product of probabilities:

$$\hat{c}=\underset{c \in C}{\operatorname{argmax}} \operatorname{log} P(c) + \sum_{i \in [1:n]} \log P(w_i|c)$$

Congratulations, we reach the final equation for naive bayes!

In [36]:
nb_classifier = MyMultinomialNaiveBayes()

### Build vocab
Firstly, given the preprocessed dataset, we need to build a vocabulary to cover the words we care about.

In [37]:
# Note: you need to finish the code completion task for `build_vocabulary`, then you can run this code block
# Note: account for 10 points
cutoff_freq = 2 # you can change this by yourself.
nb_classifier.build_vocabulary(processed_train_texts, cutoff_freq=cutoff_freq)

Vocab size: 1900


### Vectorize Datasets
Given the vocab, we need to transform each text in our dataset into one vector. Recall that one import assumption in naive bayes is `Bag of Words assumption`, which means each document is represented as a `Bag of Words vector`.

In [38]:
# Note: you need to finish the code completion task for `texts2vec`, then you can run this code block
# Note: account for 20 points
vectorized_train_texts = nb_classifier.texts2vec(processed_train_texts)
vectorized_test_texts = nb_classifier.texts2vec(processed_test_texts)
if vectorized_test_texts:
    print("Example:")
    print(vectorized_train_texts[0])
    print("Vocab size:", nb_classifier.vocabulary_size)
    print("Vector size:", len(vectorized_train_texts[0]))

Example:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

So far we have completed the preparation of the dataset.

### Core Part
Now, let's move to the core implementation of naive bayes.

please keep in mind that our goal is to implement the equation:

$$\hat{c}=\underset{c \in C}{\operatorname{argmax}} \operatorname{log} P(c) + \sum_{i \in [1:n]} \log P(w_i|c)$$

For the **first** item $P(c)$, we can simply use the frequncy in the dataset.

$$P(c)=\frac{N_c}{N_{doc}}$$

where $N_c$ represents the number of documemts with class label $c$, and $N_{doc}$ means total number of documents in the training dataset.

For the **second** term $P(w_i|c)$, we compute the fraction of numbers of word $w_i$ appears among all words in all documents of topic $c$.

$$P(w_i|c)=\frac{\operatorname{count}(w_i, c)}{\sum_{w \in vocab}\operatorname{count}(w, c)}$$

For the above two examples, the core calculation is the number of times a word occurs, such as $N_c$, $N_{doc}$ and $count(w_i,c)$. Thus, **you need to implement a functon `fit`**, which performs such calculations and store all the results in the `self.count` parameter and `self.classes` parameter.

Before we start to train the `nb_classifier`, let's see what we have prepared so far.

In [39]:
# Note: directly run this code block.
for vectorized_train_text, train_label in zip(vectorized_train_texts[:display_num], train_labels[:display_num]):
    print(vectorized_train_text, " --> ", train_label)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

Now, it's time to start the training process.

In [40]:
# Note: you need to finish the code completion task for `fit`, then you can run this code block
# Note: account for 30 points
nb_classifier.fit(vectorized_train_texts, train_labels)

Finish Model Training.


Let's try some examples.

In [41]:
# After you get the trained `nb_classifier`, you can try these examples, otherwise, the following codes will throw errors.
print("P(joy) = N_joy / N_doc = {} / {} = {}".format(nb_classifier.count["joy"]["total_data"],
      nb_classifier.count["total_data"], nb_classifier.count["joy"]["total_data"]/nb_classifier.count["total_data"]))
print("P(anger) = N_anger / N_doc = {} / {} = {}".format(nb_classifier.count["anger"]["total_data"],
      nb_classifier.count["total_data"], nb_classifier.count["anger"]["total_data"]/nb_classifier.count["total_data"]))

P(joy) = N_joy / N_doc = 853 / 5426 = 0.15720604496866936
P(anger) = N_anger / N_doc = 1025 / 5426 = 0.18890527091780318


Given the processed test dataset `test_vectorized_texts`, the multinomial naive bayes will assign each text datapoint a class label. Thus, we provide you with a `predict` function, which aims to iteratively give the prediction for each datapoint.

Specifically, for each test data, we give a function `predict_single` to predict its best class.


The key for predicting one lable for a test datapoint is to calculate:

$$\hat{c}=\underset{c \in C}{\operatorname{argmax}} \operatorname{log} P(c) + \sum_{i \in [1:n]} \log P(w_i|c)$$

Therefore, we need a function `probability` to calculate the probability for each $c \in C$. Afterwards, we can decide the final label for the test datapoint by comparing the probabilities.

Any problem for the above estimation? What if the word $w_i$ never occurs in documents with label $c$? If so, the $P(w_i|c)$ becomes 0! 0 is going to hurt the above equation badly.

Can you think of some ways to solve this problem?

Maybe something like the smoothing we use in the Langugae Modeling?

Yes! The simplest solution is the add-one smoothing. Based on the add-one smoothing, we can get:

$$P(w_i|c)=\frac{\operatorname{count}(w_i, c)+1}{\sum_{w \in vocab}\operatorname{count}(w, c)+|vocab|}$$

where $|vocab|$ is the number of words in the vocabulary.

In [42]:
# Note: you need to finish the code completion task for `probability`, then you can run this code block
# Note: account for 20 points
if vectorized_test_texts:
    log_prob = nb_classifier.probability(vectorized_test_texts[0], "joy")
    print(log_prob)

-17.296522040096455


Now, given a vectorized test representation, we can give it a prediction!

In [43]:
# Note: you need to finish the code completion task for `predict`, then you can run this code block
# Note: account for 10 points
if vectorized_test_texts:
    prediction = nb_classifier.predict_single(vectorized_test_texts[0])
    print("Prediction: ", prediction)

Prediction:  approval


Let's start to predict for all test datas!

In [44]:
# Note: directly run this code block.
predictions = None
if vectorized_test_texts:
    predictions = nb_classifier.predict(vectorized_test_texts)
    print("Predictions: ", predictions)

Predictions:  ['approval', 'approval', 'approval', 'approval', 'approval', 'confusion', 'anger', 'approval', 'approval', 'anger', 'sadness', 'sadness', 'approval', 'approval', 'anger', 'approval', 'approval', 'approval', 'approval', 'approval', 'approval', 'approval', 'confusion', 'approval', 'anger', 'approval', 'approval', 'anger', 'approval', 'sadness', 'approval', 'sadness', 'sadness', 'approval', 'joy', 'approval', 'approval', 'approval', 'anger', 'joy', 'sadness', 'anger', 'joy', 'sadness', 'approval', 'anger', 'confusion', 'approval', 'sadness', 'sadness', 'approval', 'confusion', 'sadness', 'confusion', 'approval', 'sadness', 'approval', 'approval', 'sadness', 'sadness', 'approval', 'joy', 'approval', 'approval', 'approval', 'joy', 'joy', 'confusion', 'approval', 'anger', 'approval', 'sadness', 'confusion', 'anger', 'approval', 'approval', 'sadness', 'anger', 'sadness', 'confusion', 'anger', 'confusion', 'sadness', 'sadness', 'approval', 'sadness', 'confusion', 'approval', 'app

So far, you have successfully used the classifier to provide predictions for the test data.
Do you want to know how your classifier performs?
To this end, you need to implement a `score` function to calculate the accuracy.

In [45]:
# Note: you need to finish the code completion task for `score`, then you can run this code block
# Note: account for 10 points
if predictions:
    acc = nb_classifier.score(predictions, test_labels)
    print("Your MultinomialNaiveBayes classifier achieves an accuracy score of:", acc)

Your MultinomialNaiveBayes classifier achieves an accuracy score of: 0.6889226100151745


## Summary

In this assignment we focus on a generative classification method, called **Naive Bayes Classifier**. The crux of the classifier is based on the Bayes theorem. We make two assumptions, Bag of Words assumption and Conditional Independence assumption, to reach out the final Naive Bayes Classifier.
Congratulations on finishing this assignment!