# PA1.2 Naive Bayes for Text Classification

### Introduction

In this notebook, you will be implementing a Naive Bayes model to classify sentences based off their emotions.

The Naive Bayes model is a probabilistic model that uses Bayes' Theorem to calculate the probability of a label given some observed features. In this case, we will be using the Naive Bayes model to calculate the probability of a sentence belonging to a certain emotion given the words in the sentence.

For reference and additional details, please go through [Chapter 4](https://web.stanford.edu/~jurafsky/slp3/4.pdf) of the SLP3 book.


### Instructions

- Follow along with the notebook, filling out the necessary code where instructed.

- <span style="color: red;">Read the Submission Instructions, Plagiarism Policy, and Late Days Policy in the attached PDF.</span>

- <span style="color: red;">Make sure to run all cells for credit.</span>

- <span style="color: red;">Do not remove any pre-written code.</span>

- <span style="color: red;">You must attempt all parts.</span>

In [1]:
# import all required libraries here
import numpy as np 
import pandas as pd 
import regex as re 
import matplotlib.pyplot as mpl 
import scipy as sp 
import os
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import random
from transformers import AutoTokenizer
from nltk.corpus import stopwords
import datasets
from datasets import load_dataset, load_dataset_builder

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


## Loading and Preprocessing the Dataset

We will be working with the [dair-ai/emotion](https://huggingface.co/datasets/dair-ai/emotion) dataset. This contains 6 classes of emotions: `joy`, `sadness`, `anger`, `fear`, `love`, and `surprise`.

Instead of downloading the dataset manually, we will be using the [`datasets`](https://huggingface.co/docs/datasets) library to download the dataset for us. This is a library in the HuggingFace ecosystem that allows us to easily download and use datasets for NLP tasks. Outside of just downloading the dataset, it also provides a standard interface for accessing the data, which makes it easy to use with other libraries like Pandas and PyTorch. You can take a look at the huge list of datasets available [here](https://huggingface.co/datasets).

In the following cells,

1. Load in the dataset (It should already be split into train, validation, and test sets.)

2. Define a dictionary mapping the emotion labels to integers. You can find these on the dataset page linked above.

3. Format each split of the dataset into a Pandas DataFrame. The columns should be `text` and `label`, where `text` is the sentence and `label` is the emotion label.

In [2]:
# code here
emotion_dataset = load_dataset('emotion')

Using the latest cached version of the module from C:\Users\user\.cache\huggingface\modules\datasets_modules\datasets\emotion\cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd (last modified on Sat Feb  3 22:15:15 2024) since it couldn't be found locally at emotion, or remotely on the Hugging Face Hub.


In [3]:
int_to_emotion = {
    0: 'sadness',
    1: 'joy',
    2: 'love',
    3: 'anger',
    4: 'fear',
    5: 'surprise'
}

emotion_to_int = {
    'sadness': 0,
    'joy': 1,
    'love': 2,
    'anger': 3,
    'fear': 4,
    'surprise': 5
}

In [4]:
dfs = []
for split in emotion_dataset.keys():
    print(split)
    df = pd.DataFrame(emotion_dataset[split], columns=['text', 'label'])
    df['label'] = df['label'].map(int_to_emotion)
    dfs.append(df)
print('---------------')
train_df = dfs[0]
print(train_df.head())
val_df = dfs[1]
print(val_df.head())
test_df = dfs[2]
print(test_df.head())

train
validation
test
---------------
                                                text    label
0                            i didnt feel humiliated  sadness
1  i can go from feeling so hopeless to so damned...  sadness
2   im grabbing a minute to post i feel greedy wrong    anger
3  i am ever feeling nostalgic about the fireplac...     love
4                               i am feeling grouchy    anger
                                                text    label
0  im feeling quite sad and sorry for myself but ...  sadness
1  i feel like i am still looking at a blank canv...  sadness
2                     i feel like a faithful servant     love
3                  i am just feeling cranky and blue    anger
4  i can have for a treat or if i am feeling festive      joy
                                                text    label
0  im feeling rather rotten so im not very ambiti...  sadness
1          im updating my blog because i feel shitty  sadness
2  i never make her separate fro

Now that we've gotten a feel for the dataset, we might want to do some cleaning or preprocessing before continuing. For example, we might want to remove punctuation and other alphanumeric characters, lowercase all the text, strip away extra whitespace, and remove stopwords.

In the cell below, write a function that does exactly the following described above. You can use the `re` library to help you with this. You can also use the `nltk` library to help you with removing stopwords.

Once you are done, you can simply `apply` this function to the `text` column of the dataset to get the preprocessed text.

In [5]:
# code here
def preprocess_text(text):
    """Preprocesses text by removing punctuation, converting to lowercase,
       stripping whitespace, and removing stopwords.

    Args:
        text (str): The text to preprocess.

    Returns:
        str: The preprocessed text.
    """

    # Remove punctuation and other non-alphanumeric characters
    text = re.sub(r"[^\w\s]", "", text)

    # Convert to lowercase
    text = text.lower()

    # Strip whitespace
    text = text.strip()

    # Tokenize the text
    words = text.split()

    # Remove stopwords
    stop_words = set(stopwords.words('english')) 
    words = [word for word in words if word not in stop_words]

    # Join the words back into a string
    text = " ".join(words)

    return text

In [6]:
train_df['text'] = train_df['text'].apply(preprocess_text)
val_df['text'] = val_df['text'].apply(preprocess_text)
test_df['text'] = test_df['text'].apply(preprocess_text)


### Vectorizing sentences with Bag of Words

Now that we have loaded in our data, we will need to vectorize our sentences - this is necessary to be able to numericalize our inputs before feeding them into our model. 

We will be using a Bag of Words approach to vectorize our sentences. This is a simple approach that counts the number of times each word appears in a sentence. 

The element at index $\text{i}$ of the vector will be the number of times the $\text{i}^{\text{th}}$ word in our vocabulary appears in the sentence. So, for example, if our vocabulary is `["the", "cat", "sat", "on", "mat"]`, and our sentence is `"the cat sat on the mat"`, then our vector will be `[2, 1, 1, 1, 1]`.

You will now create a `BagOfWords` class to vectorize our sentences. This will involve creating

1. A vocabulary from our corpus

2. A mapping from words to indices in our vocabulary

3. A function to vectorize a sentence in the fashion described above

It may help you to define something along the lines of a `fit` and a `vectorize` method.

In [7]:
# code here
class BagOfWords:
    def __init__(self):
        self.vocabulary = None
        self.word_to_idx = None

    def fit(self, corpus):
        """Creates vocabulary and word-to-index mapping from a corpus of text.

        Args:
            corpus (list of str): A list of sentences.
        """

        # Create a list of all unique words in the corpus
        vocabulary = set()
        # corpus_list = corpus.split(' ')
        for sentence in corpus:
            # vocabulary.update(word) # update func takes uniion and returns result to the set on which func was called
            words = sentence.split(" ")
            vocabulary.update(words)

        # Create a mapping from words to indices
        self.vocabulary = list(vocabulary)
        self.word_to_idx = {word: i for i, word in enumerate(self.vocabulary)}

    def vectorize(self, sentence):
        """Vectorizes a sentence using the Bag of Words approach.

        Args:
            sentence (str): The sentence to vectorize.

        Returns:
            numpy.ndarray: A vector of word counts, where the element at index i
            represents the number of times the i-th word in the vocabulary appears
            in the sentence.
        """

        word_counts = np.zeros(len(self.vocabulary))
        for word in sentence.split():
            if word in self.word_to_idx:
                word_counts[self.word_to_idx[word]] += 1
        return word_counts


For a sanity check, you can manually set the vocabulary of your `BagOfWords` object to the vocabulary of the example above, and check that the vectorization of the sentence is correct.

Once you have implemented the `BagOfWords` class, fit it to the training data, and vectorize the training, validation, and test data.

In [8]:
# code here
# corp = ['the','cat','sat','on','the','mat']
# sent1 = "the cat sat on the mat"
# classifier = BagOfWords()
# classifier.fit(corp)
# print(classifier.vocabulary)
# print(classifier.vectorize(sent1))


# Create the BagOfWords object
bow = BagOfWords()

# Fit the BagOfWords to the training data (corrected the typo "traning")
training_text = emotion_dataset["train"]["text"]  # Access text data correctly
bow.fit(training_text)
vectorized_dic = {}
# Vectorize the training, validation, and test data
for split in emotion_dataset.keys():
    # Get the text data for the current split
    text_data = emotion_dataset[split]["text"]

    # Vectorize the text data
    vectorized_data = np.array([bow.vectorize(text) for text in text_data])

    # Store the vectorized data in a new column
    # emotion_dataset[split]["vectorized_text"] = vectorized_data
    # emotion_dataset[split] = emotion_dataset[split].add_column("vectorized_text", vectorized_data)
    vectorized_dic[split] = vectorized_data

print("Vectorized data:")
for split, data in vectorized_dic.items():
    print(split, '\n')
    print(data)
    print('\n')
    print('\n')


# print(emotion_dataset["train"].head())



Vectorized data:
train 

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]




validation 

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]




test 

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]






In [9]:
# print(bow.vocabulary)
print(training_text[1])

i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake


In [10]:
# print(vectorized_dic['train'][1].count(1))
xxx = vectorized_dic['train'][77]
print(train_df['text'][2])
print(np.count_nonzero(xxx == 1.0))
print(len(xxx))

im grabbing minute post feel greedy wrong
13
15212


## Naive Bayes

### From Scratch

Now that we have vectorized our sentences, we can implement our Naive Bayes model. Recall that the Naive Bayes model is based off of the Bayes Theorem:

$$
P(y \mid x) = \frac{P(x \mid y)P(y)}{P(x)}
$$

What we really want is to find the class $c$ that maximizes $P(c \mid x)$, so we can use the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(x \mid c)P(c)
$$

We can then use the Naive Bayes assumption to simplify this:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(c) \prod_{i=1}^{n} P(x_i \mid c)
$$

Where $x_i$ is the $i^{\text{th}}$ word in our sentence.

All of these probabilities can be estimated from our training data. We can estimate $P(c)$ by counting the number of times each class appears in our training data, and dividing by the total number of training examples. We can estimate $P(x_i \mid c)$ by counting the number of times the $i^{\text{th}}$ word in our vocabulary appears in sentences of class $c$, and dividing by the total number of words in sentences of class $c$.

It would help to apply logarithms to the above equation so that we translate the product into a sum, and avoid underflow errors. This will give us the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ \log P(c) + \sum_{i=1}^{n} \log P(x_i \mid c)
$$

You will now implement this algorithm. It would help to go through [this chapter from SLP3](https://web.stanford.edu/~jurafsky/slp3/4.pdf) to get a better understanding of the model - **it is recommended base your implementation off the pseudocode that has been provided on Page 6**. You can either make a `NaiveBayes` class, or just implement the algorithm across two functions.

<span style="color: red;"> For this part, the only external library you will need is `numpy`. You are not allowed to use anything else.</span>

In [11]:
# code here
import numpy as np

class NaiveBayes:
    def __init__(self, vocabulary, classes):
        self.vocabulary = vocabulary
        self.classes = classes
        self.class_priors = None
        self.word_cond_probs = None

    def fit(self, X, y):
        """
        Fits the Naive Bayes model to the training data.

        Args:
            X (numpy.ndarray): A 2D array of shape (n_samples, n_features), where each
                row represents a vectorized sentence and each column represents a word
                in the vocabulary.
            y (numpy.ndarray): A 1D array of shape (n_samples,), where each element
                represents the class label for the corresponding sentence in X.
        """

        # Calculate class priors (P(c))
        self.class_priors = np.bincount(y) / len(y)

        # Initialize word conditional probabilities (P(x_i | c)) with Laplace smoothing
        self.word_cond_probs = np.ones((len(self.vocabulary), len(self.classes)))
        for i in range(len(self.vocabulary)):
            for j in range(len(self.classes)):
                class_words = X[y == j, i] # selects the occurences of word i in samples of class j
                self.word_cond_probs[i, j] = (np.sum(class_words) + 1) / (np.sum(y == j) + len(self.vocabulary)) # dividing by length of vocab each time

    def predict(self, X):
        """
        Predicts the class labels for a set of sentences.

        Args:
            X (numpy.ndarray): A 2D array of shape (n_samples, n_features), where each
                row represents a vectorized sentence and each column represents a word
                in the vocabulary.

        Returns:
            numpy.ndarray: A 1D array of shape (n_samples,), where each element
                represents the predicted class label for the corresponding sentence in X.
        """

        # Calculate log probabilities for each class
        log_probs = np.zeros((X.shape[0], len(self.classes)))
        for c in range(len(self.classes)):
            class_prob = np.log(self.class_priors[c])
            log_probs[:, c] = self.class_priors[c]
            for i in range(X.shape[1]):
                log_probs[:, c] += np.log(self.word_cond_probs[i, c]) * X[:, i] # count of occurrences of the word in the sample

        # Predict the class with the highest log probability
        return np.argmax(log_probs, axis=1)



X_train, y_train = vectorized_dic['train'], train_df['label'].map(emotion_to_int)
X_test = vectorized_dic['test']

# Create the Naive Bayes model
classes = [0,1,2,3,4,5]

model = NaiveBayes(bow.vocabulary, classes)

# Train the model
model.fit(X_train, y_train)

# # Predict on the test data
# predictions = model.predict(X_test)


Now use your implementation to train a Naive Bayes model on the training data, and generate predictions for the Validation Set.

Report the Accuracy, Precision, Recall, and F1 score of your model on the validation data. Also display the Confusion Matrix. You are allowed to use `sklearn.metrics` for this.

In [12]:
# code here
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix



X_val, y_val = vectorized_dic['validation'], val_df['label'].map(emotion_to_int)
# Predict on the validation data
predictions = model.predict(X_val)

# Evaluate model performance
accuracy = accuracy_score(y_val, predictions)
precision = precision_score(y_val, predictions, average='macro')
recall = recall_score(y_val, predictions, average='macro')
f1 = f1_score(y_val, predictions, average='macro')

# Print the metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

# Calculate and display the confusion matrix
confusion_mat = confusion_matrix(y_val, predictions)
print("Confusion Matrix:\n", confusion_mat)


Accuracy: 0.5695
Precision: 0.6970694724981793
Recall: 0.3096600645794386
F1-score: 0.2714764787482508
Confusion Matrix:
 [[410 140   0   0   0   0]
 [  3 701   0   0   0   0]
 [  7 170   1   0   0   0]
 [ 61 199   0  15   0   0]
 [ 44 155   0   1  12   0]
 [ 19  62   0   0   0   0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Using `sklearn`

Now that you have implemented your own Naive Bayes model, you will use the `sklearn` library to train a Naive Bayes model on the same data. Alongside this, you will use their implementation of the Bag of Words model, the `CountVectorizer` class, to vectorize your sentences.

You can use the `MultinomialNB` class to train a Naive Bayes model. Go through the relevant documentation to figure out how to use it, and how it differs from the model you implemented.

When you finish training your model, report the same metrics as above on the Validation Set.

In [13]:
# code here
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Assuming you have access to training and validation data (text and labels)
training_text = train_df['text']
validation_text = val_df['text']
training_labels = train_df['label'].map(emotion_to_int)
validation_labels = val_df['label'].map(emotion_to_int)

# Create the CountVectorizer object
vectorizer = CountVectorizer()

# Vectorize the training and validation text
training_features = vectorizer.fit_transform(training_text)
validation_features = vectorizer.transform(validation_text)

# Create the MultinomialNB model
model = MultinomialNB()

# Train the model
model.fit(training_features, training_labels)

# Make predictions on the validation data
predictions = model.predict(validation_features)

# Evaluate model performance
accuracy = accuracy_score(validation_labels, predictions)
precision = precision_score(validation_labels, predictions, average='macro')
recall = recall_score(validation_labels, predictions, average='macro')
f1 = f1_score(validation_labels, predictions, average='macro')

# Print the metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

# Calculate and display the confusion matrix
confusion_mat = confusion_matrix(validation_labels, predictions)
print("Confusion Matrix:\n", confusion_mat)


Accuracy: 0.788
Precision: 0.8440655233193844
Recall: 0.608369864944128
F1-score: 0.6484652213208238
Confusion Matrix:
 [[519  20   1   5   5   0]
 [ 30 669   3   2   0   0]
 [ 38  77  60   2   1   0]
 [ 52  30   0 189   4   0]
 [ 49  24   0   8 129   2]
 [ 31  30   0   1   9  10]]
