<img src='data/images/section-notebook-header.png' />

# Naive Bayes Classifier -- Implementation from Scratch

The Naive Bayes classifier is a simple probabilistic machine learning algorithm used for classification tasks, particularly in text classification and spam filtering. It is based on Bayes' theorem and the assumption of conditional independence between features. The Naive Bayes classifier calculates the probability of a data instance belonging to a particular class by applying Bayes' theorem, which states: 

* P(Class|Features) = (P(Features|Class) * P(Class)) / P(Features)

Here, P(Class|Features) is the probability of the class given the observed features, P(Features|Class) is the probability of the observed features given the class, P(Class) is the prior probability of the class, and P(Features) is the probability of the observed features.

The "naive" assumption in Naive Bayes comes from assuming that the features are conditionally independent given the class, meaning that the presence or absence of one feature does not affect the presence or absence of other features. Despite this assumption, Naive Bayes often performs well in practice, especially with large feature spaces. The Naive Bayes classifier calculates the probability of each class for a given data instance and assigns the instance to the class with the highest probability. It can handle both binary and multiclass classification problems.

Naive Bayes classifiers are relatively fast to train and make predictions, and they require a small amount of training data to estimate the necessary probabilities. However, they may suffer from the "zero probability problem" if a feature has not been observed in the training data with a particular class. This issue can be addressed using techniques like Laplace smoothing or other smoothing methods.

Overall, the Naive Bayes classifier is a simple yet effective probabilistic algorithm for text classification and other classification tasks, particularly when dealing with large feature spaces and limited training data. The goal of this tutorial is to perform text classification "from scratch", i.e., without using an implementation of a classifier provided by existing packages such as sklearn. This gives a better intuition how the classifier work.

## Setting up the Notebook

### Import all Required Packages

Note that we do not import only auxiliary methods from `sklearn` but not any implementation of the Naive Bayes classifier.

In [1]:
import numpy as np
import pandas as pd
import random

from nltk import bigrams
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

from src.nlputil import preprocess_text

---

## Prepare Dataset

We use the same dataset as in the "Text Classification" notebook. Hence, we perform the same steps in terms of 

* Loading the files
* Preprocessing the sentences
* Generating training and test data

all in the code cell below. If anything is in clear, check out the "Text Classification" notebook for more details

In [2]:
# Load files using pandas
df_sent_pos = pd.read_csv('data/corpora/sentence-polarity-dataset/sentence-polarity.pos', sep='\t', header=None)
df_sent_neg = pd.read_csv('data/corpora/sentence-polarity-dataset/sentence-polarity.neg', sep='\t', header=None)

# Create a list for all sentences and ad the sentences from both read files
sentences = []
sentences.extend(df_sent_neg[0].tolist())
sentences.extend(df_sent_pos[0].tolist())

# Preprocess sentences (by default, we only lowercase all letter and remove topwords and punctuation)
sentences_preprocessed = [''] * len(sentences)
for idx, sent in enumerate(sentences):
    sentences_preprocessed[idx] = preprocess_text(sent)

# Create a list for all lables
polarities = []
polarities.extend([0]*len(df_sent_neg))
polarities.extend([1]*len(df_sent_pos))

# Convert from lists to numpy arrays
sentences = np.array(sentences_preprocessed)
polarities = np.array(polarities)

# Shuffle sentences and labels
combined = list(zip(sentences, polarities))
random.seed(1) # (optional)
random.shuffle(combined)
# split the "zipped" list into the two lists of sentences and labels/polarities
sentences[:], polarities[:] = zip(*combined)

# Let's go for a 80%/20% split -- you can change the value anf see its effects
train_test_ratio = 0.8

# Calculate the size of the training data (the size of the dest data is also implicitly given)
train_set_size = int(train_test_ratio * len(sentences))

# Split data and labels into training and test data with respect to the size of the test data
X_train, X_test = sentences[:train_set_size], sentences[train_set_size:]
y_train, y_test = polarities[:train_set_size], polarities[train_set_size:]

print("Size of training set: {}".format(len(X_train)))
print("Size of test: {}".format(len(X_test)))

Size of training set: 8529
Size of test: 2133


---

## Implementing a Naive Bayes Classifier

The Naive Bayes Classifier classifies documents - given by a set of words $\{w_1, w_2, ..., w_n\}$ - by calculating the conditional probabilities

$$P(c_i\ |\  w_1,w_2,...,w_n)$$

for all classes $c_i$, and picking the class with the highest conditional probability. For example, one would assume that $P(c_{pos}\ |\  happy,luck,...,vacation)$ has a higher value that $P(c_{pos}\ |\  \text{accident},\text{bad},...,\text{traffic})$.

Using Bayes' Theorem, we can write:

$$P(c_i\ |\  w_1,w_2,...,w_n) = \frac{P(w_1,w_2,...,w_n \ |\ c_i)  \cdot P(c_i)}{P(w_1,w_2,...,w_n)}$$

$P(c_i)$ is called the prior probability of class $c_i$ and simply reflects the distribution of the different classes in the set of documents. For example, if our dataset of positive and negative documents (i.e., 2 classes) contains 55% positive sentences, $P(c_{pos})=0.55$ and $P(c_{neg})=0.45$.

We can further simplify this calculation. In the end, we are only interested which has the higher probability, $P(c_{pos}\ |\  w_1,w_2,...,w_n)$ or $P(c_{neg}\ |\  w_1,w_2,...,w_n)$. The absolute values are not important. As such, we can ignore the denominator $P(w_1,w_2,...,w_n)$ since it does not depend on the class $c_i$. We therefore can write:

$$P(c_i\ |\  w_1,w_2,...,w_n) \propto P(w_1,w_2,...,w_n \ |\ c_i)  \cdot P(c_i)$$

Note that we no longer can user "$=$", since $P(c_i\ |\  w_1,w_2,...,w_n)$ is only proportional to the product on the right-hand side.

In general, $P(w_1,w_2,...,w_n \ |\ c_i)$ is difficult to calculate. This is where the "Naive Bayes" assumption comes in - that is, we assume that all words $w_1,w_2,...,w_n$ are independent from each other. In general, this assumption does not hold. For example, documents containing "birthday" often also contain "happy". But it turns out that in practice this assumption hardly affects the results. We can now write:

$$P(c_i\ |\  w_1,w_2,...,w_n) \propto P(w_1\ |\ c_i)  \cdot P(w_2\ |\ c_i)\cdot ...  \cdot P(w_n\ |\ c_i)  \cdot P(c_i) =  P(c_i)\cdot \prod P(w_i\ |\ c_i)$$

with $P(w_i\ |\ c_i)$ being the probability of finding the word $w_i$ in a document of class $c_i$. In other words, we can say:

$$P(w_i\ |\ c_i) = \frac{\#occurrences\ of\ w_i\ in\ c_i}{\#words\ in\ c_i}$$

These values are easy to calculate for a given set of documents. It's all about counting words, that's it.

While this works fine in theory, in practice, two concerns need to be addressed. Firstly, most $P(w_i\ |\ c_i)$ are very small probabilities. Thus if a document contains hundreds or even thousands of words, hundreds or thousands of small numbers need to be multiplied. The result is then too small to be represented in a computer and rounded to 0. To avoid this, we calculate the **log probability**. Since the logarithm is a monotonic function, it won't affect the final decision for the classification. Using the rules of logarithm, we can write:

$$\log{P(c_i\ |\  w_1,w_2,...,w_n)} \propto \log{P(c_i)}\cdot \log{\sum P(w_i\ |\ c_i)}$$

Another problem is if $P(w_i\ |\ c_i) = 0$, i.e., word $w_i$ never appeared in class $c_i$. This can easily happen if the classifier gets a document with words it has never seen before. From a mathematical perspective this is a problem since $\log{0}$ is undefined. And even if without considering the logarithm, $P(w_i\ |\ c_i) = 0$ would dominate the formula and would make $P(c_i\ |\  w_1,w_2,...,w_n) = 0$ no matter how common all other words are. The solution is to assign even unknown words with a very small probability greater than 0. Without going into details, a common approach is *Laplace Smoothing*, which results in:

$$P(w_i\ |\ c_i) = \frac{(\#occurrences\ of\ w_i\ in\ c_i)\ + 1}{(\#words\ in\ c_i) + (\#words\ in\ vocabulary)}$$

Summing up, we need to calculate 3 types of information:
- size of the vocabulary
- the log probabilities $\log{P(c_i)}$
- the number of occurrences of all words $w_i$ in the different classes $c_i$

**Important:** To keep it simple, we only consider unigrams here -- that is each $w_i$ represents a single word/token. While supporting bigrams, trigrams, etc. is a straightforward extension, it would make the code somewhat more complex. Here we focus on simplicity to maximize understanding.

The following three variable will store these information:

In [3]:
vocabulary = set()
log_class_priors = {}
token_counts = { 'pos': {}, 'neg': {} }

The following auxiliary method `get_token_counts()` takes a list of tokens as input and returns the number of occurences of each term/token in the token list.

In [4]:
def get_token_counts(token_list):
    token_counts = {}
    for token in token_list:
        token_counts[token] = token_counts.get(token, 0.0) + 1.0
    return token_counts

Let's run an example to see the output of method `get_token_counts()`

In [5]:
token_list = X_train[-1].split() # Data is preprocessed; split() is good enough to tokenize a sentence.

print(get_token_counts(token_list))

{'lee': 1.0, 'marvelously': 1.0, 'compelling': 1.0, 'present': 1.0, 'brown': 2.0, 'catalyst': 1.0, 'struggle': 1.0, 'black': 1.0, 'manhood': 1.0, 'restrictive': 1.0, 'chaotic': 1.0, 'america': 1.0, 'sketchy': 1.0, 'nevertheless': 1.0, 'gripping': 1.0, 'portrait': 1.0, 'jim': 1.0, 'celebrated': 1.0, 'wonder': 1.0, 'spotlight': 1.0}


The `fit()` method does the actual caluclation of the 3 types of required information.

In [6]:
def fit(X, y):
    num_data_items = len(X)
    
    # Calculate the prior log probabilites, i.e, the ratio of positive and negative documents
    log_class_priors['pos'] = np.log(sum(1 for label in y if label == 1) / num_data_items)
    log_class_priors['neg'] = np.log(sum(1 for label in y if label == 0) / num_data_items)
    
    # The whole loop essentially just counts the words for each class
    for doc, label in zip(X, y):
        polarity_class = 'pos' if label == 1 else 'neg'
        # Get token token counts for the current document
        counts = get_token_counts(doc.split())
        for token, count in counts.items():
            # Remember vocabulary so we can handle unknown tokens later
            # It's a set, so no harm to add multiple times
            vocabulary.add(token)
            # If the token is not yet in the dictionary, initialize count with 0
            if token not in token_counts[polarity_class]:
                token_counts[polarity_class][token] = 0
            # Update token count
            token_counts[polarity_class][token] += count   

Let's train the classifier.

In [7]:
fit(X_train, y_train)

The following illustrate the result of the calculation.

In [8]:
print("Priors (log probabilities): {}".format(log_class_priors))
print("Priors (probabilities): {}".format({k:np.exp(v) for k, v in log_class_priors.items()}))
print()

# Some example results
token_1 = "good"
token_2 = "bad"

print('Number of occurrences of "{}" in class POSITIVE: {}'.format(token_1, token_counts['pos'].get(token_1, 0.0)))
print('Number of occurrences of "{}" in class NEGATIVE: {}'.format(token_1, token_counts['neg'].get(token_1, 0.0)))
print('Number of occurrences of "{}" in class POSITIVE: {}'.format(token_2, token_counts['pos'].get(token_2, 0.0)))
print('Number of occurrences of "{}" in class NEGATIVE: {}'.format(token_2, token_counts['neg'].get(token_2, 0.0)))

Priors (log probabilities): {'pos': -0.6930299403933299, 'neg': -0.693264434473429}
Priors (probabilities): {'pos': 0.5000586235197562, 'neg': 0.4999413764802439}

Number of occurrences of "good" in class POSITIVE: 156.0
Number of occurrences of "good" in class NEGATIVE: 137.0
Number of occurrences of "bad" in class POSITIVE: 22.0
Number of occurrences of "bad" in class NEGATIVE: 165.0


Not surprisingly, the word "good" is more likely to occur in a positive document. That "good" is still common in negative documents can be caused by documents containing phrases like "not good" or "not so good". If we consider only single words, information about negation is lost.

Next, the method `predict{}` actually calculates $\log{P(c_i)}\cdot \log{\sum P(w_i\ |\ c_i)}$ as defined above. If you look closely at the individual lines, you can easily identify each parts of the calculation.

In [None]:
def predict(X):
    y_pred = []
    
    # Loop over all test sample and predict class label for each sample
    for doc in X:
        # Initialize class scores (i.e., log probablities)
        pos_score, neg_score = 0, 0
        # Get the number of occurrences of each token in the document
        counts = get_token_counts(doc.split())
        for token, _ in counts.items():
            # Ignore unknown tokens
            if token not in vocabulary: 
                continue
                
            # Add Laplace smoothing
            log_w_given_pos = np.log( (token_counts['pos'].get(token, 0.0) + 1) / (sum(token_counts['pos'].values()) + len(vocabulary)) )
            log_w_given_neg = np.log( (token_counts['neg'].get(token, 0.0) + 1) / (sum(token_counts['neg'].values()) + len(vocabulary)) )
 
            # Update class scores
            pos_score += log_w_given_pos # Since we are dealing with log probabilities here
            neg_score += log_w_given_neg # we need to add (and not multiply) the values
 
        # Include priors in class scores
        pos_score += log_class_priors['pos'] # Since we are dealing with log probabilities here
        neg_score += log_class_priors['neg'] # we need to add (and not multiply) the values
 
        if pos_score > neg_score:
            y_pred.append(1)
        else:
            y_pred.append(0)
            
    # Return list of predicted class labels
    return y_pred

Let's test the method `predict()` on simple example samples. Note that the method expects a list/array as input, so we have to wrap a single sample still as a list with 1 element.

In [None]:
sample = ['nice movie happy end']  # should be 1 (positive)
# sample = ['boring flick']  # should be 0 (negative)

print("Final prediction: {}".format(predict(sample)))

We are not finally able to evaluate our classifier by running `predict()` over our test data `X_test`.

**Important:** Running the code cell below might take a couple of seconds even for such a small dataset. Using `sklearn`'s implementation of the Naive Bayes classifier would likely be several orders of magnitudes faster. This is simply because our implementation is in no shape or form optimized for performance. The focus of the implementation is on understanding the basic intuition and steps of the Naive Bayes classifier. Implementations provided by popular packages such as `sklearn` are highly optimized, and those are of course the ones you should use in practice.

In [None]:
y_pred = predict(X_test)

Let's see how good our classifier is:

In [None]:
print(classification_report(y_test, y_pred))
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

The result should show an accuracy and f1 score of around 0.77. If you compare this result with the ones obtained in the "Text Classification" notebook using the same dataset, you can see that we are in the same ball park. While not proof, it seems that our implementation of the Naive Bayes classifier performs well.

---

## Summary

The Naive Bayes classifier is a popular and simple probabilistic machine learning algorithm used for classification tasks, particularly in text classification and spam filtering. It operates based on Bayes' theorem and assumes conditional independence between features. Despite its simplicity and the "naive" assumption, Naive Bayes often performs well in practice, especially with large feature spaces. It is fast to train and make predictions and requires a small amount of training data.

When compared to other models, Naive Bayes has several distinct characteristics. Firstly, it assumes independence between features, which can be a limitation when features are dependent on each other. This assumption might cause the model to miss out on valuable correlations in the data. However, Naive Bayes can still perform well in practice, especially when the assumption holds approximately or when there are many irrelevant features.

Additionally, Naive Bayes is computationally efficient and has low memory requirements. It can handle high-dimensional data and large feature spaces effectively. This makes it suitable for text classification tasks, where the number of possible features (words or n-grams) can be enormous. Compared to more complex models like deep neural networks or support vector machines, Naive Bayes is less prone to overfitting, especially when the training data is limited. It can provide good results with smaller datasets, making it useful in scenarios where large labeled datasets are not available.

However, Naive Bayes has its limitations. It assumes that features are independent, which might not hold true in all cases. Additionally, it may struggle with rare or unseen feature combinations, as it assigns zero probabilities to them. This issue can be addressed using smoothing techniques like Laplace smoothing. In summary, the Naive Bayes classifier is a simple and efficient algorithm that performs well in text classification tasks and when dealing with large feature spaces. While it makes the naive assumption of feature independence, it often delivers competitive results and is especially useful with limited training data.