<img src="data/images/div/lecture-notebook-header.png" />

# Naive Bayes Classifier

The Naive Bayes (NB) classifier -- here more specifically: Multinomial Naive Bayes (MNB) classifier -- is a probabilistic machine learning model based on Bayes' theorem. It's particularly well-suited for text classification tasks, especially when dealing with word frequencies or occurrence counts within text documents. Here's an overview of the Multinomial Naive Bayes classifier and its application in text classification:

* **Bayes' Theorem:** The classifier is based on Bayes' theorem, which calculates the probability of a label (class) given observed features (words in the document). In text classification, it computes the probability of a document belonging to a particular category or sentiment class based on the occurrence of words in the document.

* **Multinomial Distribution:** MNB assumes that the features (word counts or frequencies) follow a multinomial distribution. It works well with discrete features, such as word counts in text documents.

* **Naive Bayes Assumption:** The "naive" assumption in MNB refers to the independence assumption between features (words in this context). It assumes that the presence or absence of each word in the document is independent of the presence or absence of other words, given the class label. While this assumption might not hold true in reality, MNB often performs well in practice, especially for text classification tasks.

* **Text Classification:** In text classification, MNB uses the frequency of words (bag-of-words model) or other features derived from text (like TF-IDF - Term Frequency-Inverse Document Frequency) to build a probabilistic model. It calculates the likelihood of each word occurring in a particular class based on the training data. When a new document is encountered, it uses Bayes' theorem to calculate the probability of the document belonging to each class and selects the class with the highest probability as the predicted label.

* **Sparse Data Handling:** MNB is robust in handling high-dimensional and sparse datasets typical in text classification. It works well even with relatively small training datasets and can efficiently handle a large number of features (words) without overfitting.

Overall, the Multinomial Naive Bayes classifier is a simple yet effective probabilistic model for text classification tasks. Its ease of implementation, efficiency with sparse data, and reasonable performance, especially in tasks like document classification, spam filtering, and sentiment analysis, make it a popular choice in the field of natural language processing.

## Setting up the Notebook

### Required packages

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn import metrics
from sklearn.pipeline import Pipeline

from tqdm import tqdm

## Preparing the Data

For this notebook, we use a simple dataset for sentiment classification. This dataset consists of 10,662 sentences, where 50% of the sentences are labeled 1 (positive), and 50% of the sentences are labeled -1 (negative).

### Loading Sentence/Label Pairs from File

In [None]:
sentences, labels = [], []

with open("data/datasets/sentence-polarities/sentence-polarities.csv") as file:
    for line in file:
        line = line.strip()
        sentence, label = line.split("\t")
        sentences.append(sentence)
        labels.append(int(label))  
        
print("Total number of sentences: {}".format(len(sentences)))

### Create Training & Test Set

To evaluate any classifier, we need to split our dataset into a training and a test set. With the method `train_test_split()` this is very easy to do; this method also shuffles the dataset by default, which is important for this example, since the dataset file is ordered with all positive sentences coming first. In the example below, we set the size of the test set to 20%.


In [None]:
# Split sentences and labels into training and test set with a test set size of 20%
sentences_train, sentences_test, labels_train, labels_test = train_test_split(sentences, labels, test_size=0.2, random_state=42)

# We can directly convert the numerical class labels from lists to numpy arrays
y_train = np.asarray(labels_train)
y_test = np.asarray(labels_test)

print("Size of training set: {}".format(len(sentences_train)))
print("Size of test set: {}".format(len(sentences_test)))

## Training & Testing a Naive Bayes Classifier

Let's first have a look at how to train a Naive Bayes classifier with the minimum number of steps. For this, we randomly pick some meaningful values for the vectorizer and use the the default values of the [`MultinomialNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) classifier.

In [None]:
# Create Document-Term Matrix for differen n-gram sizes
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1), max_features=10000)

X_train = tfidf_vectorizer.fit_transform(sentences_train)
X_test = tfidf_vectorizer.transform(sentences_test)

Using the training data, we can train a Naive Bayes classifier with a single line of code


In [None]:
model = MultinomialNB().fit(X_train, y_train)

Once trained, we can predict the class labels for the document vectors in our test set.

In [None]:
y_pred = model.predict(X_test)

`y_pred` now contains the 2,133 predicted labels that we can compare with the ground truth labels from the test set. scikit-learn provides methods to easily calculate all the important metrics we covered in the lecture. Since we only have to class labels (i.e., binary classification), we do not have to set the `average` parameter to indicate micro or macro averaging.

In [None]:
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred)

print("Precison: {:.3f}".format(precision))
print("Recall:   {:.3f}".format(recall))
print("F1 score: {:.3f}".format(f1))

scikit-learn also provides a method `classification_report()` for a more detailed description of the results, showing a breakdown of the precision, recall, and f1 scores broken down for each class.

In [None]:
print(metrics.classification_report(y_test, y_pred))

## Hyperparameter Tuning

The Naive Bayes Classifier -- compared to, e.g., the K-Nearest Neighbor Classifier -- has no fundamentally intrinsic parameter that needs to be chosen wisely. If you check the documentation of [`MultinomialNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) you will see some input parameters. However, there are not as fundamental as, say, the `n_neigbors` for [`KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)


### Selecting the Best Maximum N-Gram Size

In the case of the Naive Bayes Classifier, the input in terms of the size of the n-gram has typically the most effect on the results. In the basic example above, we only assumed unigram. Now let's see how the result changes if we change the maximum number of n-gram when vectorizing our input data.


In [None]:
min_ngram_size = 1
max_ngram_size = 5

num_runs = max_ngram_size - min_ngram_size

# numpy array to keep track of all results
results = []

with tqdm(total=num_runs) as pbar:
    for i, ngram in enumerate(range(min_ngram_size, max_ngram_size+1)):
        # Create Document-Term Matrix for different n-gram sizes
        tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngram), max_features=20000)
        X_train = tfidf_vectorizer.fit_transform(sentences_train)
        X_test = tfidf_vectorizer.transform(sentences_test)
        # Train & test model using cross validation
        model = MultinomialNB()
        scores = cross_val_score(model, X_train, y_train, cv=10, scoring="f1")
        mean_score = np.mean(scores)
        results.append((ngram, mean_score))
        pbar.update(1)

With the f1 scores for the different values for `max_ngram_size`, we can quickly plot those results.

In [None]:
plt.figure()
plt.plot([s[0] for s in results], [s[1] for s in results], lw=3)
font_axes = {'family':'serif','color':'black','size':16}
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel("Max N-Gram Size", fontdict=font_axes)
plt.ylabel("F1 Score", fontdict=font_axes)
plt.tight_layout()
plt.show()

While the best value for the maximum n-gram size is at 2, keep in mind that the f1 score actually doesn't change too much; see the scale of the y-axis. The main reason for this is that, for example, a maximum n-gram size of 3 still contains all unigrams and bigrams. This is the most common approach in practice. However, feel free to also set `min_ngram_size` to a larger value than 1 and see how it affects the results.

Of course, all these results and observations only hold true for this specific data set and might significantly differ for other ones.


## Pipelines & Grid Search

Hyperparameter tuning is a quite important step, but the previous example has shown that it can be quite tedious. However, note that we basically tried all possible combinations for certain sets of parameter values. And since we were tuning 2 parameters, we required 2 nested loops. Thus, if we would tune $N$ parameters at the same time, we would need to have $N$ nested loops. Luckily, scikit-learn makes this much easier using [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

Since the parameters we would like to tune refer to 2 different components -- the vectorizer and the classifier -- we also need a [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to combine both components into a single model. Let's do this first:


In [None]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB()),
])

Now we can define the search space, by providing the set of values for the hyperparameters we want to consider. See how the identifier of the parameters are a combination of the name in the pipeline (here: `tfidf` and `nb`) and the name of the parameter in the respective class. For example, `tfidf__max_df` refers to the `max_df` parameter of the `TfidfVectorizer`.

In [None]:
parameters = {
    'tfidf__max_df': (0.75, 1.0),
    'tfidf__max_features': (5000, 10000),
    'tfidf__ngram_range': ((1, 1), (1, 2), (1, 3)),
    'nb__alpha': (0.5, 1.0, 1.5)
}

Now we can use `GridSearchCV` to check all possible combinations of parameter values. Of course, we kept the number of possible values rather small to avoid overly run times here. Note that for any parameter not listed above (e.g., `min_df` of the vectorizer) the default value is used.

In [None]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=2, cv=5)

grid_search = grid_search.fit(sentences_train, labels_train)

Once the `GridSearchCV` has checked all possible parameter combinations, we can read out the best combination as follows:

In [None]:
print(grid_search.best_params_)

With these best parameter values -- note that those might not really be the best values as we selected just some alternatives for this example -- we compute the final scores by vectorizing our data and training the Naive Bayes Classifiers using those parameters. Now we train the classifier using the complete training data, and evaluate the classifier over the test data. Appreciate that we used the test data only this one time for the final results.

In [None]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=10000, max_df=0.75)

X_train = tfidf_vectorizer.fit_transform(sentences_train)
X_test = tfidf_vectorizer.transform(sentences_test)

model = MultinomialNB().fit(X_train, y_train)
y_pred = model.predict(X_test)

precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred)

print("Precison: {:.3f}".format(precision))
print("Recall:   {:.3f}".format(recall))
print("F1 score: {:.3f}".format(f1))

---

## Summary

In this notebook, we looked at the Naive Bayes classifier. Of course, with packages like scikit-learn, it is very easy to train a classifier with very few lines of code. We saw that the Naive Bayes classifier is in some sense very easy to train as it does not feature any very fundamental parameters that need to be tuned. Despite its simplicity, this model can still provide good results for text classification and can serve as a simple baseline to compare against more sophisticated models for text classification.