# Bayesian Learning

In this notebook you will learn how to implement the Naive Bayes classifier in Python and how to use the version
implemented in scikit-learn.

## The Dataset

In this notebook you will be working with the
[Twenty Newsgroups dataset](https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups).
This dataset consists of 20,000 messages taken from 20 newsgroups. The aim of this classification task is to predict
from which group each message came from.

The newsgroups are:
<PRE>
    alt.atheism
    comp.graphics
    comp.os.ms-windows.misc
    comp.sys.ibm.pc.hardware
    comp.sys.mac.hardware
    comp.windows.x
    misc.forsale
    rec.autos
    rec.motorcycles
    rec.sport.baseball
    rec.sport.hockey
    sci.crypt
    sci.electronics
    sci.med
    sci.space
    soc.religion.christian
    talk.politics.guns
    talk.politics.mideast
    talk.politics.misc
    talk.religion.misc
</PRE>

The messages are typical postings and thus have headers including subject lines,
signature files, and quoted portions of other messages.

We will download this dataset directly from the UCI repository. Note that this dataset is made of
20 folders, one per newsgroup, containing
1,000 files each. Each file is a message. You can open these files with an editor.

In [None]:
from urllib.request import urlretrieve

dataset_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/20newsgroups-mld/20_newsgroups.tar.gz'
urlretrieve(dataset_url, '20_newsgroups.tar.gz')

The file is compressed, therefore we will now uncompress it.

In [None]:
import tarfile

with tarfile.open('20_newsgroups.tar.gz') as f:
    f.extractall('.')

We will now loop across the files in the folder and create a dataset list
containing tuples made of message texts (the content of the file) and labels (the folder name).
For each message we will also remove the header.

In [None]:
import glob
import os

dataset = []
# loop across the newsgroups files
for path in glob.glob('20_newsgroups/**/*'):
    folder, y, document_id = path.split(os.sep)
    with open(path, 'r', encoding='latin-1') as file:
        # by splitting an rejoining using a double new line we remove the header from each message.
        x = '\n\n'.join(file.read().split('\n\n')[1:])
        dataset.append((x, y))

Let's now print the values of the target variable, and the number of messages per class.

In [None]:
classes = {}
for x, y in dataset:
    if y not in classes:
        classes[y] = 0
    classes[y] += 1

print('#classes: ', len(classes))

for clazz in classes:
    print(clazz, classes[clazz])

Let's now have a look at the content of one example.

In [None]:
x, y = dataset[101]

print('Message:')
print(x)

print('Label:')
print(y)

We will now randomize the dataset and perform the train-test split.

In [None]:
import random
import numpy as np

# this sets the numpy to print numbers with float precision (this setting affects only the prints not the actual values)
np.set_printoptions(suppress=True)

# shuffles the list in place
random.shuffle(dataset)

xs, ys = np.split(dataset, [1], axis=1)
xs = xs.reshape(-1)
ys = ys.reshape(-1)

We will now select 80% of the dataset for training and 20% for testing.

In [None]:
n_train = len(xs) * 80 // 100
xs_train, xs_test = np.split(xs, [n_train], axis=0)
ys_train, ys_test = np.split(ys, [n_train], axis=0)

print('training set shape:\t', xs_train.shape)
print('test set shape:\t\t', xs_test.shape)

### Preprocessing

Since the kind of data we are working with is text and the learners that we will be using later require examples made of
categorical values, we will now convert these texts into bag-of-words.

In [None]:
import re

def preprocess(text):
    tokens = re.split(r'[^\w]', text) # split on everything unless it is a word
    tokens = [t.lower() for t in tokens if t]
    res = set(tokens)
    return res

xs_train_prep = [preprocess(x) for x in xs_train]

## The Naive Bayes Classifier

We will now implement the Naive Bayes Classifier in Python from scratch.

In [None]:
from collections import defaultdict

class NaiveBayes:

    def __init__(self):
        # keep the count of the number of times a token appears for a given class
        self.count_token_given_class = defaultdict(lambda: defaultdict(int))
        # keep the count of the number of times an example belongs to a class
        self.count_class = defaultdict(int)
        # keep the count of the number of times a token is in an example
        self.count_token = defaultdict(int)

    def add_example(self, x, y):
        """
        Add one example to the list of training examples.
        :param x: The set of tokens
        :param y: The label associated to this example
        """
        self.count_class[y] += 1
        count = self.count_token_given_class[y]
        for token in x:
            if token not in count:
                count[token] = 0
            count[token] += 1
            if token not in self.count_token:
                self.count_token[token] = 0
            self.count_token[token] += 1

    def add_examples(self, xs, ys):
        """
        Add a list of examples to the list of training examples.
        :param xs: A list of token sets
        :param ys: A list of labels associated to the examples
        """
        for x, y in zip(xs, ys):
            self.add_example(x, y)

    def classify(self, x_q):
        scores = {}
        for clazz in self.count_class:
            p_class = self.count_class[clazz]
            count = self.count_token_given_class[clazz]
            p_tokens = 1.0
            for token in x_q:
                p_token_given_class = 0
                if token in self.count_token:
                    p_token_given_class = count[token]/self.count_token[token]
                p_tokens *= p_token_given_class
            scores[clazz] = p_class * p_tokens

        max_score = -float('inf')
        max_class = None
        for clazz, score in scores.items():
            if score > max_score:
                max_score = score
                max_class = clazz

        return max_class

We now train this classifier.

In [None]:
nb_clf = NaiveBayes()
nb_clf.add_examples(xs_train_prep, ys_train)

We can now classify any set of tokens.

In [None]:
nb_clf.classify({'linux', 'is', 'beautiful'})

We will now evaluate the performance of this classifier computing its accuracy on the train and test set.

In [None]:
def accuracy(ys, ys_hat):
    res = 0
    for y, y_hat in zip(ys, ys_hat):
        if y == y_hat:
            res += 1
    res /= len(ys)
    return res

ys_train_pred = []
for x in xs_train_prep:
    y_hat = nb_clf.classify(x)
    ys_train_pred.append(y_hat)

# preprocess test set
xs_test_prep = [preprocess(x) for x in xs_test]

ys_test_pred = []
for x in xs_test_prep:
    y_hat = nb_clf.classify(x)
    ys_test_pred.append(y_hat)

print('Train accuracy of NB', accuracy(ys_train, ys_train_pred))
print('Test accuracy of NB', accuracy(ys_test, ys_test_pred))

The performance of this Naive Bayes classifier is quite poor because if a token exists
in the tested example that was not part of the training set,
the score computed by the classifier for this example will always be 0, and
the classifier will return one class at random.

In [None]:
nb_clf.classify({'linux', 'is', 'beautiful', 'djsklajdklsajdkl'})

In order to avoid this issue we need to smooth the probabilities using the m-estimate.

In [None]:
class MEstimateNaiveBayes(NaiveBayes):

    def __init__(self, m = 1):
        super().__init__()
        self.m = m

    def classify(self, x_q):
        scores = {}
        p = 1.0/len(self.count_class)
        for clazz in self.count_class:
            logp_class = np.log(self.count_class[clazz])
            count = self.count_token_given_class[clazz]
            logp_tokens = 0.0
            for token in x_q:
                num_token_given_class = p * self.m
                if token in count:
                    num_token_given_class = np.log(count[token] + p * self.m)
                den_token_given_class = - np.log(p)
                if token in self.count_token:
                    den_token_given_class = - np.log(self.count_token[token] + p)
                logp_tokens += num_token_given_class + den_token_given_class
            scores[clazz] = logp_class + logp_tokens

        max_score = -float('inf')
        max_class = None
        for clazz, score in scores.items():
            if score > max_score:
                max_score = score
                max_class = clazz

        return max_class

We now train this classifier.

In [None]:
mnb_clf = MEstimateNaiveBayes()
mnb_clf.add_examples(xs_train_prep, ys_train)

We now perform the same test as before.

In [None]:
mnb_clf.classify({'linux', 'is', 'beautiful'})

In [None]:
mnb_clf.classify({'linux', 'is', 'beautiful', 'xzydsads'})

And evaluate the model.

In [None]:
ys_train_pred = []
for x in xs_train_prep:
    y_hat = mnb_clf.classify(x)
    ys_train_pred.append(y_hat)

ys_test_pred = []
for x in xs_test_prep:
    y_hat = mnb_clf.classify(x)
    ys_test_pred.append(y_hat)

print('Train accuracy of MNB', accuracy(ys_train, ys_train_pred))
print('Test accuracy of MNB', accuracy(ys_test, ys_test_pred))

Try to change the `m` parameter value to get a better result.

## Naive Bayes in Scikit-Learn

Now we will implement the same classifier but using the scikit-learn implementation,
the `MultinomialNB` model.

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb_clf = MultinomialNB()


In order to use this classifier we need to convert each message into a vector using the bag-of-words approach.
Scikit-learn provides the `CountVecorizer` class, which converts a collection of texts into a
matrix of token counts.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

We now vectorize the training set by using the method `fit_transform`.

In [None]:
xs_train_prep = vectorizer.fit_transform(xs_train)


We can now train the Naive Bayes classifier.

In [None]:
nb_clf.fit(xs_train_prep, ys_train)

Now we evaluate this classifier plotting its confusion matrix.
A confusion matrix is often used to describe the performance of a
classification model on a test set.
Each row of the matrix represents the instances in a predicted class, while each column represents
the instances in an actual class.

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics import confusion_matrix

xs_test_prep = vectorizer.transform(xs_test)
ys_test_pred = nb_clf.predict(xs_test_prep)

mat = confusion_matrix(ys_test, ys_test_pred)

fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=classes, yticklabels=classes, ax=ax)
plt.xlabel('true label')
plt.ylabel('predicted label')

From this matrix we can assess which pair of classes get the most confused.

We now evaluate its accuracy.

In [None]:
from sklearn.metrics import classification_report

print(classification_report(ys_test, ys_test_pred))

Interpret the results. What is the difference between macro and micro avg?