Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` removing 'raise NotImplementedError()'. Make sure you also fill your student id below:

In [None]:
STUDENT_ID = "128"

---

## Introductory code

In [1]:
import nltk
import math
nltk.download('punkt')
from nltk import punkt

docs = []
docs.append(['just plain boring', 'entirely predictable and lacks energy', 'no surprises and very few laugs'])
docs.append(['very powerful', 'the most fun film of the summer'])
test_doc = 'the film was predictable with no fun'

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Implement Multinomial NB Extensions

We will here build a class for three variations of NB: a) Multinomial NB, b) Binary Multinomial NB. c) Multivariate Bernoulli NB. We are giving you (a). You must complete (b) activated with constructor parameter binary=True and (c) activated with constructor parameter bernoulli=true.

In [11]:
from enum import Enum
# Download the required resource
nltk.download('punkt_tab')
from collections import Counter


class Mode(Enum):
    MULTINOMIAL = 1
    BINARY_MULTINOMIAL = 2
    BERNOULLI = 3

class NB():
    def __init__(self, mode: Mode):
        """
        Initializes the Naive Bayes classifier.
        :param mode: Mode of Naive Bayes (MULTINOMIAL, BINARY_MULTINOMIAL, BERNOULLI)
        """
        self.mode = mode

    def fit(self, docs):
        """
        Fits the model to the given documents.
        :param docs: List of lists, where each inner list contains documents for a class.
        """
        self.num_classes = len(docs)
        self.priors = [len(c) / sum(len(c) for c in docs) for c in docs]
        self.vocabulary = set()
        self.counts = []

        for class_docs in docs:
            tokenized_docs = [
                set(nltk.word_tokenize(doc)) if self.mode != Mode.MULTINOMIAL else nltk.word_tokenize(doc)
                for doc in class_docs
            ]
            class_tokens = [token for doc in tokenized_docs for token in doc]
            self.vocabulary.update(class_tokens)
            self.counts.append(Counter(class_tokens))

        self.vocabulary = list(self.vocabulary)
        vocab_size = len(self.vocabulary)

        self.probs = [
            {
                token: (class_count.get(token, 0) + 1) / (sum(class_count.values()) + vocab_size)
                for token in self.vocabulary
            } if self.mode != Mode.BERNOULLI else {
                token: (class_count.get(token, 0) + 1) / (len(docs[i]) + 2)
                for token in self.vocabulary
            }
            for i, class_count in enumerate(self.counts)
        ]

    def predict_proba(self, doc):
        """
        Predicts the probabilities for the given document.
        :param doc: The document to classify (string).
        :return: List of probabilities for each class.
        """
        tokens = set(nltk.word_tokenize(doc)) if self.mode != Mode.MULTINOMIAL else nltk.word_tokenize(doc)
        scores = []

        for i in range(self.num_classes):
            score = math.log(self.priors[i])
            for token in self.vocabulary:
                prob = self.probs[i].get(token, 1e-10)
                score += math.log(prob if token in tokens else 1 - prob if self.mode == Mode.BERNOULLI else 1)
            scores.append(score)

        total_score = sum(math.exp(s) for s in scores)
        return [math.exp(s) / total_score for s in scores]

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [12]:
"""Testing that the class implementation returns the correct results for the given standard multinomial implementation"""
mnb = NB(mode=Mode.MULTINOMIAL)
mnb.fit(docs)
assert round(mnb.predict_proba('the film was predictable with no fun')[0], 6) == 0.184151

In [13]:
"""Testing that the class implementation returns the correct results for the binary multinomial version"""
mnb = NB(mode=Mode.BINARY_MULTINOMIAL)
mnb.fit(docs)
assert round(mnb.predict_proba('the film was predictable with no fun')[0], 6) == 0.221239

In [14]:
"""Testing that the class implementation returns the correct results for the Bernoulli version"""
mnb = NB(mode=Mode.BERNOULLI)
mnb.fit(docs)
assert round(mnb.predict_proba('the film was predictable with no fun')[0], 6) == 0.121536