## Bernoulli Naive Bayes Theory

Naive Bayes is a probabilistic classifier that relies on supervised learning. Similar to logistic regression, it aims to assign new observations, represented as feature vectors $\mathbf{x} \in \mathbb{R}^n$, to one of a set of discrete classes. In the case of binary classification, the output is a scalar value $y \in \{0, 1\}$.

A key distinction between logistic regression and Naive Bayes lies in their modeling approaches: logistic regression is a **discriminative** classifier, while Naive Bayes is a **generative** classifier. These two paradigms represent different frameworks for building machine learning models.

To illustrate this difference, consider the task of distinguishing images of dogs from images of cats:

- A **generative model** attempts to learn the underlying distribution of each class. In principle, such a model could generate a sample (e.g., draw a dog). Given a new image, the model evaluates which class distribution—dog or cat—better explains the observation, and assigns the corresponding label.
- A **discriminative model**, on the other hand, focuses only on finding boundaries between classes. For instance, if all dogs in the dataset wear collars and none of the cats do, the model may rely solely on this feature to separate the classes. In this case, the model has little to say about the actual characteristics of cats or dogs beyond this distinction.

There are three common variants of Naive Bayes classifiers, which differ in how they represent features:

- **Bernoulli Naive Bayes**: works with binary features, modeling the presence or absence of a feature.
- **Multinomial Naive Bayes**: models features as frequency counts (e.g., word counts in text classification).
- **Gaussian Naive Bayes**: assumes continuous features that follow a Gaussian distribution.

In this notebook, we focus on the **Bernoulli Naive Bayes classifier**. While the specific feature representations differ among the variants, the fundamental concepts of Naive Bayes apply to all of them.

### Bayes' Theorem

The Naive Bayes classifier derives its name from **Bayes' Theorem**, combined with a simplifying (or "naive") assumption about feature independence. Before applying it to classification, let us first recall Bayes' Theorem, a fundamental result in probability theory:

$$
P(H|E) = \frac{P(E|H)P(H)}{P(E)}
$$

Here:

- $H$ represents a hypothesis, and $E$ represents the observed evidence.
- $P(H|E)$ is the **posterior probability** of the hypothesis $H$ given the evidence $E$.
- $P(E|H)$ is the **likelihood** of $H$ give a fixed $E$, i.e., the probability of observing $E$ given that $H$ is true.
- $P(H)$ is the **prior probability** of the hypothesis $H$, independent of any evidence.
- $P(E)$ is the **marginal probability** of the evidence $E$.

The essential idea is that evidence does not determine beliefs in isolation. Instead, it updates prior beliefs. In other words, to compute the posterior probability $P(H|E)$, both the prior $P(H)$ and the likelihood $P(E|H)$ must be taken into account.

### Model

To understand the Naive Bayes classifier, let us consider the classic **spam vs. ham** example. Given an email, the task is to classify it as spam or not spam (ham).

The first step is to represent an email as a feature vector $\mathbf{x}$. We construct an $n$-dimensional dictionary of the most frequent words in the dataset. In **Bernoulli Naive Bayes**, the features are binary, so each entry in $\mathbf{x} \in \{0, 1\}^n$ indicates the **presence** ($1$) or **absence** ($0$) of a particular word in the email.

The classifier assigns the email to the class $\hat{y}$ with the maximum posterior probability given the feature vector:

$$
\hat{y} = \underset{y}{\arg\max} \; p(y|\mathbf{x})
$$

Using Bayes’ Theorem, we can rewrite the posterior probability as:

$$
p(y|\mathbf{x}) = \frac{p(\mathbf{x}|y)p(y)}{p(\mathbf{x})}
$$

Thus, to compute $p(y|\mathbf{x})$, we need to model the **likelihood** $p(\mathbf{x}|y)$ and the **prior probability** $p(y)$. Since the marginal probability $p(\mathbf{x})$ does not depend on $y$ and remains constant across classes, it can be ignored in this case.

A direct modeling of $p(\mathbf{x}|y)$ would require estimating $2^n$ parameters, since each of the $n$ features can take binary values. Applying the chain rule, the likelihood can be written as:

$$
p(\mathbf{x}|y) = p(x_1, x_2, \dots, x_n|y) = p(x_1|x_2, \dots, x_n, y) \, p(x_2|x_3, \dots, x_n, y) \dots p(x_n|y)
$$

With a large enough $n$ this escalates and the formulation becomes computationally intractable. Naive Bayes resolves this by making the **conditional independence assumption**: the features are assumed to be independent given $y$. Under this assumption:

$$
p(\mathbf{x}|y) = p(x_1, x_2, \dots, x_n|y) = \prod_{j=1}^n p(x_j|y)
$$

In other words, once the class $y$ is known, the probability of a word appearing in the email is assumed to be independent of the other words. While this assumption is not true, it can be sufficiently accurate to yield effective predictions.

Combining these components, the classification rule becomes:

$$
\hat{y} = \underset{y}{\arg\max} \; \prod_{j=1}^n p(x_j|y) \, p(y)
$$

To avoid numerical underflow and to simplify computations, it is common to perform the calculation in log space:

$$
\hat{y} = \underset{y}{\arg\max} \; \sum_{j=1}^n \log p(x_j|y) + \log p(y)
$$

This formulation highlights the distinction between **generative** and **discriminative** models. A generative model like Naive Bayes makes use of the likelihood term, expressing how the features of an email are generated given its class (spam or ham). A discriminative model, in contrast, focuses directly on estimating $p(y|\mathbf{x})$, potentially by assigning weights to features that maximize classification accuracy without modeling how the data is generated.

#### Estimation

We now turn to the estimation of parameters for the Bernoulli Naive Bayes model. The key parameters are defined as follows:

$$
\begin{align*}
\phi_{(j\mid y=1)} &= p(x_j=1\mid y=1) \\ 
\phi_{(j\mid y=0)} &= p(x_j=1\mid y=0) \\ 
\phi_{(y)} &= p(y=1)
\end{align*}
$$

These parameters can be interpreted as:

- $\phi_{(j\mid y=1)}$: the probability that word $j$ appears in an email, given the email is spam.  
- $\phi_{(j\mid y=0)}$: the probability that word $j$ appears in an email, given the email is not spam.  
- $\phi_{(y)}$: the prior probability that a randomly chosen email is spam.  

To estimate these parameters, we maximize the **joint likelihood**:

$$
\mathcal{L}(\phi_{(y)}, \phi_{(j\mid y)}) = \prod_{i=1}^{m} p(\mathbf{x}^{(i)}, y^{(i)}; \phi_{(y)}, \phi_{(j\mid y)}).
$$

The standard approach is **maximum likelihood estimation (MLE)**. Before deriving the results formally, we can already anticipate the following intuitive estimates:

$$
\begin{align*}
\phi_{(y)} &= \frac{\sum_{i=1}^{m} 1\{y^{(i)}=1\}}{m} \\[6pt]
\phi_{(j\mid y=1)} &= \frac{\sum_{i=1}^{m} 1\{x_j^{(i)}=1,\, y^{(i)}=1\}}{\sum_{i=1}^{m} 1\{y^{(i)}=1\}} \\[6pt]
\phi_{(j\mid y=0)} &= \frac{\sum_{i=1}^{m} 1\{x_j^{(i)}=1,\, y^{(i)}=0\}}{\sum_{i=1}^{m} 1\{y^{(i)}=0\}}
\end{align*}
$$

These formulas have straightforward interpretations:

- $\phi_{(y)}$: the fraction of spam emails in the dataset.  
- $\phi_{(j\mid y=1)}$: among all spam emails, the fraction containing word $j$.  
- $\phi_{(j\mid y=0)}$: among all non-spam emails, the fraction containing word $j$.  

##### Derivation

The likelihood can be factored as:

$$
\mathcal{L}(\phi_{(y)}, \phi_{(j\mid y)}) = \prod_{i=1}^{m} p(y^{(i)};\phi_{(y)}) \, p(\mathbf{x}^{(i)} \mid y^{(i)};\phi_{(j\mid y)}).
$$

For $y^{(i)} \in \{0,1\}$, the distribution of $y$ is Bernoulli:

$$
p(y^{(i)};\phi_y)=\phi_y^{y^{(i)}}(1-\phi_y)^{1-y^{(i)}}.
$$

For a single feature $x_j^{(i)}$:

$$
p(x_j^{(i)}\mid y^{(i)};\phi_{j\mid y})=
\begin{cases}
\phi_{j|1}^{\,x_j^{(i)}}(1-\phi_{j|1})^{1-x_j^{(i)}}, & \text{if }y^{(i)}=1,\\[6pt]
\phi_{j|0}^{\,x_j^{(i)}}(1-\phi_{j|0})^{1-x_j^{(i)}}, & \text{if }y^{(i)}=0.
\end{cases}
$$

This can be written more compactly using indicator functions, which activate the left or the right part of the equation depending on the current $y^{(i)}$:

$$
p(x_j^{(i)}\mid y^{(i)}) =
\big(\phi_{j|1}^{\,x_j^{(i)}}(1-\phi_{j|1})^{1-x_j^{(i)}}\big)^{1\{y^{(i)}=1\}}
\big(\phi_{j|0}^{\,x_j^{(i)}}(1-\phi_{j|0})^{1-x_j^{(i)}}\big)^{1\{y^{(i)}=0\}}.
$$

The full joint likelihood becomes:

$$
\mathcal{L}(\Phi)
=\prod_{i=1}^m \Big[
\phi_y^{y^{(i)}}(1-\phi_y)^{1-y^{(i)}}
\prod_{j=1}^n 
\big(\phi_{j|1}^{\,x_j^{(i)}}(1-\phi_{j|1})^{1-x_j^{(i)}}\big)^{1\{y^{(i)}=1\}}
\big(\phi_{j|0}^{\,x_j^{(i)}}(1-\phi_{j|0})^{1-x_j^{(i)}}\big)^{1\{y^{(i)}=0\}}
\Big].
$$

Taking the log yields:

$$
\mathcal{L}(\Phi)
= \sum_{i=1}^m \Big[ y^{(i)}\log\phi_y + (1-y^{(i)})\log(1-\phi_y)\Big]
+ \sum_{j=1}^n \sum_{i=1}^m \Bigg\{
\begin{aligned}
&1\{y^{(i)}=1\}\big[x_j^{(i)}\log\phi_{j|1} + (1-x_j^{(i)})\log(1-\phi_{j|1})\big] \\
&+\,1\{y^{(i)}=0\}\big[x_j^{(i)}\log\phi_{j|0} + (1-x_j^{(i)})\log(1-\phi_{j|0})\big]
\end{aligned}
\Bigg\}.
$$

##### Useful counts

Define:

$$
N_1 \;=\; \sum_{i=1}^m 1\{y^{(i)}=1\},\qquad N_0 = m-N_1 = \sum_{i=1}^m 1\{y^{(i)}=0\}.
$$

$$
S_{j,1} \;=\; \sum_{i=1}^m 1\{y^{(i)}=1\}\,x_j^{(i)} \;=\; \sum_{i=1}^m 1\{x_j^{(i)}=1,\;y^{(i)}=1\},
$$

$$
S_{j,0} \;=\; \sum_{i=1}^m 1\{y^{(i)}=0\}\,x_j^{(i)} \;=\; \sum_{i=1}^m 1\{x_j^{(i)}=1,\;y^{(i)}=0\}.
$$

With these, the log-likelihood simplifies into separate parts:

- For $\phi_y$:

$$
\mathcal{L}(\phi_y) = N_1\log\phi_y + (m-N_1)\log(1-\phi_y).
$$

- For $\phi_{j|1}$:

$$
\mathcal{L}(\phi_{j|1}) = S_{j,1}\log\phi_{j|1} + (N_1-S_{j,1})\log(1-\phi_{j|1}).
$$

- For $\phi_{j|0}$:

$$
\mathcal{L}(\phi_{j|0}) = S_{j,0}\log\phi_{j|0} + (N_0-S_{j,0})\log(1-\phi_{j|0}).
$$


##### Maximization

Differentiating and solving for $\phi_y$ (first example in detail). Note the application of the chain rule and resulting changing of the signs:

$$
\frac{d\mathcal{L}}{d\phi_y} = N_1 \frac{1}{\phi_y} * 1 + m-N_1 \frac{1}{1-\phi_y} * (-1) = \frac{N_1}{\phi_y} - \frac{m-N_1}{1-\phi_y}=0
$$

$$
\begin{align*}
\frac{N_1}{\phi_y} &= \frac{m-N_1}{1-\phi_y} \\
N_1(1-\phi_y) &= (m-N_1)\phi_y \\
N_1 - N_1\phi_y &= m\phi_y - N_1\phi_y \\
N_1 &= m\phi_y \\
\phi_y &= \frac{N_1}{m} = \frac{\sum_{i=1}^m 1\{y^{(i)}=1\}}{m}.
\end{align*} 
$$

For $\phi_{j|1}$:

$$
\frac{d\mathcal{L}}{d\phi_{j|1}} = \frac{S_{j,1}}{\phi_{j|1}} - \frac{N_1-S_{j,1}}{1-\phi_{j|1}} = 0
\quad\Rightarrow\quad \phi_{j|1}=\frac{S_{j,1}}{N_1} = \frac{\sum_{i=1}^m 1\{x_j^{(i)}=1, y^{(i)}=1\}}{\sum_{i=1}^m 1\{y^{(i)}=1\}}.
$$

For $\phi_{j|0}$:

$$
\frac{d\mathcal{L}}{d\phi_{j|0}} = \frac{S_{j,0}}{\phi_{j|0}} - \frac{N_0-S_{j,0}}{1-\phi_{j|0}} = 0
\quad\Rightarrow\quad \phi_{j|0}=\frac{S_{j,0}}{N_0} = \frac{\sum_{i=1}^m 1\{x_j^{(i)}=1, y^{(i)}=0\}}{\sum_{i=1}^m 1\{y^{(i)}=0\}}.
$$

In summary, the maximum likelihood estimates of the parameters coincide with the intuitive fractions derived earlier. They represent relative frequencies of word occurrences within spam and non-spam classes, along with the overall proportion of spam in the dataset.

#### Laplace Smoothing

Up to this point, the parameter estimation procedure for Bernoulli Naive Bayes is almost complete, but there is a crucial issue to address.  

Consider the case where a particular word never appears in emails of one class (e.g., spam). This would cause the numerator of the estimate to be $0$, resulting in the parameter estimate for that feature also being $0$. Consequently, the likelihood (and therefore the posterior) would be $0$ as well. In practice, this means that if a word like *“chicken”* never appears in spam emails during training, then any email containing *“chicken”* would automatically be classified as non-spam, regardless of the other features.  

Beyond corrupting predictions, this also introduces a computational issue: taking the logarithm of $0$ is undefined, which can cause errors during implementation.  

From a logical standpoint, the problem arises because the absence of an event in the training set does not imply that the event has zero probability of occurring in the future. This is known as the **zero-frequency problem**.  

A standard solution is to apply a smoothing technique. For Bernoulli Naive Bayes (and categorical Naive Bayes in general), **Laplace smoothing** is commonly used. The adjusted estimate is:

$$
\phi_{(j \mid y=1)} = \frac{\sum_{i=1}^{m} 1\{x_{j}^{(i)}=1, y^{(i)}=1\} + l}{\sum_{i=1}^{m} 1\{y^{(i)}=1\} + lK}
$$

Here:
- $l$ is the smoothing parameter,  
- $K$ is the number of feature classes (not label classes).  

If $l = 0$, we recover the maximum likelihood estimator (MLE). Setting $l = 1$ corresponds to Laplace smoothing.  

The same adjustment is applied to $\phi_{(j \mid y=0)}$.  
In the case of Bernoulli Naive Bayes with binary features, this means adding $1$ to the numerator and $2$ to the denominator.

##### Example of Laplace Smoothing

Suppose we are estimating the probability that the word *“chicken”* appears in a spam email.  

- In the training set, we have $m = 100$ emails labeled as spam.  
- The word *“chicken”* never appears in any of these spam emails.  

Without smoothing, the maximum likelihood estimate would be:

$$
\phi_{(\text{chicken} \mid y=1)} = \frac{0}{100} = 0
$$

This would lead to the zero-frequency problem described earlier.

Now let us apply **Laplace smoothing** with $l=1$ and $K=2$ (since the feature is binary: present or absent). The formula becomes:

$$
\phi_{(\text{chicken} \mid y=1)} = \frac{0 + 1}{100 + 2} = \frac{1}{102} \approx 0.0098
$$

So instead of assigning zero probability, we assign a small but nonzero probability.  

This adjustment ensures that the model does not automatically rule out spam emails containing *“chicken”*, even though the training data did not include such examples.

#### Training

The parameter estimation procedures discussed earlier are carried out during the **training phase** using the training dataset.  

This process differs significantly from models such as logistic regression. In logistic regression, training involves iterative optimization methods (e.g., gradient descent) to fit the parameters. By contrast, the Naive Bayes algorithm estimates each parameter independently through straightforward counting procedures.  

This non-iterative process makes Naive Bayes computationally very efficient, which is one of its key advantages.

#### Prediction

The goal of the Naive Bayes classifier in the prediction phase is to assign a new observation to one of the possible classes. In the spam detection example, this means classifying an email as either spam or non-spam.  

The decision rule is:

$$
\hat{y} = \underset{y}{\arg\max} \; \sum_{j=1}^n \log p(x_j \mid y) + \log p(y)
$$

During prediction, we replace the probabilities with the parameter estimates obtained in the training phase:

$$
\hat{y} = \underset{y}{\arg\max} \; \sum_{j=1}^n \Bigl(x_j \log \phi_{j \mid y} + (1-x_j)\log(1-\phi_{j \mid y})\Bigr) + \log \phi_y
$$

Breaking this down, we compute the **log-scores** for each class:

$$
\begin{align*}
s_1 &= \sum_{j=1}^n \Bigl(x_j\log\phi_{j \mid 1} + (1-x_j)\log(1-\phi_{j \mid 1})\Bigr) + \log \phi_y \\
s_0 &= \sum_{j=1}^n \Bigl(x_j\log\phi_{j \mid 0} + (1-x_j)\log(1-\phi_{j \mid 0})\Bigr) + \log (1 - \phi_y)
\end{align*}
$$

The predicted label $\hat{y}$ is the class corresponding to the larger score.  

If we want to recover the actual **posterior probabilities** for the email to be a spam or non-spam, we must normalize the scores:

$$
\begin{align*}
p(y=1 \mid \mathbf{x}) &= \frac{e^{s_1}}{e^{s_0} + e^{s_1}} = \frac{1}{1 + (e^{s_0} - e^{s_1})}\\
p(y=0 \mid \mathbf{x}) &= 1 - p(y=1 \mid \mathbf{x})
\end{align*}
$$

This normalization step is equivalent to reintroducing $p(\mathbf{x})$ into Bayes’ theorem.

### Evaluation Metrics

As with most classification problems, we evaluate the performance of the Naive Bayes classifier using the standard set of metrics, which were discussed in more detail in the previous notebooks: **Accuracy**, **Precision**, **Recall**, **F1 Score**.

## Bernoulli Naive Bayes Implemention from Scratch

In [1]:
import numpy as np
import pandas as pd

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import re
from spellchecker import SpellChecker

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

We begin by loading the SMS Spam Collection dataset, which contains messages labeled as either spam or ham (non-spam). This dataset serves as a practical example for the Bernoulli Naive Bayes classifier we discussed in the theory section. Each message is represented as a text string, and the corresponding label indicates whether it is spam or not.

In [None]:
data_path = "../datasets/sms_spam/SMSSpamCollection.txt"
spam = pd.read_csv(data_path,delimiter="\t",header=None)
spam.columns = ['class', 'message']
spam.head()

This section performs preprocessing on the raw SMS messages to prepare them for modeling. The steps include lowercasing, replacing emails, URLs, phone numbers, currency symbols, and numbers with placeholder tokens, removing punctuation, tokenizing, correcting spelling errors, and stemming words while removing stop words.  

In this context, **stemming** means reducing words to their root form (e.g., "running" becomes "run") to consolidate similar word forms. Both a stemmed and a non-stemmed version of each message is stored.  

**Notes:**  
- The `nltk.download('stopwords')` command only needs to be run once per system.  
- The entire preprocessing step can be skipped if you prefer to directly use the previously cleaned dataset saved as a CSV file.

In [None]:
nltk.download('stopwords')

In [None]:
def spell_corrected(msg):
  spell = SpellChecker()

  # find those words that may be misspelled
  misspelled = spell.unknown(msg)
  corrected_words = {}
  for word in misspelled:
      # Get the one `most likely` answer
      correct_spell = spell.correction(word)
      if correct_spell != word:
          corrected_words[word] = correct_spell

  for m in range(len(msg)):
    try:
      if corrected_words[msg[m]]:
        msg[m] = corrected_words[msg[m]]
    except Exception:
      pass
  return msg

In [None]:
corpus_stemmed = []
corpus_not_stemmed = []

ps = PorterStemmer()

for i in range(len(spam['message'])):
    # Applying Regular Expression

    # Replace email addresses with 'emailaddr'
    # Replace URLs with 'httpaddr'
    # Replace money symbols with 'moneysymb'
    # Replace phone numbers with 'phonenumbr'
    # Replace numbers with 'numbr'
      
    msg = spam['message'][i]

    # Each word to lower case
    msg = msg.lower()
    msg = re.sub(r'[\w\-.]+?@\w+?\.\w{2,4}', ' emailaddr ', msg)
    msg = re.sub(r'(http[s]?\S+)|(\w+\.[a-z]{2,4}\S*)', 'httpaddr', msg)
    msg = re.sub(r'£|\$', ' moneysymb ', msg)
    msg = re.sub(r' [0-9]{4}(-)?[0-9]{3}(-)?[0-9]{4} ', ' phonenumber ', msg)
    msg = re.sub(r'\d+(\.\d+)?', ' number ', msg)
    msg = re.sub(r' u ', ' you ', msg)

    # Remove all punctuations
    msg = re.sub(r'[^\w\d\s]', ' ', msg)
    # Splitting words to Tokenize
    msg = msg.split()
    msg = spell_corrected(msg)
    corpus_not_stemmed.append(' '.join(msg))
    # Stemming with PorterStemmer handling Stop Words
    msg = [ps.stem(word) for word in msg if not word in set(stopwords.words('english'))]
    # Preparing Messages with Remaining Tokens
    msg = ' '.join(msg)
    # Preparing WordVector Corpus
    corpus_stemmed.append(msg)

In [None]:
spam['stemmed'] = corpus_stemmed
spam['not_stemmed'] = corpus_not_stemmed

In [None]:
spam.to_csv("../datasets/sms_spam/sms_spam_cleaned.csv",sep=";")

Here we load the previously cleaned SMS spam dataset from the CSV file. After inspecting the first few rows and general dataset information, any missing entries in the stemmed or non-stemmed message columns are replaced with empty strings. Finally, the distribution of classes is printed to examine the proportion of spam versus non-spam messages in the dataset.

In [8]:
spam = pd.read_csv("../datasets/sms_spam/sms_spam_cleaned.csv", delimiter=";", index_col=0)
spam.head()

Unnamed: 0,class,message,stemmed,not_stemmed
0,ham,"Go until jurong point, crazy.. Available only ...",go wrong point crazi avail bug n great world l...,go until wrong point crazy available only in b...
1,ham,Ok lar... Joking wif u oni...,ok lar joke,ok lar joking if you on
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entri number wili come win fa cup final t...,free entry in number a wily come to win fa cup...
3,ham,U dun say so early hor... U c already then say...,u dun say earli c alreadi say,u dun say so early for you c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah think goe us live around though,nah i don t think he goes to us he lives aroun...


In [9]:
spam.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5572 entries, 0 to 5571
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   class        5572 non-null   object
 1   message      5572 non-null   object
 2   stemmed      5562 non-null   object
 3   not_stemmed  5570 non-null   object
dtypes: object(4)
memory usage: 217.7+ KB


In [10]:
spam['stemmed'] = spam['stemmed'].fillna("")
spam['not_stemmed'] = spam['not_stemmed'].fillna("")

In [11]:
spam['class'].value_counts()

class
ham     4825
spam     747
Name: count, dtype: int64

In this step, we convert the preprocessed text messages into a numerical feature representation using `CountVectorizer`. Each unique word in the dataset becomes a feature (column), and each message is represented as a binary vector indicating the **presence (1) or absence (0)** of each word. This binary representation is suitable for the Bernoulli Naive Bayes classifier, which models features as being either present or absent.  

After the transformation, the feature matrix `X` has shape `(5572, 4905)`, indicating there are 5,572 messages and 4,905 unique words in the dataset.

In [12]:
cv = CountVectorizer(binary=True)
X = cv.fit_transform(spam['stemmed']).toarray()
feat = cv.get_feature_names_out()

In [13]:
X.shape

(5572, 4905)

Here, the class labels are converted into a binary numerical format suitable for modeling. Messages labeled as **ham** (non-spam) are assigned a value of `0`, while **spam** messages are assigned a value of `1`. This creates the target vector `y` for training the Bernoulli Naive Bayes classifier.

In [14]:
y = np.where(spam['class'] == 'ham', 0, 1)

In this section, the dataset is first **shuffled** to randomize the order of the messages, which helps ensure that the training and test sets are representative and not biased by any ordering in the original data.  

Next, the dataset is **split** into training and test sets according to a specified ratio (`split_ratio = 0.75`). This means 75% of the data is used for training the Bernoulli Naive Bayes model, and the remaining 25% is reserved for evaluating its performance.

In [15]:
def shuffle_data(X, y):
    shuffle_indices = np.random.permutation(len(X))
    X, y = X[shuffle_indices], y[shuffle_indices]

    return X, y

In [16]:
X, y = shuffle_data(X, y)

In [17]:
def split_dataset(X, y, split_ratio):
    split_size = int(len(X) * split_ratio)
    X_train = X[:split_size]
    y_train = y[:split_size]
    X_test = X[split_size:]
    y_test = y[split_size:]

    return X_train, y_train, X_test, y_test

In [18]:
split_ratio = 0.75

In [19]:
X_train, y_train, X_test, y_test = split_dataset(X, y, split_ratio)

This section implements a **Bernoulli Naive Bayes classifier from scratch**.  

- The `fit` method estimates the parameters from the training data:  
  - `likelihood_one` and `likelihood_zero` store the probability of each word being present given that the message is spam (`1`) or ham (`0`), applying Laplace smoothing.  
  - `prior` stores the overall probability of a message being spam in the training set.  

- The `predict` method computes the **log-joint likelihoods** for each class, adds the log prior, and assigns each message to the class with the higher score.  

Finally, the model is instantiated and fitted to the training data.

In [20]:
class BernoulliNaiveBayes:  
    def fit(self, X, y):
        self.likelihoods_1 = (np.sum(X[y == 1], axis=0) + 1) / (np.sum(y) + np.unique(y).size)
        self.likelihoods_0 = (np.sum(X[y == 0], axis=0) + 1) / (len(y) - np.sum(y) + np.unique(y).size)
        self.prior = np.sum(y) / len(y)
   
    def predict(self, X):
        joint_log_likelihood_1 = np.sum(X * np.log(self.likelihoods_1) + (1 - X) * np.log(1 - self.likelihoods_1), axis=1)
        joint_log_likelihood_0 = np.sum(X * np.log(self.likelihoods_0) + (1 - X) * np.log(1 - self.likelihoods_0), axis=1)
        log_posterior_1 = joint_log_likelihood_1 + np.log(self.prior)
        log_posterior_0 = joint_log_likelihood_0 + np.log(1 - self.prior)
        y_pred = (log_posterior_1 > log_posterior_0).astype(int)
        
        return y_pred

In [21]:
model = BernoulliNaiveBayes()
model.fit(X_train, y_train)

In this step, the fitted Bernoulli Naive Bayes model is used to **predict the class labels** of the test set. The predictions are then evaluated using standard classification metrics: precision, recall, F1-score, and accuracy.

In [22]:
y_pred = model.predict(X_test)

In [23]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1205
           1       0.97      0.89      0.93       188

    accuracy                           0.98      1393
   macro avg       0.98      0.94      0.96      1393
weighted avg       0.98      0.98      0.98      1393



## Conclusion

In this notebook, we implemented a Bernoulli Naive Bayes classifier from scratch and applied it to a simplified SMS spam detection example. While the model achieved high accuracy, this serves mainly as a demonstration; in a real-world scenario, additional exploratory data analysis, feature engineering, and model tuning would likely be necessary to achieve optimal performance.

## References  

Jurafsky, D., & Martin, J. H. (2020). *Speech and language processing* (Draft of January 20, 2020, Chapter 4: Naive Bayes and Sentiment Classification). Retrieved from https://web.stanford.edu/~jurafsky/slp3/old_dec20/ed3book_dec302020.pdf

Weinberger, K. (2018). *Lecture note 05: Bayes Classifier and Naive Bayes*. Cornell University CS4780: Machine Learning for Intelligent Systems. Retrieved from https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote05.html  

Collins, M. (2012). *The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm*. Columbia University. Retrieved from https://www.cs.columbia.edu/~mcollins/em.pdf  

Sanderson, G. (2019). *Bayes' theorem* [Video]. 3Blue1Brown. Retrieved from https://www.3blue1brown.com/lessons/bayes-theorem  

Ng, A. (2018). *Lecture 5 - GDA & Naive Bayes* [Video]. YouTube. Stanford CS229: Machine Learning. Retrieved from https://www.youtube.com/watch?v=nt63k3bfXS0&list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU&index=5

Ng, A. (2018). *Lecture 6 - Support Vector Machines* [Video]. YouTube. Stanford CS229: Machine Learning. Retrieved from https://www.youtube.com/watch?v=lDwow4aOrtg&list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU&index=6  
