# Naïve Bayes Classifier
This notebook presents a Naïve Bayes text classifier for spam detection.
Each section of the code is related to the relevant formula from the reference PDF.

## Imports
We begin by importing necessary libraries.

In [13]:
import numpy as np # type: ignore
from collections import defaultdict
import re

## Dataset
We define the sample dataset, including emails and their labels.

In [14]:
emails = ["Win a free lottery now", "Buy money online cheap", 
          "You won free cash prize", "Meet me at 5 pm", "Hello, how are you?"]
labels = [1, 1, 1, 0, 0]  # 1 = Spam, 0 = Not Spam

## Tokenization
We define a function to tokenize the emails by extracting words.

In [15]:
def tokenize(text):
    """Convert text into a list of lowercase words."""
    return re.findall(r'\b\w+\b', text.lower())

## Building Vocabulary and Counting Words
We count occurrences of words in spam and non-spam emails.

Formula (Multinomial Naïve Bayes):

$$ P(x_i | c) = \frac{\text{count}(x_i, c) + \alpha}{\sum \text{count}(w, c) + \alpha V} $$

Where:


$ P(x_i | c) $ is the probability of word $x_i$ appearing in class $c$ (Spam or Not Spam).

$ \text{count}(x_i, c) $ is the number of times word $x_i$ appears in emails belonging to class $c$ (Spam or Not Spam).

$ \sum \text{count}(w, c) $ is the total count of all words $w$ in class $c$, summing over all words in that class.

$\alpha $ is Laplace smoothing parameter (to prevent zero probabilities).

$V$ is the total number of unique words (vocabulary size).



## Mapping Formula Components to Code


\begin{array}{|c|c|}
\hline
\textbf{Formula Component} & \textbf{Code Representation} \\
\hline
P(x_i | c) & \text{Computed as word probability later} \\
\hline
\text{count}(x_i, c) & \texttt{word\_counts[label][word]} \\
\hline
\sum \text{count}(w, c) & \texttt{sum(word\_counts[label].values())} \\
\hline
V \text{ (vocabulary size)} & \texttt{len(vocab)} \\
\hline
\alpha \text{ (Laplace smoothing)} & \texttt{alpha = 1} \\
\hline
\end{array}



In [16]:
vocab = set()
word_counts = {0: defaultdict(int), 1: defaultdict(int)}
class_counts = {0: 0, 1: 0}

for email, label in zip(emails, labels):
    words = tokenize(email)
    class_counts[label] += 1
    for word in words:
        vocab.add(word)
        word_counts[label][word] += 1

## Convert Vocabulary to List

In [17]:
vocab = list(vocab)
total_words = len(vocab)

## Calculate Priors
We compute the prior probabilities for spam and non-spam emails.

Bayes' Theorem:

$$ P(A | B) = \frac{P(B | A) P(A)}{P(B)} $$

In [18]:
total_samples = len(emails)
P_spam = class_counts[1] / total_samples
P_not_spam = class_counts[0] / total_samples

## Compute Likelihoods with Laplace Smoothing
To prevent zero probabilities, we apply Laplace smoothing.

Formula:

$$ P(x_i | c) = \frac{\text{count}(x_i, c) + \alpha}{\sum \text{count}(w, c) + \alpha V} $$

In [19]:
alpha = 1
word_probs = {0: {}, 1: {}}

for word in vocab:
    word_probs[0][word] = (word_counts[0][word] + alpha) / (sum(word_counts[0].values()) + alpha * total_words)
    word_probs[1][word] = (word_counts[1][word] + alpha) / (sum(word_counts[1].values()) + alpha * total_words)

## Naïve Bayes Prediction Function
Formula:

$$ \log P(c | x_1, ..., x_n) = \log P(c) + \sum_{i=1}^{n} \log P(x_i | c) $$

In [20]:
def predict(email):
    words = tokenize(email)
    log_prob_spam = np.log(P_spam)
    log_prob_not_spam = np.log(P_not_spam)

    for word in words:
        if word in vocab:
            log_prob_spam += np.log(word_probs[1][word])
            log_prob_not_spam += np.log(word_probs[0][word])

    return 1 if log_prob_spam > log_prob_not_spam else 0 #argmix

## Testing the Classifier

In [21]:
new_email = "Get free cash now"
prediction = predict(new_email)
print(f"New Email: '{new_email}'")
print("Prediction:", "Spam" if prediction == 1 else "Not Spam")

New Email: 'Get free cash now'
Prediction: Spam
