In [1]:
import os 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline 
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix, f1_score
)

# **MultiNomial Naive Bayes From Scratch**

**Naive Bayes Classifier**

The Naive Bayes Classifier is a *supervised ML* algorithm used for classification.

It belongs to the family of **probabilistic classifiers**, meaning, it makes predictions based on calculating the probability that a given sample belongs to a particular class.

It is used in text classification such as:
- Email Spam Filtering
- Document Classification
- Sentiment analysis

The Classifier is named "NAIVE" because of a core assumption it makes. (Explained below)

___
**Bayes Theorem**
- The Theorem allows us to "Flip" a conditional probability. Instead of asking
> What is the probability of words given the class, it asks **What's the probability of the class given the words**?

*The Formula*
$$
P(C|X) = \frac {P(X|C)\ P(C)}{P(X)}
$$

**Meaning of these Terms**:
- Posterior Probability: P(C|X)
  - This is what we want to find
  - **Example**: Given a sentence, "The study was randomized", what is the probability it is a `METHODS` document?
- **Liklihood:** P(X|C)
  - "If we assume that this class is C, what is the probability of seeing this exact sentence X?
  - **Example**: "If we look only at `METHODS` documents, what is the probability of seeing 'The study was randomized'?
- **Prior Probability** P(C):
  - "How common is Class C overall, regardless of any data?"
  - **Example**: What is the percentage of all documents in our training set are `METHODS` documents?
- **Evidence**: P(X)
  - "What is the overall probablility of seeing this sentence X in the entire dataset?

> For classification, we want to find the class C that has the Highest P(C|X). Since, P(X) is the same for all classes, we can ignore it for comparison. We just need to find the class that mazimizes the numerator:
$$
  Score(C) \propto P(X|C) \cdot P(C)
$$

**Why Naive?**

Calculating the liklihood: P(X|C) is extremely difficult. It means finding the probability of finding the **Exact** words: "The study was randomized".

To make this possible, Naive Bayes, makes its one, big, "**naive**" assumption: **All features(words) are conditionally independent given the class**

This means that:
1. The word "study" appearing has no effect on the probability of the word "randomized" appearing
2. The order of words does not matter

*This is why the text model is often called as a **Bag-of-Words** model. It treats the sentence as just a bag containing words, ignoring all grammar and word-oroder.
This assumption simplifies the liklihood calculation from one giant, impossible probability into small, easy ones:

$$
P(X|C) = P(w_1, w_2, ..., w_n) \approx P(w_1|C) \cdot P(w_2|C) \cdot ... \cdot P(w_n|C)
$$

**Multinomial Bayes Classifier**

What we use in the code:
1. **Log-Probabilities**
  - Why: Multiplying many small probabilities results in a very very small number, may be approximated to 0, wiping out the calculation
  - Soln: The *log function* turns multiplication into addition, which is numerically more stable: `log(ab) = log(a) + log(b)
  - Formula:
  $$
    log(Score(c)) \propto log(P(C)) + Σ_{i=1}^{n} \ log(P(w_i|C))
  $$

  - If a datapoint (like a word appears multiple times)
  $$
    log(Score(c)) \propto log(P(C)) + Σ_{i=1}^{n} \ log(P(w_i|C)) \cdot count(w_i)
  $$

2. **Zero-Frequency Problem**
  - What if the word "patient" appears in a new document, but it never appeared in any `METHODS` documents during training?
    - P("patient"|METHODS) = 0
    - log(0) = -∞
  - This single word would make the entire score for the `METHODS` class negative infinity, disqualifying it even if all other words were a perfect match.
  - **Solution**: Laplace(or additive) smoothing: *We add one (or a small value α=1.0 in our code) to every word count. This pretends every word in the vocabulary has appeared at least α times in  every class. This ensures that no probability is ever 0.
  - **Formula**:
  $$
  P(w_i|C) = \frac {count(w_i, C) + α}{(Total\ Words\ in\ C) + (α \cdot {Total\ Vocabulary Size})}
  $$


**The Prediction Process**
1. Calculate Priors (from fit): For each class find **P(C)** = "How common is this class?". Store **log(P(C))**
2. Calculate Likelihoods (from fit): For each class, and for every word in the vocabulary, find **P(w|C)** = "How common is this word in this class?" (Using laplace smoothing). Store **log(P(w|C))**.
3. Calculate Score (in predict):Start with the prior: $\text{Score} = \log(P(C))$.Loop through the words in the new document.For each word, add its weighted likelihood: $\text{Score} += \text{count}(\text{word}) \cdot \log(P(\text{word}|C))$.Make Decision: Compare the final log-scores for all classes (e.g., Score(METHODS), Score(RESULTS)). The class with the highest score wins. This is called the Maximum A Posteriori (MAP) estimate.