# Naïve Bayes for NLP

> Naïve Bayes is a probabilistic classifier that predicts the class with the highest posterior probability, assuming features are conditionally independent given the class.

## The Classification Goal

Given:

* Feature vector **x = (x₁, x₂, …, xₘ)**
* Classes **κ ∈ {1, …, K}**

We want:

$$
h(x) = \arg\max_{\kappa} P(\kappa \mid x)
$$

## Step 1: Apply Bayes’ Theorem

$$
P(\kappa \mid x) = \frac{P(x \mid \kappa) P(\kappa)}{P(x)}
$$

* **P(x)** is constant for all classes → **ignored**
* So we maximize:

$$
\arg\max_{\kappa} P(x \mid \kappa) P(\kappa)
$$

## Step 2: The “Naïve” Assumption

**Conditional independence of features**:

$$
P(x \mid \kappa) = \prod_{i=1}^{m} P(x_i \mid \kappa)
$$

This massively simplifies learning:

*  Without assumption → exponential parameters
*  With assumption → linear parameters

This works surprisingly well for **Bag-of-Words text data**.

## Step 3: Log Trick (Important)

To avoid numerical underflow and simplify math:

$$
\arg\max_{\kappa} \left[\log P(\kappa) + \sum_{i=1}^{m} \log P(x_i \mid \kappa) \right]
$$

➡️ Turns products into sums  
➡️ Final model becomes **linear in features**

## Bag-of-Words + Naïve Bayes

If features are **word counts**:

$$
P(x_i \mid \kappa) = P(w_i \mid \kappa)^{x_i}
$$

So:

$$
\log P(x_i \mid \kappa) = x_i \log P(w_i \mid \kappa)
$$

This is why **Naïve Bayes works beautifully with BoW / TF-IDF**.

## Parameter Estimation (Just Counting!)


$$
P(w_i \mid \kappa) =
\frac{\text{count of } w_i \text{ in class } \kappa}
{\text{total words in class } \kappa}
$$

## Laplace Smoothing (Very Important)

Problem

If a word never appears in a class:
$$
P(w_i \mid \kappa) = 0 \Rightarrow \log(0) = -\infty
$$

Solution: Laplace Smoothing

$$
P(w_i \mid \kappa) =
\frac{N_{i,\kappa} + \alpha}
{\sum_j N_{j,\kappa} + \alpha M}
$$

* Prevents zero probabilities
* Ensures all classes remain possible

## Why Naïve Bayes Works Well for NLP
Pros: 
- Handles **high-dimensional sparse data**
- Extremely **fast to train**
- Works well even with **small datasets**
- Naturally probabilistic and interpretable

Cons:
- Assumes independence (often false)
- Weaker than deep models for semantics

## Common NLP Applications

* **Spam detection**
* **Sentiment analysis**
* **Topic classification**
* **Document categorization**
* **Information retrieval ranking**

## Summary

* Naïve Bayes chooses the class with **maximum posterior probability**
* Independence is assumed **given the class**
* Works best with **BoW or TF-IDF**
* Uses **counts, not gradient descent**
* Laplace smoothing prevents zero probabilities
* Linear decision boundary in feature space
