# 📘 Understanding the Naïve Bayes Classifier

## 📌 1. Introduction
The **Naïve Bayes Classifier** is a **probabilistic algorithm** based on **Bayes' Theorem** and assumes that features are **conditionally independent** given the class. It is widely used for **spam detection, sentiment analysis, and medical diagnosis**.

---

## 📌 2. Bayes' Theorem
Bayes’ theorem describes the relationship between conditional probabilities:

$$
P(C|X) = \frac{P(X|C) P(C)}{P(X)}
$$

Where:
- $ P(C|X) $ = **Posterior Probability**: Probability of class $ C $ given data $ X $.
- $ P(X|C) $ = **Likelihood**: Probability of data $ X $ given class $ C $.
- $ P(C) $ = **Prior Probability**: Probability of class $ C $ occurring.
- $ P(X) $ = **Evidence (Normalization Factor)**: Probability of data $ X $ over all classes.

---

## 📌 3. Applying the Naïve Independence Assumption
For **multiple features** $ (X_1, X_2, ..., X_n) $, we assume **feature independence**:

$$
P(X|C) = P(X_1, X_2, ..., X_n | C) = P(X_1 | C) \cdot P(X_2 | C) \cdot ... \cdot P(X_n | C)
$$

Using this, Bayes' Theorem simplifies to:

$$
P(C|X) = \frac{P(C) \prod_{i=1}^{n} P(X_i | C)}{P(X)}
$$

Since **$ P(X) $ is constant** across all classes, we only need to compute the **numerator**.

---

## 📌 4. Classification Rule
To classify a new data point, we compute **posterior probabilities** for all classes and select the one with the highest probability:

$$
\hat{C} = \arg\max_{C} P(C) \prod_{i=1}^{n} P(X_i | C)
$$

Steps:
1. Compute the **prior probability** $ P(C) $.
2. Compute the **likelihood** $ P(X_i | C) $ for each feature.
3. Multiply them together to get **posterior probabilities**.
4. Choose the class with the highest probability.

---

## 📌 5. Types of Naïve Bayes Models

### **A. Gaussian Naïve Bayes (For Continuous Data)**
For **continuous features**, we assume a **Gaussian (Normal) distribution**:

$$
P(X_i | C) = \frac{1}{\sqrt{2 \pi \sigma_C^2}} \exp \left( -\frac{(X_i - \mu_C)^2}{2 \sigma_C^2} \right)
$$

Where:
- $ \mu_C $ = Mean of feature $ X_i $ for class $ C $.
- $ \sigma_C^2 $ = Variance of feature $ X_i $ for class $ C $.

---

### **B. Multinomial Naïve Bayes (For Text Data)**
For **discrete data** (e.g., word counts in text classification), we use **Multinomial Naïve Bayes**:

$$
P(X|C) = \prod_{i=1}^{n} P(X_i | C)^{X_i}
$$

Using **Laplace Smoothing** (to avoid zero probability):

$$
P(X_i | C) = \frac{\text{count of } X_i \text{ in } C + 1}{\sum_{j=1}^{n} \text{count of } X_j \text{ in } C + n}
$$

---

### **C. Bernoulli Naïve Bayes (For Binary Data)**
For **binary features** (e.g., word presence or absence):

$$
P(X_i | C) = P_i^X (1 - P_i)^{(1 - X)}
$$

where $ P_i $ is the probability of feature $ X_i $ occurring in class $ C $.

---

## 📌 6. Using Log Probabilities for Stability
Multiplying many small probabilities **causes numerical underflow**, so we take the **logarithm**:

$$
\log P(C|X) = \log P(C) + \sum_{i=1}^{n} \log P(X_i | C)
$$

For classification:

$$
\hat{C} = \arg\max_{C} \left( \log P(C) + \sum_{i=1}^{n} \log P(X_i | C) \right)
$$

---

## 📌 7. Summary of Naïve Bayes Models
| Model Type | Data Type | Likelihood Assumption |
|------------|----------|----------------------|
| **Gaussian Naïve Bayes** | Continuous (e.g., real numbers) | Normal Distribution |
| **Multinomial Naïve Bayes** | Categorical (e.g., word counts) | Multinomial Distribution |
| **Bernoulli Naïve Bayes** | Binary (e.g., word presence) | Bernoulli Distribution |

---

## 📌 8. When to Use Naïve Bayes?
✅ **Use Naïve Bayes when:**
- **Speed is important** (it is very fast).
- **Feature independence is reasonable**.
- **Text classification** (spam detection, sentiment analysis).
- **Medical diagnosis** (predicting diseases from symptoms).

❌ **Avoid Naïve Bayes when:**
- **Features are highly correlated**.
- **Small datasets with few samples**.
- **Complex relationships** (deep learning or ensemble methods work better).