# Naive Bayes Classifier ‚Äì Detailed Theory 

---

## 1. What is Naive Bayes?

Naive Bayes is a **probabilistic supervised learning algorithm** used for **classification**.  
It is based on **Bayes‚Äô Theorem** and makes a simplifying assumption that the **features are conditionally independent given the class**.

Despite this strong assumption, Naive Bayes performs very well in many real-world tasks, especially:
- Text classification
- Spam detection
- Document categorization
- Medical diagnosis (baseline models)

---

## 2. Bayes‚Äô Theorem

Bayes‚Äô theorem describes how to update the probability of a hypothesis when new evidence is observed.

$$
P(y \mid X) = \frac{P(X \mid y) \cdot P(y)}{P(X)}
$$

Where:
- $P(y \mid X)$ ‚Üí **Posterior probability** (probability of class given data)
- $P(X \mid y)$ ‚Üí **Likelihood**
- $P(y)$ ‚Üí **Prior probability of the class**
- $P(X)$ ‚Üí **Evidence** (constant for all classes)

---

## 3. Classification Using Bayes‚Äô Theorem

In classification, we compute the posterior probability for **each class** and choose the class with the highest probability.

$$
\hat{y} = \arg\max_y P(X \mid y) \cdot P(y)
$$

Since $P(X)$ is the same for all classes, it can be ignored during comparison.

---

## 4. The ‚ÄúNaive‚Äù Independence Assumption

Naive Bayes assumes that **features are conditionally independent given the class**.

$$
P(X \mid y) = P(x_1, x_2, ..., x_n \mid y)
$$

Naive assumption:

$$
P(X \mid y) = \prod_{i=1}^{n} P(x_i \mid y)
$$

This simplifies computation greatly and makes the algorithm fast and scalable.

> Even when this assumption is not perfectly true, Naive Bayes often works surprisingly well.

---

## 5. Gaussian Naive Bayes

Gaussian Naive Bayes is used when features are **continuous** (real-valued), such as:
- Height
- Weight
- Sepal length (Iris dataset)

It assumes each feature follows a **normal (Gaussian) distribution** within each class.

### Gaussian Probability Density Function

$$
P(x \mid y) =
\frac{1}{\sqrt{2\pi\sigma_y^2}}
\exp\left(
-\frac{(x - \mu_y)^2}{2\sigma_y^2}
\right)
$$

Where:
- $\mu_y$ ‚Üí Mean of feature for class $y$
- $\sigma_y^2$ ‚Üí Variance of feature for class $y$

Each feature has its **own mean and variance per class**.

---

## 6. Training Phase (What the Model Learns)

For each class $y$, the model computes:

1. **Prior probability**
$$
P(y) = \frac{\text{Number of samples in class } y}{\text{Total samples}}
$$

2. **Mean of each feature**
$$
\mu_{y,i} = \text{mean of feature } i \text{ for class } y
$$

3. **Variance of each feature**
$$
\sigma_{y,i}^2 = \text{variance of feature } i \text{ for class } y
$$

No iterative optimization is needed. Training is very fast.

---

## 7. Prediction Phase

For a new sample $X = (x_1, x_2, ..., x_n)$:

1. Compute likelihood for each feature using Gaussian PDF
2. Multiply all feature likelihoods
3. Multiply by class prior
4. Choose the class with maximum posterior probability

$$
P(y \mid X) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)
$$

---

## 8. Log Probabilities (Numerical Stability)

Multiplying many small probabilities can cause numerical underflow.

To avoid this, we use logarithms:

$$
\log P(y \mid X) = \log P(y) + \sum_{i=1}^{n} \log P(x_i \mid y)
$$

Log does not change the class ordering and is numerically stable.

---

## 9. Predicted Probabilities

Naive Bayes can output **class probabilities**, not just class labels.

$$
P(y = k \mid X)
$$

These probabilities:
- Provide confidence of predictions
- Are useful in decision-making systems
- Help understand model behavior

---

## 10. Variants of Naive Bayes

### 1. Gaussian Naive Bayes
- Continuous features
- Assumes normal distribution
- Used in Iris, medical data

### 2. Multinomial Naive Bayes
- Discrete counts
- Word frequency in text
- Used in NLP and spam detection

### 3. Bernoulli Naive Bayes
- Binary features (0/1)
- Presence or absence of a word

---

## 11. Advantages of Naive Bayes

- Very fast training and prediction
- Works well with small datasets
- Handles high-dimensional data
- Performs well in text classification
- Simple and interpretable

---

## 12. Limitations of Naive Bayes

- Strong independence assumption
- Cannot model feature interactions
- Poor probability calibration
- Gaussian assumption may not fit real data perfectly

---

## 13. When to Use Naive Bayes

Use Naive Bayes when:
- You need a fast baseline model
- Dataset is small or high-dimensional
- Features are approximately independent
- Text classification tasks

Avoid when:
- Features are highly correlated
- Complex decision boundaries are required

---

## 14. Summary

- Naive Bayes is a probabilistic classifier
- Based on Bayes‚Äô theorem
- Assumes conditional independence
- Gaussian NB handles continuous features
- Simple, fast, and surprisingly effective

---

## 15. Next Learning Steps

- Implement Multinomial Naive Bayes for text
- Compare Naive Bayes vs Logistic Regression
- Study probability calibration
- Apply Naive Bayes on real NLP datasets

---


In [11]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X,y = iris.data, iris.target
class_names = iris.target_names

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [14]:
class GaussianNaiveBayes:
    def fit(self,X,y):
        self.classes = np.unique(y)
        self.mean    = {}
        self.var     = {}
        self.prior   = {}
        
        for cls in self.classes:
            X_c = X[y == cls]
            self.mean[cls] = X_c.mean(axis=0)
            self.var[cls]  = X_c.var(axis=0)
            self.prior[cls]= X_c.shape[0]/X.shape[0]
            
    def gaussian_pdf(self,x,mean,var):
        eps = 1e-6
        coeff = 1/np.sqrt(2*np.pi*var + eps)
        exponent = np.exp(-((x-mean)**2)/(2*var+eps))
        return coeff*exponent
    
    def predict_proba(self,X):
        probabilities = []
        
        for x in X:
            class_probs = []
            for cls in self.classes:
                prior = np.log(self.prior[cls])
                likelihood = np.sum(
                    np.log(self.gaussian_pdf(x,self.mean[cls],self.var[cls]))
                )
                posterior = prior + likelihood
                class_probs.append(posterior)
                
            # Convert log-probabilities to probabilities
            class_probs = np.exp(class_probs)
            class_probs = class_probs/np.sum(class_probs)
            probabilities.append(class_probs)
        return np.array(probabilities)
    
    def predict(self,X):
        probs = self.predict_proba(X)
        return np.argmax(probs,axis=1)
    
# Train the Model
gnb = GaussianNaiveBayes()
gnb.fit(X_train,y_train)
# Make Predictions
y_pred = gnb.predict(X_test)
# Observe Predicted Probabilities 
probs = gnb.predict_proba(X_test)

for i in range(5):
    print(f"Sample {i+1}")
    for cls, p in zip(class_names,probs[i]):
        print(f" P({cls}) = {p:4f}")
    print("Predict class:",class_names[y_pred[i]])
    print()

Sample 1
 P(setosa) = 0.000000
 P(versicolor) = 0.995636
 P(virginica) = 0.004364
Predict class: versicolor

Sample 2
 P(setosa) = 1.000000
 P(versicolor) = 0.000000
 P(virginica) = 0.000000
Predict class: setosa

Sample 3
 P(setosa) = 0.000000
 P(versicolor) = 0.000000
 P(virginica) = 1.000000
Predict class: virginica

Sample 4
 P(setosa) = 0.000000
 P(versicolor) = 0.977593
 P(virginica) = 0.022407
Predict class: versicolor

Sample 5
 P(setosa) = 0.000000
 P(versicolor) = 0.870021
 P(virginica) = 0.129979
Predict class: versicolor



### üîπ 1. Multinomial Naive Bayes (Text / Count Data)
When to use
- Word counts
- Document-term matrices
- Bag-of-Words / TF (not TF-IDF ideally)

### Theory 
$$P(y|x) ‚àù P(y) ‚àè_{i} P(x_{i}|y)$$

For multinomial NB:
$$P(x_{i}|y) = \frac{count(x_{i},y)}{\sum count(y) + \alpha n}$$
(Laplace smoothing)

In [6]:
import numpy as np

class MultinomialNaiveBayes:
    def __init__(self,alpha=1.0):
        self.alpha = alpha # Lapace smoothing 
        
    def fit(self,X,y):
        self.classes = np.unique(y)
        self.feature_log_prob = {}
        self.class_log_prior = {}
        
        for cls in self.classes:
            X_c = X[y == cls]
            class_count = X_c.shape[0]
            
            # Prior
            self.class_log_prior[cls] = np.log(class_count/X.shape[0])
            
            # Feature probabilities with Laplace smoothing
            feature_count = X_c.sum(axis=0) + self.alpha
            total_count = feature_count.sum()
            
            self.feature_log_prob[cls] = np.log(feature/total_count)
        
    def predict(self,X):
        predictions = []
        
        for x in X:
            class_scores = []
            for cls in self.classes:
                score = (
                    self.class_log_prior[cls] 
                    + np.sum(x*self.feature_log_prob[cls])
                )
                class_scores.append(score)
                
            predictions.append(self.classes[np.argmax(class_scores)])
            
        return np.array(predictions)
    
X = np.array([
    [3,0,1],
    [2,0,0],
    [1,3,0],
    [0,2,1]
])

y = np.array([0,0,1,1])
mnb = MultinomialNaiveBayes(alpha=1.0)

print(mnb.predict(X))


In [8]:
import numpy as np

class MultinomialNaiveBayes:
    def __init__(self, alpha=1.0):
        self.alpha = alpha  # Laplace smoothing 
        
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.classes = np.unique(y)
        self.feature_log_prob = {}
        self.class_log_prior = {}
        
        for cls in self.classes:
            X_c = X[y == cls]
            
            # 1. Calculate Class Prior: P(c)
            # Log(count of samples in class / total samples)
            self.class_log_prior[cls] = np.log(X_c.shape[0] / n_samples)
            
            # 2. Calculate Feature Likelihoods with Laplace Smoothing: P(w|c)
            # Numerator: Count of each feature in class + alpha
            feature_counts = X_c.sum(axis=0) + self.alpha
            # Denominator: Total count of all features in class + (alpha * n_features)
            total_count = feature_counts.sum()
            
            # Store log probabilities to prevent underflow
            self.feature_log_prob[cls] = np.log(feature_counts / total_count)
        
    def predict(self, X):
        # We calculate: log(P(c)) + sum(feature_counts * log(P(w|c)))
        class_scores = []
        
        for cls in self.classes:
            # Vectorized calculation using matrix multiplication (@)
            # This calculates the score for all rows in X simultaneously
            prior = self.class_log_prior[cls]
            likelihood = X @ self.feature_log_prob[cls]
            class_scores.append(prior + likelihood)
        
        # Convert list to array (shape: n_classes, n_samples)
        # argmax along axis 0 gives the index of the highest score per sample
        class_scores = np.array(class_scores)
        return self.classes[np.argmax(class_scores, axis=0)]

# --- Testing the implementation ---
X = np.array([
    [3, 0, 1], # Sample 0
    [2, 0, 0], # Sample 1
    [1, 3, 0], # Sample 2
    [0, 2, 1]  # Sample 3
])

y = np.array([0, 0, 1, 1])

mnb = MultinomialNaiveBayes(alpha=1.0)
mnb.fit(X, y)

predictions = mnb.predict(X)
print(f"Predictions: {predictions}")
# Expected Output: [0 0 1 1]

Predictions: [0 0 1 1]


### 3Ô∏è‚É£ Bernoulli Naive Bayes (From Scratch)
Used for:

Binary features

Spam detection, keyword presence

Likelihood:

$$
P(x_i \mid y) =
\begin{cases}
p_{iy}, & \text{if } x_i = 1 \\
1 - p_{iy}, & \text{if } x_i = 0
\end{cases}
$$

#### üîπ Bernoulli NB Code (From Scratch)

In [9]:
class BernoulliNaiveBayes:
    def __init__(self, alpha=1.0):
        self.alpha = alpha

    def fit(self, X, y):
        self.classes = np.unique(y)
        self.feature_prob = {}
        self.class_log_prior = {}

        for cls in self.classes:
            X_c = X[y == cls]
            self.class_log_prior[cls] = np.log(X_c.shape[0] / X.shape[0])

            # Probability of feature being 1
            feature_count = X_c.sum(axis=0)
            self.feature_prob[cls] = (
                feature_count + self.alpha
            ) / (X_c.shape[0] + 2 * self.alpha)

    def predict(self, X):
        predictions = []

        for x in X:
            class_scores = []
            for cls in self.classes:
                log_prob = self.class_log_prior[cls]
                prob = self.feature_prob[cls]

                log_prob += np.sum(
                    x * np.log(prob) + (1 - x) * np.log(1 - prob)
                )
                class_scores.append(log_prob)

            predictions.append(self.classes[np.argmax(class_scores)])

        return np.array(predictions)

    
X = np.array([
    [1, 0, 1],
    [1, 0, 0],
    [0, 1, 1],
    [0, 1, 0]
])

y = np.array([0, 0, 1, 1])

bnb = BernoulliNaiveBayes(alpha=1.0)
bnb.fit(X, y)

print(bnb.predict(X))


[0 0 1 1]
