# Day 54 – Naive Bayes Classifier

## Introduction

In this notebook, I explore the **Naive Bayes Classifier**, a family of probabilistic algorithms based on Bayes’ Theorem with the assumption of feature independence. **Naive Bayes Classifier** is a powerful and efficient supervised learning algorithm based on **Bayes' Theorem**. Its simplicity and speed make it a popular choice for many tasks, especially in text classification and spam filtering.

I begin with the theory behind conditional probability, Bayes’ Theorem, and how Naive Bayes works, followed by its real-world applications. Then I implement and compare the three main types of Naive Bayes classifiers:  
- **GaussianNB** (for continuous features)  
- **MultinomialNB** (for count-based features)  
- **BernoulliNB** (for binary features)  

Each classifier is tested under different preprocessing approaches — without scaling, with StandardScaler, and with Normalizer — to observe how scaling impacts their performance.  

By the end of this notebook, it becomes clear which Naive Bayes variants are sensitive to scaling, why MultinomialNB fails with StandardScaler, and how each algorithm behaves under different scenarios, giving practical insights into choosing the right variant for a dataset.

---


## 1. The Foundation: Conditional Probability and Bayes' Theorem

To understand Naive Bayes, we must first grasp two core concepts from probability.

### Conditional Probability

Conditional Probability is the likelihood of an event occurring, given that another event has already happened. We write it as $P(A|B)$, which means "the probability of event A given event B."

* **Example**: The probability of a student getting a high score on a test ($A$) given that they studied for many hours ($B$).

Mathematically:

$$P(A|B) = \frac{P(A ∩ B)}{P(B)}$$


- \( P(A|B) \) = Probability of A given B  
- \( P(A ∩ B) \) = Probability that both A and B occur  
- \( P(B) \) = Probability of B  

Example: Suppose 30% of people like coffee, and among coffee lovers, 60% also like tea.  
- Here, \( P(Tea|Coffee) = 0.6 \).

### Bayes' Theorem

Bayes’ Theorem provides a way to **update probabilities** when new evidence (data) is observed. It is a way to calculate conditional probability by using other known probabilities. It's the mathematical formula that allows us to update our beliefs about an event based on new evidence.

The formula is:

$$P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$$

* $P(A|B)$: The **posterior probability** (what we want to find). The probability of event A happening given event B has occurred.
* $P(B|A)$: The **likelihood**. The probability of event B happening given event A is true.
* $P(A)$: The **prior probability**. The initial probability of event A happening.
* $P(B)$: The **evidence**. The probability of event B happening.

---

## 2. The Naive Bayes Algorithm

The Naive Bayes algorithm applies Bayes' Theorem to a classification problem. It uses a "naive" assumption to simplify the complex calculations of conditional probability.

### The "Naive" Assumption
The algorithm assumes that all features in a dataset are **independent** of each other. In other words, it assumes that the value of one feature (e.g., a person's age) does not influence the value of another feature (e.g., their salary). While this assumption is rarely true in the real world, it greatly simplifies the model and makes it incredibly fast and efficient.

**Why "Naive"?**  
- It assumes that **all features are independent given the class**.  
- In real life, this is rarely true, but the algorithm still works surprisingly well.

### How it works
The algorithm calculates the probability of each class for a given set of features and then assigns the class with the highest probability to the new data point.

* For example, in a spam filter, Naive Bayes calculates:
    * The probability that an email is spam, given its words.
    * The probability that an email is not spam, given its words.
* It then classifies the email as spam or not spam based on which of these two probabilities is higher.

### Steps in Naive Bayes classification:**
1. Calculate prior probability for each class.  
2. Calculate conditional probability for each feature given the class.  
3. Apply Bayes’ theorem to compute posterior probability for each class.  
4. Choose the class with the highest posterior probability.

---

## 3. Real-Time Example of Naive Bayes

**Spam Email Detection:**
- Features: words present in an email.  
- Classes: *Spam* or *Not Spam*.  
- The algorithm calculates probabilities like:  
  - \( P(Spam|Word = "Free") \)  
  - \( P(NotSpam|Word = "Free") \)  
- If the probability of Spam is higher, the email is classified as Spam.  

Naive Bayes is widely used in:
- Text classification (spam filtering, sentiment analysis, document categorization)  
- Medical diagnosis  
- Real-time predictions where speed is crucial  

---

## 4. Types of Naive Bayes

There are **three main variants** of Naive Bayes, chosen based on the type of data:

1. **Gaussian Naive Bayes**  
   - Assumes features are continuous and follow a **normal (Gaussian) distribution**.  
   - Common for numeric features like age, salary, measurements.

2. **Multinomial Naive Bayes**  
   - Used for **discrete counts** (e.g., word counts in text classification).  
   - Feature values must be non-negative integers.  
   - Very popular in NLP applications.

3. **Bernoulli Naive Bayes**  
   - Features are assumed to be **binary** (0 or 1).  
   - Example: presence or absence of a word in text.  
   - Suitable when features are indicators rather than counts.

The choice of Naive Bayes classifier depends on the nature of your data's features.


| Classifier | Data Type | Key Characteristic | Common Use Case |
| :--- | :--- | :--- | :--- |
| **Gaussian Naive Bayes** | Continuous | Assumes features follow a **Gaussian (Normal) distribution**. | Continuous numerical data, like age, height, or salary. |
| **Multinomial Naive Bayes** | Discrete | Used for **count-based data**. | Text classification, where features are word counts or frequencies. |
| **Bernoulli Naive Bayes** | Binary | Used for **binary features**. | Document classification, where a feature indicates whether a word is present (1) or not (0). |

---

## 5. Advantages and Limitations

**Advantages:**
- Simple, fast, and efficient for large datasets.
- Works well with text classification problems (spam, sentiment).
- Requires less training data compared to many other algorithms.

**Limitations:**
- Strong independence assumption rarely holds in real-world data.
- Struggles with highly correlated features.
- Performance depends heavily on the correct choice of variant (Gaussian, Bernoulli, Multinomial).
- Multinomial and Bernoulli need non-negative/binary features, limiting flexibility.  

---

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

## Load the dataset

In [2]:
dataset = pd.read_csv(r"C:\Users\Arman\Downloads\dataset\logit classification.csv")

In [3]:
dataset

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0
...,...,...,...,...,...
395,15691863,Female,46,41000,1
396,15706071,Male,51,23000,1
397,15654296,Female,50,20000,1
398,15755018,Male,36,33000,0


## Feature Selection
### Split into features (X) and target (y)
- X: Features (Age, EstimatedSalary)
- y: Target (Purchased)

In [4]:
X = dataset[["Age", "EstimatedSalary"]].values
y = dataset["Purchased"].values

## Splitting the dataset into the Training set and Test set¶

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

## Feature Scaling

## Apply StandardScaler

In [6]:
sc = StandardScaler() 
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

## Apply NormalizerScaler

In [7]:
sc_norm = Normalizer()
X_train_norm = sc_norm.fit_transform(X_train)
X_test_norm = sc_norm.transform(X_test)

## Training and Evaluating Navie Bayes Model

## BernoulliNB

### With Scaling
#### StandardScaler

In [8]:
# Training the  model on the Training set
classifier1 = BernoulliNB() 
classifier1.fit(X_train_sc, y_train)

# Predicting the Test set results
y_pred1 = classifier1.predict(X_test_sc)

# Evaluation of the model
print("BernoulliNB with StandardScaler")
print("Accuracy:", accuracy_score(y_test, y_pred1))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred1))
print("Classification Report:\n", classification_report(y_test, y_pred1))

BernoulliNB with StandardScaler
Accuracy: 0.79
Confusion Matrix:
 [[63  5]
 [16 16]]
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.93      0.86        68
           1       0.76      0.50      0.60        32

    accuracy                           0.79       100
   macro avg       0.78      0.71      0.73       100
weighted avg       0.79      0.79      0.78       100



#### NormalizerScaler

In [9]:
# Training the  model on the Training set
classifier2 = BernoulliNB() 
classifier2.fit(X_train_norm, y_train)

# Predicting the Test set results
y_pred2 = classifier2.predict(X_test_norm)

# Evaluation of the model
print("BernoulliNB with NormalizerScaler")
print("Accuracy:", accuracy_score(y_test, y_pred2))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred2))
print("Classification Report:\n", classification_report(y_test, y_pred2))

BernoulliNB with NormalizerScaler
Accuracy: 0.68
Confusion Matrix:
 [[68  0]
 [32  0]]
Classification Report:
               precision    recall  f1-score   support

           0       0.68      1.00      0.81        68
           1       0.00      0.00      0.00        32

    accuracy                           0.68       100
   macro avg       0.34      0.50      0.40       100
weighted avg       0.46      0.68      0.55       100



### Without Scaling

In [10]:
# Training the  model on the Training set
classifier3 = BernoulliNB() 
classifier3.fit(X_train, y_train)

# Predicting the Test set results
y_pred3 = classifier3.predict(X_test)

# Evaluation of the model
print("BernoulliNB without Scaling")
print("Accuracy:", accuracy_score(y_test, y_pred3))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred3))
print("Classification Report:\n", classification_report(y_test, y_pred3))

BernoulliNB without Scaling
Accuracy: 0.68
Confusion Matrix:
 [[68  0]
 [32  0]]
Classification Report:
               precision    recall  f1-score   support

           0       0.68      1.00      0.81        68
           1       0.00      0.00      0.00        32

    accuracy                           0.68       100
   macro avg       0.34      0.50      0.40       100
weighted avg       0.46      0.68      0.55       100



## GaussianNB


### With Scaling
#### StandardScaler

In [11]:
# Training the  model on the Training set
classifier4 = GaussianNB() 
classifier4.fit(X_train_sc, y_train)

# Predicting the Test set results
y_pred4 = classifier4.predict(X_test_sc)

# Evaluation of the model
print("GaussianNB with StandardScaler")
print("Accuracy:", accuracy_score(y_test, y_pred4))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred4))
print("Classification Report:\n", classification_report(y_test, y_pred4))

GaussianNB with StandardScaler
Accuracy: 0.9
Confusion Matrix:
 [[65  3]
 [ 7 25]]
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.96      0.93        68
           1       0.89      0.78      0.83        32

    accuracy                           0.90       100
   macro avg       0.90      0.87      0.88       100
weighted avg       0.90      0.90      0.90       100



#### NormalizerScaler

In [12]:
# Training the  model on the Training set
classifier5 = GaussianNB() 
classifier5.fit(X_train_norm, y_train)

# Predicting the Test set results
y_pred5 = classifier5.predict(X_test_norm)

# Evaluation of the model
print("GaussianNB with NormalizerScaler")
print("Accuracy:", accuracy_score(y_test, y_pred5))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred5))
print("Classification Report:\n", classification_report(y_test, y_pred5))

GaussianNB with NormalizerScaler
Accuracy: 0.7
Confusion Matrix:
 [[62  6]
 [24  8]]
Classification Report:
               precision    recall  f1-score   support

           0       0.72      0.91      0.81        68
           1       0.57      0.25      0.35        32

    accuracy                           0.70       100
   macro avg       0.65      0.58      0.58       100
weighted avg       0.67      0.70      0.66       100



#### Without Scaling

In [13]:
# Training the  model on the Training set
classifier6 = GaussianNB() 
classifier6.fit(X_train, y_train)

# Predicting the Test set results
y_pred6 = classifier6.predict(X_test)

# Evaluation of the model
print("GaussianNB without Scaling")
print("Accuracy:", accuracy_score(y_test, y_pred6))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred6))
print("Classification Report:\n", classification_report(y_test, y_pred6))

GaussianNB without Scaling
Accuracy: 0.9
Confusion Matrix:
 [[65  3]
 [ 7 25]]
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.96      0.93        68
           1       0.89      0.78      0.83        32

    accuracy                           0.90       100
   macro avg       0.90      0.87      0.88       100
weighted avg       0.90      0.90      0.90       100



## MultinomialNB

### With Scaling
#### StandardScaler

In [14]:
# # Training the  model on the Training set
# classifier7 = MultinomialNB() 
# classifier7.fit(X_train_sc, y_train)

# # Predicting the Test set results
# y_pred7 = classifier7.predict(X_test_sc)

# # Evaluation of the model
# print("MultinomialNB with StandardScaler")
# print("Accuracy:", accuracy_score(y_test, y_pred7))
# print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred7))
# print("Classification Report:\n", classification_report(y_test, y_pred7))

>  **Note on MultinomialNB with StandardScaler:**

MultinomialNB expects **non-negative feature values** (counts or frequencies). However, StandardScaler transforms features into z-scores (centered at 0), which produces **negative values** for below-average samples. Since negative inputs are invalid for MultinomialNB, this combination results in an error. To use MultinomialNB, features should remain as non-negative counts or be discretized into bins, not standardized.


#### NormalizerScaler

In [15]:
# Training the  model on the Training set
classifier8 = MultinomialNB() 
classifier8.fit(X_train_norm, y_train)

# Predicting the Test set results
y_pred8 = classifier8.predict(X_test_norm)

# Evaluation of the model
print("MultinomialNB with NormalizerScaler")
print("Accuracy:", accuracy_score(y_test, y_pred8))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred8))
print("Classification Report:\n", classification_report(y_test, y_pred8))

MultinomialNB with NormalizerScaler
Accuracy: 0.68
Confusion Matrix:
 [[68  0]
 [32  0]]
Classification Report:
               precision    recall  f1-score   support

           0       0.68      1.00      0.81        68
           1       0.00      0.00      0.00        32

    accuracy                           0.68       100
   macro avg       0.34      0.50      0.40       100
weighted avg       0.46      0.68      0.55       100



### Without Scaling

In [16]:
# Training the  model on the Training set
classifier9 = MultinomialNB() 
classifier9.fit(X_train, y_train)

# Predicting the Test set results
y_pred9 = classifier9.predict(X_test)

# Evaluation of the model
print("MultinomialNB without Scaling")
print("Accuracy:", accuracy_score(y_test, y_pred9))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred9))
print("Classification Report:\n", classification_report(y_test, y_pred9))

MultinomialNB without Scaling
Accuracy: 0.59
Confusion Matrix:
 [[49 19]
 [22 10]]
Classification Report:
               precision    recall  f1-score   support

           0       0.69      0.72      0.71        68
           1       0.34      0.31      0.33        32

    accuracy                           0.59       100
   macro avg       0.52      0.52      0.52       100
weighted avg       0.58      0.59      0.58       100



---
## Comparison of Naive Bayes Variants

In this notebook, I explored three types of **Naive Bayes classifiers** — **BernoulliNB, GaussianNB, and MultinomialNB** — applied with different preprocessing techniques (no scaling, StandardScaler, Normalizer).

**Key findings:**
- **GaussianNB** performed the best (≈90% accuracy, strong precision/recall) both with and without StandardScaler. This makes sense since our features are continuous and roughly align with Gaussian assumptions.
- **BernoulliNB** worked only with StandardScaler (≈79% accuracy). With no scaling or Normalizer, it predicted only the majority class (0), failing to identify positives. BernoulliNB is more suitable for binary features (after binarization).
- **MultinomialNB** performed poorly on continuous features (best ≈68% accuracy by predicting only the majority class). It is designed for count/frequency data (e.g., text classification) rather than raw continuous values.
- **Preprocessing impact:** StandardScaler improved or maintained performance, while Normalizer generally harmed results by distorting feature distributions.

**Takeaways:**
- For continuous numerical features → **GaussianNB** is the best choice.
- BernoulliNB requires explicit binarization of features to be useful.
- MultinomialNB should be applied to count-based data, not continuous features.
- Always check confusion matrices and class-wise metrics, as accuracy alone can be misleading.

---

## Summary

In this notebook, I studied and implemented the **Naive Bayes Classifier**, a simple yet powerful probabilistic classification algorithm.  
I explored the mathematical foundation starting with conditional probability, Bayes’ Theorem, and the working of the Naive Bayes algorithm.  
I then applied three different variants of Naive Bayes — **GaussianNB, MultinomialNB, and BernoulliNB** — on the dataset under three scenarios:  
- Without scaling  
- With StandardScaler  
- With Normalizer  

Through experiments, I observed how each variant performs and how preprocessing techniques affect their accuracy. The results highlighted the importance of selecting the right variant based on data type and choosing appropriate preprocessing steps.  

---

## Key Takeaways
- **GaussianNB**:  
  Works best for continuous data. Performance improves with scaling (StandardScaler/Normalizer) since it assumes normally distributed features.  

- **MultinomialNB**:  
  Suitable for count-based or frequency data (e.g., text classification).  
  Does **not work with StandardScaler**, as scaling may introduce negative values, which violates its assumption of non-negative counts.  

- **BernoulliNB**:  
  Best for binary/boolean features.  
  Scaling has minimal impact since it only considers whether a feature is present (1) or absent (0).  

- **Scaling Matters**:  
  GaussianNB is highly influenced by scaling, while BernoulliNB and MultinomialNB are relatively scale-invariant.  

- **Practical Insight**:  
  The choice of Naive Bayes variant should depend on the type of data (continuous, counts, or binary), and preprocessing should be applied carefully to avoid invalid inputs.  