In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

# Naive Bayes

Naive Bayes is a classification algorithm that predicts the probability of an input's class based on its features. It's based on Bayes' theorem, which calculates the probability of a hypothesis being true given the evidence. It's known for simplicity, speed, and effectiveness, particularly in text classification tasks. 

[Naive Bayes, Clearly Explained!!!](https://youtu.be/O2L2Uv9pdDA?si=r1I-t3QSuMnGW18W)  
[Gaussian Naive Bayes, Clearly Explained!!!](https://youtu.be/H3EjCKtlVog?si=3n4x1aZ6gHI1JJPW)

![](https://databasecamp.de/wp-content/uploads/naive-bayes-overview-1024x709.png)

## Intuition

Naive Bayes, relies on the principle of conditional probability. It assumes that features are independent given the class label. This simplifies calculations, making it computationally efficient, especially for large datasets. In categorical data, it predicts class probabilities, while in numerical data, it estimates conditional probabilities of the target variable. Despite its simplifying *naive* assumption, Naive Bayes often performs remarkably well, particularly with text data. Its speed and simplicity make it suitable for real-time applications. While not without limitations, it serves as a solid baseline for various tasks in  classifications.

**Prior Probability**:

   $$P(y) = \frac{{\text{Number of samples with target/class } y}}{{\text{Total number of samples}}}$$

**Conditional Probability of Feature Given Target/Class**:

   $$P(X_i|y) = \frac{{\text{Number of samples with feature } X_i \text{ and target/class } y}}{{\text{Number of samples with target/class } y}}$$

**Posterior Probability using Bayes' Theorem**:

   $$P(y|X_{\text{test}}) = \frac{{P(X_{\text{test}}|y) \cdot P(y)}}{{P(X_{\text{test}})}}$$

- where $P(X_{\text{test}}|y)$ is calculated using the Naive Bayes assumption of feature independence: 

   $$P(X_{\text{test}}|y) = \Pi_{i=1}^n P(X_i|y) = P(X_1|y) \times P(X_2|y) \times \ldots \times P(X_n|y)$$

## Algorithm


1. **Input Data**: Receive labeled training data consisting of features $X$ and corresponding class labels $y$ for classification or continuous target variable $y$ for regression.
2. **Training Phase**:
   - Calculate prior probabilities of each class $P(y)$ or probability density function of $y$.
   - Compute conditional probabilities of each feature given the class $P(X_i|y)$.
3. **Prediction**:
   - For classification: Given a new input $X_{\text{test}}$, calculate $P(y|X_{\text{test}})$ using Bayes' theorem. Choose the class with the highest probability as the predicted class for $X_{\text{test}}$.
   - For regression: Given a new input $X_{\text{test}}$, calculate the conditional probability density function $P(y|X_{\text{test}})$ using Bayes' theorem. Estimate the expected value of the target variable $y$ using the conditional probability density function.

## Types of Input Features

1. Categorical Features
2. Numerical Features

### Categorical Features

In [2]:
link = 'https://raw.githubusercontent.com/daaanishhh002/MachineLearning/main/Datasets/education.csv'
df = pd.read_csv(link)

df.sample(5)

Unnamed: 0,age,gender,review,education,purchased
7,60,Female,Poor,School,Yes
49,25,Female,Good,UG,No
18,19,Male,Good,School,No
47,38,Female,Good,PG,Yes
11,74,Male,Good,UG,Yes


In [12]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

In [15]:
X_encoded = pd.get_dummies(X,drop_first=True)

In [16]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X_encoded,y,test_size=0.2,random_state=2002)

In [17]:
from sklearn.naive_bayes import CategoricalNB
cnb = CategoricalNB()

cnb.fit(X_train,y_train)
cnb.score(X_test,y_test)

0.4

In [18]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

In [19]:
gnb = GaussianNB()
bnb = BernoulliNB()
mnb = MultinomialNB()

In [20]:
gnb.fit(X_train,y_train)
bnb.fit(X_train,y_train)
mnb.fit(X_train,y_train)

In [25]:
gnb.score(X_test,y_test)
bnb.score(X_test,y_test)
mnb.score(X_test,y_test)

0.3

### Numerical Features

In [31]:
from sklearn.datasets import load_iris
from sklearn import metrics
iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

model = GaussianNB()
model.fit(X_train, y_train)
model.score(X_test,y_test)

0.9777777777777777