# Naive Bayes Classifier

* Naive Bayes algorithm is a supervised learning algorithm, which is based on **Bayes theorem** and used for solving classification problems.
* It is mainly used in **text classification** that includes a high-dimensional training dataset.
* Naive Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make **quick predictions**.
* It is a **probabilistic classifier**, which means it predicts on the basis of the probability of an object.
* Some popular examples of Naïve Bayes Algorithm are **spam filtration, Sentimental analysis, and classifying articles**.

## Bayes’ Theorem

Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes’ theorem is stated mathematically as the following equation:

$$P(A|B) = \frac{P(A)P(B|A)}{P(B)} $$

where $A$ and $B$ are events and $P(B) \neq 0$

* Basically, we are trying to find probability of event $A$, given the event $B$ is true. Event $B$ is also termed as **evidence**.
* $P(A)$ is the priori of $A$ (the **prior probability**, i.e. Probability of event before evidence is seen). The evidence is an attribute value of an unknown instance(here, it is event $B$).
* $P(B)$ is **Marginal Probability**: Probability of Evidence.
* $P(A|B)$ is a **posteriori probability** of $B$, i.e. probability of event after evidence is seen.
$P(B|A)$ is Likelihood probability i.e the likelihood that a hypothesis will come true based on the evidence.

Now, with regards to our dataset, we can apply Bayes’ theorem in following way:

$$P(y|X) = \frac{P(y)P(X|y)}{P(X)} $$

where, $y$ is class variable and $X$ is a dependent feature vector where:
$X=(x_1,x_2,x_3,\dots,x_n)$

## Naive assumption

Now, its time to put a naive assumption to the Bayes’ theorem, which is, independence among the features. So now, we split evidence into the independent parts.

Now, if any two events A and B are independent, then,

$P(A,B) = P(A)P(B)$ OR $P(A\cap B) = P(A)P(B)$



## Model

Hence, we reach to the result:
$$
P\left(y \mid x_1, \ldots, x_n\right)=\frac{P\left(x_1 \mid y\right) P\left(x_1 \mid y\right) \ldots P\left(x_n \mid y\right) P(y)}{P\left(x_1\right) P\left(x_2\right) \ldots P\left(x_n\right)}
$$
which can be expressed as:
$$
P\left(y \mid x_1, \ldots, x_n\right)=\frac{P(y) \prod_{i=1}^n P\left(x_i \mid y\right)}{P\left(x_1\right) P\left(x_2\right) \ldots P\left(x_n\right)}
$$

Now, as the denominator remains constant for a given input, we can remove that term:
$$
P\left(y \mid x_1, \ldots, x_n\right) \propto P(y) \prod_{i=1}^n P\left(x_i \mid y\right)
$$

Now, we need to create a classifier model. For this, we find the probability of given set of inputs for all possible values of the class variable $y$ and pick up the output with maximum probability. This can be expressed mathematically as:
$$
y=\operatorname{argmax}_y P(y) \prod_{i=1}^n P\left(x_i \mid y\right)
$$

So, finally, we are left with the task of calculating $P(y)$ and $P(x_i|y)$.

## Example: Spam Email Detection

* Dataset: spam.csv
* This is a csv file containing related information of 5172 randomly picked email files and their respective labels for spam or not-spam classification.
* Source: https://www.kaggle.com/code/satarupadeb/na-ve-bayes-classification-spam-email-detection/input

### Load the dataset

In [13]:
import numpy as np
import pandas as pd

# Reading the data from .csv file
data = pd.read_csv('spam.csv', encoding='latin-1')

In [14]:
# display the first 3 rows
data.head(3)

Unnamed: 0,Label,Text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...


### Train Test Split
Let us split the dataset into train and test datasets.

In [15]:
from sklearn.model_selection import train_test_split

# Separate features (X) and target labels (y)
X =  data['Text']
y = data['Label']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Question**: Define the target variable and the features from the dataset. Is it possible to consider the column 'Text' as the feature vector?

### Feature Extraction
* Feature extraction refers to the process of transforming raw data into numerical features that can be processed while preserving the information in the original data set.
* The **sklearn.feature_extraction** module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

#### Text feature extraction

* We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors.
* the main text vectorizers provided by `scikit-learn`:
> 1. **CountVectorizer:** Converts a collection of text documents to a matrix of token counts.
> 2. **TfidfVectorizer:** Converts a collection of raw documents to a matrix of TF-IDF features.
> 3. **HashingVectorizer:** Converts a collection of text documents to a matrix of token occurrences using a hashing trick.
> 4. **DictVectorizer:** Transforms lists of feature-value mappings to vectors.

If you what to know more about these please refer to:

https://scikit-learn.org/stable/modules/feature_extraction.html

In [16]:
# Example of Count vectorization for feature extraction
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer instance
vectorizer = CountVectorizer()

# Fit and transform the training data (X_train)
X_train_vectorized = vectorizer.fit_transform(X_train)

In [17]:
print(y_train)

1978     ham
3989    spam
3935     ham
4078     ham
4086    spam
        ... 
3772     ham
5191     ham
5226     ham
5390     ham
860      ham
Name: Label, Length: 4457, dtype: object


### Different Naive Bayes algorithms in sklearn


*  **BernoulliNB**: The Bernoulli model is suitable when our feature vectors are binary, meaning they can only take two values (usually 0 and 1). In the context of text classification with a 'bag of words' model, the 1s represent "word occurs in the document," and the 0s represent "word does not occur in the document." This model is useful when we want to represent presence or absence of certain features in our data.

* **CategoricalNB**: Naive Bayes classifier for categorical features.

* **ComplementNB**: The Complement Naive Bayes classifier described in Rennie et al. (2003).

* **GaussianNB**: In classification, Gaussian is a method that assumes the features we use to describe data (like measurements or characteristics) follow a normal distribution. This means that most of the data points cluster around the average value, and fewer data points deviate far from this average.

* **MultinomialNB**: Multinomial is used when we are dealing with discrete counts. For example, in text classification, instead of just checking if a word occurs in a document (like in Bernoulli), we now count how many times a word appears in the document. It's like counting how many times a specific outcome (word) is observed over several trials (words in the document).

If you what to know more about these please refer to:
1. https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes

In [25]:
# Multinomial Naive Bayes
from sklearn.naive_bayes import BernoulliNB

# Train the Multinomial Naive Bayes classifier
classifier = BernoulliNB()
classifier.fit(X_train_vectorized, y_train)

### Predictions

In [19]:
# # Multinomial Naive Bayes
# from sklearn.naive_bayes import MultinomialNB

# # Train the Multinomial Naive Bayes classifier
# classifier = MultinomialNB()
# classifier.fit(X_train_vectorized, y_train)

* To predict whether a new email is spam or ham, we first need to vectorize the new email using the same vectorization technique that we applied to the training dataset.

In [26]:
# Transform the test data (X_test)
X_test_vectorized = vectorizer.transform(X_test)

In [27]:
# Make predictions on the test data
y_pred = classifier.predict(X_test_vectorized)

In [28]:
from sklearn.metrics import classification_report

classification_rep = classification_report(y_test, y_pred)

print("Classification Report:")
print(classification_rep)

Classification Report:
              precision    recall  f1-score   support

         ham       0.97      1.00      0.99       965
        spam       1.00      0.81      0.90       150

    accuracy                           0.97      1115
   macro avg       0.99      0.91      0.94      1115
weighted avg       0.98      0.97      0.97      1115



### Model evaluation

In [30]:
# from sklearn.metrics import classification_report

# classification_rep = classification_report(y_test, y_pred)

# print("Classification Report:")
# print(classification_rep)

# Pipeline
A sequence of data transformers with an optional final predictor.

* `Pipeline` allows you to sequentially apply a list of transformers to preprocess the data and, if desired, conclude the sequence with a final predictor for predictive modeling.

In [29]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Reading the data from .csv file
data = pd.read_csv('spam.csv', encoding='latin-1')

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(data['Text'], data['Label'], test_size=0.2, random_state=42)

# pipeline of CountVectorizer followed by MultinomialNB Classifier
spam_filter = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

# train on count vectors with Naive Bayes
spam_filter.fit(X_train, y_train)

# Make predictions on the test data
y_pred = spam_filter.predict(X_test)

classification_rep = classification_report(y_test, y_pred)

print("Classification Report:")
print(classification_rep)

Classification Report:
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       965
        spam       0.99      0.89      0.94       150

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



# Laplace smoothing

Laplace smoothing, also known as additive smoothing, is a technique used in Naive Bayes (NB) classifiers to handle the problem of zero probability in categorical data. It's particularly useful when dealing with text classification where the dataset may not be comprehensive enough to include all possible words in the training sample.

Here's how it works:
- **Zero Probability Issue**: Without smoothing, if a given word or feature does not occur in the training data, the probability estimate for that word given a class would be zero. This would make the posterior probability of the class zero, which can skew the results of the classifier.
- **Additive Smoothing**: To avoid this, Laplace smoothing adds a small positive value (usually 1) to the count of each word for every class, regardless of whether it has been observed in the training data.
- **Updated Probabilities**: The probability estimates are then adjusted accordingly. This ensures that no probability is exactly zero, and the classifier can handle unseen words in the test data.





```
class sklearn.naive_bayes.MultinomialNB(*, alpha=1.0, force_alpha=True, fit_prior=True, class_prior=None)
```
**Parameters:**

**alpha**: float or array-like of shape (n_features,), default=1.0

Additive (Laplace/Lidstone) smoothing parameter (set alpha=0 and force_alpha=True, for no smoothing).

**force_alpha**: bool, default=True

If False and alpha is less than 1e-10, it will set alpha to 1e-10. If True, alpha will remain unchanged. This may cause numerical errors if alpha is too close to 0.


1. https://www.kaggle.com/code/satarupadeb/na-ve-bayes-classification-spam-email-detection/notebook
2. https://www.geeksforgeeks.org/naive-bayes-classifiers/
3. https://www.javatpoint.com/machine-learning-naive-bayes-classifier

