# Naive Bayes

Naive Bayes' is a probabilistic model based on the Bayes Theorem (hence the name), used in a wide variety of classification tasks.

## Bayes' Theorem

Recall Bayes' Theorem from probability theory:

$$ Posterior = \frac{Likelihood \cdot Prior}{Evidence} $$

or

$$ P(y|X) = \frac{P(X|y)P(y)}{P(X)} $$

Here $X$ is multi-dimensional input data (say Bag of Words representation of a set of documents) whereas $y$ is the corresponding set of labels (say each document's sentiment). 

<img align="center" src="../assets/sentiment.png" width="80%">

## Naive Assumption

Note that the likelihood term $P(X|y)$ in the Bayes theorem can get very complicated to compute, for high dimensional data where X may be composed of many features.

Naive Bayes' makes a **simplifying naive assumption** that the features are independent of each other, given the class. In other words, 

$$ P(X|y) = \prod_{i=1}^{D} P(x_i|y) $$

<img src="../assets/naive_assumption.png" width="50%" align="center">

This is a naive assumption, since in most cases, features are not independent. This is especially not true for text data, where the presence of a word in a document is highly correlated with the presence of other words in the document. Hence the popular adage in NLP: _"You shall know a word by the company it keeps"_ by Firth.

However, the naive assumption makes the computation of the likelihood term tractable and still remarkably yields great results in practice.

Naive Bayes model is a **generative model** since it makes an assumption about how the data is generated. It assumes that the data is generated by first sampling a class $y$ from the prior distribution $P(y)$ and then sampling each feature $x_i$ from the likelihood distribution $P(x_i|y)$.

## Bag of Words (BOW) as Multinomial Data

Recall that multinomial distribution counts the number of times an outcome occurs when there are k-possible outcomes and N independent trials. Also recall that BOW representation of text is a count of how many times a word occurs in a document.

BOW representation of text is a multinomial data, where each word in the vocabulary is a possible outcome and the length of the document is the number of trials.

So we can use the multinomial distribution to model the likelihood term $P(x_i|y)$ in the prediction rule. The parameters of the multinomial distribution are estimated using Maximum Likelihood Estimation (MLE). 

The Multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. 

It has multinomial in the name because it assumes that the features have a multinomial distribution. 

We can use Maximum Likelihood Estimation (MLE) to estimate the parameters of the model.

## Multinomial Naive Bayes Classifier

Naive Bayes classifier is simply a **probabilistic classifier** where the prediction $\hat{y}$ is the class $y$ that maximizes the posterior probability $P(y|X)$ i.e. 

$$ \hat{y} = \underset{y}{\operatorname{argmax}} P(y|X) $$

In other words, if y=0 maximizes P(y | X), then the predicted class is 0. Otherwise, if y=1 maximizes P(y | X), then the predicted class is 1.

P(y | X) is proportional to P(X | y)P(y), as per the Bayes' theorem.  So, we can also write the prediction rule as:

$$ \hat{y} = \underset{y}{\operatorname{argmax}} P(X|y)\cdot P(y) $$

If we make the naive assumption that the features are independent of each other, given the class, then we can write the prediction rule as:

$$ \hat{y} = \underset{y}{\operatorname{argmax}} \prod_{i=1}^{D} P(x_i|y) \cdot P(y)  $$

Based on our conversation on logarithms, we can also write the prediction rule as:

$$ \hat{y} = \underset{y}{\operatorname{argmax}} \sum_{i=1}^{D} \log P(x_i|y) + \log P(y)  $$

