# Naive Bayes Classification

Naïve Bayes algorithm is a supervised learning algorithm, which is based on `Bayes theorem` and used for solving classification problems.

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
- <b>Naïve</b>: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. 

- <b>Bayes</b>: It is called Bayes because it depends on the principle of Bayes' Theorem.

## Bayes Theorem:

Bayes' theorem is used to determine the probability of a hypothesis with prior knowledge. It depends on the conditional probability.

The formula for Bayes' theorem is given as:

$$P(A|B)=\dfrac{P(B|A)P(A)}{P(B)}$$

Where,

- <b>P(A|B) is Posterior probability</b>: Probability of hypothesis A on the observed event B.

- <b>P(B|A) is Likelihood probability</b>: Probability of the evidence given that the probability of a hypothesis is true.

- <b>P(A) is Prior Probability</b>: Probability of hypothesis before observing the evidence.

- <b>P(B) is Marginal Probability</b>: Probability of Evidence.

### Naive Assumption

Now, its time to put a naive assumption to the Bayes’ theorem, which is, <b>independence</b> among the features. So now, we split evidence into the independent parts.

Now, if any two events A and B are independent, then,

$$P(A,B) = P(A)P(B)$$

Hence, we reach to the result:

$$P(y|x_1,...,x_n) = \dfrac{ P(x_1|y)P(x_2|y)...P(x_n|y)P(y)}{P(x_1)P(x_2)...P(x_n)}$$ 

which can be expressed as:

$$P(y|x_1,...,x_n) = \dfrac{P(y)\prod_{i=1}^{n}P(x_i|y)}{P(x_1)P(x_2)...P(x_n)}$$

Now, as the denominator remains constant for a given input, we can remove that term:

$$P(y|x_1,...,x_n)\propto P(y)\prod_{i=1}^{n}P(x_i|y)$$

Now, we need to create a classifier model. For this, we find the probability of given set of inputs for all possible values of the class variable y and pick up the output with maximum probability. This can be expressed mathematically as:

$$y = argmax_{y} P(y)\prod_{i=1}^{n}P(x_i|y)$$

So, finally, we are left with the task of calculating P(y) and $P(x_i|y)$.

Please note that P(y) is also called <b>class probability</b> and $P(x_i|y)$ is called <b>conditional probability</b>.

### Example

Here is a tabular representation of our dataset.

||Outlook|	Temperature|	Humidity|	Windy|	Play Golf|
|---|---|---|---|---|---|
|0|	Rainy|	Hot|	High|	False|	No|
|1|	Rainy|	Hot	High|	True|	No|
|2|	Overcast|	Hot	High|	False|	Yes|
|3|	Sunny|	Mild|	High|	False|	Yes|
|4|	Sunny|	Cool|	Normal|	False|	Yes|
|5|	Sunny|	Cool|	Normal|	True|	No|
|6|	Overcast|	Cool|	Normal|	True|	Yes|
|7|	Rainy|	Mild|	High|	False|	No|
|8|	Rainy|	Cool|	Normal|	False|	Yes|
|9|	Sunny|	Mild|	Normal|	False|	Yes|
|10|	Rainy|	Mild|	Normal|	True|	Yes|
|11|	Overcast|	Mild|	High|	True|	Yes|
|12|	Overcast|	Hot	Normal|	False|	Yes|
|13|	Sunny|	Mild|	High|	True|	No|

We need to find P(xi | yj) for each xi in X and yj in y. All these calculations have been demonstrated in the tables below:

<img src="naive-bayes-classification.png">

So, in the figure above, we have calculated $P(x_i|y_j)$ for each $x_i$ in X and $y_j$ in y manually in the tables 1-4. For example, probability of playing golf given that the temperature is cool, i.e $P(temp.=cool|play\;golf = Yes) = 3/9$.

Also, we need to find class probabilities P(y) which has been calculated in the table 5. For example, $P(play\;golf = Yes) = 9/14$.

So now, we are done with our pre-computations and the classifier is ready!

Let us test it on a new set of features (let us call it today):

today = (Sunny, Hot, Normal, False)

So, probability of playing golf is given by:

$$P(Yes|today) = \dfrac{P(Sunny\;Outlook|Yes)P(Hot\;Temperature|Yes)P(Normal\;Humidity|Yes)P(No\;Wind|Yes)P(Yes)}{P(today)}$$ 

and probability to not play golf is given by:

$$P(No|today) = \dfrac{P(Sunny\;Outlook|No)P(Hot\;Temperature|No)P(Normal\;Humidity|No)P(No\;Wind|No)P(No)}{P(today)}$$

Since, P(today) is common in both probabilities, we can ignore P(today) and find proportional probabilities as:

$$P(Yes|today) \propto \dfrac{2}{9}.\dfrac{2}{9}.\dfrac{6}{9}.\dfrac{6}{9}.\dfrac{9}{14} \approx 0.0141$$

and

$$P(No|today) \propto \dfrac{3}{5}.\dfrac{2}{5}.\dfrac{1}{5}.\dfrac{2}{5}.\dfrac{5}{14} \approx 0.0068$$

Now, since

$$P(Yes|today) + P(No|today) = 1$$ 

These numbers can be converted into a probability by making the sum equal to 1 (normalization):

$$P(Yes|today) = \dfrac{0.0141}{0.0141 + 0.0068} = 0.67$$

and

$$P(No|today) = \dfrac{0.0068}{0.0141 + 0.0068} = 0.33$$

Since

$$P(Yes|today) > P(No|today)$$

So, prediction that golf would be played is ‘Yes’.

The method that we discussed above is applicable for discrete data. In case of continuous data, we need to make some assumptions regarding the distribution of values of each feature. The different Naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of $P(x_i|y)$.

### Types of Naïve Bayes Model for continuous data:

There are three types of Naive Bayes Model, which are given below:

- <b>Gaussian</b>: In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution. A Gaussian distribution is also called Normal distribution. When plotted, it gives a bell shaped curve which is symmetric about the mean of the feature values as shown below:

<p align="center">
<img src="naive-bayes-classification-1.png">
</p>

The likelihood of the features is assumed to be Gaussian, hence, conditional probability is given by:

$$P(x_i|y) = \dfrac{1}{\sqrt{2\pi\sigma _{y}^{2} }} exp \left (-\frac{(x_i-\mu _{y})^2}{2\sigma _{y}^{2}}  \right )$$ 

- <b>Multinomial</b>: Feature vectors represent the frequencies with which certain events have been generated by a multinomial distribution. This is the event model typically used for document classification.

- <b>Bernoulli</b>: In the multivariate Bernoulli event model, features are independent booleans (binary variables) describing inputs. Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence(i.e. a word occurs in a document or not) features are used rather than term frequencies(i.e. frequency of a word in the document).

- <b>Complement Naive Bayes</b>: It is an adaptation of Multinomial NB where the complement of each class is used to calculate the model weights. So, this is suitable for imbalanced data sets and often outperforms the MNB on text classification tasks.

- <b>Categorical Naive Bayes</b>: Categorical Naive Bayes is useful if the features are categorically distributed. We have to encode the categorical variable in the numeric format using the ordinal encoder for using this algorithm.

In [1]:
# load the iris dataset
from sklearn.datasets import load_iris

iris = load_iris()
  
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target
  
# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
  
# training the model on training set
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
  
# making predictions on the testing set
y_pred = gnb.predict(X_test)
  
# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)

Gaussian Naive Bayes model accuracy(in %): 95.0


## References

- [Naive Bayes Classifiers](https://www.geeksforgeeks.org/naive-bayes-classifiers/)

- [Naïve Bayes Classifier Algorithm](https://www.javatpoint.com/machine-learning-naive-bayes-classifier)