# Naive Bayes, Part 1

Derived from:

https://www.machinelearningplus.com/predictive-modeling/how-naive-bayes-algorithm-works-with-example-and-full-code/
https://medium.com/syncedreview/applying-multinomial-naive-bayes-to-nlp-problems-a-practical-explanation-4f5271768ebf
https://towardsdatascience.com/introduction-to-na%C3%AFve-bayes-classifier-fa59e3e24aaf

### What is Naive Bayes?
Naive Bayes is a probabilistic machine learning algorithm. It is based on Bayes' theorem. This algorithm makes an assumption that all features are independent of each other. In other words, changing the value of one feature, does not directly change the value of any of the other features. This assumption is naive because it is (almost) never true.

For example, if you have temperature and humidity as input features, Naive Bayes assumes that temperature and humidity are independent of each other. So, changing the value of temperature, does not directly change the value of humidity. In reality, temperature and humidity are related to each other.

### Conditional Probability
Before you go into Naive Bayes, you need to understand what Conditional Probability is. Let's start with an example.

When you toss a fair coin, it has a probability of 1/2 of getting heads or tails. Mathematically, it is written as
$P(Head)=\frac{1}{2}$ and $P(Tail)=\frac{1}{2}$.

Another example, suppose you pick a card from the deck and you already know that your card is an ace. What is the probability of getting a diamond given the card is an ace? Well, we have already a condition that the card is an ace. So, the population (denominator) is 4 not 52. There is only one diamond in aces. So, the probability of getting a diamond given the card is an ace is 1/4. Mathematically, it is written as $P(Diamond|Ace)=\frac{1}{4}$. This is called conditional probability.

### Naive Bayes by Example 1
Let's start with a simple example of Naive Bayes algorithm. Suppose you have 100 fruits which could be either apple or orange. Your training data of these fruits is shown in the following table. In this training data, we have one feature which is color that has three possible values: red, green, and orange. Then, we have one label that has two possible values: apple and orange. We denote color as $X$ and fruit as $y$.

| No. | Color  | Fruit  |
|-----|--------|--------|
| 1   | Red    | Apple  |
| 2   | Red    | Apple  |
| 3   | Green  | Apple  |
| 4   | Green  | Orange |
| 5   | Orange | Orange |
| ... | ...    | ...    |
| 100 | Green  | Orange |

The objective of Naive Bayes classifier is to predict if a given fruit (by knowing its color) is an apple or orange. Let's say the color of a fruit is green ($X=green$), can you predict what fruit ($y$) is it? In other words, you can predict $y$ when only the $X$ variables in training data are known.

The idea is to compute the two probabilities, that is the probability of the fruit being an apple or orange. Whichever fruit type gets the highest probability wins. Mathematically, you need to compute both $P(y=apple|X=green)$ and $P(y=orange|X=green)$, then compare the results. If $P(y=apple|X=green) > P(y=orange|X=green)$ then the prediction is apple. If $P(y=apple|X=green) < P(y=orange|X=green)$ then the prediction is orange.

This is the Naive Bayes formula for computing these probability:
$$P(y=apple|X=green) = \frac{P(X=green|y=apple)P(y=apple)}{P(X=green)}$$
$$P(y=orange|X=green) = \frac{P(X=green|y=orange)P(y=orange)}{P(X=green)}$$

These probabilities can be calculated by using information in a table called frequency table which is derived from training data. The following table shows the frequency table.

|        | Apple | Orange | Total |
|--------|-------|--------|-------|
| Red    | 33    | 0      | 33    |
| Green  | 20    | 7      | 27    |
| Orange | 1     | 39     | 40    |
| Total  | 54    | 46     | 100   |

#### **Step 1: compute the probabilities for each of the fruits (label)**
Out of 100 fruits, you have 54 apples and 46 oranges. So the respective probabilities are:
$$P(y=apple)=\frac{54}{100}$$
$$P(y=orange)=\frac{46}{100}$$

#### **Step 2: compute the probabilities of the color (feature)**
Out of 100 fruits, you have 27 greens. So the probability is:
$$P(X=green)=\frac{27}{100}$$

#### **Step 3: compute the conditional probability**
Out of 54 apples, you have 20 greens. So the probability is:
$$P(X=green|y=apple)=\frac{20}{54}$$
Out of 46 oranges, you have 7 greens. So the probability is:
$$P(X=green|y=orange)=\frac{7}{46}$$

#### **Step 4: subtitute all the three probabilities into the Naive Bayes formula**
$$P(y=apple|X=green) = \frac{\frac{20}{54}\frac{54}{100}}{\frac{27}{100}}=\frac{20}{27}$$
$$P(y=orange|X=green) = \frac{\frac{7}{46}\frac{46}{100}}{\frac{27}{100}}=\frac{7}{27}$$

Since $P(y=apple|X=green) > P(y=orange|X=green)$, which means apple get higher probability than orange, then apple will be our predicted fruit ($y$) given the color is green ($X=green$).

### The Naive Bayes Formula
From the above example, you can rewrite the Naive Bayes formula in more general form:
$$P(y|X) = \frac{P(X|y)P(y)}{P(X)}$$
where $X$ is input feature and $y$ is output label that you want to predict. $P(y|X)$ is called posterior probability. $P(X|y)$ is called likelihood probability. $P(y)$ is called label prior probability. $P(X)$ is called feature prior probability.

Now, if you notice in step 4 in the above example, the value of denominators of both formulas remain constant ($\frac{27}{100}$) for given input. Therefore, you can remove that term. So, for Naive Bayes **<u>classifier</u>** you can simplify the formula to:
$$P(y|X)\propto P(X|y)P(y)$$

### Laplace Smoothing
In the above example, the probability of $P(X=red|y=orange)$ is zero. It makes sense because out of 46 oranges you have 0 red, but if you have many input features, the entire probability will become zero because one of the feature’s value is zero. It will wipe out all the information in the other probabilities. This case is called zero frequency, and it needs to be avoided. You can use Laplace Smoothing to solve the problem of zero probability.

Laplace Smoothing can best be explained by an example. Let's say we want to calculate $P(y=orange|X=red)$. Laplace Smoothing needs to be applied to both $P(X|y)$ and $P(y)$.

#### **Step 1: compute the probabilities for each of the fruit/label**
Without Laplace Smoothing, the probability is
$$P(y=orange)=\frac{46}{100}$$
Laplace Smoothing is usually done by **<u>adding one to the numerator and adding number of possible value of label to the denominator</u>**. In this case, number of possible value of label is two (apple and orange). So, the probability after Laplace Smoothing is
$$P(y=orange)=\frac{46+1}{100+2}=\frac{47}{102}$$

#### **Step 2: compute the conditional probability**
Without Laplace Smoothing, the probability is
$$P(X=red|y=orange)=\frac{0}{46}$$
Laplace Smoothing is also applied to the conditional probability by **<u>adding one to the numerator and adding number of possible value of feature to the denominator</u>**. In this case, number of possible value of feature is three (red, green, and orange). So, the probability after Laplace Smoothing is
$$P(X=red|y=orange)=\frac{0+1}{46+3}=\frac{1}{49}$$
So, by using Laplace Smoothing, it gives us a non-zero probability.

#### **Step 3: subtitute all the two probabilities into the simplified Naive Bayes formula**
$$P(y=orange|X=red)=\frac{1}{49}\frac{47}{102}=\frac{47}{4998}\approx 0.0094037615$$