# Naive Bayes

Naive Bayes is so called because we use Bayes theorem, but assume that all the input variables are independent (such as the weather being sunny and the day of the week being Wednesday). It is a classification model.

### Bayes Theorem

Take time to consider Bayes theorem more closely. Consider two classes: classical and pop. Imagining these are the only two genres of music, we might assume that out of 100 streams, 80 are pop and only 20 are classical, i.e. $ P(H) = 0.2$, $P(¬H) = 0.8 $ where we take $ P(H) $ to be the probability of a given stream of music being classical.

We then consider the fact that this given music stream is being played in a coffee shop. We might assume that in a coffee shop the chance that the music being played is classical is only 1 in 10 and so the fact we know that the music is being played in a coffee shop restricts the probability space, altering the chance the music being played is classical.

Without context, the expected number of classical streams would be 20 in 100, but adding in that the location is a coffee shop restricts us, because only 2 in 20 coffee shops play classical. We then consider without context, we would expect 80 of 100 streams to be not classical, jumping up to 72 in 80 coffee shops. 

We then consider the restricted coffee space (noting the probability of there being not classical music playing in a coffee shop must be stricly less than probability of there being a coffee shop in all but the trivial case). Intuitively, we know that $P(H|E) < P(H|¬E) $, i.e., given that we're in a coffee shop, we're less likely to hear classic than normal

$$
P(H|E) = \frac{2}{72 + 2} = \frac{P(H|E)}{P(H|E) + P(¬H|E)} = \frac{P(H|E)}{P(E)}
$$


Formally, this is often written:
$$
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
$$
or 
$$
posterior =  \frac{prior \times likelihood}{evidence} 
$$


## Bayes Classifier

Let the possible k output classes be represented by $C_k$, so that the $i^{th}$ class is represented by $C_i$
Let also there be n features (independent variables), such that $\bold{x} = (x_1, x_2, x_3, \dots, x_n)$

We then consider the probability of a given class for a given input vector. In mathematical notation:
$$
p(C_k|\bold{x})
$$

Using Bayes theorem, we can rewrite:

$$
p(C_k|\bold{x}) = \frac{p(\bold{x}|C_k)p(C_k)}{p(\bold{x})}
$$

We note here that the denominator is not dependent on the class $ C_k $ and therefore given generally we are only interested in the class with the highest probability and the denominator is relatively invariant and can be treated as a constant.

We therefore focus on numerator $ p(\bold{x}|C_k)p(C_k) $, which is equivalent to the joint probability model. Joint probability models are most easily conceptialised as bivariate distributions, such as rolling two die.

The bivariate case generalises, and using the chain rule of probability, 

$$
\begin{split} p(C_k, x_1, x_2, \dots, x_n)
&= p(x_1, x_2, \dots, x_n, C_k)\\
&= p(x_1 | x_2, \dots, x_n, C_k)p(x_2, \dots, x_n, C_k) \\
&= p(x_1 | x_2, \dots, x_n, C_k)p(x_2 | x_3, \dots, x_n, C_k)p(x_3, \dots, x_n, C_k) \\
&= p(x_1 | x_2, \dots, x_n, C_k)p(x_2 | x_3, \dots, x_n, C_k)p(x_3, \dots, x_n, C_k)\dots p(x_{n-1} | x_n, C_k)p(x_n | C_k)p(C_k)
\end{split}
$$

### Assumptions
Assuming all features are independent (often an invalid assumption), we get:
$$
p(x_i|x_{i+1}, x_{i+2}, \dots, x_n, C_k) = p(x_i|C_k)
$$

Hence, we can wewrite $ p(C_k, x_1, x_2, \dots, x_n) $:

$$
\begin{split} p(C_k, x_1, x_2, \dots, x_n) 
&= p(C_k)\cdot p(x_1|C_k)\cdot p(x_2|C_k) \dots p(x_n|C_k) \\
&= p(C_k) \prod _{i=1}^{n} p(x_i|C_k)
\end{split}
$$
$$
\therefore p(C_k| x_1, x_2, \dots, x_n) \propto p(C_k) \prod _{i=1}^{n} p(x_i|C_k)
$$

More precisely, we take our value for $ p(C_k|\bold{x})$ and we can now use Bayes Theorem:
$$
\begin{split} p(C_k | \bold{x})
&= \frac{p(C_k)p(\bold{x}|C_k)}{p(\bold{x})} \\
&= \frac{p(C_k) \prod _{i=1}^{n} p(x_i|C_k)}{p(\bold{x})}
\end{split}
$$

### Event Models

There are different event models (i.e., what is done based upon these probabilities). The simplest is the Maximum a Posteriori model, which predicts a class based on which class has the highest predicted posterior probability.

Given all the posterior probaibilities have a common denominator, we only need to consider the numerator: $p(C_k) \prod _{i=1}^{n} p(x_i|C_k)$

Since the logarithm is monotonically increasing, the most probable class will also have the largest log probability. However, were as for the probabilities we use the product of conditional probabilities, for the logs we use the sum of the logs of these conditional probabilties.

Hence, we say the predicted class $y$ is the class that maximisies the argument for a given feature vector, i.e.:
$$
\begin{split} y
&=argmax_y(p(x_1|y)\cdotp(x_2|y)\cdot\ldots\cdot p(x_n|y)) \\
&=argmax_y(\log{p(x_1|y)}+\log{p(x_2|y)}+\ldots + \log{p(x_n|y)})
\end{split}
$$