# Bayesian Classifiers

### Outline
In this notebook, we discuss a few different (but closely related) classifiers based on Baye's Theorems:
1. Bayes Classifier
2. Naive Bayes Classifier
3. Linear Discriminant Analysis
4. Quadratic Discriminant Analysis
I refer to these as Bayesian classifiers because they are all based on Baye's Theorem, but differ in the assumptions they make about the distribution of the variables in the dataset. `

### Baye's Theorem recalled
In this notebook, we assume that we are given a labelled dataset with *continuous* features $X_1,\dotsc,X_n$, and a categorical target $Y$ which takes values in some finite set $\mathcal{C} = \{c_1,\dotsc,c_k\}$. We assume that there are $m$ instances in our dataset. 

Recall from last class that we denote a row of features as $\vec{x} = (x_1,\dotsc,x_n)$, and we write $\vec{X} = \vec{x}$ to mean that $X_i = x_i$ for $i=1,\dotsc,n$. Then, Bayes' theorem states that for any possible values $y$ and $\vec{x}$ of $Y$ and $\vec{X}$, we have
\begin{equation*}
    P(c|\vec{x}) = \frac{P(\vec{x}|c)P(c)}{P(\vec{x})}.
\end{equation*}

Here, we have the following terminology and concepts:
- **Prior** $P(c)$:
    
    This is the probability $P(Y=c)$ that the target is $y$ before we have observed the features. We either make a reasonable assumption about this probability, or we estimate it from the data (e.g. by taking it to be the fraction of the training data that has target $y$ in the discrete case, or by computing an approximation to the density of the target at $y$ in the continuous case).
- **Likelihood** $P(\vec{x}|c)$:

    This is the condition probability $P(\vec{X} = \vec{x}|Y=c)$ of observing the features $\vec{x}$ given that the target is $y$.

- **Evidence** $P(\vec{x})$:

    This is the probability $P(\vec{X} = \vec{x})$ of observing the features $\vec{x}$, regardless of the target. 

- **Posterior** $P(c|\vec{x})$:

    This is the probability that the target is $y$ given that we have observed the features $\vec{x}$. 

Note that the denominator $P(\vec{x})$ can be computed as a sum over the possible classes of $Y$:
\begin{equation*}
    P(\vec{x}) = \sum_{c \in \mathcal{C}} P(\vec{x}|c)P(c).
\end{equation*}
From this it becomes quite clear that the posteriors $P(c|\vec{x})$ define a probability distribution over the possible class $c \in \mathcal{C}$. Indeed, each value is non-negative, and when we sum over all classes, we get $1$:
\begin{align*}
    \sum_{c \in \mathcal{C}} P(c| \vec{x}) & = \sum_{c \in \mathcal{C}} \frac{P(\vec{x}|c)P(c)}{ P(\vec{x}) } = 1.
\end{align*}
Quite logically, given a row of features $\vec{x}$, our best bet for which class to predict for $Y$ would be the class $c \in \mathcal{C}$ with the largest posterior probability (because this is (in principle) the most likely class that was observed in the dataset). This is the basic idea behind Bayesian Classifiers. 

Now, let's go over the various classifiers mentioned in the outline.

### 1. Bayes Classifier
This is the most direct classifier in the sense that it makes basically no assumptions about how the data are distributed; in practice, that means it is essentially impossible to implement. Nevertheless, it is still useful as a theoretical construct, and will help give us some intuition for Bayesian classifiers in general. 

First off, let's be completely clear that a classifier is a function
\begin{equation*}
    F : \mathbb{R}^n \to \mathcal{C},
\end{equation*}
which assigns to any instance $\vec{x} = (x_1,\dotsc,x_n)$ of features a class $F(\vec{x}) \in \mathcal{C}$. 

Now, the Baye's Classifier, which we denote by $F_{\textup{Bayes}}$, does the natural probabilistic thing: it classifies a given $\vec{x}$ into the class which bears the maximum posterior probability. Notationally, we write this as follows:
\begin{align*}
    F_{\textup{Bayes}}(\vec{x}) & = \argmax_{c \in \mathcal{C}} P(c \mid \vec{x})\\
    & = \argmax_{c \in \mathcal{C}} \dfrac{ P(\vec{x}| c) P(c) }{ P(\vec{x}) }.
\end{align*}
Note that the denominator is non-negative and fixed as we vary the classes, so to find the class which maximimizes the LHS, we may as well drop the denominator from the right-hand side. This yields the following simplified form of the Bayes Classifier:
\begin{align*}
    F_{\textup{Bayes}}(\vec{x}) & = \argmax_{c \in \mathcal{C}} P(\vec{x}| c) P(c)\\
    & = \argmax_{c \in \mathcal{C}} P(\vec{x},c),
\end{align*}
where $P(\vec{x},c)$ denotes the joint probability that $\vec{X} = \vec{x}$ and $Y = c$. 