# Gaussian Discriminant Analysis
GDA models revolve around a modeling assumption that the <b>class-conditional</b> distributions are multi-variate Gaussians, meaning that the distribution of each class is individually a MVN:

$$
p(x|y=c,\theta_c) \sim N(x|\mu_c,\Sigma_c)
$$

With this assumption, a classifier can be used which assigns a data vector to the class which has the maximum posterior probability for that vector:

$$
\begin{align}
\hat{y(x)} &= argmax_c [{Likelihood}_c*{Prior}_c] \\
           &= {argmax}_c [p(x|y=c,\theta_c)p(y=c,\theta_c)] \\
           &= {argmax}_c log[p(x|y=c,\theta_c)p(y=c,\theta_c)] \\
           &= {argmax}_c [log\ p(x|y=c,\theta_c) + log\ p(y=c,\theta_c)]
\end{align}
$$

#### Gaussian discriminant classifiers are distance-based classifiers
This might sound surprising, since many time a distinction is made in models as probabilistic vs geometric, parameteric vs non-parametric, etc. However, take a look at the definition of the MVN:

$$
\begin{align}
p(x|y=c,\theta_c) &\sim N(x|\mu_c,\Sigma_c) \\
                  &= \frac 1 {{(2\pi)}^{\frac D 2} {|\Sigma|^{\frac 1 2}}} e^{-\frac 1 2 {(x-\mu)}^T \Sigma^{-1} (x-\mu)}
\end{align}
$$

Since we are using an $argmax$ we can
1. use the logarithm, which will remove the exponential
2. drop constants, as they do not change the $argmax$ solution

$$
\begin{align}
    log\ p(x|y=c,\theta_c) &\sim -{(x-\mu_c)}^T \Sigma_c^{-1} (x-\mu_c) \\
                           &= -{(x-\mu_c)}^T (\sum_{i=1}^D \frac 1 {\lambda_i} u_i u_i^T) (x-\mu_c) \\
                           &= -\sum_{i=1}^D \frac 1 {\lambda_i} {(x-\mu_c)}^T u_i u_i^T (x-\mu_c) \\
                           &= -\sum_{i=1}^D \frac {{\tilde{x}_i}^2} {\lambda_i},\ {\tilde{x}_i}=u_i^T (x-\mu_c) \\
                           &= -\sum_{i=1}^D \frac 1 {\lambda_i} \|u_i, x-\mu_c\| \\
\end{align}
$$

In this light, we now see that the likelihood of the data vector $x$ under the MVN distribution is proportional to its dimension-wise distances to the class's covariance eigenvectors, where the distances are represented as ratios of the eigenvalues. For this reason, a Gaussian discriminant is also called a <b>nearest centroid classifier</b>.

We can then represent the classification decision as

$$
\hat{y_i} = argmin_c\ d(x, \mu_c) \\
d(x, \mu_c) = \sum_{i=1}^D \frac 1 {\lambda_i} \|u_i, x - \mu_c\|
$$

TODO ellipse plots

TODO distance check

# Quadratic Discriminant Analysis

In QDA, we work with the data likelihood, conditioned on the class index. Again, this modeling assumes that these class-conditional distributions are multivariate Gaussians. Recall that, for $NC$ classes indexed by $c$ and parametrized by $\theta$ vectors, the distribution can be written (via Bayes rule) as

$$
\begin{align}
    p(y=c|x,\theta) &= \frac {p(x \cap y=c|\theta)} {\sum_{i=1}^{NC} p(x \cap y=c_i|\theta)} \\
                    &= \frac {p(y=c|\theta)p(x|y=c,\theta)} {\sum_{i=1}^{NC} p(y=c_i|\theta)p(x|y=c_i,\theta)}
\end{align}
$$