# Gaussians

The gaussian distribution is a reasonable model in many situations, thus is very usefull for machine learning.

$$
p(x \mid \mu, \sigma^2) = N(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( \frac{-(x - \mu)^2}{2 \sigma^2} \right)
$$

The parameters are:
- The mean $\mu$ (location)
- The variance $\sigma^2$ (dispersion)

A gaussian can be used to describe a random variable that cluster around some mean value

- 68% of values are within one standard deviation ($\pm \sigma)$ from the mean
- 95% of values are within two standard deviation ($\pm 2\sigma)$ from the mean
- 99.7% of values are within three standard deviation ($\pm 3\sigma)$ from the mean

## Maximum likelhood estimation

For a dataset $\left\{ x_1, x_2, ..., x_N \right\}$ we can get unbiased estimates of the sample mean and sample variance with:

$$
\widehat{\mu} = \frac{1}{N} \sum_{n=1}^N x_n
$$

$$
\widehat{\sigma}^2 = \frac{1}{N - 1} \sum_{n=1}^N (x_n - \hat{\mu})^2
$$

The maximum likelihood estimates are simular:

$$
\widehat{\mu}_{ML} = \frac{1}{N} \sum_{n=1}^N x_n
$$

$$
\widehat{\sigma}^2_{ML} = \frac{1}{N} \sum_{n=1}^N (x_n - \hat{\mu})^2
$$

To show were the maximum likehood estimates come from, we will consider the optermization problem for a set $\mathcal{D}$ training samples.

$$
\max_{\mu, \sigma} p(\mathcal{D} \mid \mu, \sigma^2)
$$

If we make the assumption that the samples in $\mathcal{D}$ are independently drawn from the same distribution, we get $L$, the likehood function:

$$
\begin{align}
p(\mathcal{D} \mid \mu, \sigma^2) &= p(x_1, ..., x_N \mid \mu, \sigma^2) \\
&= p(x_1 \mid \mu, \sigma^2) ... p(x_N \mid \mu, \sigma^2) \\
&= \prod_{n=1}^N p(x_n \mid mu, \sigma^2) \\
&= L(\mu, \sigma^2 \mid \mathcal{D})
\end{align}
$$

Taking the log of $L$ gives us the log-likehood function $LL$ (this doesnt change are optimization problem since log is a monotonic increasing function)

$$
\begin{align}
LL(\mu, \sigma^2 \mid \mathcal{D}) &= \ln L(\mu, \sigma \mid \mathcal{D}) \\
&= \ln \prod_{n=1}^N p(x_n \mid \mu, \sigma^2) \\
&= \sum_{n=1}^N \ln p(x_n \mid \mu, \sigma^2) \\
&= \sum_{n=1}^N \ln \left( \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( \frac{-(x_n - \mu)^2}{2 \sigma^2} \right) \right) \\
&= - \frac{N}{2} \ln (2 \pi) - \frac{N}{2} \ln (\sigma^2) - \sum_{n=1}^N \frac{(x_n - \mu)^2}{2 \sigma^2}
\end{align}
$$

Now to maximize, we will take the partial derivative with respect to each parameter and set it equal to zero

$$
\frac{\partial LL(\mu, \sigma^2 \mid \mathcal{D})}{\partial \mu} = 0 \implies \hat{\mu} = \frac{1}{N} \sum_{n=1}^N x_n
$$

$$
\frac{\partial LL(\mu, \sigma^2 \mid \mathcal{D})}{\partial \sigma} = 0 \implies \hat{\mu} = \frac{1}{N} \sum_{n=1}^N (x_n - \hat{\mu})^2
$$

## Extending to $D$-dimensions

A $D$ dimensional vector $\mathbf{x} = (x_1, ..., x_D)^T is _multivariate Gaussian_ if it has a pdf in the form:

$$
p(\mathbf{x} \mid \mathbf{\mu, \Sigma}) = \frac{1}{(2\pi)^{d/2} |\mathbf{\Sigma}|^{1/2}} \exp \left( - \frac{1}{2}(\mathbf{x} - \mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{\mu}) \right)
$$

$\frac{1}{2}(\mathbf{x} - \mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{\mu}$ is referred to as a _quadratic form_.

The parameters are:

- $\mathbf{\mu}$ the mean vector, $\mathbf{\mu} = E(X)$
- $\mathbf{\Sigma}$ the covariance matrix, $\mathbf{\Sigma} = E((\mathbf{x} - \mathbf{\mu})(\mathbf{x} - \mathbf{\mu})^T)$

The covariance matrix is a $D \times D$ symetric matrix, that is $\mathbf{\Sigma}^T = \mathbf{\Sigma}$. The elements of $\Sigma$ show the relation ship of two components:

- If $x_j$ is large when $x_i$ is large, then $(x_j - \mu_j)(x_i - \mu_i)$ will tend to be positive
- If $x_j$ is small when $x_i$ is large, then $(x_j - \mu_j)(x_i - \mu_i)$ will tend to be negative

The variance of $x_i$ is $\sigma_i^2$ which is equivalent to the covariance between $x_i$ and $x_i$, that is $\sigma_i^2 = \sigma_{ii}$.

The unbiased and maximum likelihood estimations become:

$$
\begin{align}
\mathbf{\mu} &= E(\mathbf{x})\\
\hat{\mathbf{\mu}}_{ML} &= \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n \\
\mathbf{\Sigma} &= E((\mathbf{x} - \mathbf{\mu})(\mathbf{x} - \mathbf{\mu})^T) \\
\hat{\mathbf{\Sigma}}_{ML} &= \frac{1}{N} \sum_{n=1}^N (\mathbf{x}_n - \hat{\mathbf{\mu}}_{ML})(\mathbf{x}_n - \hat{\mathbf{\mu}}_{ML})^T
\end{align}
$$

The covariance matrix is not scale independent, to fix this we introduce the _correlation matix_ $R$:

$$
\begin{align}
R &= (\rho_{ij}) \\
\rho_{ij} &= \frac{\sigma_{ij}}{\sqrt{\sigma_{ii} \sigma_{jj}}}
\end{align}
$$

This new terms are now scale and location independent, that is:

$$
\rho(x_i, x_j) = \rho(a x_i +b, c x_j + d)
$$

# 2-D Gaussian examples

![](res/gaussian_example_1.png)
![](res/gaussian_example_2.png)
![](res/gaussian_example_3.png)

---

### Example

Calculate $\mathbf{\mu}$ and $\mathbf{\Sigma}$ for the samples $\displaystyle\mathbf{x} : \binom{5}{1}, \binom{5}{2}, \binom{7}{2}, \binom{7}{3}$

First, using $\mathbf{\mu} = \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n$ we will compute $\mathbf{\mu}$

$$
\begin{align}
\mathbf{\mu} \\
&= \frac{1}{4} \left\{ \begin{bmatrix}5\\1\end{bmatrix} + \begin{bmatrix}5\\2\end{bmatrix} + \begin{bmatrix}7\\2\end{bmatrix} + \begin{bmatrix}7\\3\end{bmatrix} \right\} \\
&= \begin{bmatrix}6\\2\end{bmatrix}
\end{align}
$$

To compute $\mathbf{\Sigma}$ we will first compute $\mathbf{x}_n - \mathbf{\mu}$

$$
\mathbf{x}_n - \mathbf{\mu} : \binom{-1}{-1}, \binom{-1}{0}, \binom{1}{0}, \binom{1}{1}
$$

Finnally using $\mathbf{\Sigma} = \frac{1}{N} \sum_{n=1}^N (\mathbf{x}_n - \hat{\mathbf{\mu}}_{ML})(\mathbf{x}_n - \hat{\mathbf{\mu}}_{ML})^T$ 

$$
\begin{align}
\mathbf{\Sigma} &= \frac{1}{4} \left\{ 
\begin{bmatrix}-1\\-1\end{bmatrix} \begin{bmatrix}-1 & -1\end{bmatrix}+
\begin{bmatrix}-1\\0\end{bmatrix} \begin{bmatrix}-1 & 0\end{bmatrix}+
\begin{bmatrix}1\\0\end{bmatrix} \begin{bmatrix}1 & 0\end{bmatrix}+
\begin{bmatrix}1\\1\end{bmatrix} \begin{bmatrix}1&1\end{bmatrix}
\right\} \\
&= \begin{bmatrix}1&\frac{1}{2}\\\frac{1}{2}&\frac{1}{2}\end{bmatrix}
\end{align}
$$

Contour plot:

![](res/example_contour.png)

---