# Covariance Matrix

## Vectorizing covariance matrix calculations

The covariance matrix written in this form

$$
\Sigma = \frac{1}{N} \sum_{n=1}^N (\mathbf{x}_n - \mathbf{\mu})(\mathbf{x}_n - \mathbf{\mu})^T
$$

would requires a for loop to iterate over each sample. To avoid this we can replace the sumation with matrix operations

$$
\begin{align}
\Sigma &= \frac{1}{N} \sum_{n=1}^N (\mathbf{x}_n - \mathbf{\mu})(\mathbf{x}_n - \mathbf{\mu})^T \\
&= \frac{1}{N} (\mathbf{x}_1 - \mathbf{\mu}, ..., \mathbf{x}_N - \mathbf{\mu}) \begin{pmatrix}
\mathbf{x}_1^T - \mathbf{\mu}^T \\
\vdots \\
\mathbf{x}_N^T - \mathbf{\mu}^T \\
\end{pmatrix} \\
&= \frac{1}{N} (X- M_N)^T (X - M_N)
\end{align}
$$

Where

$$
\begin{align}
X = \begin{bmatrix}
x_{11} & \cdots & x_{1D} \\
\vdots & \ddots & \vdots \\
x_{N1} & \cdots & x_{ND}
\end{bmatrix}
\quad\quad
M_N &= \begin{bmatrix}
\mu_1 & \cdots & \mu_D \\
\vdots & \ddots & \vdots \\
\mu_1 & \cdots & \mu_D
\end{bmatrix}\\
&= \frac{1}{N} 1_{NN} X
\end{align}
$$

Where $1_{NN}$ is a $N \times N$ matrix were all the elements are $1$.

## More properties of covariance matrix ($\dagger$)

We can split any covariance matrix into three parts

$$
\begin{align}
\Sigma &= V D V^T \\
&= \begin{pmatrix}
v_{11} & \cdots & v_{1D} \\
\vdots & \ddots & \vdots \\
v_{N1} & \cdots & v_{ND}
\end{pmatrix}
\begin{pmatrix}
\lambda_1 &  & 0 \\
 & \ddots &  \\
0 & & \lambda_D
\end{pmatrix}
\begin{pmatrix}
v_{11} & \cdots & v_{1D} \\
\vdots & \ddots & \vdots \\
v_{D1} & \cdots & v_{DD}
\end{pmatrix}
\end{align}
$$

- All eigenvalues in the diagonal matrix will be non-negative
- The determinent $|\Sigma|$ is equal to $\prod_{i=1}^D \lambda_i$
- The sum of variances $\sum_{i=1}^D \sigma_{ii}$ is equal to the sum of eigenvalues $\sum_{i=1}^D \lambda_i$

From the rank of $\Sigma$ we can obtain the following:

$$
\begin{align}
rank(\Sigma) = D \rightarrow & \forall_i : \lambda_i > 0 \\
& \forall_{i \ne j} : \mathbf{v}_i \bot \mathbf{v}_j \\
& | \Sigma | > 0\\
rank(\Sigma) < D \rightarrow & \exists_i : \lambda_i = 0 \\
& \exists_{(i,j)} : \rho(x_i, x_j) \\
& | \Sigma | = 0\\
\end{align}
$$

## Problems with estimation of covariance matrix

$|\Sigma|$ is almost (or equal to) zero when

- $N$ is not large enough (compared to $D$) $|\Sigma| = 0$ for $N \leq D$
- There is high dependence (correlation) between variables ($\rho(x_i, x_j) \approx 1$)

Due to the nature of floating point numbers, the inverse of $\Sigma^{-1}$ when $|\Sigma|$ is small is unstable. Some solutions are:

- Share $\Sigma$ among classes
- Assume independence amoung variables (naive bayes assumption), this yields a diagonal covariance matrix rather than a full covariance matrix
- Reduce the dimensionality of the data (e.g. PCA)
- Add a small positive constant $\epsilon$ to the diagonal elements, $\Sigma + \epsilon I$

To share $\Sigma$ among classes $1, ..., K$ we take the average of each covariance matrix

$$
\Sigma = \frac{1}{K} \sum_{k=1}^K \Sigma_k
$$