# Covariance matrix for images

## Background and Motivation

**Variational Autoencoders (VAE) short summary** 
1. VAE are models that map the input data to a low-dimensional and stochastic latent space. This encoding step is performed with a deep neural network that models the parameters of a normal distribution. 
1. A second neural networks receives samples from the latent space and map them back to input space. The generator or decoder has an explicit distribution (e.g. Bernoulli, Gaussian) 
1. The model is trained by minimizing the distance between the latent space distribution and a prior (unit normal) plus the negative likelihood of the generated data given the original data 

Here we focus on the decoder/generator which is typically assumed to be

$$
p_\theta(x|z) = \mathcal{N}(\mu_\theta(z), I)
$$

where $\mu_\theta(z)$ is an artificial neural network that outputs the mean of this Gaussian distribution

> The variance is, in general, not considered and set as the identity matrix. In this case the log likeligood is equivalent to the Mean Square Error.  The second most general formulation is  $\mathcal{N}(\mu_\theta(z), I\sigma^2)$ where $\sigma$ is a scalar that is either hard-coded/tuned for all the dataset  [add ref] or learned per data sample  [add ref]

**A spherical gaussian is not a good model for images**, because pixels are highly correlated. In practice VAE applied to images produces blurry outputs. This is because researchers only show $\mu_\theta(z)$ and sampling from  $\mathcal{N}(\mu_\theta(z), I)$ does not make much sense because you are only adding white noise to the mean.

> Correlations between pixels can be modeled with the covariance. But including the covariance increases the cost of the model considerabily

We could exploit a middle ground with simple covariance structures that provide correlation between pixels without incurring in large costs, such as in [(Dorta et al. 2018)](http://openaccess.thecvf.com/content_cvpr_2018/papers/Dorta_Structured_Uncertainty_Prediction_CVPR_2018_paper.pdf)



## Covariance and precision matrix

The precision matrix is the inverse of the covariance matrix

$$
\Lambda = \Sigma^{-1}
$$

If we have a $N\times N$ image then the covariance (and precision) is $N^2 \times N^2$ (yikes)

Both matrices are square and positive definite, but their interpretation is different


### A zero in the precision matrix

> a zero at  $\Lambda[i, j]$  means that pixels i and j are contionally independent given all the other pixels

### A zero in the covariance matrix

> a zero at $\Sigma[i, j]$  means that pixels i and j are independent 

Let's we want a one pixel neighbourhood, i.e. we want pixel $i,j$ and $i+2,j$ to be conditionally independant given $i+1,j$.

Typically a sparse precision matrix translates to a dense covariance matrix as seen in the GRF example below

Also the log likelihood of the Normal distribution uses the precision matrix

> These reasons motivates us to model precision instead of covariance 


## [Gaussian random field 1D](https://books.google.cl/books?id=TLBYs-faw-0C&pg=PA2&lpg=PA2&dq=zero+precision+covariance+independence&source=bl&ots=RN2KsIUCo_&sig=ACfU3U32Np2IbjmZkt9FY5RVGeC2L8ZG5Q&hl=es&sa=X&ved=2ahUKEwjppZT_xOPpAhXAILkGHQHyCCMQ6AEwBHoECAoQAQ#v=onepage&q=zero%20precision%20covariance%20independence&f=false)

Let's consider the autoregressive (AR) process with $\phi < 1$ y white noise (with unit variance)

$$
x_{t+1} = x_t \phi + \epsilon_t
$$

this process is conditionally independent

$$
x_{t+1} | x_t, x_{t-1}, \ldots x_1 = x_{t+1} | x_t = N(\phi x_t, 1)
$$

The associated precision matrix is sparse

$$
\Lambda = \begin{pmatrix}
1 & -\phi & 0 & 0 & \ldots & 0 \\
-\phi & (1+ \phi)^2 & -\phi & 0 & \ldots & 0 \\
0 & -\phi & (1+ \phi)^2 & -\phi &  \ldots & 0 \\
\vdots & \vdots & \vdots & \vdots &  \ddots & \vdots \\
0 & 0 & 0 & 0 &  \ldots & 1 \\
\end{pmatrix}
$$

but the covariance is dense!
$$
\sigma_{ij} = \frac{1}{1-\phi^2} \phi^{|i-j|}
$$



## Modeling the precision matrix

Let's create a sparse precision matrix

Then we can draw and inspected sampled from a zero mean multivariate gaussian

For every pixel we model precision with the right and lower neighbours, i.e. a 4 points neighbourhood

We have to take care of pixels at the borders (avoid wrap-around)

![Screenshot_2020-06-04%20da%20pdf.png](attachment:Screenshot_2020-06-04%20da%20pdf.png)

Figure and discussion from: http://www.keysers.net/daniel/files/da.pdf

In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
from ipywidgets import widgets

N = 20
fig, ax = plt.subplots(1, 3, figsize=(9, 3.5), tight_layout=True)
    
def update(k, seed):
    np.random.seed(seed)
    P = np.zeros(shape=(N*N, N*N))
    i,j = np.indices(P.shape)
    # FIX: WRAP AROUND
    P[i==j] = 1 # main diagonal
    P[i==j+1] = k # right neighbour
    P[i == j+N] = k # lower neighbour
    # Makes it symmetric
    for i in range(N*N):
        for j in range(i, N*N):
            P[i, j] = P[j, i]
    for ax_ in ax:
        ax_.cla()
        ax_.axis('off')
    ax[0].imshow(P[:N*2, :N*2])
    ax[0].set_title('Precision (fragment)')
    ax[1].imshow(np.linalg.inv(P)[:N*2, :N*2])
    ax[1].set_title('Covarianza (fragment)')
    ax[2].imshow(np.random.multivariate_normal(np.zeros(shape=(N*N)), np.linalg.inv(P)).reshape(N, N))
    ax[2].set_title('Image zero-mean MVN')
    
widgets.interact(update, k=widgets.FloatSlider(min=-1, max=1., step=0.01, continuous_update=False),
                 seed=widgets.IntSlider(min=0, max=10, continuous_update=False));

<IPython.core.display.Javascript object>

interactive(children=(FloatSlider(value=0.0, continuous_update=False, description='k', max=1.0, min=-1.0, step…

## Cholesky factor

Note that to sample from multivariate gaussian we could use the reparameterization trick

$$
x = \mu + L \epsilon
$$

where $L L^T = \Sigma$

And also note that 

$$
\Lambda = \Sigma^{-1} = (L L^T)^{-1} = L^{T^{-1}} L^{-1} = L^{{-1}^T} L^{-1}
$$

And inverting the cholesky factor is easy! 

What is the cholesky factor of the sparse precision? 

### TODO
- WHAT DOES A POSITIVE OR NEGATIVE $\Lambda[i, j]$ MEANS?
- GENERALIZE GRF TO 2D
- FIX WRAP AROUND FOR NEIGHBOURHODS
- FIX NUMERAL ERRORS ON INVERTION, FIND LIMIT TO $\Lambda_{ij}$
- MODEL THE CHOLESKY FACTOR OF $\Lambda$ instead of $\Lambda$

### Appendix

- sparse inverse covariance estimation
- How strongly pixels correlate with neighbours: https://dahtah.github.io/imager/image_stats.html
- Covariance of covariance features: https://aimagelab.ing.unimore.it/imagelab/pubblicazioni/2014ICMR_covariance.pdf