## Variational Autoencoder

The variational autoencoder (VA) is in the class of generative neural networks. Which in contrast to Generative Adverserial Network (GAN) approximates a lower bound on the log-likelihood of the data. 

Aditionally the VA leverages a encoding scheme that maps the an input, $\mathcal{X} \sim P(\mathcal{X} : \theta ^*)$, to some lower-dimensional vector space $\mathcal{z}$. That is we have a set of two non-linear maps $\phi$ and $\psi$ where they map to and from the latent space respectively. More formally:

$ \psi: \mathcal{X} \rightarrow \mathcal{z}$ and $\phi: \mathcal{z} \rightarrow \mathcal{X}$

The objective of the VA is to minimize the squared pixel-distance from the reconstructed input, or maximise some MLE for the data. Which is to say the MSE and Binary crossentropy are both valid cost functions. 

What differentiates a VA from an ordinary autoencoder is the restriction placed on the latent space. One imposes and additional cost forcing the latent variable to a normal, zero mean, unit variance distribution. In practice this is achieved with the addition of a Kullback-Leibler Divergence (KL-divergence) term to the loss function. This divergence is defined as :

$D_{KL}(P||Q) = \int^{\infty}_{-\infty} P(x) \log{\frac{P(x)}{Q(x)}}dx$

The KL-divergence measures the overlap between two distributions and is valued at unity for perfectly overlapping distributions. 

Of-course this is not an integral (or sum in the discrete case) we want to compute, and it can be  quite easily shown (given some patience and some nicely featured gaussian integrals) that two normal distributions,$P, Q$ , with means $\mu_1$ and $\mu_2$, and corresponding standard deviations $\sigma_1$ and $\sigma_2$ that the KL-divergence takes the following form 

$D_{KL}(P||Q) = \log{\frac{\sigma_2}{\sigma_1}} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2 ^2} - \frac{1}{2}$ 

Which further simplifies when, as in the case for the VA, that $P \sim \mathcal{N}(0, 1)$, to 

$D_{KL}(P||Q) = \log{\sigma_2} + \frac{\mu_2^2}{2\sigma_2 ^2} - \frac{1}{2}$ 

Traditionally VAs have struggled with problems related to the lower bound approximation and also related to the context-less prediciton of each pixel value. There have been numerous models proposed in remedy to those issues but will not be covered just yet.

To summarize  the VA has two parts, the encoder and the decoder as shown in the figures below

<img src="clustering_cnn_representations/images/encoder.png" width="800" /> 

In [4]:
import keras as ker
import matplotlib.pyplot as plt
import sklearn

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
