# Variational AutoEncoder

**Resources**
* [Jeremy Jordan's Blog](https://www.jeremyjordan.me/variational-autoencoders/)
* [Excellent Medium Article](https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf)

*Note*:- Content & resources are taken from the above links

## AutoEncoders
In general AutoEncoders, the Encoder converts input data into an *encoding vector where each dimension represents some learned attribute about the data*. It outputs a *single value* for each dimension.

For a latent vector of dimension 6, autoencoder will learn descriptive attributes of faces such as skin color, whether or not the person is wearing glasses, etc. in an attempt to describe an observation in some compressed representation.

![AutoEncoder](https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-16-at-10.24.11-PM.png)

### Issues
AutoEncoders are only good at reconstructing the original input data. It does not allow us to generate variations of data.  

The fundamental problem with autoencoders, for generation, is that the latent space they convert their inputs to and where their encoded vectors lie, may not be continuous, or allow easy interpolation.

Ideally, to generate new images we randomly sample from the latent space & pass them to decoder. If the space has discontinuities (eg. gaps between clusters) and you sample/generate a variation from there, the decoder will simply generate an unrealistic output or garbage

![Distinct Clusters in latent space](https://miro.medium.com/max/1000/1*-i8cp3ry4XS-05OWPAJLPg.png)


## Variational AutoEncoders (VAEs)

VAEs allow us to generate random, new output that looks similar to the training data.

VAEs provide a probabilistic distribution for describing each attribute/feature in the latent space. Instead of describing the input using a single variable, they represent each latent variable as a range of possible values described using a probability distribution. VAEs have latent spaces that are continous, allowing easy random sampling and interpolation.

![VAE](https://www.jeremyjordan.me/content/images/2018/06/Screen-Shot-2018-06-20-at-2.47.56-PM.png)

When decoding, we randomly sample from each latent state distribution to generate vector as input for the decoder model.

For any sampling of the latent distributions, we're expecting our decoder model to be able to accurately reconstruct the input i.e for nearby values for each featurer in the latent space distribution, shoule produce similar reconstructions.

![Sampled Decoding](https://www.jeremyjordan.me/content/images/2018/06/Screen-Shot-2018-06-20-at-2.48.42-PM.png)

### Solution
Suppose that there exists some hidden variable $z$ which generates an observation $\hat{x}$. To infer the characteristics of $z$ for observation $\hat{x}$, compute - 

$$p\left({z|\hat{x}}\right) = \frac{{p\left({\hat{x}|z}\right)p\left( z \right)}}{{p\left(\hat{x}\right)}}$$

Computing the denominator term is difficult. 

Variational Inference let's us approximate $p\left({z|\hat{x}}\right)$ using another distribution $q\left({z|\hat{x}}\right)$. We want to approximate $q\left({z|\hat{x}}\right)$ similar to $p\left({z|\hat{x}}\right)$, which can be done using KL Divergence -

$$\min KL\left( {q\left( {z|\hat{x}} \right)||p\left( {z|\hat{x}} \right)} \right)$$

$${E_{q\left( {z|\hat{x}} \right)}}\log p\left( {\hat{x}|z} \right) - KL\left( {q\left( {z|\hat{x}} \right)||p\left( z \right)} \right)$$


**My Understanding**

Encoder essentailly gives us $q(z|x)$. We can use it to infer $q\left({z|\hat{x}}\right)$, if $x$ and $\hat{x}$ are very close to each other. It can be achieved using the reconstruction loss.

Neural Network is encouraged to learn the distribution $q\left({z|\hat{x}}\right)$ to be similar to the true distribution $p(z)$, which is assumed to be a unit Gaussian distribution. This could be achieved by adding the second term in the above equation to the loss function.

$${\cal L}\left( {x,\hat x} \right) + \sum\limits_j {KL\left( {{q_j}\left( {z|x} \right)||p\left( z \right)} \right)}$$

Second term 


### Implementation