# Variational autoencoder (VAE)

Variational Autoencoder is a generative model that enforces a prior on the latent vector. Instead of mapping the image on a point in space, the encoder of VAE maps the image onto a normal distribution.

Latent variables are a transformation of the data points into a continuous lower-dimensional space. Intuitively, the latent variables will describe the data in a simpler way.

In a stricter mathematical form, data points $x$ that follow a prbability distribution $p(x)$, are mapped into latent variables $z$ that follow a distribution $p(z)$.
Given that idea, we can now define five basic terms:
- The prior distribution $p(z)$ that models the behaviour of the latent variables
- The likelihood $p(x|z)$ that defines how to map latent variables to the data points
- The joint distribution $p(x,z) = p(x|z)p(z)$, which is the multiplication of the likelihood and the prior and essentially describes our model
- The marginal distribution $p(x)$ is the distribution of the original data and it is the ultimate goal of the model. The marginal distribution tells us how possible it is to generate a data point.
- The posterior distribution $p(z|x)$ which describes the latent variables that can be produced by a specific data point

If we assume that we somehow know the likelihood $p(x|z)$, the posterior $p(z|x)$, the marginal $p(x)$, and the prior $p(z)$ we can do the following:
To generate a data point, we can sample $z$ from $p(z)$ and then sample the data point $x$ from $p(x|z)$.
But how we find all those distributions?

Maximum likelihood estimation is a well-established technique of estimating the parameters of a probability distribution so that the distribution fits the observed data. This is accomplished by maximizing a likelihood function.

A likelihood function measures the goodness of fit of a statistical model to a sample of data and it is formed from the joint probability distribution of the sample.
Mathematically we have: $$\Theta^{ML} = arg max \sum_{i=1}^{N}log p_\theta(x_i)$$
This is a standard optimization problem. It can't be solved analytically so we use an iterative approach such as gradient descent.
In order to apply gradient descent, we need to calculate the gradient of the marginal log-likelihood function. Using simple calculus and the Bayes rule, we can prove that: $$\nabla log p_\theta(x) = \int p_\theta(z|x) \nabla_\theta logp_\theta(x,z)dz$$
In order to compute the gradient, we need to have the posterior distribution $p(z|x)$. We return to the problem of Inference.

In intractable problems we try to approximate the inference.
So we want to approximate the actual $p_\theta(z|x)$, with another distribution $q_\phi(z|x)$ called the variational posterior. We will extract the variational posterior by optimizing over a space of possible distributions with respect to the variational parameters $\phi$.

But how the approximation problem is actually formulated? We will approximate the marginal log-likelihood function.
Because the marginal log-likelihood is intractable, we instead approximate a lower bound $L_{\theta,\phi}(x)$ of it, also known as variational lower bound.
It can be proved that the lower bound is:
$$L_{\theta,\phi}(x) = E_{q\phi}(z)[log \frac{p_\theta(x,z)}{q_\phi(z|x)}] \le log p_\theta(x)$$
This is commonly known as the Evidence Lower Bound (ELBO) and is the most common variational lower bound.
If we extend the ELBO equation even further, we derive:
$$L_{\theta,\phi}(x) = log p_\theta(x) - KL(q_\phi(z|x)||p_\theta(z|x))$$
KL refers to Kullback-Leibler divergence and in simple terms is a measure of how different a probability distribution is from a second one.
Kullback-Leibler divergence is defined as:
$$KL(P || Q) = \int_{-\infty}^{\infty}p(x) log(\frac{p(x)}{q(x)})dx$$
The KL divergence is known as the variational gap. In our case, it expresses the difference between the true posterior and the variational posterior. It is essentially a measure of how good our approximation is.

So we know that we need to maximize ELBO with respect to both the model and variational parameters. This means that we need to compute the gradients.
Let's start with model parameters. Although exact gradient calculation is possible, a much better approach is to use Monte Carlo sampling. We generate a handful of samples for the variational posterior and avergae them. That way we estimate the gradients instead of calculating the in a closed form.


When it comes to variational parameters, things are a little trickier because ELBO is an expectation with respect to $\phi$. Luckily we can pull the reparameterization trick from our sleeves. Therefore, we can now compute the gradient and run backpropagation of ELBO with respect to the variational parameters.