# Auto-Encoding Variational Bayes

> see the paper [Auto-Encoding Variational Bayes](https://arxiv.org/pdf/1312.6114.pdf), Diederik P Kingma, Max Welling, ICLR 2014

> see slide 1: http://lear.inrialpes.fr/~verbeek/tmp/AEVB.jjv.pdf


> see slide 2: https://www.slideshare.net/mehdidc/auto-encodingvariationalbayes-54478304


## What is a generative model ?

- A model of how the data $X$ was generated
- Typically, the purpose is to find a model for: $p(x)$ or $p(x, y)$
- y can be a set of latent (hidden) variables or a set of output variables, for discriminative problems
- Note: `latent variables`, as opposed to observable variables, are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called `latent variable models`.

## Training generative models
Typically, we assume a parametric form of the probability density: 
$$p ( x | \theta)$$

Given an i.i.d dataset: $X = ( x_1 , x_2 , ..., x_N )$ , we typically do:
- Maximum likelihood (ML) :  $$\operatorname*{argmax}_\theta p( X | \theta)$$
- Maximum a posterior (MAP) : $$ \operatorname*{argmax}_\theta p ( X | \theta) p(\theta)$$
- Bayesian inference : $$p (\theta | X) = \frac{p (x | \theta) p (\theta)} { \int_ \theta p (x | \theta) p (\theta) d\theta}$$

## The problem

- let $x$ be the observed variables
- we assume a latent representation $z$ (Note: again, latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured))
- we define $p_\theta(z)$ and $p_\theta ( x | z )$

We want to design a generative model where: 
- **marginal** $p_\theta( x ) = \int p_\theta( x | z ) p_\theta ( z) dz$  is intractable
- **posterior** $p_\theta(z|x) = \frac{p_\theta (x|z) p_\theta(z)}{p_\theta(x)}$ is intractable
- we have **large datasets**: we want to avoid sampling based training procedures (e.g., MCMC)

## The proposed solution

They propose:
- a fast training procedure that estimates the parameters $\theta$: for **data generation**

- an approximation of the posterior $p_\theta (z | x) $ : for **data representation**

- an approximation of the marginal $p_\theta( x )$ : for **model
evaluation and as a prior for other tasks**

## Formulation of the problem

The process of generation consists of sampling $z$ from $p_\theta( x | z )$.

Let's define:
- a prior over the latent representation $p_\theta(z)$,
- a **decoder**  $p_\theta( x | z )$

We want to maximize the log-likelihood of the data $( x^{( 1 )} , x^{ ( 2 )} , \dots, x^{( N )})$:

$$\log p_\theta ( x^{( 1 )} , x^{ ( 2 )} , \dots, x^{( N )}) = \sum_i \log p_\theta(x_i)$$

and be able to do inference: $p_\theta(z | x)$

## The variational lower bound


<img src="../files/auto-encoding-variational-bayes-fig1.png" alt="drawing" width="700"/>

- We will learn an approximate of the intractable posterior $p_\theta ( z | x )$ : $q_\phi ( z | x )$ by maximizing a lower bound of the log-likelihood of the data

- We can write :

$$\log p_\theta (x) = D_{KL}( q_\phi ( z|x) || p_\theta(z|x)) + L(\theta, \phi, x )$$ where:

$$L(\theta, \phi, x ) = \mathbb{E}_{q_\phi ( z|x)} [ \log p_\theta( x, z ) − \log q_\phi( z | x)] $$

- $L(\theta, \phi, x )$ is called the **variational lower bound**, and the goal is to maximize it w.r.t to all the parameters $(\theta, \phi)$

## Estimating the lower bound gradients

- We need to compute $\frac{\partial L(\theta, \phi, x )}{\partial \theta}$ and $\frac{\partial L(\theta, \phi, x )}{\partial \phi}$ to apply gradient descent

- For that, we use the **reparametrisation trick** : we sample from a noise variable $p(\epsilon)$ and apply a determenistic function to it so that we obtain correct samples from $q_\phi ( z | x )$, meaning:

 - if $ \epsilon \sim p(\epsilon)$ we find $g$ so that if $z = g(x, \phi, \epsilon)$ then $z \sim q_\phi (z | x)$
 
 - $g$ can be the **inverse CDF** of $q_\phi ( z | x )$ if $\epsilon$ is uniform

- With the reparametrisation trick we can rewrite L:
$$L(\theta, \phi, x ) = \mathbb{E}_{ \epsilon \sim p(\epsilon)} [\log p_\theta( x, g(x, \phi, \epsilon) ) − \log q_\phi ( g(x, \phi, \epsilon) | x)]$$

- We then estimate the gradients with **Monte Carlo**

## A connection with auto-encoders

- Note that $L$ can also be written in this form:

$$ L(\theta, \phi, x ) = - D_{KL} (q_\phi ( z | x ) || p_\theta (z)) + 
\mathbb{E}_{q_\phi( z|x)} [\log p_\theta(x | z)]$$

- We can interpret the first term as a **regularizer**: it forces
$q_\theta( z | x )$ to not be too divergent from the prior $p_\theta (z)$

- We can interpret the second term as the **reconstruction error**

## The algorithm

<img src="../files/auto-encoding-variational-bayes-algr1.png" alt="drawing" width="700"/>

## Variational Autoencoders (VAEs)

It is a model example which uses the procedure described above to maximize the lower bound

In VAEs, we choose:
- $p_\theta (z) = N ( 0 , \mathbf {I} )$ 
- $p_\theta ( x | z )$ : 
 - is normal distribution for real data, we have **neural network decoder** that computes $\mu$ and $\sigma$ of this distribution from $z$
 - is multivariate bernoulli for boolean data, we have **neural network decoder** that computes the probability of 1 from $z$

- $q_\phi ( z | x ) = N \left( \mu( x ), \sigma(x) \mathbf {I}\right) $: we have a **neural network encoder** that computes $\mu$ and $\sigma$ of $q_\phi(z | x )$ from $x$
- $ \epsilon \sim N ( 0 , \mathbf {I} )$ and $z = g(x, \phi, \epsilon) = \mu(x) + \sigma(x)∗ \epsilon$