# Variational Inference

The goal of generative models is to learn a distribution $p(x)$ (or known as **Evidence**) over the data $x$. Variational inference is a method to approximate complex distributions by transforming them into simpler ones.


For training a generative mode, the goal is to maximize the likelihood of the data, which can be expressed as:

$$\log p(x)$$

Directly estimating $p(x)$ requires we have all the data in the world, which is not feasible. Based on chain rule, we can rewrite the log likelihood as:

$$\log p(x) = \log \int p(x, z) dz = \log \int p(x|z) p(z) dz$$


However, this marginal likelihood is intractable, as it requires integrating over all possible latent variables $z$.
- even if the $p(z), p(x|z)$ are simple distributions (e.g., Gaussian), the product is often non-Gaussian and high-dimensional, making the integral difficult to compute.
- $p(x|z)$ as the decoder is usually a complex approximator, which makes the integral even harder to compute. 



## Variational Trick

Now we introduce a **variational distribution** $q(z|x)$ to approximate the true posterior $p(z|x)$, an easier distribution to sample from. We can then rewrite the log likelihood as:

$$\log p(x) = E_{q(z|x)}[\log p(x)]$$

This holds because: $\log p(x)$ is a constant with respect to $z$, independent of the integration variable. 

$$E_{q(z|x)}[\log p(x)] = \sum_z q(z|x) \log p(x) = \log p(x) \sum_z q(z|x) = \log p(x) \times 1 = \log p(x)$$

Now add and subtract $\log q(z|x)$ inside the expectation:

$$\log p(x) = E_{q(z|x)}[\log \frac{p(x, z)}{q(z|x)} + \log \frac{q(z|x)}{p(z|x)}]$$

This gives:

$$ \log p(x) = E_{q(z|x)}[\log \frac{p(x, z)}{q(z|x)}] + E_{q(z|x)}[\log \frac{q(z|x)}{p(z|x)}] $$


The first term is the **variational evidence lower bound** (ELBO), 

\begin{align*}
    \mathcal{L}(q) &= E_{q(z|x)}[\log \frac{p(x, z)}{q(z|x)}]\\
    &= E_{q(z|x)}[\log \frac{p(x|z)p(z)}{q(z|x)}] \\
    &= E_{q(z|x)}[\log p(x|z)] + E_{q(z|x)}[\log \frac{p(z)}{q(z|x)}] \\
    &= E_{q(z|x)}[\log p(x|z)] - E_{q(z|x)}[\log \frac{q(z|x)}{p(z)}] \\
    &= E_{q(z|x)}[\log p(x|z)] - KL(q(z|x) || p(z))

\end{align*}

The second term is the **KL divergence** between the estimated variational distribution and the true posterior distribution, which is unknown because we cannot compute $p(z|x)$ directly.

$$ KL(q(z|x) || p(z|x)) = E_{q(z|x)}[\log \frac{q(z|x)}{p(z|x)}] $$


**Note that KL divergence is always non-negative**:

Consider the function $f(x) = -\log x$, which is convex, 
From Jensen's inequality, for a random variable $X$ and the convex function $f$, we have:
$$ E[f(X)] \geq f(E[X]) $$

Apply $X = \frac{p(z)}{q(z)}$:

$$ E_{q(z)} [\log(\frac{p(z)}{q(z)})] \leq \log E_{q(z)}[\frac{p(z)}{q(z)}] = \log \int_z q(z) \frac{p(z)}{q(z)} dz = \log 1 = 0 $$

Thus, KL divergence is always non-negative:
$$ KL(q(z) | p(z)) = E_{q(z)}[\log \frac{q(z)}{p(z)}] = - E_{q(z)} [\log \frac{p(z)}{q(z)}] \geq 0 $$

This means that **maximizing the likelihood $\log p(x)$ is equivalent to maximizing the ELBO** $\mathcal{L}(q)$, which is a lower bound of the log likelihood.
This is the **variational trick**: we can find the variational distribution $q(z|x)$ to maximize the ELBO, which in turn approximates the true posterior distribution $p(z|x)$.





## Variational Distribution

The **variational distribution** or **encoder** $q(z|x)$ is a simpler distribution that we can sample from. The choice of its form affects how well it approximates the true posterior $p(z|x)$.

### Diagonal Gaussian
A common choice is a **diagonal Gaussian**:
$$ q(z|x) = \mathcal{N}(z; \mu(x), \sigma^2(x)) $$

where $\mu(x)$ and $\sigma^2(x)$ are functions (often neural networks) that output the mean and variance for each latent variable $z$ given the input $x$. 

This distribution leads to analytical KL with a closed-form solution:
$$ KL(\mathcal{N}(\mu, \delta^2) || \mathcal{N}(0, 1)) = \frac{1}{2} \sum_i (\mu_i^2 + \sigma_i^2 - \log(\sigma_i^2) - 1) $$

where $i$ indexes the latent dimensions.

This choice allows sampling via the reparameterization trick:
$$ z = \mu(x) + \sigma(x) \odot \epsilon $$

where $\epsilon \sim \mathcal{N}(0, I)$ is a standard normal noise vector, and $\odot$ denotes element-wise multiplication. 
This allows gradients to flow through the sampling process, enabling backpropagation during training.

### Mixture of Gaussians
Another choice is a **mixture of Gaussians**:
$$ q(z|x) = \sum_{k=1}^K \pi_k(x) \mathcal{N}(z; \mu_k(x), \sigma_k^2(x)) $$

where $\pi_k(x)$ are the mixing coefficients, $\mu_k(x)$ and $\sigma_k^2(x)$ are the means and variances for each component, and $K$ is the number of components.
This can model multimodal posteriors better than a single Gaussian.

But it has no analytical KL divergence, so we need to use Monte Carlo methods to estimate the ELBO.

### Normalizing Flows
A more flexible approach is to use **normalizing flows**, which transform a simple distribution (like a Gaussian) into a more complex one through a series of invertible transformations.

$z = f_K f_{K-1} \dots f_1(\epsilon)$, 

where $f$ is a sequence of transformations parameterized by $\theta$. 

The ELBO can be computed using the change of variables formula:
$$ \log p(x) = E_{q(z|x)}[\log p(x|z) + \log \det \frac{\partial f^{-1}}{\partial z}] $$
This allows for complex posteriors while still being able to compute the ELBO. 

## Autoencoder 

An **autoencoder** is a neural network architecture that learns to encode input data into a lower-dimensional latent space and then decode it back to reconstruct the original data.

It consists of two main components:
1. **Encoder**: Maps the input data $x$ to a latent representation $z$ 
2. **Decoder**: Maps the latent representation $z$ back to the data space to reconstruct $x$.

If the encoder is not probabilistic, it simply outputs a deterministic $z = f(x)$, where $f$ is a neural network. The decoder then reconstructs the input as $\hat{x} = g(z)$, where $g$ is another neural network.
The training objective is to minimize the reconstruction error, often using mean squared error (MSE) or cross-entropy loss:
$$ \mathcal{L}_{\text{recon}}(x, \hat{x}) = || x - \hat{x} ||^2 $$

Thus this traditional autoencoders
- are not generative models, as they do not model the distribution of the data but rather learn a compressed representation.
- have no mechanism to sample from the latent space, as it does not define a distribution over $z$.
- latent space may be irrelevant or not sparse -> a random z often leads to garbage output.



## Variational Autoencoder (VAE)

To make the autoencoder generative, we introduce a probabilistic encoder that outputs a distribution over the latent space instead of a single point.
The encoder outputs parameters of a variational distribution $q(z|x)$, typically a diagonal Gaussian with mean $\mu(x)$ and variance $\sigma^2(x)$.
The decoder then samples from this distribution to generate the latent variable $z$:
$$ z \sim q(z|x) = \mathcal{N}(z; \mu(x), \sigma^2(x)) $$

The decoder then reconstructs the input from the sampled $z$:
$$ \hat{x} = g(z) $$
The training objective is to maximize the ELBO, based on the variational trick.

This generative model can be used in several ways:
1. **Reconstruction**: Given an input $x$, the encoder maps it to a latent representation $z$ and the decoder reconstructs $\hat{x}$. The model is trained to minimize the reconstruction error while also ensuring that the latent space follows a prior distribution (e.g., standard normal).
2. **Sampling**: We can sample from the latent space by sampling $z$ from the prior $p(z)$ (often a standard normal $\mathcal{N}(0, I)$) and then passing it through the decoder to generate new data points. This process is random but constrained by the learned latent space structure. To maxmize the ELBO is to minimize the KL divergence between the variational distribution and the prior, which push the latent space to be close to the prior distribution (e.g., normal distribution). During inference, sampling from the prior leads to meaningful samples that resemble the training data.


For generative tasks, **how to control the sampling process?**
- **Interpolation**: By sampling two points in the latent space and interpolating between them, we can generate smooth transitions between different data points.

$$ z_{\text{interp}} = \alpha z_1 + (1 - \alpha) z_2 $$

where $\alpha$ is a parameter that controls the interpolation.
- **Conditional Generation**: By conditioning the encoder on additional information (e.g., class labels), we can generate samples that belong to specific categories. This is done by modifying the encoder to take the additional information as input, allowing it to learn a conditional distribution over the latent space.
- **VQ-VAE + Transformer**: For text generation, we can use a VQ-VAE to encode text into discrete latent codes and then use a transformer to model the relationships between these codes, allowing for coherent text generation.





## Vector Quantized Variational Autoencoder (VQ-VAE)

Instead of using a continuous latent space, VQ-VAE uses a discrete latent space by quantizing the latent representations into a finite set of vectors (codebook). The architecure consists of:
1. **Encoder**: Maps the input data $x$ to a continuous latent representation $z_e = f_e(x)$.
2. **Codebook**: A set of discrete vectors (codebook), e.g., $K$, that the continuous latent representation is quantized to. The encoder output is mapped to the nearest codebook vector:
$$ z_q = \text{argmin}_{e_k \in \text{codebook}} || z_e - e_k ||^2 $$
3. **Decoder**: Maps the quantized latent representation $z_q$ back to the data space to reconstruct $x$:
$$ \hat{x} = g(z_q) $$



The training involves learning the encoder, the codebook and the decoder, which can be formulated as follows:

$$ \mathcal{L} = \mathcal{L}_{ELBO}(q(z|x)) + \mathcal{L}_{codebook} $$

where the first term is the ELBO for VAE, and the second term is a commitment loss that encourages the encoder outputs and the codebook vectors to stay close to each other.



**ELBO loss**

$$\mathcal{L}_{ELBO} = \mathcal{E}_{q(z_e|x)} (p(x|z_e) - \text{KL}(q(z|x)|p(z))$$




These three compponents are trained jointly, and the training loss is:
$$ \mathcal{L} = || x - \hat{x} ||^2 + || sg(z_e) - e_k ||^2 +  \beta ||z_e - sg(e_k)||$$

where the first term is the reconstruction loss and the second term is a commitment loss (codebook loss) that encourages codebook vectors to move towards encoder outputs and the third term is commitment loss that encourages encoder outputs to stay close to the codebook vectors. The $sg$ denotes the stop gradient operation, which prevents gradients from flowing through the quantization step. In pytorch, $sg(e_k)$ can be implemented as `e_k.detach()`.
