# Variational Inference

The goal of generative models is to learn a distribution $p(x)$ over the data $x$. Variational inference is a method to approximate complex distributions by transforming them into simpler ones.


For training a generative mode, the goal is to maximize the likelihood of the data, which can be expressed as:

$$\log p(x)$$

Directly estimating $p(x)$ requires we have all the data in the world, which is not feasible. Based on chain rule, we can rewrite the log likelihood as:

$$\log p(x) = \log \int p(x, z) dz = \log \int p(x|z) p(z) dz$$


However, this integral is intractable. 



## Variational Trick

Now we introduce a **variational distribution** $q(z|x)$, an easier distribution to sample from. We can then rewrite the log likelihood as:

$$\log p(x) = E_{q(z|x)}[\log p(x)]$$

This holds because: $\log p(x)$ is a constant with respect to $z$, independent of the integration variable. 

$$E_{q(z|x)}[\log p(x)] = \sum_z q(z|x) \log p(x) = \log p(x) \sum_z q(z|x) = \log p(x) \times 1 = \log p(x)$$

Now add and subtract $\log q(z|x)$ inside the expectation:

$$\log p(x) = E_{q(z|x)}[\log \frac{p(x, z)}{q(z|x)} + \log \frac{q(z|x)}{p(z|x)}]$$

This gives:

$$ \log p(x) = E_{q(z|x)}[\log \frac{p(x, z)}{q(z|x)}] + E_{q(z|x)}[\log \frac{q(z|x)}{p(z|x)}] $$


The first term is the **variational evidence lower bound** (ELBO), which we want to maximize. 

\begin{align*}
    \mathcal{L}(q) &= E_{q(z|x)}[\log \frac{p(x, z)}{q(z|x)}]\\
    &= E_{q(z|x)}[\log \frac{p(x|z)p(z)}{q(z|x)}] \\
    &= E_{q(z|x)}[\log p(x|z)] + E_{q(z|x)}[\log \frac{p(z)}{q(z|x)}] \\
    &= E_{q(z|x)}[\log p(x|z)] - E_{q(z|x)}[\log \frac{q(z|x)}{p(z)}] \\
    &= E_{q(z|x)}[\log p(x|z)] - KL(q(z|x) || p(z))

\end{align*}

The second term is the **KL divergence** between the variational distribution and the prior distribution:

$$ KL(q(z|x) || p(z|x)) = E_{q(z|x)}[\log \frac{q(z|x)}{p(z|x)}] $$

**Note that KL divergence is always non-negative**:

Consider the function $f(x) = -\log x$, which is convex, 
From Jensen's inequality, for a random variable $X$ and the convex function $f$, we have:
$$ E[f(X)] \geq f(E[X]) $$

Apply $X = \frac{p(z)}{q(z)}$:

$$ E_{q(z)} [\log(\frac{p(z)}{q(z)})] \leq \log E_{q(z)}[\frac{p(z)}{q(z)}] = \log \int_z q(z) \frac{p(z)}{q(z)} dz = \log 1 = 0 $$

Thus, KL divergence is always non-negative:
$$ KL(q(z) | p(z)) = E_{q(z)}[\log \frac{q(z)}{p(z)}] = - E_{q(z)} [\log \frac{p(z)}{q(z)}] \geq 0 $$

This means that maximizing the ELBO is equivalent to minimizing the KL divergence between the variational distribution and the prior distribution.







With $KL >= 0$, maximizing the likelihood is equivalent to maximizing the ELBO

## Variational Gaussian mixture model