# Kullback-Leibler (KL) divergence

Given two separate probability distributions $p(z)$ and $q(z)$ over the same random variable $Z$ (i.e. $p$ and $q$ are PMFs if $Z$ is discrete and PDFs otherwise), the Kullback-Leibler (KL) divergence measures how different these two distributions are:

$$ D_{KL}(q||p) \equiv D_{KL}(q(z)||p(z)) = \mathbb{E}_{Z \sim q} \Big[ \log \frac{q(z)}{p(z)} \Big] $$

Properties:
* $ D_{KL}(q||p)  \geq 0 $, i.e. the KL-divergence is always non-negative
* $ D_{KL}(q||p) = 0 $ iff $p(z) = q(z)$
* KL-divergence is not symmetric: $D_{KL}(q||p) \neq D_{KL}(p||q)$

## KL-divergence between two Gaussians

Let $p(\mathbf{x}) = N(\mathbf{x}|\mu_p,\Sigma_p)$ and $q(\mathbf{x}) = N(\mathbf{x}|\mu_q,\Sigma_q)$, both $k$ dimensional Then:

$$ 
D_{KL}(p|q) = 
\frac{1}{2} \Big[ tr(\Sigma_q^{-1}\Sigma_p) + (\mu_q-\mu_p)^T\Sigma_q^{-1}(\mu_q-\mu_p)-k + \ln \frac{\det\Sigma_q}{\det \Sigma_p} \Big]
$$

If $q(\mathbf{x}) = N(\mathbf{x}|\mathbf{0},\mathbf{I}_k)$, we get:

$$ 
D_{KL}(p|q) = 
\frac{1}{2} \Big[ tr(\Sigma_p) + \mu_p^T\mu_p-k - \ln (\det \Sigma_p) \Big]
$$

In the scalar case, where $p(x) = N(x|\mu_p,\sigma_p^2)$ and $q(x) = N(x|\mu_q,\sigma_q^2)$, we get:

$$ 
D_{KL}(p|q) = 
\ln \frac{\sigma_q}{\sigma_p} + \frac{\sigma_p^2+(\mu_p-\mu_q)^2}{2 \sigma_q^2} - \frac{1}{2}
$$


# Evidence Lower Bound (ELBO)

Let $X$ and $Z$ be random variables, jointly distributed with distribution $p$. Then, for any sample $x \sim p$ and for any distribution $q$, we have:

$$ \ln p(x) \geq \mathbb{E}_{q} \Big[ \ln \frac{p(x,z)}{q(z)} \Big] $$

where $p(x)$ is the marginal distribution $p(X)$ evaluated at $X = x$.

The quantity 

$$ \mathcal{L}(q) \equiv \mathbb{E}_{q} \Big[ \ln \frac{p(x,z)}{q(z)} \Big] = \mathbb{E}_{q}[\ln p(x,z)] - \mathbb{E}_{q}[\ln q(z)]$$ 

is the ELBO, also called the variational lower bound.



## Derivation

\begin{align*}
\ln p(x) & = \ln \int p(x,z) dz \\
  & = \ln \int p(x,z) \frac{q(z)}{q(z)} dz \\
  & = \ln \int \frac{p(x,z)}{q(z)} q(z) dz \\
  & = \ln \mathbb{E}_{q} \Big[ \ln \frac{p(x,z)}{q(z)} \Big] \\
  & \geq \mathbb{E}_{q} \Big[ \ln \frac{p(x,z)}{q(z)} \Big] \\
\end{align*}

Where the final inequality follows from Jensen’s Inequality.

## Connection between ELBO and KL-Divergence

\begin{align*}
\mathcal{L}(q) & = \mathbb{E}_{q}[\ln p(x,z)] - \mathbb{E}_{q}[\ln q(z)] \\

  & = \mathbb{E}_{q}[\ln p(x|z)p(z)] - \mathbb{E}_{q}[\ln q(z)] \\

  & = \mathbb{E}_{q}[\ln p(x|z) + \ln p(z)] - \mathbb{E}_{q}[\ln q(z)]  \\
  
  & = \mathbb{E}_{q}[\ln p(x|z)] + \mathbb{E}_{q}[\ln p(z) - \ln q(z)] \\

  & = \mathbb{E}_{q}[\ln p(x|z)] + \mathbb{E}_{q}[\ln \frac{p(z)}{q(z)}] \\
  
  & = \mathbb{E}_{q}[\ln p(x|z)] + D_{KL}(q(z)||p(z)) \\
\end{align*}

## Connection between ELBO and the log-evidence

The gap between the log-evidence and the ELBO is equal to the KL-divergence:

\begin{align*}
D_{KL}(q(z)||p(z|x)) & = \mathbb{E}_{q} \Big[ \ln \frac{q(z)}{p(z|x)} \Big] \\
  & = \mathbb{E}_{q}[\ln q(z)] - \mathbb{E}_{q}[\ln p(z|x)] \\
  & = \mathbb{E}_{q}[\ln q(z)] - \mathbb{E}_{q}[\ln \frac{p(x,z)}{p(x)}] \\
  & = \mathbb{E}_{q}[\ln q(z)] - \mathbb{E}_{q}[\ln p(x,z)] + \mathbb{E}_{q}[\ln p(x)]\\
  & = - \mathbb{E}_{q} \Big[\ln \frac{p(x,z)}{q(z)} \Big] + \ln p(x)\\
  & = - \mathcal{L}(q) + \ln p(x)\\
  & = \ln p(x) - \mathcal{L}(q)\\
\end{align*}

# Variational Inference

Variational inference estimates a posterior distribution when computing it explicitly is intractable. Variational inference is used in situations in which we have a model that involves hidden random variables $Z$, observed data $X$, and some posited probabilistic model over the hidden and observed random variables $p(X,Z)$. Our goal is to compute the posterior distribution $P(Z|X)$. Ideally, we would do so using Bayes' Theorem:

$$p(z|x) = \frac{p(x|z)p(z)}{p(x)}$$

In practice, it is often difficult to compute $p(z|x)$ via Bayes theorem because the denominator $p(x)$ does not have a closed form. Usually $p(x)$ can be only be expressed as an integral that marginalizes over $z$:

$$ p(x) = \int p(x,z) dz $$

In such scenarios, we’re often forced to approximate $p(z|x)$ rather than compute it directly. Variational inference is one such approximation technique. Instead of computing $p(z|x)$ exactly via Bayes theorem, variational inference attempts to find another distribution $q(z)$ that is "close" to $p(z|x). Ideally, $q(z)$ is easier to evaluate than $p(z|x)$ and, if $p(z|x)$ and $q(z)$ are similar, than we can use $q(z)$ as a replacement for $p(z|x)$ for any relevant downstream tasks. 

The goal of variational inference is to choose a family $q$ of tractable distributions parameterized by $\phi$ (e.g. $q$ is the normal distribution and $\phi$ = $(\mu,\sigma)$) and then to find $\hat{\phi}$ such that $q_{\hat{\phi}}(z)$ is as close to $p(z|x)$ as possible.

It would make sense to proceed with this problem by choosing $\hat{\phi}$ to minimize the KL-divergence between $q_{\hat{\phi}}(z)$ and $p(z|x)$:

$$ \hat{\phi} = \underset{\phi}{argmin} \ D_{KL}(q_{\phi}(z)||p(z|x)) $$

However, this problem involves $p(z|x)$, which is intractable.

Now recall:

$$ D_{KL}(q_{\phi}(z)||p(z|x)) = \ln p(x) - \mathcal{L}(q_{\phi}) $$

where:

$$ \mathcal{L}(q_{\phi}) = \mathbb{E}_{q_{\phi}} \Big[ \ln \frac{p(x,z)}{q_{\phi}(z)} \Big] $$

is the ELBO.

Notice that when it comes to the problem of optimizing $\phi$, the term $\ln p(x)$ is a constant. Therefore:

\begin{align*}
\hat{\phi} & = \underset{\phi}{argmin} \ D_{KL}(q_{\phi}(z)||p(z|x)) \\
  & = \underset{\phi}{argmin} [\ln p(x) - \mathcal{L}(q_{\phi})] \\
  & = \underset{\phi}{argmin} [- \mathcal{L}(q_{\phi})] \\
  & = \underset{\phi}{argmax} \ \mathcal{L}(q_{\phi}) \\
\end{align*}

# Sources:

1. https://mpatacchiola.github.io/blog/2021/01/25/intro-variational-inference.html

