# Variational Inference

Variational inference is an algorithm to sample the posterior distribution with an unknown normalization factor $Z(y)$

$$ 
\begin{align*}
\rho_{\rm post}(\theta | y) = \frac{\rho(\theta, y)}{Z(y)} =  \frac{\rho_{\rm prior}(\theta) \rho(y | \theta)}{Z(y)}
\end{align*}
$$

We consider the case that the conditional probability $\rho(y | \theta)$ and the prior $\rho_{\rm prior}(\theta)$ are easy to compute.

## Basic variational inference algorithm

The basic idea of variational inference is to find a simpler distribution $q_{\lambda}(\theta)$, which is parameterized by $\lambda$, to approximate the original
distribution $\rho_{\rm post}(\theta | y)$

$KL$ divergence is widely used to measure the distance between these distributions, 
$$
\begin{align*}
KL\Bigl[q_{\lambda}(\theta) \Vert  \rho_{\rm post}(\theta | y)\Bigr] &= \int q_{\lambda}(\theta)  \log \frac{q_{\lambda}(\theta)}{\rho_{\rm post}(\theta | y)} d\theta \\
&= \mathbb{E}_{\theta \sim q_{\lambda}(\theta)}  \Bigl[ \log \frac{q_{\lambda}(\theta)}{\rho_{\rm post}(\theta | y)} \Bigr]
\end{align*}
$$

The goal is obtain an optimal $\lambda$, which minimizes the $KL$ divergence.  A natural idea is to use gradient descent method

$$
\begin{align*}
\nabla_{\lambda} KL\Bigl[q_{\lambda}(\theta) \Vert  \rho_{\rm post}(\theta | y)\Bigr] &= \nabla_{\lambda} \int q_{\lambda}(\theta)  \log \frac{q_{\lambda}(\theta)}{\rho_{\rm post}(\theta | y)} d\theta \\
&=  \int \nabla_{\lambda} q_{\lambda}(\theta)  \Bigl( \log q_{\lambda}(\theta) - \log \rho_{\rm prior}(\theta) - \log \rho (y|\theta)\Bigr) \\
&=  \mathbb{E}_{\theta \sim q_{\lambda}(\theta)} \Bigl[ \nabla_{\lambda} \log q_{\lambda}(\theta)  \Bigl( \log q_{\lambda}(\theta) - \log \rho_{\rm prior}(\theta) - \log \rho (y|\theta)\Bigr) \Bigr]
\end{align*}
$$

Here we use the fact 
$$
\begin{align*}
\int \nabla_{\lambda} q_{\lambda}(\theta) d\theta  = 0
\end{align*}
$$
It is worth noticing that the gradient does not depend on the unknown normalization factor $Z(y)$. And the expectation can be approximated by Monte Carlo methods.


### Evidence lower bound
$KL$ divergence can be written as 
$$
\begin{align*}
KL\Bigl[q_{\lambda}(\theta) \Vert  \rho_{\rm post}(\theta | y)\Bigr] &= \mathbb{E}_{\theta \sim q_{\lambda}(\theta)}  \Bigl[ \log \frac{q_{\lambda}(\theta)}{\rho_{\rm post}(\theta | y)} \Bigr] \\
&= Z_y - \mathbb{E}_{\theta \sim q_{\lambda}(\theta)}  \Bigl[ \log \frac{\rho(\theta, y)}{q_{\lambda}(\theta)} \Bigr] 
\end{align*}
$$

The evidence lower bound $ELBO(\lambda)$ is defined as 
$$
\begin{align*}
ELBO(\lambda) = \mathbb{E}_{\theta \sim q_{\lambda}(\theta)}  \Bigl[ \log \frac{\rho(\theta, y)}{q_{\lambda}(\theta)} \Bigr] 
\end{align*}
$$

Therefore, minimizing $KL$ divergence is equivalent to maximizing $ELBO(\lambda)$. And they have the same gradient (with different signs) with respect to $\lambda$.

And we have 
$$
\begin{align*}
log \rho(y) \geq ELBO(\lambda) = \mathbb{E}_{\theta \sim q_{\lambda}(\theta)}  \Bigl[ \log \frac{\rho(\theta, y)}{q_{\lambda}(\theta)} \Bigr] 
\end{align*}
$$

## Mean field approximation

We assume that the parameterized distribution has some nice structures as following

$$
\begin{align*}
q_{\lambda}(\theta) = \Pi_{i=1}^{m}q_{\lambda_i}(\theta_i)
\end{align*}
$$

$KL$ divergence becomes
$$
\begin{align*}
KL\Bigl[q_{\lambda}(\theta) \Vert  \rho_{\rm post}(\theta | y)\Bigr] 
&= \mathbb{E}_{\theta \sim q_{\lambda}(\theta)}  \Bigl[ \sum_{i=1}^{m} \log q_{\lambda_i}(\theta_i) - \log \rho_{\rm post}(\theta | y) \Bigr]
\end{align*}
$$

This can be minimized with the coordinate descent method, namely sequentially minimize each $\lambda_i$. The $KL$ divergence can be rewritten as 

$$
\begin{align*}
KL\Bigl[q_{\lambda}(\theta) \Vert  \rho_{\rm post}(\theta | y)\Bigr] 
&= \mathbb{E}_{\theta \sim q_{\lambda}(\theta)}  \Bigl[ \sum_{i=1}^{m} \log q_{\lambda_i}(\theta_i) - \log \rho_{\rm post}(\theta | y) \Bigr]\\
&= \sum_{i=1}^{m} \mathbb{E}_{\theta_i \sim q_{\lambda_i}(\theta_i)}\log q_{\lambda_i}(\theta_i) - \mathbb{E}_{\theta \sim q_{\lambda}(\theta)} \log \Bigl(\rho_{\rm post}(\theta_{-i_0} | y)\rho_{\rm post}(\theta_{i_0} | \theta_{-i_0} , y) \Bigr) \\
&=  \mathbb{E}_{\theta_{i_0} \sim q_{\lambda_{i_0}}(\theta_{i_0})}\Bigl[ \log q_{\lambda_{i_0}}(\theta_{i_0}) - 
\mathbb{E}_{\theta_{-i_0} \sim q_{\lambda_{-i_0}}(\theta_{-i_0})} \log \rho_{\rm post}(\theta_{i_0} | \theta_{-i_0} , y)
\Bigr] + C
\end{align*}
$$
here other terms in $C$ are independent of $\lambda_{i_0}$. Let denote 
$$
h_{i_0}(\cdot) = \exp\Bigl( \mathbb{E}_{\theta_{-i_0} \sim q_{\lambda_{-i_0}}(\theta_{-i_0})} \log \rho_{\rm post}(\cdot | \theta_{-i_0} , y) \Bigr)
$$

The optimal solution satisfies
$$
q_{\lambda_{i_0}}(\cdot) \propto h_{i_0}(\cdot)
$$

## Stein variational gradient descent

TODO

# Reference
1. [Lecture 5: Variational Inference (Stanford Canvas)](https://canvas.stanford.edu/files/1780120/download?download_frd=1&verifier=MWyibVq7L4EmRgunWLV7pS7CekAI9MLuTJIHxuCV;Lecture+5.pdf;application/pdf)
2. [Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm](https://proceedings.neurips.cc/paper/2016/file/b3ba8f1bee1238a2f37603d90b58898d-Paper.pdf)