# Variational Inference

The basic idea is to pick an approximation $q(x)$ to the distribution from some tractable family, and then to try to make this approximation as close as possible to the true posterior $p ^ { * } ( \mathbf { x } ) \triangleq p ( \mathbf { x } | \mathcal { D } )$. This reduces inference to an optimization problem. By relaxing the constraints and/or approximating the objective, we can trade accuracy for speed.

## Variational Inference

Suppose $p^∗(x)$ is our true but intractable distribution and $q(x)$ is some approximation, chosen from some tractable family, such as a multivariate Gaussian or a factored distribution. We assume $q$ has some free parameters which we want to optimize so as to make $q$ “similar to” $p^∗$.

KL divergence:
$$\mathbb { K } \mathbb { L } ( q \| p ^ { * } ) = \sum _ { \mathbf { x } } q ( \mathbf { x } ) \log \frac { q ( \mathbf { x } ) } { p ^ { * } ( \mathbf { x } ) }$$

Evaluating $p ^ { * } ( \mathbf { x } ) \triangleq p ( \mathbf { x } | \mathcal { D } )$ is intractable since it requires evaluating the intractable normalization constant $Z = p(\mathcal{D})$. However, usually the unormalized distribution $\tilde { p } ( \mathbf { x } ) \triangleq p ( \mathbf { x } , \mathcal { D } ) = p ^ { * } ( \mathbf { x } ) Z$ is tractable to compute. 

Objective function:
$$J ( q ) \triangleq \mathbb { K } \mathbb { L } ( q \| \tilde { p } ) = \mathbb { K } \mathbb { L } ( q \| p ^ { * } ) - \log Z$$

Since $Z$ is a constant, by minimizing $J(q)$, we will force $q$ to become close to $p^∗$. Since KL divergence is always non-negative, we see that $J(q)$ is an upper bound on the NLL:
$$J ( q ) = \mathbb { K } \mathbb { L } ( q \| p ^ { * } ) - \log Z \geq - \log Z = - \log p ( \mathcal { D } )$$

Equivalently, we have the following objective:

$$\begin{aligned} J ( q ) & = \mathbb { E } _ { q } [ \log q ( \mathbf { x } ) - \log p ( \mathbf { x } ) p ( \mathcal { D } | \mathbf { x } ) ] \\ & = \mathbb { E } _ { q } [ \log q ( \mathbf { x } ) - \log p ( \mathbf { x } ) - \log p ( \mathcal { D } | \mathbf { x } ) ] \\ & = \mathbb { E } _ { q } [ - \log p ( \mathcal { D } | \mathbf { x } ) ] + \mathbb { K } \mathbb { L } ( q ( \mathbf { x } ) | | p ( \mathbf { x } ) ) \end{aligned}$$

This is the expected NLL, plus a penalty term that measures how far the approximate posterior is from the exact prior.

## The mean field method
or the mean field approximation for inferring latent variable $z_i$. We assume the parameters $\theta$ of the model are known.

We assume the posterior is fully factorized approximation of the form:
$$q ( \mathbf { x } ) = \prod_i q _ { i } \left( \mathbf { x } _ { i } \right)$$

Our goal is to solve this optimization problem:
$$\min _ { q _ { 1 } , \ldots , q _ { D } } \mathbb { K } \mathbb { L } ( q \| p )$$

where we optimize over the parameters of each marginal distribution $q_i$.

We derive the coordinate descent method, where at each step we make the following update:
$$\log q _ { j } \left( \mathbf { x } _ { j } \right) = \mathbb { E } _ { - q _ { j } } [ \log \tilde { p } ( \mathbf { x } ) ] + \text { const }$$

where $\tilde { p } ( \mathbf { x } ) = p ( \mathbf { x } , \mathcal { D } )$ is the unnormalized posterior and $\mathbb { E } _ { - q _ { j } } [ f ( \mathbf { x } ) ]$ means to take the expectation over $f(x)$ w.r.t all variables except for $x_j$. For example,
$$\mathbb { E } _ { - q _ { 2 } } [ f ( \mathbf { x } ) ] = \sum _ { x _ { 1 } } \sum _ { x _ { 3 } } q \left( x _ { 1 } \right) q _ { 3 } \left( x _ { 3 } \right) f \left( x _ { 1 } , x _ { 2 } , x _ { 3 } \right)$$

When updating $q_j$ , we only need to reason about the variables which share a factor with $x_j$ , i.e., the terms in $j$’s Markov blanket; the other terms get absorbed into the constant term. Since we are replacing the neighboring values by their mean value, the method is known as mean ﬁeld.



## Variational Bayes and Variational Bayes EM
We want to infer the parameters themselves. If we make a fully factorized approximation: $p ( \boldsymbol { \theta } | \mathcal { D } ) \approx \prod _ { k } q \left( \boldsymbol { \theta } _ { k } \right)$, we have variational Bayes or VB

If we want to infer both latent variables and parameters, and we make an approximation of the form $p \left( \boldsymbol { \theta } , \mathbf { z } _ { 1 : N } | \mathcal { D } \right) \approx q ( \boldsymbol { \theta } ) \prod _ { i } q _ { i } \left( \mathbf { z } _ { i } \right)$, we get a method known as variational Bayes EM. $\mathbf { z } _ { i } \rightarrow \mathbf { x } _ { i } \leftarrow \boldsymbol { \theta }$

+ In E step we infer the posterior over the latent variables, $p \left( \mathbf { z } _ { i } | \mathbf { x } _ { i } , \boldsymbol { \theta } \right)$ (mean field)
+ In M step, we compute a point estimate of the parameters $\theta$: $q ( \boldsymbol { \theta } | \mathcal { D } )$.




## Loopy belief propagation (LBP) for general graph
$q$ is not required to be factorized

Idea: we apply the belief propagation algorithm to the graph, even if it has loops. This method is simple and efficient, and often works well in practice, outperforming mean field. 

### LBP on pairwise models
![](../images/22.LBP.png)