<a href="https://colab.research.google.com/github/USCbiostats/PM520/blob/main/Lab_9_Variational_Inference_PtII.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Out of touch, or: Non-Conjugate Variational Inference
Last week, we discussed how to perform Bayesian inference when our exact posterior is computationally intractable. Specifically, Bayesian variational inference seeks to identify _approximating_ or _surrogate_ distributions $Q$ that are "close" in a KL-sense to the true posterior distribution, given by,
$$\newcommand{\data}{\text{Data}}\newcommand{\E}{\mathbb{E}}\newcommand{\ELBO}{\text{ELBO}}
\begin{align*}D_{KL}(Q(\theta | \data) || \Pr(\theta | \data)) &= \E_Q\left[ \log \frac{Q(\theta | \data)}{\Pr(\theta | \data) }\right]\\
&= -\ELBO(\theta) + \log \Pr(\data)\\
\ELBO(\theta) &:= -\E_Q[ \log Q(\theta | \data)] + \E_Q[\log \Pr(\data | \theta)] + \E_Q[\log \Pr(\theta)] \\
  &= \E_Q[\log \Pr(\data | \theta)] - \E_Q\left[ \log \frac{Q(\theta | \data)}{\Pr(\theta)}\right].
\end{align*}$$

Rather than evaluate $D_{KL}(Q(\theta | \data) || \Pr(\theta | \data))$, variational inference (often) focuses on maximizing (and evaluating) the $\ELBO$ term, which provides a lower bound on the marginal likelihood $\Pr(\data)$.

Before proceeding with optimization, we are required to specify structural independencies across latent variables $\theta_j$, to provide itermediate surrogates $Q_j$. A common factorization is the mean-field, given by,
$$\newcommand{\indep}{\perp \!\!\!\! \perp}Q(\theta) = \prod_{j=1}^p Q_j(\theta_j),$$ or, intuitively that each $\theta_j \indep \theta_{j'}$ for $j \neq j'$ under $Q$. There are certainly other options for how to factor $Q$ over latent variables (e.g., *structured* mean-field, etc), and trade-offs can sometimes be made over model/computational complexity and downstream accuracy, but often the simplest place to begin is the mean field.

Given a factorization for $Q$, CAVI seeks to identify the optimal $Q_j^*$, which tells us that,
$$\begin{align*}
\log Q_j^*(\theta_j) &= \E_Q\left[\log \Pr(\data | \theta) | \theta_j\right] + \E_Q\left[\log \Pr(\theta) | \theta_j \right].
\end{align*}$$
Here, we condition on $\theta_j$, and compute expectations with respect to $Q$ for _other_ variables $\theta_{j'}$.

Our derivation of the variational linear regression model seemed to have identifying $Q_j$ from "thin air", is there a systematic means to identify the functional form of $Q_j$?

## Conditional conjugacy and Exponential Families
Let's suppose that our surrogate distribution for $\theta_j$ is in the exponential family, $Q_j(\theta_j) \propto \exp(\eta_j \cdot T_j(\theta_j))$ where $\eta_j$ are the _natural_ parameters, $T_j(\theta_j)$ are the sufficient statistics, and assuming some constant base measure. Recall that $\mu_j := \E_Q[T_j(\theta_j)] = \frac{\partial}{\partial \eta_j} A(\eta_j)$ are the _expectation_ parameters for the same exponential family distribution.

$$\begin{align*}
\log Q_j(\theta_j) &= \E_Q\left[\log \Pr(\data, \theta) | \theta_j\right] + O(1) \\
&= \E_Q[\eta_j(\theta_{j'}, \data) \cdot \theta_j - A(\eta_j(\theta_{j'}, \data))] + O(1)\\
&= \E_Q[\eta_j(\theta_{j'}, \data)] \cdot \theta_j - \E_Q[A(\eta_j(\theta_{j'}, \data))] + O(1) ⇒\\
Q_j(\theta_j) &\propto \exp\left(\E_q[\eta_j(\theta_{j'}, \data)] \cdot \theta_j\right)
\end{align*}$$