<a href="https://colab.research.google.com/github/USCbiostats/PM520/blob/main/Lab_8_Variational_Inference_PtI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Running Up That Hill, or: Intro to Variational (Bayesian) Inference
$\newcommand{\data}{\text{Data}}$
$\newcommand{\E}{\mathbb{E}}$
Recall that in Bayesian inference, we seek to model the uncertainty in our estimates through a _posterior_ distribution. The posterior is derived from [Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) as,
$$\Pr(\theta | \data) = \frac{\Pr(\data | \theta) \Pr(\theta)}{\Pr(\data)},$$
where $\Pr(\theta | \data)$ is the [_posterior_ probability](https://en.wikipedia.org/wiki/Posterior_probability) for $\theta$ and reflects our uncertainty in the values that $\theta$ may take on, $\Pr(\data | \theta)$ is our likelihood, $\Pr(\theta)$ is a [_prior_ probability](https://en.wikipedia.org/wiki/Prior_probability) (or _prior_) over $\theta$ and $\Pr(\data)$ is a [_marginal_ probability/likelihood](https://en.wikipedia.org/wiki/Marginal_likelihood) of the data.

Last week, we explored this concept in the regime of "brute forcing" the posterior distribution for a simple exercise (e.g., calculating posterior probability an individual is sick, given a positive test) as well as a result in Exponential Families that leveraged conjugacy (e.g., posterior probability for a coin to land on "heads").

*What happens if our model does not have a simple or conjugate form?*

Rather than performing inference under an intractible exact posterior $\Pr(\theta | \data)$, we seek to perform inference using a surrogate distribution $Q(\theta | data)$ that is simpler. But how to identify or even quantify how good a proposed surrogate distribution $Q$ is?

Recall, [KL-Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) captures a notion of "[statistical distance](https://en.wikipedia.org/wiki/Statistical_distance)" between parameterized distribution functions, whose definition for discrete variables is,
$$
\begin{align*}
D_{KL}(p || q) &= \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} = - \sum_{x \in \mathcal{X}} p(x) \log \frac{q(x)}{p(x)} \\
  &= -\mathbb{E}_{x \sim p}\left[\log \frac{q(x)}{p(x)} \right].
\end{align*}$$

For continuous $x \in \mathbb{R}$, we have,
$$\begin{align*}
D_{KL}(p || q) &= \int_{-\infty}^\infty p(x) \log \frac{p(x)}{q(x)}dx = -\int_{-\infty}^\infty p(x) \log \frac{q(x)}{p(x)}dx \\
  &= -\mathbb{E}_{x \sim p}\left[\log \frac{q(x)}{p(x)} \right].
\end{align*}$$

We can leverage this concept to measure how good a proposal surrogate $Q$ is compared to the true posterior by,
$$D_{KL}(Q(\theta | \data) || \Pr(\theta | \data))$$
however, we often don't know the functional form of $\Pr(\theta | \data)$ let alone compute it in intracable settings! Where are we left?

$$\newcommand{\ELBO}{\text{ELBO}}\begin{align*}
D_{KL}(Q(\theta | \data) || \Pr(\theta | \data)) &= \E_Q\left[ \log \frac{Q(\theta | \data)}{\Pr(\theta | \data) }\right] \\
  &= \E_Q\left[ \log \frac{Q(\theta | \data)\Pr(\data)}{\Pr(\data | \theta) \Pr(\theta)}\right] \\
  &= \E_Q\left[ \log \frac{Q(\theta | \data)}{\Pr(\data | \theta) \Pr(\theta)}\right] + \E_Q[\log \Pr(\data) ]\\
  &= \E_Q\left[ \log \frac{Q(\theta | \data)}{\Pr(\data | \theta) \Pr(\theta)}\right] + \log \Pr(\data) \\
  &= \underbrace{\E_Q[ \log Q(\theta | \data)] - \E_Q[\log \Pr(\data | \theta)] - \E_Q[\log \Pr(\theta)]}_{-\ELBO} + \log \Pr(\data) \geq 0 ⇒\\
-\ELBO \geq - \log \Pr(\data) \iff \ELBO \leq \log \Pr(\data).
\end{align*}$$

The implications of the above derivation suggest that we can maximize $\ELBO$ to minimize the KL-divergence between $Q$ and $\Pr(\theta | \data)$ up to a constant.

Given this, a helpful representation of the $\ELBO$ is to re-write it as,
$$\begin{align*}
\ELBO &:= -\E_Q[ \log Q(\theta | \data)] + \E_Q[\log \Pr(\data | \theta)] + \E_Q[\log \Pr(\theta)] \\
  &= \E_Q[\log \Pr(\data | \theta)] - \E_Q\left[ \log \frac{Q(\theta | \data)}{\Pr(\theta)}\right]\\
  &= \E_Q[\log \Pr(\data | \theta)] + D_{KL}(Q(\theta | \data) || \Pr(\theta))\\
\end{align*}$$