### Latent Variable Models

We would like to model the probability distribution $p_\theta (x)$ parametrized by $\theta$, but it happens that $x$ depends on some latent variable $z$. For instance, if $x$ is an image of a handwritten digit, then $z$ can describe the digit number or the thickness of the writing. In that case,
\\[ p_\theta(x) = \int p_\theta (x | z) \, p(z) \, dz = \mathbb{E}_{p(z)} \left[ p_\theta (x | z) \right].\\]
To find the maximum likelihood estimate $\hat{\theta} = \text{argmax}_{\theta \in \Theta} \; p_\theta(x)$, we need to evaluate the integral over $z$. However, $z$ can be high dimensional, and we would need many Monte Carlo samples of $z^{(m)} \sim p(z)$ to get a decent approximation of the integral. Another better idea is to rely on [importance sampling](https://en.wikipedia.org/wiki/Importance_sampling), where we sample from a proposal distribution $q_\phi (z | x)$ parametrized by $\phi$ instead of directly from $p(z)$:
\\[ p_\theta(x) = \int \frac{p_\theta (x | z) \, p(z)}{q_\phi (z | x)} q_\phi(z | x) \, dz = \mathbb{E}_{q_\phi(z | x)} \left[ \frac{p_\theta (x | z) \, p(z)}{q_\phi (z | x)} \right] \approx \frac{1}{M} \sum_{m=1}^M  \frac{p_\theta (x | z^{(m)}) \, p(z^{(m)}}{q_\phi (z^{(m)} | x)} .\\]
An important property of importance sampling is that it provides a lower bound on the log likelihood:
\\[ \log p_\theta(x) = \log \mathbb{E}_{q_\phi(z | x)} \left[ \frac{p_\theta (x | z) \, p(z)}{q_\phi (z | x)} \right] \geq \mathbb{E}_{q_\phi(z | x)} \log \left[ \frac{p_\theta (x | z) \, p(z)}{q_\phi (z | x)} \right] \approx -\log(M) + \sum_{m=1}^M \left[ \log p_\theta (x | z^{(m)}) + \log p(z^{(m)}) - \log q_\phi (z^{(m)} | x) \right].\\]
The inequality comes from the concavity of the logarithmic function and [Jensen's inequality](https://en.wikipedia.org/wiki/Jensen%27s_inequality). Another illuminating way to confirm the inequality is via computing the difference between the left-hand side and the right-hand side:
\\[ \log p_\theta(x) - \mathbb{E}_{q_\phi(z | x)} \log \left[ \frac{p_\theta (x | z) \, p(z)}{q_\phi (z | x)} \right] = \mathbb{E}_{q_\phi(z | x)} \log \left[ \frac{p_\theta(x) q_\phi (z | x)  } {p_\theta (x | z) \, p(z)} \right] = \mathbb{E}_{q_\phi(z | x)} \left[ \frac{q_\phi (z | x)}{p_\phi (z | x)} \right] = \text{KL}(q_\phi (z | x) \, \| \, p_\phi (z | x)) \geq 0. \\]
Here, $\text{KL}(q_\phi (z | x) \, \| \, p_\phi (z | x))$ denotes the [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence) between the proposal distribution $q_\phi (z | x)$ and $p_\phi (z | x)$, which is always bounded below by 0, according to [Gibb's inequality](https://en.wikipedia.org/wiki/Gibbs%27_inequality). In the language of Bayesian inference, $p_\phi (z | x)$ is the posterior distribution if we assume a prior $p(z)$ on the latent space. Since the posterior is often intractable, we use $q_\phi (z | x)$ to approximate it, hence the name approximate posterior.

### Variational Inference

If the approximation is quite accurate, i.e. $\text{KL}(q_\phi (z | x) \, \| \, p_\phi (z | x)) \approx 0$, maximizing the log likelihood $\log p_\theta(x)$ is now equivalent to maximizing the quantity on the right-hand side of the previous inequality, often known as the evidence lower bound (ELBO). It can be decomposed into two interpretable parts:
\\[\mathbb{E}_{q_\phi(z | x)} \log \left[ \frac{p_\theta (x | z) \, p(z)}{q_\phi (z | x)} \right] = \mathbb{E}_{q_\phi(z | x)} \log p_\theta (x | z) -  \mathbb{E}_{q_\phi(z | x)} \log \left[ \frac{q_\phi (z | x)}{p(z)} \right] = \mathbb{E}_{q_\phi(z | x)} \log p_\theta (x | z) - \text{KL} ( q_\phi(z | x) \, \| \, p(z)).\\]
The first term $\mathbb{E}_{q_\phi(z | x)} \log p_\theta (x | z)$ describes the reconstruction error resulted from mapping the input $x$ to its latent code $z$ and back to the input space. The second term $\text{KL} ( q_\phi(z | x) \, \| \, p(z))$ tells us how much the approximate posterior $q_\phi(z | x)$ differs from the prior $p(z)$.

We can get rid of the expectation in ELBO simply by using a single example from the proposal distribution to estimate it. Given a training dataset $\mathcal{D}$, we find estimates for $\theta \in \Theta$ and $\phi \in \Phi$ using maximum likelihood, i.e. $\theta^*, \, \phi^* = \text{argmax}_{\theta \in \Theta, \, \phi \in \Phi} \mathcal{L}(x; \theta, \phi)$ where
\\[ \mathcal{L}(x; \theta, \phi) = \frac{1}{|\mathcal{D}|} \sum_{i = 1}^{|\mathcal{D}|} \log \left[ \frac{p_\theta (x | z) \, p(z)}{q_\phi (z | x)} \right], \quad z \sim q_\phi (z | x).\\]

### Stochastic Gradient Estimators

We probably want to use gradient descent or any other optimization algorithms in that family to learn $\theta$ and $\phi$. To do so, we need to be able to compute gradients of $\mathcal{L}(x; \theta, \phi)$ with respect to $\theta$ and $\phi$. Computing $\nabla_\theta \mathcal{L}(x; \theta, \phi)$ is straightforward as long as $p_\theta(x | z)$ is differentiable. Computing $\nabla_\phi \mathcal{L}(x; \theta, \phi)$, however, is not straightforward, because the sampling distribution depends on $\phi$. More generally, how do we actually compute $\nabla_\phi \mathbb{E}_{q_\phi(z)} \left[ f (z) \right]$?

The first trick is reparametrization, probably best known in [Kingma and Welling](https://arxiv.org/abs/1312.6114). The idea is to parametrize $\mathbb{E}_{q_\phi(z)} \left[ f(z) \right]$ as $\mathbb{E}_{q'(\epsilon)} \left[ f(g_\phi(\epsilon)) \right]$. For example, if $z \sim \mathcal{N}(\mu, \sigma^2)$ (in this case $\phi = \{\mu, \sigma\}$), then we can let $z = \mu + \epsilon \sigma$ and sample $\epsilon$ from $\mathcal{N}(0, 1)$. Essentially, we let the function $f(z)$ absorb the parameters $\phi$ while freeing the sampling distribution from them. Obviously, this trick only works if there is such an invertible mapping between $z$ and $\epsilon$. In case the sampling distribution is Bernoulli or the function $f$ is non-differentiable, for example, the trick simply doesn't work.

The second trick is with score function. [A score function](https://en.wikipedia.org/wiki/Score_(statistics) is a special likelihood function, often defined as $\nabla_\phi q_\phi(x)$. Rewriting the gradient of ELBO with respect to $\phi$ gives us 
\\[ \nabla_\phi \mathbb{E}_{q_\phi(z)} \left[ f(z) \right] \stackrel{(1)}{=} \int f(z) \, \nabla_\phi q_\phi(z) dz = \int f(z) \nabla_\phi q_\phi(z) \left(\frac{q_\phi(z)}{q_\phi(z)} \right)  dz \stackrel{(2)}{=} \int f(z) \, \nabla_\phi \log q_\phi(z) \, q_\phi(z)dz =  \mathbb{E}_{q_\phi(z)} \left[ f (z) \nabla_\phi \log q_\phi(z) \right] \stackrel{(3)}{\approx} \, \frac{1}{M} \sum_{m = 1}^M f(z^{(m)}) \nabla_\phi \log q_\phi z^{(m)}. \\]

Here, we (1) apply [Leibniz integral rule](https://en.wikipedia.org/wiki/Leibniz_integral_rule) to exchange the derivative and the integral, and (2) use a log-derivative trick to introduce another $q_\phi(z)$ term, and (3) perform Monte Carlo approximation. Note that although it's tempted to just take the average of the gradients weighted by $f(z)$, that's not a valid approximation for the integral:
\\[\nabla_\phi \mathbb{E}_{q_\phi(z)} \left[ f (z) \right] =  \nabla_\phi \int f(z) q_\phi(z) dz = \int f(z) \left( \nabla_\phi q_\phi(z) \right) dz \, \not\approx \, \frac{1}{M} \sum_{m = 1}^M f(z^{(m)}) \nabla_\phi q_\phi(z^{(m)}).\\]

The downside of the second trick is that the Monte Carlo approximation often has high variance. We can't simply get more samples to reduce the variance because it doesn't scale well. A better idea is to use [control variates](https://en.wikipedia.org/wiki/Control_variates). In this case, a good control variate is $C(x) \nabla_\phi \log q_\phi(z)$, for it can be shown that $\mathbb{E}_{q_\phi(z)} \left[ C(x) \nabla_\phi \log q_\phi(z) \right] = 0$ using the exact same trick as above. Putting everything together, we get a gradient estimator with lower variance that works even when no reparametrization is available:

\\[\nabla_\phi \mathbb{E}_{q_\phi(z)} \left[ f (z) \right] = \mathbb{E}_{q_\phi(z)} \left[ f (z) \nabla_\phi \log q_\phi(z) \right] = \mathbb{E}_{q_\phi(z)} \left[ (f(z) - C(x)) \nabla_\phi \log q_\phi(z) \right] \approx \, \frac{1}{M} \sum_{m = 1}^M (f(z^{(m)}) - C(x)) \nabla_\phi \log q_\phi z^{(m)}.\\]

In [2]:
from IPython.core.display import HTML
HTML(open('../css/custom.css', 'r').read())