# Transport Score Climbing
This tutorial examines a recently developed method for Bayesian inference called Transport Score Climbing (TSC) ([Zhang et al., 2022](https://arxiv.org/abs/2202.01841)). We explain how the method works and how to implement it using FlowTorch. Then, we demonstrate TSC on a simple toy example, replicating Figure 3 from [the paper](https://arxiv.org/pdf/2202.01841.pdf), and comparing to modern black-box variational inference.

[High level explanation of TSC!]

We assume the reader is familiar with *Bayesian statistics*, *variational inference (VI)*, and *Hamiltonian Monte Carlo (HMC)*.

## Background
The setting is that we have a Bayesian model, $p_\phi(x, z)$, and wish to perform approximate Bayesian inference over it, that is, either calculate an approximation to the posterior, $p_\phi(z\mid x)$, or some expectation with respect to it. In our notation, $x$ denotes observed variables, $z$ denotes latent variables, and the model is represented as the product of a prior and likelihood,
$$
p_\phi(x, z) = p_\phi(z)p_\phi(x\mid z).
$$
Optional learnable parameters are denoted by $\phi$; they are often fixed, especially in more traditional Bayesian models.

## Standard Variational Inference
Let us briefly review how modern black-box VI works, which we will refer to from here on as *standard VI*. The idea behind VI in the broader sense is to specify a family of approximations to the posterior,
$$\{q_\psi(z\mid x)\mid\psi\in\Psi\}$$
and learn a member from this family, $q_{\psi*}$, that is "closest" to the posterior in some sense. Standard VI defines closeness as the reverse KL-divergence,
$$
\textnormal{KL}\{q_\psi(z\mid x)\,||\,p_\phi(z\mid x)\} \triangleq \mathbb{E}_{q_\psi}\left[\ln\frac{q_\psi(z\mid x)}{p_\phi(z\mid x)}\right],
$$
and minimizes a *lower-bound* on this quantity with respect to $(\psi, \phi)$ via stochastic gradient descent. In some settings, the data, $x$, is fixed, and in other cases we taken an additional expectation over $X\sim f(\cdot)$ and learn to amortise the cost of inference across related observations from the same distribution, $f(\cdot)$.

The main benefit of the formulation of standard VI is computational convenience - to perform approximate inference we only have to be able to draw samples from $q_\psi$, score $q_\psi$ and $p_\phi$, and take gradients through the lower-bound objective, which are very general conditions satisfied for a variety of models.

## Transport Score Climbing
[Zhang et al., 2022](https://arxiv.org/abs/2202.01841) pose the question: suppose instead we wish to optimize the *forward* KI-divergence,
$$
\textnormal{KL}\{p_\phi(z\mid x)\,||\,q_\psi(z\mid x)\} \triangleq \mathbb{E}_{p_\phi}\left[\ln\frac{p_\phi(z\mid x)}{q_\psi(z\mid x)}\right],
$$
how could this be accomplished? The challenge here is that the expectation is with respect to the posterior and that is the exact thing we are trying to learn!

There are several reasons why we might want to optimize the *forward* KL-divergence rather than the *reverse* one. For instance, optimizing the reverse KL-divergence is known to commonly exhibit "mode-seeking behaviour" that fits the solution to a single mode and underestimate posterior uncertainty, especially when the variational family is unimodal. Whereas, the forward KL-divergence commonly exhibits "mode-spreading behaviour" that fits the solution across modes and give a more realistic estimate of posterior uncertainty.

The idea behind TSC is to approximate the expectation over the integral with samples that are drawn from an MCMC kernel having the posterior as its stationary distribution. Let us elaborate.



It turns out that one can prove this algorithm minimizes the forward KL-divergence.