# Parallel Tempering and HMC

In this problem we are going to explore **parallel tempering**, a markov chain monte carlo (MCMC) technique that takes advantage of an ensemble approach to better sample multi-modal distributions.


### Rememberance of Temperature Past

If you recall from our studies of Simulated Annealing almost any probability distribution that we're trying to sample (as long as it's a proper probability distribution) can be represented as an exponential function

$$p_{new}(x) = e^{-\log{p(x)}}$$

Given this formulation we can now add a temperature parameter to our probability distribution allowing us to index our probability distribution at various temperatures and sample from our probability  at that temperature.  We represent $p(x)$ with respect to the temperature $T$ such that

$$p_{new}(x) = e^{-\frac{1}{T} \log{p(x)}}$$

At T = 1 we recover our initial probability distribution, but as we discussed in Simulated Annealing at higher values of T we have "flatter" probability distributions that allow us to sample the entirety of our range more easily and reach areas of lower probability density.

We know from simulated annealing how to use this to find optima, but how do we use this to sample?

### Introduction to the Theory of Parallel Tempering

In order to take advantage of temperatures to sample a distribution the trick is to use data augmentation on steroids.  Instead of sampling from one distribution, sample from an ensemble of M distributions (our original distribution at M temperatures).  At each sampling step, instead of 1 sample you're going to get a vector of M samples.  Let's call that vector $X$ with $X_1$ the samples at temperature $T_1$, $X_2$ the samples at temperature $T_2$, ..., $X_M$ the samples at temperature $T_M$.

So far so good, but now we just have an ensemble of samples at different temperatures.  Are we any closer to sampling from our distribution?  Not yet, the key is that to be able to swap temperature states or ensembles. 


### Part A:  It's all About that Swap, About that Swap

We call it a swap of ensembles because if we consider the ensemble a set of M chains at different temperatures, then after q sampling steps we allow the q+1th sample of chain k to come from the qth sample from chain l and the q+1th sample of chain l to come from the qth sample of chain k.

> **Problem A**:   Let's say you're at ensemble k and you're sample q.  Given what you know about Metropolis Hastings, what's the acceptance probability of swapping to ensemble l and sample q+1.  To be specific let $x^q_k$ be the qth sample of the kth ensemble.  What's the acceptance probability of moving from $x^q_k$ to $x^{q+1}_{l}$?


### Part B:  Get Around, Get Around, I Get Around

By incorporating swaps between temperature ensebles we can now percolate mixing of exploration from higher temperature states into our samples from our original distribution.  We earlier called parallel tempering "data augmentation on steroids".  How do we recover the samples of our original distribution

> **Problem B**: If we have an ensemble of M chains and we want N samples how many samples are generating in order to get our final trace?  How many are we throwing away?  Which chain contains samples of our oriinal distribution? 

In theory while we can consider swaps between any chains in the ensemble in practice we only consider swaps between adjacent chains (assuming chains/ensembls are ordered by temperature) and we use the same order of swaps.  

> **Problem C**: Why would we only consider swaps between adjacent ensembles?

> **Problem D**: Write a parallel tempering Metropolis Hastings sampler that considers ensemble swaps after every sampling step as a python function that takes the number of ensembles (and an initial ensemble state vector) as parameters and swaps between adjacent ensembles only.  Use your sampler to generate 10,000 samples from the  the probability distribution representing a mixture of two normal distributions X ~ N(3.75, 0.75) + N(6.00, 0.50). Provide convergence diagnostics and sample statistics/visualizations.  Choose a number of ensembles between 3 and 8 and appropriate temperatures. Compare to a normal Metroplis Hastings sampler (feel free to use the one from lecture notes) in terms of convergence and sampling accuracy.  


### Part C:  HMC My Tempered Soul

We've talked about parallel tempering in terms of Metropolis Hastings, but let's look at the case of Hamiltonian Monte Carlo (HMC).  When we wrote our metropolis hastings version of parallel tempering sampler, the only factor affecting the trajectory of our samples was the proposal distribution.  In HMC we add an arbitrary Kinetic potential that together with the leapfrog dynamics determines the trajectory.  

> **Problem E**: Write out the equations defining the leapfrog dynamics for chain at ensemble k (i.e. sampling at temperature $T_k$).  Assume the normal choice of Kinetic potential.

> **Problem F**: Write out the acceptance probability of swapping between ensembles k and l in a parallel tempering version of the HMC sampler.  Assume the normal choice of Kinetic potential and that you're at sample q (see Problem A).

> **Problem G**: Write a parallel tempering HMC sampler using the the acceptance probability and leapfrog dynamics you specified in problems E and F.  Feel free to base it on the HMC sampler from lecture notes or lab.  Sample from the bivariate normal mixture distribution X ~ N([0,0], I) + N([10,10], 2I).  Compare to a normal HMC sampler in terms of convergence and sampling accuracy.  Provide convergence diagnostics and sample statistics/visualizations.  Choose a number of ensembles between 3 and 8.


### Part D:  My Temper Does Me no (Extra) Credit

It turns out that you can get the benefits of multiple ensembles in an HMC sampler without using multiple ensembles and chains because HMC gives you the flexibility in determining the  dynamics of the system (remember you can choose any Kinetic potential as long as it depends only on momentum and not position).  One posibility is to at every step of the first half of your leapfrog trajectory multiply the momentum by a factor alpha (alpha should generally be slightly higher than 1).  In the second half of the trajectory divide the momentum by alpha.

"The determinant of the Jacobian matrix for such a tempered trajectoryis one, just as for standard HMC, so its endpoint can be used as a proposal without any need to include a Jacobian factor in the acceptance probability." $^1$

> **Problem H** (Extra Credit): Write out the equations defining the leapfrog dynamics.  Assume the normal choice of Kinetic potential.

"Multiplying the momentum by an alpha that is slightly greater than one increases the value of H(q, p) slightly. If H initially had a value typical of the canonical distribution at T = 1, after this multiplication,H will be typical of a value of T that is slightly higher.

Initially, the change in H(q, p) = K(p) + U(q) is due entirely to a change in K(p) as p is made bigger, but subsequent dynamical steps will tend to distribute the increase in H between
K and U, producing a more diffuse distribution for q than is seen when T = 1. After many such multiplications of p by alpha, values for q can be visited that are very unlikely in the distribution at T = 1, allowing movement between modes that are separated by low-probability regions. The divisions by alpha in the second half of the trajectory result in H returning to values that are typical for T = 1, but perhaps now in a different mode."

> **Problem I** (Extra Credit): Write an HMC sampler using the the acceptance probability and leapfrog dynamics you specified in problems H.  Feel free to base it on the HMC sampler from lecture notes or lab.  Sample from the bivariate normal mixture distribution X ~ N([0,0], I) + N([10,10], 2I).  Compare to a normal HMC sampler in terms of convergence and sampling accuracy.  Provide convergence diagnostics and sample statistics/visualizations.  Choose a number of ensembles between 3 and 8.

