STA 663 Final Project: “The No-U-Turn Sampler”  
Progress Report 2  
Sarah Normoyle, Gonzalo Bustos 
April 21, 2016  

Headings: 

Abstract  
Background 
Implementation  
Testing  
Optimization  
Application with Data and Model  
Comparison with Stan and other MCMC Algorithms  
Conclusion  
References  

Background:  

For many models, Monte Carlo Markov Chain (MCMC) methods such as Gibbs sampling and the Metropolis Hasting algorithm may not be efficient and may require a long time to converge. By using steps that are evaluated from the first-order gradient of the log posterior, Hamiltonian Monte Carlo (HMC) is an efficient MCMC algorithm that does not use random walk behavior. This paper by Matthew D. Hoffman and Andrew Gelman introduces a new algorithm, called the No-U-Turn Sampler (NUTS) that is an extension of Hamiltonian Monte Carlo. Unlike HMC, NUTS does not require the specification of the parameter for the number of steps, L. In addition, the use of a dual averaging technique is extended from HMC to NUTS in order to avoid the specification of a step size parameter, $\epsilon$. Therefore, unlike HMC, NUTS can be implemented without having to hand-tune both of the two parameters, L and $\epsilon$. In our report, we will implement the Naive NUTS Algorithm and also extend to the NUTS Algorithm with Dual Averaging. We will compare the efficiency and the results of this algorithm to other MCMC algorithms in Stan when used for a specific model and data set.

Explanation of Algorithms and Code

**Leapfrog step** from **Algorithm 1**

**function** Leapfrog$(\theta,r,\varepsilon)$

Set $\tilde{r} \leftarrow r + \frac{\varepsilon}{2} \nabla_\theta \mathcal{L}(\theta)$

Set $\tilde{\theta} \leftarrow \theta + \varepsilon \tilde{r}$

Set $\tilde{r} \leftarrow r + \frac{\varepsilon}{2} \nabla_\theta \mathcal{L}(\tilde{\theta})$

**return** $\tilde{\theta},\tilde{r}$

$\mathcal{L}$ is the logarithm of the joint density of the variables of interest $\theta$. The Leapfrog function of Algorithm 1 implements the Stormer-Verlet ("leapfrog") integrator, which proceeds according to the updates:

$r^{t + \frac{\varepsilon}{2}} = r^t + \frac{\varepsilon}{2} \nabla_\theta \mathcal{L}(\theta^t)$

$\theta^{t + \varepsilon} = \theta^t + \varepsilon r^{t + \frac{\varepsilon}{2}}$

$r^{t + \varepsilon} = r^{t + \frac{\varepsilon}{2}} + \frac{\varepsilon}{2} \nabla_\theta \mathcal{L}(\theta^{t + \varepsilon})$

where $r^t$ and $\theta^t$ denote the values of the momentum and position variables $r$ and $\theta$ at time $t$, $\nabla_\theta$ denotes the gradient with respect to $\theta$ and $\varepsilon$ is the step size parameter.

The performance of Hamiltonian Monte Carlo (HMC) depends strongly on choosing suitable values for $\varepsilon$ and $L$, which is the number of times chosen to run the leapfrog step. If $\varepsilon$ is too large, then the simulation will be inaccurate and yield low acceptance rates. If $\varepsilon$ is too small, then computation will be wasted taking many small steps. If $L$ is too small, then successive samples will be close to one another, resulting in undesirable random walk behavior and slow mixing. If $L$ is too large, then HMC will generate trajectories that loop back and retrace their steps.

The No-U-Turn Sampler (NUTS) is an extension of HMC that eliminates the need to specify a fixed value of $L$, the number of leapfrog steps. It also incorporates schemes for setting $\varepsilon$ based on a dual averaging algorithm.



**Algorithm 3** Efficient NUTS

Given $\theta^0, \varepsilon, \mathcal{L}, M$:

**for** $m=1$ to $M$ **do**

&nbsp;&nbsp;&nbsp;&nbsp; Resample $r^0 \sim \mathcal{N}(0,I)$

&nbsp;&nbsp;&nbsp;&nbsp; Resample $u \sim \text{Uniform}([0, \text{exp}\{\mathcal{L}(\theta^{m-1}) - \frac{1}{2} r^0 \cdot r^0 \}])$

&nbsp;&nbsp;&nbsp;&nbsp; Initialize $\theta^- = \theta^{m-1},~\theta^+ = \theta^{m-1},~r^- = r^0,~r^+ = r^0,~j = 0,~\theta^m=\theta^{m-1},~n=1,~s=1$

&nbsp;&nbsp;&nbsp;&nbsp; **while** $s=1$ **do**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Choose a direction $v_j \sim \text{Uniform}(\{-1,1\})$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **if** $v_j=-1$ **then**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $\theta^-,r^-,-,-,\theta',n',s' \leftarrow \text{BuildTree}(\theta^-,r^-,u,v_j,j,\varepsilon)$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **else**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $-,-,\theta^+,r^+,\theta',n',s' \leftarrow \text{BuildTree}(\theta^+,r^+,u,v_j,j,\varepsilon)$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **end if**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **if** $s'=1$ **then**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; With probability $\text{min}\big\{1,\frac{n'}{n}\big\}$, set $\theta^m \leftarrow \theta'$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **end if**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $n \leftarrow n + n'$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $s \leftarrow s' \mathbb{1}[(\theta^+ - \theta^-) \cdot r^- \geq 0] \mathbb{1}[(\theta^+ - \theta^-) \cdot r^+ \geq 0]$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $j \leftarrow j+1$

&nbsp;&nbsp;&nbsp;&nbsp; **end while**

**end for**
<br>
<br>
<br>
**function** BuildTree$(\theta,r,u,v,j,\varepsilon)$

**if** $j=0$ **then**

&nbsp;&nbsp;&nbsp;&nbsp; *Base case - take one leapfrog step in the direction $v$*

&nbsp;&nbsp;&nbsp;&nbsp; $\theta',r' \leftarrow \text{Leapfrog}(\theta,r,v\varepsilon)$

&nbsp;&nbsp;&nbsp;&nbsp; $n' \leftarrow \mathbb{1}[u \leq \text{exp}\{\mathcal{L}(\theta') - \frac{1}{2} r' \cdot r' \}]$

&nbsp;&nbsp;&nbsp;&nbsp; $s' \leftarrow \mathbb{1}[\mathcal{L}(\theta') - \frac{1}{2} r' \cdot r' > \text{log}~u - \Delta_{max}]$

&nbsp;&nbsp;&nbsp;&nbsp; **return** $\theta',r',\theta',r',\theta',n',s'$

**else**

&nbsp;&nbsp;&nbsp;&nbsp; *Recursion - implicitly build the left and right subtrees*

&nbsp;&nbsp;&nbsp;&nbsp; $\theta^-,r^-,\theta^+,r^+,\theta',n',s' \leftarrow \text{BuildTree}(\theta,r,u,v,j-1,\varepsilon)$

&nbsp;&nbsp;&nbsp;&nbsp; **if** $s'=1$ **then**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **if** $v=-1$ **then**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $\theta^-,r^-,-,-,\theta'',n'',s'' \leftarrow \text{BuildTree}(\theta^-,r^-,u,v,j-1,\varepsilon)$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **else**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $-,-,\theta^+,r^+,\theta'',n'',s'' \leftarrow \text{BuildTree}(\theta^+,r^+,u,v,j-1,\varepsilon)$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **end if**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; With probability $\frac{n''}{n'+n''}$, set $\theta' \leftarrow \theta''$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $s' \leftarrow s'' \mathbb{1}[(\theta^+ - \theta^-) \cdot r^- \geq 0] \mathbb{1}[(\theta^+ - \theta^-) \cdot r^+ \geq 0]$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $n' \leftarrow n' + n''$

&nbsp;&nbsp;&nbsp;&nbsp; **end if**

&nbsp;&nbsp;&nbsp;&nbsp; **return** $\theta^-,r^-,\theta^+,r^+,\theta',n',s'$

**end if**


Before developing Algorithm 3, Efficient NUTS, the paper develops Algorithm 2, Naive NUTS. Algorithm 2 introduces a slice variable $u$ with conditional distribution $p(u|\theta,r)=\text{Uniform}(u;[0,\text{exp}\{\mathcal{L}(\theta) - \frac{1}{2} r \cdot r \}])$, which renders the conditional distribution $p(\theta,r|u) = \text{Uniform} (\theta,r;\{\theta',r'|\mathcal{L}(\theta) - \frac{1}{2} r \cdot r \} \geq u \})$. After resampling $u|\theta,r$, NUTS uses the leapfrog algorithm to trace out a path forwards and backwards, doing for 1 step, 2 steps, 4 steps, etc. This doubling process builds a balanced binary tree whose leaf-nodes correspond to position-momentum states. The process is halted when the trajectory starts to double back on itself. 

In summary, Algorithm 2 leaves the target distribution $p(\theta) \propto \text{exp}\{\mathcal{L}(\theta)\}$ invariant. It achieves this by resampling the momentum and slice variables $r$ and $u$, simulating a Hamiltonian trajectory forwards and backwards in time until that trajectory either begins retracing its steps or encounters a state with very low probability, selecting a subset of the states encountered on that trajectory that lie within the slice defined by the slice variable $u$, and finally choosing the next position and momentum variables $\theta^m$ and $r$ uniformly at random from the subset of the states encountered.

Algorithm 3 improves Algorithm 2 by breaking out of the recursion as soon as a zero value for the stop indicator $s$ is encountered.

Code Written so far:  
Efficient NUTS sampler  

In [1]:
# from Algorithm 1: HMC, page1353
def Leapfrog(theta, r, eps):
    r_tilde = r + (eps/2) * L_gradient(*theta)
    theta_tilde = theta + eps * r_tilde
    r_tide = r_tilde + (eps/2) * L_gradient(*theta_tilde)
    return theta_tilde, r_tilde

In [2]:
# from Algorithm 3: Efficeint NUTS pg.1364
def BuildTree(theta, r, u, v, j, eps):
    if j == 0:
        # base case, take one leapfrog step in the direction v
        theta_prime, r_prime = Leapfrog(theta, r, v*eps)
        
        n_prime = int(u <= np.exp(L(*theta_prime) - 0.5 * np.dot(r_prime, r_prime)))
        
        s_prime = int(L(*theta_prime) - 0.5 * np.dot(r_prime, r_prime) > np.log(u) - 1000)
        
        return theta_prime, r_prime, theta_prime, r_prime, theta_prime, n_prime, s_prime
    else:
        # recursion, build left and right subtrees
        theta_minus, r_minus, theta_plus, r_plus, theta_prime, n_prime, s_prime = BuildTree(theta, r, u, v, j-1, eps)
        
        if s_prime == 1:
            if v == -1:
                theta_minus, r_minus, _,_, theta_doub_prime, n_doub_prime, s_doub_prime = BuildTree(theta_minus, r_minus, u, v, j-1, eps)
            else:
                _, _, theta_plus, r_plus, theta_doub_prime, n_doub_prime, s_doub_prime = BuildTree(theta_plus, r_plus, u, v, j-1, eps)

            # Use Metropolis-Hastings
            prob = n_doub_prime / (n_prime + n_doub_prime)
            if (np.random.uniform(0,1,1) < prob):
                theta_prime = theta_doub_prime
            
            ind_1 = int(np.dot(theta_plus-theta_minus, r_minus) >= 0)
            ind_2 = int(np.dot(theta_plus-theta_minus, r_plus) >= 0)
            s_prime = s_doub_prime * ind_1 * ind_2
            n_prime = n_prime + n_doub_prime
        
        return theta_minus, r_minus, theta_plus, r_plus, theta_prime, n_prime, s_prime

In [3]:
def efficient_nuts(theta0, eps, L, M, L_gradient):
    # initialize samples matrix
    # put initial theta0 in first row of matrix
    samples = np.empty((M+1, len(theta0)))
    samples[0,:] = theta0
    
    for m in range(1, M+1):
        # resample
        norm_samp = np.random.multivariate_normal(np.repeat(0, len(theta0)), np.identity(len(theta0)), 1)
        r0 = norm_samp.ravel()
        upper = np.exp(L(*samples[m-1,:]) - 0.5*np.dot(r0, r0))
        u = np.random.uniform(0, upper, 1)
        
        # initialize
        theta_minus = samples[m-1,:]
        theta_plus = samples[m-1,:]
        r_minus = r0
        r_plus = r0
        j = 0
        samples[m,:] = samples[m-1,:]
        n = 1
        s = 1
        
        while s == 1:
            v_j = np.random.uniform(-1,1,1)
            if v_j == -1:
                theta_minus, r_minus, _, _, theta_minus, n_minus, s_prime = BuildTree(theta_minus, r_minus, u, v_j, j, eps)
            else:
                _, _, theta_plus, r_plus, theta_prime, n_prime, s_prime = BuildTree(theta_plus, r_plus, u, v_j, j, eps)
            
            if s_prime == 1:
                # Use Metropolis-Hastings
                prob = min(1, n_prime/n)
                if (np.random.uniform(0,1,1) < prob):
                    samples[m,:] = theta_prime
                    
            n = n + n_prime

            boolean_1 = int(np.dot(theta_plus-theta_minus, r_minus) >= 0)
            boolean_2 = int(np.dot(theta_plus-theta_minus, r_plus) >= 0)
            s = s_prime * boolean_1 * boolean_2
            j = j + 1
    return samples

Simple Example:

In [5]:
from scipy.stats import norm
X = np.random.normal(1, 5, size = 50)
n = 100

def L(mu, var):
    log_likelihood = -n/2 * np.log(var) - n/2 * np.log(2*np.pi) - sum((X - mu)**2) / 2*var
    return log_likelihood

def L_gradient(mu, var):
    one = sum(X-mu) / var
    two = -n/2*var + sum((X-mu)**2)/2*var**2
    return np.array([one, two])


In [6]:
# simple example
theta0 = np.array([5,5])
eps = 0.1
M = 20
efficient_nuts(theta0, eps, L, M, L_gradient)

array([[  5.00000000e+00,   5.00000000e+00],
       [  4.98719944e+00,   1.32125375e+01],
       [  4.95468585e+00,   7.85522548e+02],
       [  4.98595771e+00,   2.53360174e+06],
       [  4.96806990e+00,   3.50717315e+12],
       [  5.06259792e+00,   4.67088034e+48],
       [  5.05880907e+00,   1.02908548e+98],
       [  5.05880907e+00,   1.02908548e+98],
       [  5.05880907e+00,   1.02908548e+98],
       [  5.05880907e+00,   1.02908548e+98],
       [  5.05880907e+00,   1.02908548e+98],
       [  5.05880907e+00,   1.02908548e+98],
       [  5.05880907e+00,   1.02908548e+98],
       [  5.05880907e+00,   1.02908548e+98],
       [  5.05880907e+00,   1.02908548e+98],
       [  5.05880907e+00,   1.02908548e+98],
       [  5.05880907e+00,   1.02908548e+98],
       [  5.05880907e+00,   1.02908548e+98],
       [  5.05880907e+00,   1.02908548e+98],
       [  5.05880907e+00,   1.02908548e+98],
       [  5.05880907e+00,   1.02908548e+98]])