## Problem 7.1: Writing your own MCMC sampler

Attribution: Maddie coded and wrote up this problem. Everyone discussed the problem together.

In [3]:
import itertools

import numpy as np
import pandas as pd
import scipy.stats as st
import random

import numba

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

First we write a function that takes or rejects a Metropolis-Hastings step.

In [4]:
def mh_step(x, logpost, logpost_current, sigma, args=()):
    """
    Parameters
    ----------
    x : ndarray, shape (n_variables,)
        The present location of the walker in parameter space.
    logpost : function
        The function to compute the log posterior. It has call
        signature `logpost(x, *args)`.
    logpost_current : float
        The current value of the log posterior.
    sigma : ndarray, shape (n_variables, )
        The standard deviations for the proposal distribution.
    args : tuple
        Additional arguments passed to `logpost()` function.

    Returns
    -------
    output : ndarray, shape (n_variables,)
        The position of the walker after the Metropolis-Hastings
        step. If no step is taken, returns the inputted `x`.
    """
    # Get next step
    x_next = np.random.multivariate_normal(x, sigma)

    # Calculate r
    theta_p = np.exp(logpost(x_next, *args))
    theta_i = np.exp(logpost_current)
    r = theta_p / theta_i
    
    # Choose to accept or reject step    
    p = np.random.uniform(0, 1)
    if r >= 1:
        return x_next, 1
    elif p <= r:
        return x_next, 1
    else:
        return x, 0

Now we write a function that uses the previous step function to take samples. We'd eventually like to add some automatic tuning of sigma so that the acceptance rate is approximately 0.4, so we calculate the acceptance rate of steps and return that along with the dataframe of samples if the acceptance rate is within our desired range. If the acceptance rate is not in the desired range, an empty dataframe is returned. 

We calculate the acceptance rate of steps by calculating the ratio of the number of steps taken to the total number of sampling steps. I checked previously to see if the acceptance rate during the burn steps was different than the acceptance rate during the sampling steps, and it's approximately the same, sometimes higher, sometimes lower, but we want to make sure that the acceptance rate isn't biased by the inital burn steps, so we're only using the sampling steps to calculate the acceptance rate.

In [39]:
def mh_sample(logpost, x0, sigma, args=(), n_burn=1000, n_steps=1000,
              variable_names=None, accept_rate_bounds=[0.2, 0.5]):
    """
    Parameters
    ----------
    logpost : function
        The function to compute the log posterior. It has call
        signature `logpost(x, *args)`.
    x0 : ndarray, shape (n_variables,)
        The starting location of a walker in parameter space.
    sigma : ndarray, shape (n_variables, )
        The standard deviations for the proposal distribution.
    args : tuple
        Additional arguments passed to `logpost()` function.
    n_burn : int, default 1000
        Number of burn-in steps.
    n_steps : int, default 1000
        Number of steps to take after burn-in.
    variable_names : list, length n_variables
        List of names of variables. If None, then variable names
        are sequential integers.
    
    Returns
    -------
    output : 
    DataFrame
        The first `n_variables` columns contain the samples.
        Additionally, column 'lnprob' has the log posterior value
        at each sample.
    Float
        Acceptance rate of the sampled steps.
    """
    x = x0
    mu, inv_cov = args
    
    # Steps that will be burned
    for i in range(n_burn):
        logpost_current = logpost(x, *args)
        x, accept = mh_step(x, logpost, logpost_current, sigma, args=(mu, inv_cov))
        
    # Set up empty arrays and variables to store sample info
    n_variables = []
    lnprob = []
    n_accept = 0
    
    # Draw samples
    for i in range(n_steps):
        n_variables.append(x)
        lnprob.append(logpost_current)
        
        logpost_current = logpost(x, *args)
        x, accept = mh_step(x, logpost, logpost_current, sigma, args=(mu, inv_cov))
        n_accept += accept
        
    df = pd.DataFrame(data=n_variables, columns=['x', 'y'])
    df['lnprob'] = lnprob
    
    accept_rate = n_accept / n_steps
    if accept_rate < accept_rate_bounds[0] or accept_rate > accept_rate_bounds[1]:
        print('Current acceptance rate', accept_rate, 'is not in the desired range.')
        return pd.DataFrame(), accept_rate
    
    return df, accept_rate

Here is our log posterior function for this problem.

In [37]:
@numba.jit(nopython=True)
def log_test_distribution(x, mu, inv_cov):
    """
    Unnormalized log posterior of a multivariate Gaussian.
    """
    return -np.dot((x-mu), np.dot(inv_cov, (x-mu))) / 2

Now let's code up a function to tune sigma automatically based on the acceptance rate. We are using the scheme from the developers of PyMC3.

In [25]:
def tune_sigma(accept_rate, sigma):
    '''
    Tunes sigma based on the acceptance rate using the scheme from 
    the developers of PyMC3. Returns new sigma.
    '''
    if accept_rate < 0.001:
        return sigma * 0.1
    elif accept_rate < 0.05:
        return sigma * 0.5
    elif accept_rate < 0.2:
        return sigma * 0.9
    elif accept_rate > 0.5:
        return sigma * 1.1
    elif accept_rate > 0.75:
        return sigma * 2
    elif accept_rate > 0.95:
        return sigma * 10
    else:
        return sigma

In [26]:
def mh_sample_with_tuning(logpost, x0, sigma, args=(), n_burn=1000, n_steps=5000, 
                          variable_names=None, accept_rate_bounds=[0.2, 0.5]):
    '''Take samples and check if the acceptance rate is in desired range.
    If not, tune sigma and take samples again. Returns a dataframe 
    containing the samples and log posterior value at each sample.'''
    # Take samples
    df_samples, accept_rate = mh_sample(logpost, x0, sigma, 
                                        args, n_burn, n_steps, 
                                        variable_names, accept_rate_bounds)
    
    # If acceptance rate isn't in desired range, tune sigma and continue
    while len(df_samples) == 0:
        sigma = tune_sigma(accept_rate, sigma)
        df_samples, accept_rate = mh_sample(logpost, x0, sigma, 
                                            args, n_burn, n_steps, 
                                            variable_names, accept_rate_bounds)
        
    return df_samples

Let's put our given means, covariances, and inverse convariance for this problem into variables for use later.

In [27]:
mu = np.array([10.0, 20])
cov = np.array([[4, -2],[-2, 6]])
inv_cov = np.linalg.inv(cov)

Now let's test our sampler with an arbitrary x0 and sigma.

In [40]:
# Choose arbitrary x0 and sigma
x0 = np.array([10, 5])
sigma = np.array([[50, -3],[-3, 50]])

# Take samples
df_samples = mh_sample_with_tuning(log_test_distribution, 
                                   x0, 
                                   sigma, 
                                   args=(mu, inv_cov))

# Take a look
df_samples.head()

Current acceptance rate 0.1506 is not in the desired range.
Current acceptance rate 0.152 is not in the desired range.
Current acceptance rate 0.1696 is not in the desired range.
Current acceptance rate 0.1792 is not in the desired range.
Current acceptance rate 0.1998 is not in the desired range.
Current acceptance rate 0.1916 is not in the desired range.


Unnamed: 0,x,y,lnprob
0,9.13335,19.896484,-0.122705
1,9.13335,19.896484,-0.122705
2,9.13335,19.896484,-0.122705
3,9.13335,19.896484,-0.122705
4,9.13335,19.896484,-0.122705


Now let's plot to check that our samples are actually drawn from the distribution we expect. Here we plot our MH samples in blue overlayed with the multivariate normal distribution centered around [10,20].

In [35]:
# Plot
p = bokeh.plotting.figure(width=400, height=400,
                          x_axis_label='x', 
                          y_axis_label='y')

# Plot samples
p.circle(df_samples['x'], df_samples['y'], alpha=0.1, legend='MH')

# Overlay multivariate gaussian
x, y = np.random.multivariate_normal(mu, cov, 5000).T
p.circle(x, y, alpha=0.1, color='orange', legend='generated')
bokeh.io.show(p)

Looks like our samples in blue match the multivariate gaussian distribution in orange. We also want to check that the covariance of our samples is similar to the inputted covariance. Let's first check if the mean of our samples is close to [10,20].

In [45]:
np.mean([df_samples['x'], df_samples['y']], axis=1)

array([10.02258216, 19.98171592])

Yes, it is. Now let's check that the covariance is close to [[4, -2],[-2,6]].

In [46]:
np.cov([df_samples['x'], df_samples['y']])

array([[ 4.05570719, -2.01395518],
       [-2.01395518,  5.73105456]])

Yes, the covariance is close.

Now let's plot a corner plot to double check that the sample comes from the distribution we expect.

In [31]:
# For corner plot
df_samples['divergent__'] = 0

# Plot
bokeh.io.show(bebi103.viz.corner(df_samples, pars=['x', 'y']))

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


This looks good. Let's try another example where the acceptance rate starts out too high.

In [47]:
# Choose arbitrary x0 and sigma
x0 = np.array([5, 5])
sigma = np.array([[4, -2],[-2, 2]])

# Take samples
df_samples = mh_sample_with_tuning(log_test_distribution, 
                                   x0, 
                                   sigma, 
                                   args=(mu, inv_cov))

# Take a look
df_samples.head()

Current acceptance rate 0.6504 is not in the desired range.
Current acceptance rate 0.6338 is not in the desired range.
Current acceptance rate 0.6216 is not in the desired range.
Current acceptance rate 0.598 is not in the desired range.
Current acceptance rate 0.5986 is not in the desired range.
Current acceptance rate 0.5856 is not in the desired range.
Current acceptance rate 0.5606 is not in the desired range.
Current acceptance rate 0.5452 is not in the desired range.
Current acceptance rate 0.5182 is not in the desired range.


Unnamed: 0,x,y,lnprob
0,8.85036,17.870405,-0.872415
1,8.85036,17.870405,-0.896595
2,8.85036,17.870405,-0.896595
3,8.85036,17.870405,-0.896595
4,9.775945,18.689124,-0.896595


Again let's plot the samples, where our MH samples are in blue and random samples from the multivariate gaussian are in orange.

In [48]:
# Plot
p = bokeh.plotting.figure(width=400, height=400,
                          x_axis_label='x', 
                          y_axis_label='y')

# Plot samples
p.circle(df_samples['x'], df_samples['y'], alpha=0.1, legend='MH')

# Overlay multivariate gaussian
x, y = np.random.multivariate_normal(mu, cov, 5000).T
p.circle(x, y, alpha=0.1, color='orange', legend='generated')
bokeh.io.show(p)

Let's check to see if the mean of our samples is close to [10,20].

In [49]:
np.mean([df_samples['x'], df_samples['y']], axis=1)

array([ 9.94240127, 20.137614  ])

And check the covariance to see if it's close to [[4, -2],[-2,6]].

In [50]:
np.cov([df_samples['x'], df_samples['y']])

array([[ 3.93052584, -2.05233058],
       [-2.05233058,  5.46336904]])

We've confirmed the covariance is similar, so let's lastly check the corner plot.

In [14]:
# For corner plot
df_samples['divergent__'] = 0

# Plot
bokeh.io.show(bebi103.viz.corner(df_samples, pars=['x', 'y']))

It looks good. We will next be using this sampler in 7.2.

In [51]:
%load_ext watermark

In [52]:
%watermark -v -p numpy,scipy,bokeh,jupyterlab

CPython 3.7.0
IPython 7.1.1

numpy 1.15.4
scipy 1.1.0
bokeh 1.0.1
jupyterlab 0.35.3
