<img src="../../shared/img/slides_banner.svg" width=2560></img>

# Under the Hood of pyMC - Markov Chain Monte Carlo

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
import math
from pathlib import Path
import random

import daft
from IPython.display import HTML, Image, YouTubeVideo
import matplotlib.patches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import scipy.stats
from statsmodels.tsa import stattools

In [None]:
sns.set_context("notebook", font_scale=1.7)

import shared.src.utils.util as shared_util

In [None]:
def autocorrplot(samples, nlags=40):
    autocorrelation = stattools.acf(samples, nlags=nlags)
    f, ax = plt.subplots(figsize=(12, 6))
    ax.vlines(np.arange(nlags), 0, autocorrelation, lw=1);
    ax.set_ylabel("Correlation"); ax.set_xlabel("Steps Ahead")
    ax.hlines(0, 0, nlags, color="C0", lw=2);

    
def plot_circle_sampler_trajectory(samples):
    f, ax = plt.subplots(figsize=(12, 12))

    square = matplotlib.patches.Rectangle((-1, -1), 2, 2)
    circle = matplotlib.patches.Circle((0, 0), 1, color="C1")

    ax.add_patch(square); ax.add_patch(circle);
    ax.set_xlim([-1, 1]); ax.set_ylim([-1, 1]);

    ax.plot(samples[:, 0], samples[:, 1], lw=4, marker=".", color="C3", markersize=24);

    
def plot_chain(samples, ax, **step_kwargs):
    n_samples = len(samples)
    ax.step(np.arange(n_samples), samples, **step_kwargs)

    
def plot_k_ahead(samples, k):
    f, ax = plt.subplots(figsize=(12, 12))
    current_xs, next_xs = k_ahead_samples(samples, k)
    sns.regplot(current_xs, next_xs, ax=ax, line_kws={"lw": 4, "color": "C1"});
    ax.set_xlabel("Current Value"), ax.set_ylabel(f"${k}$-Ahead Value")
    correlation, _ = scipy.stats.pearsonr(current_xs, next_xs)
    ax.set_title(f"Correlation is {round(correlation, 2)} at ${k}$ step(s) ahead")

    
def make_markov_chain():
    markov_chain = daft.PGM(shape=[12, 6])

    past = daft.Node("past", "past", 1, 3, scale=4)
    present = daft.Node("present", "present", 6, 3, scale=4)
    future = daft.Node("future", "future", 11, 3, scale=4)
    [markov_chain.add_node(node) for node in [past, present, future]]
    markov_chain.add_edge("past", "present", lw=4, head_width=0.4)
    markov_chain.add_edge("present", "future", lw=4, head_width=0.4)
    markov_chain.render()

    
def handle_scalars(proposal):   
    # if output is size 0 array
    if proposal.shape is ():  
        # convert it to a scalar
        proposal = np.asscalar(proposal)
    return proposal

# Today, we will peek under the hood of pyMC and learn more about the mechanics of the modeling and inference approach we've been using this semester.

For more, check out
[this blog post](https://ericmjl.github.io/essays-on-data-science/machine-learning/computational-bayesian-stats/)
covering some of the same material.

For an even deeper dive under the hood,
check out
[Bayesian Method For Hackers](https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers),
especially Chapter 3.

## Any method that uses samples to approximate distributions and averages is a _Monte Carlo_ method.

## Using samples to approximate a posterior or averages on a posterior is called Bayesian Monte Carlo.

## The method we use to draw samples is called _Markov Chain_ sampling.

## Using samples from a Markov Chain to approximate Bayesian posteriors is known as Bayesian Markov Chain Monte Carlo, or _Bayesian MCMC_.

# Any approach to approximating a fixed quantity with a random quantity is a Monte Carlo method.

How do we calculate $\pi$?

There is no end to the digits of $\pi$,
and so if we tried to get it exactly, we'd keep on computing forever.

One method is to use the same "Taylor Expansion" Method that gave us polynomial regression:

$$
\pi = 16\cdot\texttt{atan}(1/5) - 4\cdot\texttt{atan}(1/239)
$$

but replace $\texttt{atan}$ with $a + b\cdot x + c\cdot x^2 \dots$.

Another method is to use optimization.

We pick some function $f$ such that

$$
\max(f(x)) = \pi
$$

and then use methods closely related to `find_MAP` to get close to $\pi$.

This class of methods is my personal favorite.

### $\pi$ is a fixed value, but we can approximate it with the output of a random process.

Consider a circle inscribed inside a square.

$$
\frac{\text{Area of Circle}}{\text{Area of Square}} = \frac{\pi r^2}{(2r)^2} = \frac{\pi}{4}
$$

It looks like this:

In [None]:
f, ax = plt.subplots(figsize=(12, 12))

square = matplotlib.patches.Rectangle((-1, -1), 2, 2)
circle = matplotlib.patches.Circle((0, 0), 1, color="C1")

ax.add_patch(square); ax.add_patch(circle);
ax.set_xlim([-1, 1]); ax.set_ylim([-1, 1]);

If we throw darts at the square and they have an equal chance of hitting at each point,
then

$$
p(\text{dart lands in circle}) = \frac{\text{Area of Circle}}{\text{Area of Square}} = \frac{\pi}{4}
$$

That is, the probability of landing in any given region
is equal to the area of that region as a fraction of the total area.

We have frequently used samples from a probability distribution to approximate
the actual distribution.

In this case, we could simulate the "dart-throwing" process
and then calculate the fraction of darts
that are inside the circle:

$$
p(\text{dart lands in circle}) \approx \frac{1}{n}\sum_{n \text{ darts}} \texttt{is_in_circle(}\text{dart}\texttt{)}
$$

In [None]:
def is_in_circle(xs, ys):
    return np.sqrt((xs ** 2 + ys ** 2)) < 1

This is known as an _indicator function_ in mathematical probability.
It is a function that indicates whether an event occurred.

And so the average value of this function on our samples is
approximately the probability that a dart lands inside the circle:

$$
\frac{1}{n}\sum_{n \text{ darts}} \texttt{is_in_circle(}\text{dart}\texttt{)} \approx p(\text{dart lands in circle}) = \frac{\pi}{4}
$$

which we can slightly rearrange into

$$
\frac{1}{n}\sum_{n \text{ darts}} 4\cdot\texttt{is_in_circle(}\text{dart}\texttt{)} \approx \pi
$$

And so we can approximate $\pi$ just by drawing samples uniformly from a square.

In [None]:
def sample_from_square(n=10000):
    xs = pm.Uniform.dist(lower=-1, upper=1).random(size=n)
    ys = pm.Uniform.dist(lower=-1, upper=1).random(size=n)
    
    return pd.Series(xs), pd.Series(ys)

xs, ys = sample_from_square()

First, we verify that this code actually draws samples approximately uniformly across the square:

In [None]:
f, ax = plt.subplots(figsize=(12, 12))
ax.scatter(xs, ys, alpha=0.1); ax.axis("equal");

Then, we confirm that our `is_in_circle` function works as we'd hope it to:

In [None]:
f, ax = plt.subplots(figsize=(12, 12))
filtered_xs, filtered_ys = xs[is_in_circle(xs, ys)], ys[is_in_circle(xs, ys)]
ax.scatter(xs, ys, alpha=0.1); ax.axis("equal");
ax.scatter(filtered_xs, filtered_ys, alpha=0.5, color="C1"); ax.axis("equal");

Notice that about 3/4ths of the samples are colored gold and inside the circle?

The actual fraction should be close to $\pi$/4ths,
and the fraction will get closer as we draw more samples.

This Monte Carlo approach to approximating $\pi$
is encapsulated in the `monte_carlo_pi` function below. 

In [None]:
def monte_carlo_pi(n=10000):
    xs, ys = sample_from_square(n)
    
    in_circles = pd.Series(is_in_circle(xs, ys))
    
    approximate_pi = (4 * in_circles).mean()
    
    return approximate_pi

We can compare our results to `math.pi`,
the value of $\pi$ used by Python:

In [None]:
monte_carlo_pi(n=10000), math.pi

As we increase the number of samples `n`,
the approximation tends to get better and better.

But at a cost:
drawing more samples takes more time.

# Monte Carlo methods rely on sampling, but sampling is not always easy.

## Our calculation of $\pi$ relied on our ability to draw random points uniformly from square.

## What if we wanted to draw points uniformly from the _circle_?

Unlike the square, which can be sampled from easily
by sampling two separate uniform distributions,
it's not obvious how to sample the points inside a circle.

As always with a problem this simple, there are a number of solutions.

1. We might take our samples from the square and filter out only those that landed in the circle.
2. We might sample the clockwise position and the distance from the center.

We'll focus on a solution that generalizes more easily to more complex cases.

One suggestion comes from physical intuition about _diffusion_:

The video below shows food coloring spreading through,
or _diffusing into_,
a dish of water.

Even though all of the dye starts off somewhere close to the center,
given enough time (about 20 minutes in real time),
the dye ends up uniformly spread around the dish.

In [None]:
YouTubeVideo("8raI-uX4WAI", width=800, height=600)

One of the mechanisms of this diffusion has been known for some time.

Lucretius,
one of the _atomists_,
a school of philosophers of the Classical Era
who believed everything to be made of small,
indivisible particles,
described an observation of diffusion in his poem
_On the Nature of Things_:

> Watch carefully whenever shafts of streaming sunlight are allowed to penetrate a dark room. You will observe many minute particles mingling in many ways in every part of the space illuminated by the rays.
...
Such commotion implies the existence of movements of matter that are secret and imperceptible. For you will observe many of those particles, under the impulse of unseen blows, changing course and being forcibly turned back, now this way, now that way, in every direction.

- Titus Lucretius Carus, _De Rerum Natura_, c. 60 BC

The molecules of dye are being battered about by each other and by the molecules of water:
_unseen blows_ change their course, _now this way, now that way, in every direction_.

### We can mimic a diffusion process to draw uniform samples from the circle.

We start by "dropping in our dye" at some initial point.

We imagine putting in just one dye molecule, for computational reasons.

Then we start adding in our collisions.

On each step,
we adjust the position of our dye molecule a small amount:

In [None]:
def collision_result(starting_point):
    return starting_point + 0.2 * np.random.standard_normal(size=2)

All we need to do is make sure we don't leave the circle.

We do this by staying put whenver a collision would take us outside the circle.

In [None]:
def sample_from_circle_diffusion(circle_checker, init, collider, n):
    assert is_in_circle(*init)  # make sure we start in the circle
    current = init  # drop in the dye
        
    samples = [current]
    for _ in range(n):
        # simulate a collision
        possible_next = collider(current)
        # make sure the collision didn't take us outside the circle
        if is_in_circle(*possible_next):
            # and if it didn't, we've got our new current position
            current = possible_next
        samples.append(current)
    
    return np.array(samples)

The cell below runs this process once, for `n` steps,
and then plots the trajectory of our dye molecule as a red line on top of the circle.

In [None]:
circle_samples = sample_from_circle_diffusion(
    is_in_circle, [0, 0], collision_result, n=30)

plot_circle_sampler_trajectory(circle_samples)

If we run the cell multiple times,
we will see different trajectories each time.

They will share a starting point, in the center.

They will also share some gross features:
for example, sequential points will tend to be close to one another.

At just 30 steps,
it doesn't look like we're drawing uniform samples from the circle.
Increase `n` to 300, then 3000.
You should see the trajectories begin to explore the circle more uniformly.

The cell below draws 10,000 samples in the same fashion,
then plots them as a scatter.

In [None]:
circle_samples_MC = sample_from_circle_diffusion(
    is_in_circle, [0, 0], collision_result, 10000)

f, ax = plt.subplots(figsize=(12, 12))
ax.scatter(circle_samples_MC[:, 0], circle_samples_MC[:, 1], alpha=0.5, color="C1");
ax.set_xlim([-1, 1]); ax.set_ylim([-1, 1]);

This is effectively indistinguishable from a scatter plot of the samples that landed inside the circle
in our Monte Carlo $\pi$ experiment:

In [None]:
circle_samples_MC = sample_from_circle_diffusion(
    is_in_circle, [0, 0], collision_result, 10000)

f, ax = plt.subplots(figsize=(12, 12))
ax.scatter(filtered_xs, filtered_ys, alpha=0.5, color="C1");
ax.set_xlim([-1, 1]); ax.set_ylim([-1, 1]);

The typical visualization of a sampler's trajectory
focuses on one variable at a time:

In [None]:
f, axs = plt.subplots(figsize=(12, 12), nrows=2, sharex=True, sharey=True)

plot_chain(pd.Series(circle_samples_MC[:1000, 0]), axs[0], lw=2)
plot_chain(pd.Series(circle_samples_MC[:1000, 1]), axs[1], lw=2);
axs[0].set_ylim([-1, 1]); axs[0].set_ylabel("x position"); axs[1].set_ylabel("y position");
axs[1].set_xlabel("time");

This view is what gave the name _trace_ to the outputs of pyMC:
this is the path traced out by our simulated dye molecule.

Notice the relatively slow motion of the particle:
when it is at an extremely positive or negative value
(close to -1 and 1),
it will tend to stay near that value for tens of steps.
This is visible as a "waviness" of the trajectory.

Change the first `Series` plotted to

```python
pd.Series(circle_samples_MC[:1000, 0]).sample(frac=1)
```

This will shuffle the values,
removing the time dependence.

The "waves" mostly disappear,
and values at one extreme are sometimes followed by values at the other extreme.

Alternatively, we could visualize the in-circle samples from our Monte Carlo $\pi$ experiment,
which also have no dependence on each other:

In [None]:
independent_samples = np.array([filtered_xs, filtered_ys]).T
plot_circle_sampler_trajectory(independent_samples[:30, :])

The samples are spread all around the circle,
with successive points no more likely to be neighbors
than points separated quite a bit in time.

In [None]:
f, axs = plt.subplots(figsize=(12, 12), nrows=2, sharex=True, sharey=True)

plot_chain(independent_samples[:1000, 0], axs[0], lw=2)
plot_chain(independent_samples[:1000, 1], axs[1], lw=2)
axs[0].set_ylim([-1, 1]); axs[0].set_ylabel("x position"); axs[1].set_ylabel("y position");
axs[1].set_xlabel("time");

This relationship can be roughly measured with _correlation_.

First, a code snippet to pull out samples at a given time lag from one another:

In [None]:
def k_ahead_samples(samples, k=1):
    current_xs = samples[:-k]
    k_ahead_xs = samples[k:]
    return current_xs, k_ahead_xs

In [None]:
k_ahead_samples(circle_samples_MC[:10, 0], k=1)  # x positions at a time lag of 1

Then, we plot these samples as x,y pairs
and look for a linear relationship.

In [None]:
plot_k_ahead(circle_samples_MC[:, 0], 1)

For a single step ahead, the correlation is very high.

That is, if you are currently at given position,
the next sample is very likely to be at a nearby position.

The correlation drops quite a bit if we go further out,
e.g. to 10 steps:

In [None]:
plot_k_ahead(circle_samples_MC[:, 0], k=10)

And at 50 steps, it has gone away:

In [None]:
plot_k_ahead(circle_samples_MC[:, 0], k=50)

The correlation between values as a function of the time-lag between them
is known as the _autocorrelation_.

In [None]:
autocorrplot(circle_samples_MC[:, 0], nlags=50);

Auto-correlation is bad for a sampler:
it can be shown that the higher this auto-correlation,
the worse our Monte Carlo estimates will be.

For example, the in-circle samples from our Monte Carlo $\pi$ experiment
are independent.

Even at a time lag of one step, the correlation is low:

In [None]:
plot_k_ahead(independent_samples[:, 0], 1)

This is true for all time lags
(except 0, where the correlation is 1,
because we are correlating a value with itself),
as we can see from the autocorrelation.

In [None]:
autocorrplot(independent_samples[:, 0], nlags=40);

# Diffusions are one kind of _Markov Chain_.

# A Markov Chain is a sequence of random variables where future values are dependent on the past values, but only through the present.

In our case,
times far in the past
(up to 20 or more steps back)
were correlated with the next step.

But if you look closely at the code for generating the samples,
you'll see that there's only a direct dependence on the most recent value.

### Every Markov Chain has the same graphical representation: a _chain_ of nodes.

In [None]:
make_markov_chain()

It is from Markov Chains that the "chains" in a pyMC trace get their name.

The following are examples of Markov Chains:

- **The weather** is _approximately_ a Markov Chain. If it is sunny one day, it's likely to be sunny the next; if it is rainy one day, it is likely to be rainy the next.

- The **position of a fly** buzzing around a room, wandering aimlessly. This is effectively a diffusion.

The following are not examples of Markov Chains:

- The **English language**. A word arbitrarily far in the past can impact the future without impacting the present.

- **Video** streams and films. In a movie, the opening credits can often be used to predict the end credits, e.g. the name of the director, better than can the images that come in between. Or the movie might begin with a flashback that is referenced later in the film.

In some sense, most things we encounter aren't _exactly_ Markov Chains,
though many things can be closely modeled by Markov Chains.

# The basic algorithm for Markov Chain Monte Carlo is a generalized version of the diffusion used above.

It is called _Metropolis-Hastings_.

Structurally, it looks very similar:

In [None]:
def metropolis_hastings(
    logp,  # where previously we had in_circle, we now have logp
    init,
    proposer,  # where previously we had a "collider", now we have a "proposer"
    n):
    
    # we specify an initial point,
    #  the equivalent of dropping in the dye
    current = init 
    samples = [current]
    for _ in range(n):
        # then we propose the next value
        proposal = proposer(current)
        
        # then we use a criterion to choose whether to keep it or not
        #  and this criterion is based on the log-probability
        current = metropolis_criterion(logp, current, proposal)
        
        samples.append(current)  
    return np.array(samples)

### The biggest difference is that the acceptance criterion is soft, instead of hard.

In [None]:
def metropolis_criterion(logp, current, proposal):
    p_current = np.exp(logp(current))
    p_proposal = np.exp(logp(proposal))
    
    # if the proposal has higher probability than the current,
    #  or if it has a ratio of probabilities larger a random value,
    #  accept
    if (p_proposal / p_current) > pm.Uniform.dist().random():
        return proposal
    else:
        return current

We can even use the exact same method to generate proposed updates.

The below is equivalent to the `collision_result` function:

In [None]:
def gaussian_proposal(value, sd=0.05):
    proposal = pm.Normal.dist(mu=value, sd=sd).random()  # normal with mean centered at value
    proposal = handle_scalars(proposal)  # Unimportant details with shapes and types
    return proposal

### With a little massaging, we can even recreate the diffusion sampler with Metropolis-Hastings.

We have to cheat and use infinity:

In [None]:
def circle_logp(xy):
    # log-probability for uniform on circle
    x, y = xy
    if is_in_circle(x, y):
        return np.log(1 / math.pi)
    else:
        return -np.inf

Any two values that are in the circle will always evaluate to the same value,
in the first branch of the `if`-`else`,
and therefore the ratios of their probabilities will be 1,
and the update will always be accepted:

In [None]:
metropolis_criterion(circle_logp, (0, 0), (0.5, 0.35))

Whereas if the proposal is outside of the circle,
the ratio will be 0,
and so the proposal will be rejected:

In [None]:
metropolis_criterion(circle_logp, (0, 0), (1, 1))

And so the results are the same:

In [None]:
circle_samples_metropolis_hastings = metropolis_hastings(
    circle_logp, [0, 0], gaussian_proposal, 10000)

In [None]:
plot_circle_sampler_trajectory(circle_samples_metropolis_hastings[:100, :])

The trajectories look a bit different,
because the standard deviation of the Normal
in the proposal is different,
but the qualitative features are the same.

Namely,
a large number of samples roughly uniformly cover the circle:

In [None]:
f, ax = plt.subplots(figsize=(12, 12))
ax.scatter(circle_samples_metropolis_hastings[:, 0], circle_samples_metropolis_hastings[:, 1],
           alpha=0.5, color="C1");
ax.set_xlim([-1, 1]); ax.set_ylim([-1, 1]);

And from step to step, values are correlated:

In [None]:
plot_k_ahead(circle_samples_metropolis_hastings[:, 0], 2)

# Metropolis-Hastings-type algorithms help us sample from posteriors.

The goal of Markov Chain Monte Carlo is not just to sample uniformly from simple shapes like the circle.

Instead, the goal is to sample from interesting distributions that we can't write down,
like the posterior of a complicated model.

One of the major benefits of the Metropolis-Hastings algorithm comes from the fact that it uses a _ratio_ of probabilities.

First, consider the definition of the posterior from Bayes' Rule:

$$
p(\text{params}\vert\text{data}) = \frac{p(\text{data}\vert\text{params}) p(\text{params})}{p(\text{data})}
$$

The troublesome part of this equation is the denominator.
The numerator is part of our modeling process:
it has the likelihood and the prior (in that order).

Now let's consider comparing two possible values of the params, $A$ and $B$:

$$
p(\text{params = A}\vert\text{data}) = \frac{p(\text{data}\vert\text{params = A}) p(\text{params = A})}{p(\text{data})}
$$

$$
p(\text{params = B}\vert\text{data}) = \frac{p(\text{data}\vert\text{params = B}) p(\text{params = B})}{p(\text{data})}
$$

If we take the ratio of the probabilities,
we get a complicated-looking expression:

$$
\frac{p(\text{params = A}\vert\text{data})}{p(\text{params = B}\vert\text{data})} =
\frac{\frac{p(\text{data}\vert\text{params = A}) p(\text{params = A})}{p(\text{data})}}{\frac{p(\text{data}\vert\text{params = B}) p(\text{params = B})}{p(\text{data})}}
$$

But it simplifies, because $p(\text{data})$ is in the denominator of both the top and bottom:

$$
\frac{p(\text{params = A}\vert\text{data})}{p(\text{params = B}\vert\text{data})} =
\frac{p(\text{data}\vert\text{params = A}) p(\text{params = A})}{p(\text{data}\vert\text{params = B}) p(\text{params = B)}}
$$

This is only in terms of the likelihood and prior!

### Metropolis-Hastings only needs a `logp` function

### and that function only needs to give answers accurate up to a constant shift.

pyMC models have just such a `logp` function for the posterior:

In [None]:
with pm.Model() as linear_signal_model:
    signal = pm.Normal("signal", mu=0, sd=1)
    measurement = pm.Normal("measurement", mu=signal, sd=0.1, observed=0.8)

In [None]:
lsm_logp = linear_signal_model.logp
def lsm_logp_metropolis_hastings(value):
    return lsm_logp({"signal": value})

If we evaluate this function on a bunch of points
and then exponentiate, we get the posterior
(up to a proportionality constant given by $p(\text{data})$.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
signal_values = np.linspace(0, 1.6, num=1000)
ax.plot(signal_values, [np.exp(lsm_logp_metropolis_hastings(x)) for x in signal_values], lw=4);

If we hand that `logp` function to `metropolis_hastings`
along with the `gaussian_proposal`,
we will draw samples from the posterior:

In [None]:
posterior_samples_by_hand = metropolis_hastings(lsm_logp_metropolis_hastings, 0, gaussian_proposal, 5000)

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
ax.step(np.arange(len(posterior_samples_by_hand)), posterior_samples_by_hand);

Because our initial guess had low posterior probability,
there was a brief period in which our samples were slightly "off".

This period is known as the "burn-in" time of our sampler.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
xs = np.linspace(-2, 2, num=1000)
sns.distplot(posterior_samples_by_hand);

It results in a bias in our posterior,
which we can remove by cutting out those samples.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
ax.plot(signal_values, [np.exp(lsm_logp_metropolis_hastings(x)) for x in signal_values],
        color="C1", lw=2, label="exp(logp) Posterior");
sns.distplot(posterior_samples_by_hand[100:], label="Posterior Samples"); ax.legend();

Notice that the posterior obtained from `logp` directly is off by a multiplicative factor!

It has the right shape, but not quite the right heights.

Note that our samples,
though they are in aggregate drawn from the posterior,
still have temporal dependencies on short time lags:

In [None]:
autocorrplot(posterior_samples_by_hand[100:], nlags=100)

# pyMC combines the model-specification API we've worked with throughout the semester with _very_ sophisticated versions of Metropolis-Hastings.

The particular algorithm most commonly used under the hood in pyMC is called
[*NUTS*](https://stats.stackexchange.com/questions/311813/can-somebody-explain-to-me-nuts-in-english),
the "No U-Turn Sampler".

In [None]:
with linear_signal_model:
    posterior_samples = pm.sample(draws=5000, chains=1)

If we compare these `posterior_samples` to the ones we obtained `by_hand`,
we see a close agreement:

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(posterior_samples_by_hand[100:], label="By Hand");
sns.distplot(posterior_samples["signal"], label="pyMC"); ax.legend();

# The benefit of a library like pyMC is that the algorithms are much faster and more performant that anything we can write ourselves.

It is the outcome of many researchers and programmers working together for years,
and the results are both highly reliable and highly tuned.

For example,
the autocorrelation is much, much lower:

In [None]:
pm.autocorrplot(posterior_samples, max_lag=15);

Note that this plot uses a pyMC function, `pm.autocorrplot`,
that is the built-in equivalent of the `autocorrplot` function used for our homemade samplers.

And so the samples, at least for this posterior, look _almost_ independent:

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
ax.step(np.arange(len(posterior_samples["signal"])), posterior_samples["signal"]);

There is also no period of "burn-in" to worry about.

pyMC provides its own function for plotting trajectories: `traceplot`.

It takes in a `trace` and plots the posterior on the left and the samples over time on the right.

In [None]:
pm.traceplot(posterior_samples, figsize=(12, 6));

This function is most useful when you have multiple chains in your trace:

In [None]:
with linear_signal_model:
    multi_chain_posterior_samples = pm.sample(draws=1250,
                                              chains=4)  # how many separate markov chains to run?

In [None]:
pm.traceplot(multi_chain_posterior_samples, figsize=(12, 6));

The chains are separated by color in the plot above.

Multiple chains tend to improve posterior sampling performance
and the ability to test for computational problems.
By default, pyMC uses two chains when sampling.