docs/tutorials/intro-to-pymc3.py

# -*- coding: utf-8 -*-
# ---
# jupyter:
#   jupytext:
#     text_representation:
#       extension: .py
#       format_name: light
#       format_version: '1.5'
#       jupytext_version: 1.7.1
#   kernelspec:
#     display_name: Python 3
#     language: python
#     name: python3
# ---

# + nbsphinx="hidden"
# %matplotlib inline

# + nbsphinx="hidden"
# %run notebook_setup
# -

# # A quick intro to PyMC3 for exoplaneteers

# [Hamiltonian Monte Carlo (HMC)](https://en.wikipedia.org/wiki/Hamiltonian_Monte_Carlo) methods haven't been widely used in astrophysics, but they are the standard methods for probabilistic inference using Markov chain Monte Carlo (MCMC) in many other fields.
# *exoplanet* is designed to provide the building blocks for fitting many exoplanet datasets using this technology, and this tutorial presents some of the basic features of the [PyMC3](https://docs.pymc.io/) modeling language and inference engine.
# The [documentation for PyMC3](https://docs.pymc.io/) includes many other tutorials that you should check out to get more familiar with the features that are available.
#
# In this tutorial, we will go through two simple examples of fitting some data using PyMC3.
# The first is the classic fitting a line to data with unknown error bars, and the second is a more relevant example where we fit a radial velocity model to the public radial velocity observations of [51 Peg](https://en.wikipedia.org/wiki/51_Pegasi).
# You can read more about fitting lines to data [in the bible of line fitting](https://arxiv.org/abs/1008.4686) and you can see another example of fitting the 51 Peg data using HMC (this time using [Stan](http://mc-stan.org)) [here](https://dfm.io/posts/stan-c++/).
#
# ## Hello world (AKA fitting a line to data)
#
# My standard intro to a new modeling language or inference framework is to fit a line to data.
# So. Let's do that with PyMC3.
#
# To start, we'll generate some fake data using a linear model.
# Feel free to change the random number seed to try out a different dataset.

# +
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

true_m = 0.5
true_b = -1.3
true_logs = np.log(0.3)

x = np.sort(np.random.uniform(0, 5, 50))
y = true_b + true_m * x + np.exp(true_logs) * np.random.randn(len(x))

plt.plot(x, y, ".k")
plt.ylim(-2, 2)
plt.xlabel("x")
plt.ylabel("y")
# -

# To fit a model to these data, our model will have 3 parameters: the slope $m$, the intercept $b$, and the log of the uncertainty $\log(\sigma)$.
# To start, let's choose broad uniform priors on these parameters:
#
# $$
# \begin{eqnarray}
# p(m) &=& \left\{\begin{array}{ll}
# 1/10 & \mathrm{if}\,-5 < m < 5 \\
# 0 & \mathrm{otherwise} \\
# \end{array}\right. \\
# p(b) &=& \left\{\begin{array}{ll}
# 1/10 & \mathrm{if}\,-5 < b < 5 \\
# 0 & \mathrm{otherwise} \\
# \end{array}\right. \\
# p(\log(\sigma)) &=& \left\{\begin{array}{ll}
# 1/10 & \mathrm{if}\,-5 < b < 5 \\
# 0 & \mathrm{otherwise} \\
# \end{array}\right.
# \end{eqnarray}
# $$
#
# Then, the log-likelihood function will be
#
# $$
# \log p(\{y_n\}\,|\,m,\,b,\,\log(\sigma)) = -\frac{1}{2}\sum_{n=1}^N \left[\frac{(y_n - m\,x_n - b)^2}{\sigma^2} + \log(2\,\pi\,\sigma^2)\right]
# $$
#
# [**Note:** the second normalization term is needed in this model because we are fitting for $\sigma$ and the second term is *not* a constant.]
#
# Another way of writing this model that might not be familiar is the following:
#
# $$
# \begin{eqnarray}
# m &\sim& \mathrm{Uniform}(-5,\,5) \\
# b &\sim& \mathrm{Uniform}(-5,\,5) \\
# \log(\sigma) &\sim& \mathrm{Uniform}(-5,\,5) \\
# y_n &\sim& \mathrm{Normal}(m\,x_n+b,\,\sigma)
# \end{eqnarray}
# $$
#
# This is the way that a model like this is often defined in statistics and it will be useful when we implement out model in PyMC3 so take a moment to make sure that you understand the notation.
#
# Now, let's implement this model in PyMC3.
# The documentation for the distributions available in PyMC3's modeling language can be [found here](https://docs.pymc.io/api/distributions/continuous.html) and these will come in handy as you go on to write your own models.

# +
import pymc3 as pm

with pm.Model() as model:

    # Define the priors on each parameter:
    m = pm.Uniform("m", lower=-5, upper=5)
    b = pm.Uniform("b", lower=-5, upper=5)
    logs = pm.Uniform("logs", lower=-5, upper=5)

    # Define the likelihood. A few comments:
    #  1. For mathematical operations like "exp", you can't use
    #     numpy. Instead, use the mathematical operations defined
    #     in "pm.math".
    #  2. To condition on data, you use the "observed" keyword
    #     argument to any distribution. In this case, we want to
    #     use the "Normal" distribution (look up the docs for
    #     this).
    pm.Normal("obs", mu=m * x + b, sd=pm.math.exp(logs), observed=y)

    # This is how you will sample the model. Take a look at the
    # docs to see that other parameters that are available.
    trace = pm.sample(draws=1000, tune=1000, chains=2, cores=2)
# -

# Now since we now have samples, let's make some diagnostic plots.
# The first plot to look at is the "traceplot" implemented in PyMC3.
# In this plot, you'll see the marginalized distribution for each parameter on the left and the trace plot (parameter value as a function of step number) on the right.
# In each panel, you should see two lines with different colors.
# These are the results of different independent chains and if the results are substantially different in the different chains then there is probably something going wrong.

# +
import arviz as az

with model:
    az.plot_trace(trace, var_names=["m", "b", "logs"])
# -

# It's also good to quantify that "looking substantially different" argument.
# This is implemented in PyMC3 as the "summary" function.
# In this table, some of the key columns to look at are `n_eff` and `Rhat`.
# * `n_eff` shows an estimate of the number of effective (or independent) samples for that parameter. In this case, `n_eff` should probably be around 500 per chain (there should have been 2 chains run).
# * `Rhat` shows the [Gelman–Rubin statistic](https://docs.pymc.io/api/diagnostics.html#pymc3.diagnostics.gelman_rubin) and it should be close to 1.

with model:
    summary = az.summary(trace, var_names=["m", "b", "logs"])
summary

# The last diagnostic plot that we'll make here is the [corner plot made using corner.py](https://corner.readthedocs.io).
# The easiest way to do this using PyMC3 is to first convert the trace to a [Pandas DataFrame](https://pandas.pydata.org/) and then pass that to `corner.py`.

# +
import corner  # https://corner.readthedocs.io

names = ["m", "b", "logs"]
_ = corner.corner(
    trace,
    var_names=names,
    truths=dict(zip(names, [true_m, true_b, true_logs])),
)
# -

# **Extra credit:** Here are a few suggestions for things to try out while getting more familiar with PyMC3:
#
# 1. Try initializing the parameters using the `testval` argument to the distributions. Does this improve performance in this case? It will substantially improve performance in more complicated examples.
# 2. Try changing the priors on the parameters. For example, try the "uninformative" prior [recommended by Jake VanderPlas on his blog](http://jakevdp.github.io/blog/2014/06/14/frequentism-and-bayesianism-4-bayesian-in-python/#Prior-on-Slope-and-Intercept).
# 3. What happens as you substantially increase or decrease the simulated noise? Does the performance change significantly? Why?
#
# ## A more realistic example: radial velocity exoplanets
#
# While the above example was cute, it doesn't really fully exploit the power of PyMC3 and it doesn't really show some of the real issues that you will face when you use PyMC3 as an astronomer.
# To get a better sense of how you might use PyMC3 in Real Life™, let's take a look at a more realistic example: fitting a Keplerian orbit to radial velocity observations.
#
# One of the key aspects of this problem that I want to highlight is the fact that PyMC3 (and the underlying model building framework [Theano (recently renamed to Aesara)](https://aesara.readthedocs.io/)) don't have out-of-the-box support for the root-finding that is required to solve Kepler's equation.
# As part of the process of computing a Keplerian RV model, we must solve the equation:
#
# $$
# M = E - e\,\sin E
# $$
#
# for the eccentric anomaly $E$ given some mean anomaly $M$ and eccentricity $e$.
# There are commonly accepted methods of solving this equation using [Newton's method](https://en.wikipedia.org/wiki/Newton%27s_method), but if we want to expose that to PyMC3, we have to define a [custom Theano operation](https://aesara.readthedocs.io/en/latest/extending/index.html) with a custom gradient.
# I won't go into the details of the math (because [I blogged about it](https://dfm.io/posts/stan-c++/)) and I won't go into the details of the implementation (because [you can take a look at it on GitHub](https://github.com/exoplanet-dev/exoplanet/tree/main/exoplanet/theano_ops/kepler.py)).
# So, for this tutorial, we'll use the custom Kepler solver that is implemented as part of *exoplanet* and fit the publicly available radial velocity observations of the famous exoplanetary system 51 Peg using PyMC3.
#
# First, we need to download the data from the exoplanet archive:

# +
import requests
import pandas as pd
import matplotlib.pyplot as plt

# Download the dataset from the Exoplanet Archive:
url = "https://exoplanetarchive.ipac.caltech.edu/data/ExoData/0113/0113357/data/UID_0113357_RVC_001.tbl"
r = requests.get(url)
if r.status_code != requests.codes.ok:
    r.raise_for_status()
data = np.array(
    [
        l.split()
        for l in r.text.splitlines()
        if not l.startswith("\\") and not l.startswith("|")
    ],
    dtype=float,
)
t, rv, rv_err = data.T
t -= np.mean(t)

# Plot the observations "folded" on the published period:
# Butler et al. (2006) https://arxiv.org/abs/astro-ph/0607493
lit_period = 4.230785
plt.errorbar(
    (t % lit_period) / lit_period, rv, yerr=rv_err, fmt=".k", capsize=0
)
plt.xlim(0, 1)
plt.ylim(-110, 110)
plt.annotate(
    "period = {0:.6f} days".format(lit_period),
    xy=(1, 0),
    xycoords="axes fraction",
    xytext=(-5, 5),
    textcoords="offset points",
    ha="right",
    va="bottom",
    fontsize=12,
)
plt.ylabel("radial velocity [m/s]")
_ = plt.xlabel("phase")
# -

# Now, here's the implementation of a radial velocity model in PyMC3.
# Some of this will look familiar after the Hello World example, but things are a bit more complicated now.
# Take a minute to take a look through this and see if you can follow it.
# There's a lot going on, so I want to point out a few things to pay attention to:
#
# 1. All of the mathematical operations (for example `exp` and `sqrt`) are being performed using Theano instead of NumPy.
# 2. All of the parameters have initial guesses provided. This is an example where this makes a big difference because some of the parameters (like period) are very tightly constrained.
# 3. Some of the lines are wrapped in `Deterministic` distributions. This can be useful because it allows us to track values as the chain progresses even if they're not parameters. For example, after sampling, we will have a sample for `bkg` (the background RV trend) for each step in the chain. This can be especially useful for making plots of the results.
# 4. Similarly, at the end of the model definition, we compute the RV curve for a single orbit on a fine grid. This can be very useful for diagnosing fits gone wrong.
# 5. For parameters that specify angles (like $\omega$, called `w` in the model below), it can be inefficient to sample in the angle directly because of the fact that the value wraps around at $2\pi$. Instead, it can be better to sample the unit vector specified by the angle or as a parameter in a unit disk, when combined with eccentricity. In practice, this can be achieved by sampling a 2-vector from an isotropic Gaussian and normalizing the components by the norm. These are implemented in the [pymc3-ext](https://github.com/exoplanet-dev/pymc3-ext) package.

# +
import pymc3_ext as pmx
import aesara_theano_fallback.tensor as tt

import exoplanet as xo

with pm.Model() as model:

    # Parameters
    logK = pm.Uniform(
        "logK",
        lower=0,
        upper=np.log(200),
        testval=np.log(0.5 * (np.max(rv) - np.min(rv))),
    )
    logP = pm.Uniform(
        "logP", lower=0, upper=np.log(10), testval=np.log(lit_period)
    )
    phi = pm.Uniform("phi", lower=0, upper=2 * np.pi, testval=0.1)

    # Parameterize the eccentricity using:
    #  h = sqrt(e) * sin(w)
    #  k = sqrt(e) * cos(w)
    hk = pmx.UnitDisk("hk", testval=np.array([0.01, 0.01]))
    e = pm.Deterministic("e", hk[0] ** 2 + hk[1] ** 2)
    w = pm.Deterministic("w", tt.arctan2(hk[1], hk[0]))

    rv0 = pm.Normal("rv0", mu=0.0, sd=10.0, testval=0.0)
    rvtrend = pm.Normal("rvtrend", mu=0.0, sd=10.0, testval=0.0)

    # Deterministic transformations
    n = 2 * np.pi * tt.exp(-logP)
    P = pm.Deterministic("P", tt.exp(logP))
    K = pm.Deterministic("K", tt.exp(logK))
    cosw = tt.cos(w)
    sinw = tt.sin(w)
    t0 = (phi + w) / n

    # The RV model
    bkg = pm.Deterministic("bkg", rv0 + rvtrend * t / 365.25)
    M = n * t - (phi + w)

    # This is the line that uses the custom Kepler solver
    f = xo.orbits.get_true_anomaly(M, e + tt.zeros_like(M))
    rvmodel = pm.Deterministic(
        "rvmodel", bkg + K * (cosw * (tt.cos(f) + e) - sinw * tt.sin(f))
    )

    # Condition on the observations
    pm.Normal("obs", mu=rvmodel, sd=rv_err, observed=rv)

    # Compute the phased RV signal
    phase = np.linspace(0, 1, 500)
    M_pred = 2 * np.pi * phase - (phi + w)
    f_pred = xo.orbits.get_true_anomaly(M_pred, e + tt.zeros_like(M_pred))
    rvphase = pm.Deterministic(
        "rvphase", K * (cosw * (tt.cos(f_pred) + e) - sinw * tt.sin(f_pred))
    )
# -

# In this case, I've found that it is useful to first optimize the parameters to find the "maximum a posteriori" (MAP) parameters and then start the sampler from there.
# This is useful here because MCMC is not designed to *find* the maximum of the posterior; it's just meant to sample the shape of the posterior.
# The performance of all MCMC methods can be really bad when the initialization isn't good (especially when some parameters are very well constrained).
# To find the maximum a posteriori parameters using PyMC3, you can use the `optimize` function from [pymc3-ext](https://github.com/exoplanet-dev/pymc3-ext):

with model:
    map_params = pmx.optimize()

# Let's make a plot to check that this initialization looks reasonable.
# In the top plot, we're looking at the RV observations as a function of time with the initial guess for the long-term trend overplotted in blue.
# In the lower panel, we plot the "folded" curve where we have wrapped the observations onto the best-fit period and the prediction for a single overplotted in orange.
# If this doesn't look good, try adjusting the initial guesses for the parameters and see if you can get a better fit.
#
# **Exercise:** Try changing the initial guesses for the parameters (as specified by the `testval` argument) and see how sensitive the results are to these values. Are there some parameters that are less important? Why is this?

# +
fig, axes = plt.subplots(2, 1, figsize=(8, 8))

period = map_params["P"]

ax = axes[0]
ax.errorbar(t, rv, yerr=rv_err, fmt=".k")
ax.plot(t, map_params["bkg"], color="C0", lw=1)
ax.set_ylim(-110, 110)
ax.set_ylabel("radial velocity [m/s]")
ax.set_xlabel("time [days]")

ax = axes[1]
ax.errorbar(t % period, rv - map_params["bkg"], yerr=rv_err, fmt=".k")
ax.plot(phase * period, map_params["rvphase"], color="C1", lw=1)
ax.set_ylim(-110, 110)
ax.set_ylabel("radial velocity [m/s]")
ax.set_xlabel("phase [days]")

plt.tight_layout()
# -

# Now let's sample the posterior starting from our MAP estimate (here we're using the sampling routine from [pymc3-ext](https://github.com/exoplanet-dev/pymc3-ext) which wraps the PyMC3 function with some better defaults).

with model:
    trace = pmx.sample(
        draws=2000,
        tune=1000,
        start=map_params,
        chains=2,
        cores=2,
        target_accept=0.95,
    )

# As above, it's always a good idea to take a look at the summary statistics for the chain.
# If everything went as planned, there should be more than 1000 effective samples per chain and the Rhat values should be close to 1.
# (Not too bad for less than 30 seconds of run time!)

# +
import arviz as az

with model:
    summary = az.summary(
        trace,
        var_names=["logK", "logP", "phi", "e", "w", "rv0", "rvtrend"],
    )
summary
# -

# Similarly, we can make the corner plot again for this model.

_ = corner.corner(trace, var_names=["K", "P", "e", "w"])

# Finally, the last plot that we'll make here is of the posterior predictive density.
# In this case, this means that we want to look at the distribution of predicted models that are consistent with the data.
# As above, the top plot shows the raw observations as black error bars and the RV trend model is overplotted in blue.
# But, this time, the blue line is actually composed of 25 lines that are samples from the posterior over trends that are consistent with the data.
# In the bottom panel, the orange lines indicate the same 25 posterior samples for the RV curve of one orbit.

# +
fig, axes = plt.subplots(2, 1, figsize=(8, 8))

period = map_params["P"]

ax = axes[0]
ax.errorbar(t, rv, yerr=rv_err, fmt=".k")
ax.set_ylabel("radial velocity [m/s]")
ax.set_xlabel("time [days]")

ax = axes[1]
ax.errorbar(t % period, rv - map_params["bkg"], yerr=rv_err, fmt=".k")
ax.set_ylabel("radial velocity [m/s]")
ax.set_xlabel("phase [days]")

for i in np.random.randint(len(trace) * trace.nchains, size=25):
    axes[0].plot(t, trace["bkg"][i], color="C0", lw=1, alpha=0.3)
    axes[1].plot(
        phase * period, trace["rvphase"][i], color="C1", lw=1, alpha=0.3
    )

axes[0].set_ylim(-110, 110)
axes[1].set_ylim(-110, 110)

plt.tight_layout()