# Lab 9: Hamiltonian Monte Carlo, Coverage Checks, and Posterior Predictive Checks with PyMC

### Lab Date: Wednesday, Apr 16

### Lab Due: Wednesday, Apr 30

## Instructions

Work with your lab group to complete the following notebook. Your work will be reviewed by your peers in two weeks (Wednesday, April 30)

In this lab, you will:
1. **Sampling from an intractable posterior using Hamiltonian Monte Carlo (HMC):**  
   Use PyMC to sample from a Bayesian logistic regression model where the posterior is analytically intractable.

2. **Coverage:**
    Check the accuracy of your posterior sampling procedure by performing a coverage check.

4. **Posterior Predictive Checks (PPC):**  
   Use the posterior samples to perform model checking.

If you are new to working in python, or in a Jupyter notebook, please ask your lab members for help. If you notice a lab member struggling, and have experience, please offer your help.

Please see this [Ed post](https://edstem.org/us/courses/74615/discussion/6463387) for corrections, questions, and discussion. If you would rather work with your own copy of the files, I have uploaded a zip folder there with the lab materials. 

Corrections to the lab will be pushed directly to this notebook. We will only push corrections to the text, which is set to read only to prevent merge conflicts. In the event of a merge conflict, save your notebook under a different name, and click the link that launches the lab from the schedule on the [stat238 homepage](https://stat238.berkeley.edu/spring-2025/) again. Then, check for discrepancies. If you can't find them, or resolve the conflict, contact us.

## Problem Setup: Bayesian Logistic Regression

We consider a binomial logistic regression model. For groups $j = 1, \dots, J$, we observe:

- $y_j \sim \text{Binomial}(n_j, \theta_j)$
- $\theta_j = \text{logistic}(\alpha + \beta x_j)$

Here:
- $x_j$ is a known covariate for group $j$
- $\alpha, \beta$ are regression coefficients
- $\text{logistic}(z) = \frac{1}{1 + e^{-z}}$ is the inverse logit function

### Prior Distributions
We place weakly informative priors:
- $\alpha \sim t_4(0, 2^2)$  (Student-t with 4 degrees of freedom, mean 0, scale 2)
- $\beta \sim t_4(0, 1)$

These priors allow for heavier tails than the normal distribution, making them robust (c.f. lab 6).

## 1. Import Required Packages and Define Helper Functions

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pymc as pm
import arviz as az
import scipy.stats as st
from scipy.special import expit  # logistic (inverse-logit) function

# Set a random seed for reproducibility
np.random.seed(123)

## 2. Simulate the Dataset

**Q 2.1:** Simulate a dataset using the model:

$y_j \sim \operatorname{Binomial}(n_j, \operatorname{logit}^{-1}(\alpha + \beta\, x_j))$

where $J = 10$, the covariates $x_j$ are drawn from $\text{Uniform}(-1, 1)$, and the sample sizes $n_j$ are drawn from a Poisson with $\lambda = 5$ (reject any zeros). Use $\alpha_{\text{true}} = 0.5$ and $\beta_{\text{true}} = -1.0$ for simulation.

In [2]:
# Set simulation parameters
J = 10
alpha_true = 0.5
beta_true = -1.0

# Draw x_j from Uniform(-1, 1)
x = np.random.uniform(-1, 1, size=J)

# Draw n_j from a truncated Poisson (reject zeros)
n = np.empty(J, dtype=int)
for j in range(J):
    nj = 0
    while nj < 1:
        nj = np.random.poisson(lam=5)
    n[j] = nj

# Compute the probabilities theta_j = logistic(alpha_true + beta_true * x_j)
theta = ... # complete this line

# Draw y_j ~ Binomial(n_j, theta_j)
y = ... # complete this line

print("x:", np.round(x, 2))
print("n:", n)
print("y:", y)

x: [ 0.39 -0.43 -0.55  0.1   0.44 -0.15  0.96  0.37 -0.04 -0.22]
n: [3 4 8 4 8 7 5 9 4 3]
y: [1 3 7 3 3 5 2 5 3 1]


## 3. Build and Sample the Model in PyMC

PyMC is a powerful, open-source library for probabilistic programming in Python that simplifies the process of building and fitting Bayesian models. It provides an intuitive interface for defining complex statistical models and performing inference using state-of-the-art sampling algorithms such as NUTS. If you're new to PyMC or Bayesian modeling, check out the <a href="https://docs.pymc.io/" style="color: blue;">official PyMC documentation</a> for tutorials, examples (e.g. <a href="https://www.pymc.io/projects/docs/en/stable/learn/core_notebooks/GLM_linear.html#glm-linear" style="color: blue;">GLM Linear Regression</a>), and comprehensive guidance.

If you get stuck in the documentation, check with the TA during lab or ask for help in office hours.

**Q 3.1:** In the space below:

- Define the Bayesian model using PyMC.
- Use Student‑t priors with 4 degrees of freedom for $\alpha$ (scale = 2) and $\beta$ (scale = 1).
- Define the likelihood: $y_j \sim \operatorname{Binomial}(n_j, \theta_j)$ with $\theta_j = \operatorname{logit}^{-1}(\alpha + \beta\, x_j)$.
- Sample from the posterior using the No-U-Turn Sampler (NUTS), a type of HMC.


In [None]:
with pm.Model() as model:
    # Priors for alpha and beta using Student-t distributions
    alpha = pm.StudentT("alpha", nu=4, mu=0, sigma=2) # an example call to pymc defining a distributional model component
    beta = ...
    
    # Logistic regression: theta = logit^{-1}(alpha + beta * x)
    theta = ... # use the pm.Deterministic function to fix a deterministic model component
    
    # Likelihood: y_j ~ Binomial(n_j, theta_j)
    y_obs = ... # use the observed = argument to fix the observation
    
    # Sample using NUTS (HMC)
    sample_trace = ... # use the pm.sample function

# Display the posterior summary for alpha and beta
print(az.summary(sample_trace, var_names=["alpha", "beta"]))

**Q 3.2:** Adjust the sampling parameters (primarily, the number of steps taken) until you are convinced that the chain has mixed, and that you have enough samples to reliably represent the posterior. Explain the criteria you used in the space below.

*Write your answer here.*

## 4. Coverage

**Q 4.1:** Do your 50% posterior intervals for $\alpha$ and $\beta$ (25th to 75th percentiles) contain the true values?

In [None]:
# Write your code here

Our posterior inferences are approximate since they depend on an MCMC procedure which only samples from the exact posterior given infinitely long to run. To check that your sampling procedure reliably represents the true posterior, perform a coverage check.

**Q 4.2:** That is:

1. Generate a series of $m$ replicate data sets sampled from the full data generating model (sample $\alpha, \beta$ from the prior, resample $x$ and $n$ using 10 observations as before (alternately, you can check coverage conditional on the same $x$ and $n$ from the main test case), then sample a series of new $y$'s.
2. For each, run your sampling code to generate approximate samples from the posterior. For each set of samples build 65%, 80%, 90%, and 95% credible intervals for $\alpha$ and $\beta$.
3. Track the fraction of replicates for which the sampled $\alpha$ and $\beta$ land in each posterior interval. Return a table comparing the fraction of the time that the replicates should have been contained in the interval (e.g. 65% for the 65% intervals), and the fraction that actually were contained inside the intervals. Choose $m$ large enough so that the observed fractions are sufficiently stable for evaluation.
4. Report the expected standard deviation in the observed fraction of posterior intervals containing the truth. If our posterior sampling procedure is exact, and we use an interval that is chosen to contain the truth $p$ percent of the time, and we run $m$ replicates, then the observed percent of replicates containing the truth is $C/m$ where $C \sim \text{Binomial}(m,p)$. Use the standard deviation of $C/m$ for each $p$ to choose $m$.
5. Based on steps 3 and 4, evaluate whether the discrepancy between the observed fraction contained in each interval, and the predicted fraction, is plausibly explained by randomness in the trials, or is evidence of systematic error in your posterior sampling procedure.

In [9]:
# Write your code here

**Q 4.3:** In the space below, reflect on the role of coverage checking in a Bayesian analysis when we don't know the true data generating model. Does a coverage check validate the model? If not, what aspect of the Bayesian process is validated using coverage checks?

*Write your answer here.*

**Q 4.4:** In the space below, explain why we need many replicates from the data generating model to perform a coverage check (why wasn't Q4.1 sufficient by itself)?

*Write your answere here.*

## 5. Posterior Predictive Checks (PPC) and Plotting the Fit

Now, generate posterior predictive samples to perform model checking. 

**Q 5.1:** In the space below:

- Generate $m$ posterior predictive samples conditional on the observed data, sample sizes $n$ and covariates $x$. That is, resample $\alpha, \beta$ from the posterior (use your posterior samples), then generate replicate $y's$ given $\alpha, \beta, x$ and $n$.
- Create a plot comparing the proportions $y_j/n_j$ in the observed data set and the proportions predicted by replication. I would suggest a scatter plot whose horizontal coordinate is the observed proportion, and whose vertical coordinate is the replicated proportion for each replicate. For each $j$, compute an interval that contains 95% of the replicate proportions, then overlay the interval on your scatter plot. Finally, add the line of parity (line of slope one) to check agreement.

Vary $m$ until you believe you have enough replicates to answer the question, "could my data have plausibly been generated by this model?".

In [None]:
# Write your code here (it may help to use the pm.sample_posterior_predictive function)

**Q 5.2:** To show that a PPC can reject mispsecified models, regenerate the data and rerun the PPC, but use a different model for generating the data than you do for posterior inference (e.g. try using two different prior parameters, one when you sample the data, and one when you run posterior inference). Vary the models until your PPC procedure would reject the model you used for inference.

In [36]:
# Write your code here (it may help to use the pm.sample_posterior_predictive function)

**Q 5.3:** Explain how you would formalize the graphical PPC described above as a hypothesis test, and, based on your experience in Q5.2 whether your PPC procedure would make a sensitive (e.g. a powerful) test for model misspecification with respect to the parameter you chose. Explain why you believe your procedure would form a sensitive, or insensitive, test for the chosen parameter.

*Write your answer here.*