Great! I will design a structured five-class curriculum on Probabilistic Programming and Bayesian Inference with a focus on MCMC using PyMC. Each class will be one hour long, combining theoretical concepts with practical coding exercises in Jupyter notebooks. 

The curriculum will include:
- An introduction to MCMC methods and PyMC
- Practical implementation of Bayesian models
- Advanced MCMC techniques and diagnostics
- A case study applying MCMC in PyMC
- A final class on model evaluation and comparison

Each session will have references for further reading. I'll get back to you with the detailed plan and content soon!

# Probabilistic Programming and Bayesian Inference (MCMC with PyMC) – Course Outline

This five-class post-graduate course teaches Bayesian inference and probabilistic programming using PyMC, with a focus on Markov Chain Monte Carlo (MCMC) methods. Each one-hour class blends theory with hands-on coding in Jupyter notebooks. Students should have prior knowledge of Bayesian inference and probability theory. Below is the lesson plan for each session, including theoretical concepts, code snippets, and exercises.

## Class 1: Introduction to MCMC and PyMC

**Theory and Concepts:**  
- *Bayesian Inference Refresher:* Recall that Bayesian inference combines prior beliefs and data evidence via Bayes’ theorem to update our beliefs about unknown parameters. We specify a **prior** distribution for parameters (our initial belief) and a **likelihood** for the observed data given those parameters. The result is the **posterior** distribution, which quantifies our updated belief after seeing the data.  
- *Why MCMC for Bayesian Computation:* In Bayesian analysis, the posterior is often analytically intractable because it requires integrating the likelihood over all possible parameter values (the *evidence*). For even moderately complex models, we **cannot compute the posterior in closed form**, so we resort to numerical approximation. **MCMC** provides an efficient way to sample from the posterior distribution without having to compute the evidence explicitly ([statistics - How does MCMC help bayesian inference? - Stack Overflow](https://stackoverflow.com/questions/53964848/how-does-mcmc-help-bayesian-inference#:~:text=I%20understand%20what%20MCMC%20does,from%20any%20complicated%20probability%20distribution)). The core idea is to construct a special **Markov chain** whose equilibrium distribution is the target posterior; by simulating this chain, we obtain samples from the posterior. These samples can be used to approximate expectations and uncertainties of model parameters.  
- *What is PyMC:* PyMC is a Python library for probabilistic programming that makes Bayesian modeling accessible. It allows users to define custom Bayesian models in pure Python and automatically computes posterior results using advanced MCMC algorithms. PyMC’s strengths include an intuitive model specification syntax and **powerful samplers like NUTS (No-U-Turn Sampler)**, a form of Hamiltonian Monte Carlo, which enable fitting complex models with thousands of parameters and minimal tuning. In addition, PyMC integrates with other scientific libraries and supports modern computational backends (NumPy, JAX via PyTensor) for speed and flexibility. We will use PyMC to implement MCMC for our Bayesian models throughout this course.

**Hands-On Practice:** *Setting up PyMC and running a basic model*  
- *Installation:* In the Jupyter notebook, ensure PyMC is installed. You can install it via pip or conda. For example, in a notebook cell:  
  ```python
  !pip install pymc
  ```  
  Verify the installation by importing the library and checking its version:  
  ```python
  import pymc as pm
  print(pm.__version__)
  ```  
- *Defining a Basic Bayesian Model:* We will start with a simple model to illustrate Bayesian inference with MCMC – for instance, inferring the bias of a coin (probability of heads) from observed flips. Our prior belief for the coin’s bias can be uniform Beta(1,1) (i.e. no prior preference), and the likelihood for observed flips follows a Binomial distribution.  
  ```python
  import pymc as pm

  # Data: 10 coin flips, with 7 heads observed
  flips = 10
  heads = 7

  with pm.Model() as coin_model:
      # Prior: assume bias ~ Beta(1,1) (uniform)
      p = pm.Beta("p", alpha=1, beta=1)
      # Likelihood: observed heads out of flips ~ Binomial(flips, p)
      y = pm.Binomial("y", n=flips, p=p, observed=heads)
      # Perform MCMC sampling
      trace = pm.sample(draws=1000, chains=2, random_seed=42)
  ```  
  This code defines a PyMC model and runs an MCMC sampler (`pm.sample()` defaults to NUTS for continuous variables). PyMC automatically constructs the posterior by combining the Beta prior and Binomial likelihood given our data. After sampling, `trace` contains draws from the posterior distribution of the coin’s bias `p`.  
- *Inspecting the Results:* With the `trace` object (an `InferenceData`), we can examine the posterior. For example, we might check the mean of the posterior samples or plot the posterior distribution of `p`. Using ArviZ (an analysis library integrated with PyMC), we can summarize results:  
  ```python
  import arviz as az
  print(az.summary(trace, var_names=["p"]))
  ```  
  This will display statistics like the posterior mean of `p` and credibility intervals. Students should confirm that the posterior for `p` makes sense (e.g., with 7 heads in 10 flips, the posterior Beta distribution should peak around 0.7).  

**Exercise:** After running the provided model, modify the prior and see its effect. For instance, change the prior to a more biased prior like Beta(5,1) (strong belief that the coin is biased toward heads) and re-run the sampler. Compare the posterior results to the original to observe how a stronger prior influences the posterior. Discuss why MCMC still works for sampling the posterior even when the prior is informative (the mechanism of MCMC doesn’t change; it will concentrate more samples in regions favored by both prior and likelihood).

**References (Class 1):** Bayesian inference background, necessity of MCMC for posterior sampling, PyMC overview.

---

## Class 2: Building Bayesian Models in PyMC

**Theory and Concepts:**  
- *Model Components – Priors and Likelihoods:* A Bayesian model is composed of **prior distributions** on parameters and a **likelihood function** for the observed data. In PyMC, we define priors using probability distribution objects (e.g., `pm.Normal`, `pm.Exponential`) and attach observed data to a likelihood distribution via the `observed` argument. For example, `y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=data)` means “Assume observed variable $y$ is Normally distributed with mean `mu` and std `sigma`, and set its value to the actual data.” The combination of the prior and this observed likelihood forms the joint probability model. Intuitively, the observed data *constrain* the parameters during sampling: parameter values that make the observed data likely (high likelihood) will be favored, tempered by the prior belief about those parameters.  
- *Constructing a PyMC Model:* PyMC uses context managers (`with pm.Model() as model:`) to build a model. Inside the `with` block, we declare random variables. Each unobserved variable (parameters) gets a prior, and at least one variable represents the likelihood of observed data (using the `observed=` keyword). PyMC then internally creates the mathematical representation of the posterior. **Example:** For a simple linear regression $y = \alpha + \beta x + \epsilon$, we might assign priors $\alpha \sim \mathcal{N}(0,10)$, $\beta \sim \mathcal{N}(0,10)$, and $\epsilon \sim \text{HalfNormal}(10)$. The likelihood would be $y_i \sim \mathcal{N}(\alpha + \beta x_i,\ \epsilon)$ for each observed data point.  
- *Sampling Strategies in PyMC:* Once the model is specified, we use `pm.sample()` to draw samples from the posterior. PyMC will automatically choose an appropriate MCMC sampler. For continuous models, by default it uses the No-U-Turn Sampler (NUTS), an efficient gradient-based MCMC method. For discrete variables or certain scenarios, it may use other samplers (e.g., Metropolis or slice sampling) or a combination. We generally do not need to manually choose the sampler, but it’s useful to know that *NUTS is used under the hood for most of our models*, providing adaptive, efficient exploration of the posterior. After sampling, PyMC returns an `InferenceData` object (from ArviZ) containing the draws for each parameter.  
- *Interpreting PyMC Output:* The MCMC output can be summarized and visualized. Key outputs include the posterior mean, median, and highest posterior density intervals for parameters. We also often examine **trace plots** (parameter values vs. iteration) to see if the chains have mixed well, and **autocorrelation** to gauge how independent samples are. These topics will be covered in depth in Class 3 (MCMC diagnostics). For now, understanding that `trace` contains a large collection of parameter samples which approximate the posterior distribution is enough.  

**Hands-On Practice:** *Implementing a Simple Bayesian Regression in PyMC*  
In this exercise, students will build and sample from a Bayesian linear regression model using PyMC. We will simulate a small dataset and then infer the regression line. Steps and code are as follows:

1. **Simulate Dataset:** Create synthetic data for a simple linear relationship. For example, let’s simulate `x` values and generate `y` with some noise:
   ```python
   import numpy as np
   rng = np.random.default_rng(42)
   # Independent variable
   x = np.linspace(0, 10, 50)
   # True (unknown) parameters for simulation
   true_intercept = 5.0
   true_slope = 1.5
   true_sigma = 2.0
   # Generate linear data with Gaussian noise
   y = true_intercept + true_slope * x + rng.normal(0, true_sigma, size=x.size)
   ```  
   Here `true_intercept=5`, `true_slope=1.5`, and noise std `true_sigma=2`. In a real scenario, `x` and `y` would come from an experiment or observation. Students can visualize the data (e.g., scatter plot of `x` vs `y`) to see the linear trend, but plotting is optional due to environment constraints.
2. **Define the Bayesian Linear Regression Model:** Using PyMC, specify priors for the intercept, slope, and noise, and define the likelihood for observed `y`:
   ```python
   import pymc as pm
   with pm.Model() as linreg_model:
       # Priors for intercept and slope (normal with broad std)
       intercept = pm.Normal("intercept", mu=0, sigma=10)
       slope = pm.Normal("slope", mu=0, sigma=10)
       # Prior for noise standard deviation (half-normal to ensure positivity)
       sigma = pm.HalfNormal("sigma", sigma=10)
       # Linear model for mean
       mu = intercept + slope * x  # x is the numpy array from above
       # Likelihood (observed data y)
       y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)
       # Posterior sampling
       trace = pm.sample(draws=2000, tune=1000, chains=2, random_seed=42)
   ```  
   In this code, we assume fairly uninformative priors (Normal(0,10) for both intercept and slope, allowing a wide range of values). The model likelihood ties these parameters to the observed data `y`. When we call `pm.sample`, PyMC will run two MCMC chains for 2000 draws each (with 1000 tuning steps each) using NUTS. It will output progress including acceptance rates and whether the NUTS sampler was automatically chosen (“Auto-assigning NUTS sampler…”). After sampling, the `trace` contains posterior samples of `intercept`, `slope`, and `sigma`.  
3. **Examine Posterior Estimates:** Using ArviZ, we can summarize the posterior:
   ```python
   az.summary(trace, var_names=["intercept", "slope", "sigma"])
   ```  
   This will show the mean and 94% highest posterior density (HPD) interval for each parameter. We expect the posterior for `intercept` to be around 5 and `slope` around 1.5 (since that’s how we generated the data), with `sigma` around 2. The credible intervals should indicate the uncertainty in these estimates. Students should verify that the true values used in simulation fall within the credible intervals of the posterior.  
   We can also inspect trace plots for each parameter to visually confirm that the chains have mixed well (they should resemble “fat fuzzy caterpillars” without obvious trends):  
   ```python
   az.plot_trace(trace, var_names=["intercept", "slope", "sigma"])
   ```  
   *(Plotting is conceptually suggested; in a text-based environment, students can skip actual plotting and rely on summary statistics and diagnostics covered later.)*  
4. **Interpretation:** Based on the posterior, discuss what the results mean. For example, “The posterior mean for the slope is 1.48 with a 94% credible interval [1.2, 1.8], which strongly indicates a positive relationship between x and y. The intercept is around 5.1 [4.0, 6.2], meaning when x=0, y is likely around 5. The sigma posterior ~2.1 [1.7, 2.6] suggests the data’s noise level.” Emphasize how Bayesian regression gives a full distribution for each parameter instead of just point estimates, allowing probabilistic statements about the parameters (e.g., P(slope > 0) ~ 1.0 in this case, indicating high confidence in a positive slope).  

**Exercise:** Extend the regression model by adding another predictor or making it **hierarchical** if applicable. For example, if you have a second feature `x2` (say, a categorical variable encoded 0/1), include a second slope parameter and see how the inference adjusts. Alternatively, try a different likelihood: modify the model to use a Student-t distribution for `y_obs` instead of Normal to model potential outliers or heavy tails (you would need to add a degrees-of-freedom parameter `nu` to the model). Run the sampler again and compare the results – do the posterior estimates change or become wider/narrower with the new assumptions? This exercise teaches how to customize model components in PyMC and observe the impact.

**References (Class 2):** PyMC model specification and “observed” data usage, PyMC automatic sampler selection.

---

## Class 3: Advanced MCMC Techniques and Diagnostics

**Theory and Concepts:**  
- *Hamiltonian Monte Carlo (HMC) & NUTS:* Traditional MCMC methods (like random-walk Metropolis) can struggle with high-dimensional or complex posteriors, often requiring many correlated samples to explore the space. **Hamiltonian Monte Carlo (HMC)** is a more efficient MCMC technique that uses gradient information from the posterior to propose distant but high-probability moves, reducing random walk behavior. It treats the parameter space like a physical system with positions (parameters) and momenta, simulating “physics” (Newtonian dynamics) to traverse the posterior landscape. The **No-U-Turn Sampler (NUTS)** is an adaptive variant of HMC that eliminates the need to set a trajectory length by automatically stopping when it starts to turn back (no U-turn). PyMC’s default sampler (NUTS) thus leverages gradients to converge faster and explore more efficiently than simple Metropolis. In practical terms, this means PyMC can handle complex models with less tuning: NUTS adapts step sizes and avoids getting stuck by making use of the posterior’s geometry.  
- *Convergence Diagnostics:* After running MCMC, it’s crucial to check whether the chains have **converged** to the target posterior distribution. Convergence means the sampler has reached the stationary distribution and is exploring it adequately. Key diagnostics include:  
  - **Trace Plots:** A trace plot shows sampled values of a parameter over iterations for each chain. For a converged chain, the trace should appear like a fat, stationary band with no obvious drift or trend over time. Multiple chains should overlap/mix well, indicating they’ve all converged to the same distribution. If traces show trends or one chain is systematically different, it’s a warning sign of non-convergence.  
  - **R-hat (Gelman-Rubin statistic):** R-hat (also called $\hat{R}$) measures the ratio of inter-chain variance to intra-chain variance for each parameter. If all chains have converged to the same distribution, these variances should be equal, and R-hat will be close to 1.0. An R-hat value significantly above 1 (e.g., > 1.05) indicates that the chains have not mixed well and more sampling or tuning is needed. Modern implementations use rank-normalized R-hat for improved reliability.  
  - **Effective Sample Size (ESS):** Due to autocorrelation in MCMC samples, not all draws carry independent information. ESS is an estimate of the equivalent number of independent samples one has, after accounting for autocorrelation ([Effective sample size for MCMC (Markov chain Monte Carlo)](https://www.johndcook.com/blog/2017/06/27/effective-sample-size-for-mcmc/#:~:text=There%E2%80%99s%20not%20much%20theory%20to,are%20dependent%2C%20they%E2%80%99re%20weakly%20correlated)). For example, 1000 highly correlated draws might have an ESS of only ~100, meaning the chain’s information content is similar to 100 independent samples ([Effective sample size for MCMC (Markov chain Monte Carlo)](https://www.johndcook.com/blog/2017/06/27/effective-sample-size-for-mcmc/#:~:text=The%20idea%20is%20to%20have,are%20dependent%2C%20they%E2%80%99re%20weakly%20correlated)). Higher ESS (relative to total draws) is better. If ESS is low, it suggests high autocorrelation; one might need to run the chain longer or improve the sampler’s efficiency.  
  - **Divergences and Other Diagnostic Flags:** When using HMC/NUTS, a *divergence* is a sign that the sampler encountered a region of parameter space that it struggled with (often due to abrupt changes in the posterior geometry, e.g., funnel-shaped distributions). PyMC will report divergence warnings in the sampler output. Even if R-hat is good, the presence of many divergences indicates potential problems – you may need to reparameterize the model or adjust sampler settings.  
- *Tuning MCMC:* If diagnostics indicate issues, several remedies exist. One can run chains for more iterations (to increase ESS and give chains time to converge), or adjust NUTS hyperparameters like `target_accept` (the desired acceptance probability for HMC steps, default ~0.8). For example, `pm.sample(target_accept=0.95)` can help reduce divergences at the cost of smaller steps. One can also increase the number of chains to better assess convergence. In some cases, reparameterizing the model (changing how parameters are expressed) can vastly improve sampling (this is a complex topic, often model-specific). The goal of tuning is to achieve R-hat ~ 1 and sufficiently large ESS for all parameters, indicating trustworthy posterior estimates.

**Hands-On Practice:** *Diagnosing Convergence and Tuning MCMC*  
Using the linear regression model from Class 2 (or a similar model), we will intentionally tweak sampling settings to see the effect on diagnostics and then improve them:

1. **Run MCMC with Default Settings:** If not already done, sample from the model with multiple chains. For example:  
   ```python
   with linreg_model:  # reuse model from Class 2
       trace = pm.sample(draws=1000, tune=500, chains=4, random_seed=42)
   ```  
   This runs 4 chains, 1000 draws each (with 500 tuning steps per chain). After sampling, use ArviZ to compute diagnostics:  
   ```python
   ess = az.ess(trace, var_names=["intercept", "slope", "sigma"])
   rhat = az.rhat(trace, var_names=["intercept", "slope", "sigma"])
   print(rhat)
   ```  
   Examine the output R-hat values for each parameter – they should ideally be very close to 1.00. Also look at ESS (effective sample size); if each chain had 1000 draws, ESS will be somewhat lower due to autocorrelation, but should not be drastically low.  
   *Interpretation:* If R-hat < 1.01 for all parameters and no divergence warnings were issued, the chains likely converged. If R-hat is high or ESS is very low, or if PyMC printed warnings (e.g., about divergences or reaching max tree depth in NUTS), then we have more work to do.  
2. **Examine Trace Plots:**  
   ```python
   az.plot_trace(trace, var_names=["intercept", "slope", "sigma"])
   ```  
   Visually inspect the traces for any non-stationarity. All four chains for each parameter should overlap and look like random fluctuations around a constant mean. If you see one chain stuck in a different region or a clear trend over iterations, that indicates non-convergence. (In a non-graphical environment, students can skip plotting and rely on R-hat, but understanding what the plots would show is important.)  
3. **Tuning the Sampler:** If diagnostics were not satisfactory, try tuning. For instance, if there were divergences or R-hat slightly > 1:  
   - Increase `tune` (the number of burn-in steps). E.g., `pm.sample(tune=1000, draws=1500, ...)` so the sampler has more time to adapt.  
   - Adjust `target_accept`. If divergences occurred, set `target_accept=0.95` to take smaller steps (reducing the chance of divergences at the cost of more correlated samples).  
   Example:  
   ```python
   with linreg_model:
       trace_tuned = pm.sample(draws=1500, tune=1000, chains=4, target_accept=0.95, random_seed=42)
   az.summary(trace_tuned, var_names=["intercept", "slope", "sigma"])
   ```  
   Compare the R-hat and ESS from `trace_tuned` to the previous results. You should see R-hat approach 1.00 and ESS increase if tuning was successful. The trade-off is that the sampler took more samples or smaller steps, increasing computation time.  
4. **Alternate Samplers (optional):** For didactic purposes, you can also experiment with forcing a less efficient sampler to see differences. For example:  
   ```python
   with linreg_model:
       step = pm.Metropolis()  # force Metropolis sampler for demonstration
       trace_met = pm.sample(draws=5000, step=step, chains=4)
   ```  
   Check how long this took and what the ESS/R-hat look like. Typically, Metropolis will require many more iterations to achieve the same ESS as NUTS. This exercise highlights why HMC/NUTS is preferred for continuous models. (Note: This is just for exploration – in practice, stick to PyMC’s default unless you have a specific reason.)  

**Exercise:** Using a more complex model (if time permits), such as a hierarchical model or a model with more parameters, practice applying these diagnostics. For instance, try a simple hierarchical model (two-level regression or the 8-schools problem) and monitor R-hat and ESS. Intentionally set a short run (few draws) and observe diagnostics indicating non-convergence; then progressively increase the draws until convergence criteria are met. This will give you a feel for how to judge if an MCMC run is sufficient. Summarize what R-hat and ESS values you consider acceptable and how you determined the final run length.

**References (Class 3):** Description of NUTS and its advantages, R-hat definition and usage, effective sample size concept ([Effective sample size for MCMC (Markov chain Monte Carlo)](https://www.johndcook.com/blog/2017/06/27/effective-sample-size-for-mcmc/#:~:text=There%E2%80%99s%20not%20much%20theory%20to,are%20dependent%2C%20they%E2%80%99re%20weakly%20correlated)).

---

## Class 4: Case Study – Applying MCMC in PyMC

**Theory and Approach:**  
In this session we’ll apply what we’ve learned to a real-world problem, taking it from problem description through model building to posterior analysis. The focus is on **choosing an appropriate model** for the data and **interpreting the results** in context. Key considerations include:  
- *Understanding the Problem and Data:* We start by examining the context and data-generating process. What are we trying to estimate or infer? What does the dataset look like (variables, distributions, etc.)? This understanding guides our choice of model. For example, if data are counts, a Poisson or Negative Binomial likelihood might be appropriate; if data are proportions, a Beta-Binomial model might be used; if we have groups, we might consider a hierarchical model to share information between groups.  
- *Choosing a Model Structure:* Based on the problem, decide on the model’s structure. This means selecting the likelihood (observation model) and priors. We strive for a model that is complex enough to capture the real phenomena but simple enough to be interpretable and computationally feasible. For instance, if we suspect outliers or heavier-than-normal tails in the data, using a Student-T likelihood instead of Normal can be appropriate. If we have multiple related datasets (like measurements from multiple schools or clinics), a hierarchical model can borrow strength across them.  
- *Implementing the Model in PyMC:* Once the model is sketched out, implementing in PyMC follows the patterns we’ve practiced: define priors, define observed likelihood, then sample. It’s often useful to start with relatively broad (weakly informative) priors unless strong prior information is available. We ensure the model code aligns with the problem (e.g., no data points are left unmodeled, parameters have proper support, etc.). After implementation, we run MCMC to get the posterior. We must check diagnostics (from Class 3) to ensure the results are reliable before interpretation.  
- *Interpreting and Communicating Results:* With a posterior sample in hand, we answer the original question. This could involve reporting the posterior mean and credible interval of a quantity of interest, computing probabilities of certain hypotheses (e.g., probability that a treatment effect is greater than zero), or making predictions. Emphasize estimation over null-hypothesis significance testing: in Bayesian analysis, we directly assess the magnitude and uncertainty of effects rather than rely on p-values. The interpretation should be in context (e.g., “Based on the posterior, we are 95% credible that the treatment increases IQ by between 2 and 5 points on average, relative to control”). Visualizing posterior distributions or predictive checks can also help communicate findings to stakeholders who may not be experts.

**Hands-On Case Study:** *Bayesian Estimation for a Two-Group Comparison (Treatment vs Control)*  
As a concrete case study, we will analyze a dataset comparing two groups – for example, a clinical trial of a new drug vs placebo measuring an outcome (such as an IQ score improvement). This scenario is similar to the "Bayesian Estimation Supersedes the t-Test" example. The goal is to determine how different the two groups are, in a fully Bayesian way.

1. **Data Description:** Suppose we have two sets of observations: `data_treatment` (outcomes for the treatment group, N=47) and `data_control` (outcomes for the control group, N=42). For instance, these could be IQ test score changes for participants. We want to infer the difference in mean outcomes between the treatment and control. Looking at the raw data (e.g., computing means, plotting histograms) might reveal that the distributions are roughly bell-shaped but with some outliers. We decide to use a **robust Bayesian two-sample comparison model**: each group’s data will be modeled with a Student-T distribution (to accommodate potential outliers or heavier tails than a normal distribution), with possibly different means for each group but assuming (for simplicity) a shared variance and degrees-of-freedom across groups. This is a reasonable model if we believe the two groups differ primarily in their means (and maybe variance) but otherwise have similar shape distributions.  
2. **Model Specification:** We define parameters for the group means, a common standard deviation, and the degrees-of-freedom (ν) of the Student-T (which controls tail heaviness). Priors might be set as:  
   - $\mu_{\text{control}} \sim \mathcal{N}(0, 30)$ – prior for control mean (a wide prior covering a broad range of plausible IQ changes).  
   - $\mu_{\text{treatment}} \sim \mathcal{N}(0, 30)$ – prior for treatment mean. We could also express this as $\mu_{\text{treatment}} = \mu_{\text{control}} + \delta$ with a prior on $\delta$, but for simplicity we’ll use separate means and later compute the difference.  
   - $\sigma \sim \text{HalfNormal}(10)$ – prior for the common standard deviation (constrains $\sigma$ to positive values, and we expect typical variation of maybe 0–10 IQ points).  
   - $\nu \sim \text{Exponential}(1/30)$ – prior for degrees of freedom of the Student-T, favoring moderate to high ν (the mean of Exponential(1/30) is 30, so this prior suggests the distribution is not extremely heavy-tailed a priori, but can adapt if data demand).  
   Likelihood:  
   - For each observation in control group: $y_i^{(c)} \sim \text{StudentT}(\nu, \mu_{\text{control}}, \sigma)$.  
   - For each observation in treatment group: $y_j^{(t)} \sim \text{StudentT}(\nu, \mu_{\text{treatment}}, \sigma)$.  
   We assume equal $\sigma$ and $\nu$ for both groups to pool information about variance and tail-heaviness (this is a modeling choice; one could also give each group its own σ).  
3. **PyMC Implementation and Sampling:**  
   ```python
   with pm.Model() as drug_model:
       mu_control = pm.Normal("mu_control", mu=0, sigma=30)
       mu_treatment = pm.Normal("mu_treatment", mu=0, sigma=30)
       sigma = pm.HalfNormal("sigma", sigma=10)
       nu = pm.Exponential("nu", lam=1/30)  # degrees of freedom
       # Likelihoods for each group
       y_control = pm.StudentT("y_control", nu=nu, mu=mu_control, sigma=sigma, observed=data_control)
       y_treatment = pm.StudentT("y_treatment", nu=nu, mu=mu_treatment, sigma=sigma, observed=data_treatment)
       trace = pm.sample(draws=3000, tune=1000, chains=4, random_seed=42)
   ```  
   After running this, check the sampler output for any divergences. Given the moderately informative priors and a decent amount of data, NUTS should handle it, but if there are issues (like divergences due to the $\nu$ parameter, which can be tricky), one might tighten the prior on $\nu$ or increase `target_accept`.  
4. **Diagnostics:** We use the skills from Class 3 to verify convergence:  
   - Compute R-hat for all parameters (`mu_control`, `mu_treatment`, `sigma`, `nu`). We expect values near 1.  
   - Check ESS – with 4 chains * 3000 draws, we have plenty of samples; ESS should be high for all.  
   - Examine trace plots for any signs of non-convergence or multimodality. For example, if the chains for $\nu$ are mixing slowly (which can happen if the data don’t strongly inform ν), we might accept a larger R-hat for ν or note that inference on ν is not very precise.  
5. **Posterior Analysis and Interpretation:**  
   - *Group Means:* Look at the posterior means and credible intervals for `mu_control` and `mu_treatment`. For example, we might find $\mu_{\text{control}}$ posterior ~ 0.5 (with 95% CI [-1.5, 2.5]) and $\mu_{\text{treatment}}$ ~ 4.0 (CI [2.0, 6.0]), in units of IQ points improvement. This would suggest the treatment group improved more on average.  
   - *Group Difference:* It’s often useful to directly compute the posterior of the difference $\Delta = \mu_{\text{treatment}} - \mu_{\text{control}}`. We can get this from the trace:  
     ```python
     diff = trace.posterior["mu_treatment"] - trace.posterior["mu_control"]
     az.summary(diff, stat_focus="mean")
     ```  
     Or define `delta = pm.Deterministic("delta", mu_treatment - mu_control)` inside the model before sampling to have PyMC track it. Suppose the posterior for $\delta$ is centered around ~3.5 with a 95% credible interval [1.5, 5.5]. This is the estimated treatment effect (in IQ points). We can also calculate the probability that $\delta > 0$ (i.e., treatment is better than control): this is simply the fraction of posterior samples where `mu_treatment > mu_control`. If none of the 12,000 samples (3000 per chain * 4 chains) show a negative $\delta$, we can say $P(\text{treatment improvement} > 0) \approx 1.0$ (essentially certain given the model).  
   - *Interpretation:* We would report that the drug appears to increase the IQ improvement by about 3–4 points on average compared to placebo, with high confidence in a positive effect. We also interpret $\sigma$ (common std dev of ~ maybe 8-10 points, indicating substantial individual variability) and $\nu` (if $\nu$ is, say, around 5–10, it indicates heavier tails than normal, meaning a few participants had atypical outcomes – perhaps some big improvements or declines).  
   - *Comparing to Traditional Analysis:* We might note that a classical t-test would give a p-value < 0.05 for difference in means, but the Bayesian approach provides richer information: the estimated magnitude of the difference and its uncertainty, and the probability of the effect being positive. This aligns with the idea that *“estimation rather than hypothesis testing”* is more informative. We directly address “how different are the groups” rather than a yes/no significance decision.  

6. **Communication:** Finally, students should practice explaining the model and results in words, as if to a collaborator or layperson. For example: *“We built a Bayesian model to estimate the average IQ score improvement for those who took the smart drug versus those who took a placebo. Based on our analysis, we’re highly confident the drug has a positive effect. On average, the drug group improved about 3-4 IQ points more than the placebo group. However, there’s individual variation – not everyone responds equally. We used a robust statistical model that can account for outliers, ensuring that a few unusual results don’t skew our conclusions. The results give a high probability (close to 100%) that the drug truly improves scores, which provides strong evidence in favor of the drug’s efficacy.”* This exercise of communication is important for solidifying understanding and for real-world applications.

**Exercise:** Apply a similar workflow to a different case study. For example, take a dataset of counts (e.g., number of customer purchases per day before and after a marketing campaign) and use a Bayesian model to assess the campaign’s impact. Choose a model (perhaps a Poisson or Negative Binomial for counts) with appropriate priors, run MCMC, and interpret the posterior difference. Alternatively, if you have access to a dataset in your research area, attempt to model it with PyMC – start with a basic model and gradually increase complexity if needed. Ensure to check diagnostics and interpret results, comparing what you learn from the Bayesian approach to what classical methods might provide.

**References (Class 4):** Example of Bayesian approach to two-group comparison ([Bayesian Estimation Supersedes the T-Test — PyMC example gallery](https://www.pymc.io/projects/examples/en/latest/case_studies/BEST.html#:~:text=The%20de%20facto%20standard%20for,specified%20threshold%20value)) (Kruschke, 2013), principles of model building in Bayesian workflow (Gelman’s Bayesian data analysis concepts, *not directly cited but generally referenced*).

---

## Class 5: Model Evaluation and Comparison

**Theory and Concepts:**  
- *Why Evaluate and Compare Models:* In Bayesian analysis, just because we can build a complex model doesn’t mean it’s the best choice. We often consider multiple candidate models for our data (different structures, different predictors, etc.). **Model evaluation** checks how well a model fits the data and how plausible its predictions are, while **model comparison** helps us choose among models. Since Bayesian models can be compared on both their fit to observed data and their predictive performance on new data, we have specialized tools (criteria) for this purpose. Unlike frequentist model comparison which might use AIC/BIC or likelihood ratio tests, Bayesian model comparison often relies on information criteria that are fully Bayesian or cross-validation approaches (and occasionally Bayes Factors, though those require careful prior choices).  
- *Posterior Predictive Checks (PPCs):* Before numerical comparisons, it’s important to do **posterior predictive checks** to evaluate model fit qualitatively. The idea is to simulate data from the fitted model (using the posterior samples of parameters) and compare these simulated (replicated) data to the actual observations. If the model is a good fit, the replicated data should look similar to the real data on relevant features (e.g., means, variances, distribution shape, number of zeroes, etc.). For example, you might simulate 100 datasets from the posterior and overplot their distributions against the observed data – systematic discrepancies indicate the model is not capturing something. PPCs can reveal model mis-specifications: e.g., if your model’s simulated data have far fewer extreme values than the real data, it suggests your model’s tails are too light (perhaps you need a heavier-tailed likelihood or a mixture to capture outliers). PPCs are a **graphical, intuitive way** to validate a model. They complement numerical diagnostics by answering “does the model generate data that resemble what we observed?”.  
- *Information Criteria for Model Comparison:* When comparing models, we often use metrics that estimate out-of-sample predictive performance. Three common criteria in Bayesian analysis are **DIC**, **WAIC**, and **LOO**:  
  - **Deviance Information Criterion (DIC):** An older metric that is similar in spirit to AIC. It uses the deviance (–2 * log-likelihood) at the posterior mean of parameters plus a penalty for model complexity (effective number of parameters). DIC is easy to compute but has some shortcomings (it relies on a point estimate of the posterior and can misbehave for complex posteriors, e.g., multimodal or very non-normal posteriors). DIC is **not fully Bayesian**, as it doesn’t integrate over the posterior uncertainty fully, and is generally being superseded by WAIC and LOO ([WAIC and cross-validation in Stan!](https://statmodeling.stat.columbia.edu/2014/05/26/waic-cross-validation-stan/#:~:text=WAIC%20and%20cross,comes%20as%20a%20posterior)).  
  - **WAIC (Widely Applicable Information Criterion):** A fully Bayesian criterion that sums the log-likelihood for each data point, averaged over the posterior, and then adjusts for overfitting by subtracting an effective number of parameters penalty. It essentially approximates Bayesian cross-validation in closed form. WAIC is invariant to parameterization and is suitable for a wide range of models (hence “widely applicable”). A lower WAIC indicates a better model (in terms of predictive accuracy). We often look at differences in WAIC between models – a difference > 10 is generally considered substantial. WAIC comes with a standard error to gauge if differences are meaningful.  
  - **LOO (Leave-One-Out Cross-Validation, especially PSIS-LOO):** This method estimates model predictive performance by conceptually leaving out each data point in turn, fitting the model to the rest, and evaluating predictive density on the left-out point. Doing this exactly for Bayesian models is expensive, but **Pareto-smoothed importance sampling LOO** (PSIS-LOO) uses the existing posterior draws to approximate this, providing a fast estimate of what the performance would be if we validated on each data point in turn. LOO essentially gives an estimate of expected log pointwise predictive density (elpd) for new data. Like WAIC, lower (more negative) values of –elpd indicate worse fit; higher elpd is better. LOO is often preferred over DIC/WAIC when available, because it has better theoretical properties and diagnostics (it will flag if the approximation is failing for certain points via Pareto k values). In fact, WAIC and LOO are asymptotically equivalent, but LOO with PSIS tends to be more robust for finite samples.  
- *Using WAIC/LOO in Practice:* ArviZ (which PyMC leverages) provides convenient functions to compute these: `az.waic(trace)` and `az.loo(trace)` given an `InferenceData` object. More commonly, `az.compare({model1: trace1, model2: trace2, ...}, metric='loo')` can compare multiple models’ LOO or WAIC at once. The output will rank models by the chosen metric and often provide differences (Δelpd) and weights (approximate model weights akin to Akaike weights). It’s important to note that these criteria estimate predictive performance; a model with lower WAIC/higher elpd is expected to predict new data better. This doesn’t automatically mean it’s the “true” model, but it’s a pragmatic choice for model selection. Always consider the size of the difference relative to the standard error – if two models have WAIC within 1-2 times the SE of that difference, they are essentially tied in predictive performance.  
- *Model Checking vs. Selection:* Emphasize that before choosing a model, one should check that each candidate model is fitting its data well (using PPCs and diagnostics). A highly complex model may score better on WAIC but could be overfitting if it’s not actually warranted by data patterns. Conversely, a simpler model might have only slightly worse WAIC but be preferable for interpretability or theory. Bayesian model comparison is a tool, not an absolute verdict – it should be used in conjunction with substantive knowledge. Also mention that if models are very different in structure, other techniques like Bayes factors or stacking might be considered, but those are beyond our scope here.  

**Hands-On Practice:** *Posterior Predictive Checks and Model Comparison on a Case Study*  
We will continue with the previous class’s case study (or another dataset) to demonstrate model evaluation. Suppose we suspect that our model from Class 4 might be too simplistic or too complex, and we want to compare it with an alternative. We’ll also perform PPCs on the chosen model to validate its fit.

1. **Posterior Predictive Check (PPC):** Using the drug trial example from Class 4, let’s perform a PPC on the fitted model (`drug_model`). PyMC can generate posterior predictive samples easily:  
   ```python
   with drug_model:
       ppc = pm.sample_posterior_predictive(trace, var_names=["y_control", "y_treatment"], random_seed=42)
   ```  
   This will draw samples of `y_control` and `y_treatment` from the posterior predictive distribution. Essentially, it simulates possible datasets for each group, given the uncertainty in our parameters. We can then compare these to the actual `data_control` and `data_treatment`. One way to do this is using ArviZ’s `plot_ppc`:  
   ```python
   az.plot_ppc(az.from_pymc3(trace=trace, posterior_predictive=ppc), group="posterior_predictive", data_pairs={"y_control": "y_control"})
   ```  
   (Plotting is ideally done in a local environment – since we cannot show plots here, imagine it displays the distribution of simulated control outcomes vs. actual control outcomes.)  
   *What to look for:* Do the simulated and actual data have similar means and spreads? For example, calculate the mean of each simulated dataset and compare to observed mean. If the observed mean lies in the middle of the distribution of simulated means, that’s good. Also check extreme values: does the range of simulated IQ changes cover the range of actual changes? If, say, the worst decline in actual data was -15, but simulated data rarely go below -10, the model might be underestimating tail risk. On the other hand, if everything lines up (simulated and actual quantiles match closely), it indicates the model can replicate data similar to what we saw – a sign of a good fit.  
   For our example, if the model was a Student-T, it likely handles tails decently. If we had used a normal instead and the data had outliers, PPC might have revealed discrepancies (more extreme observed values than the model predicts). Students should summarize any differences observed and think about whether they indicate a problem (and how to address it if so).  
2. **Model Comparison Setup:** Let’s define an alternative model to compare with our current model. For instance, perhaps we want to test if assuming equal variance was okay. We could build a second model where treatment and control have separate $\sigma$ parameters. Or, alternatively, compare a **non-robust model** (using Normal likelihood) to our **robust model** (Student-T). Let’s do the latter for illustration:  
   - **Model 1 (robust):** The Student-T model from Class 4 (`drug_model`).  
   - **Model 2 (normal):** A similar model but using Normal likelihood for both groups instead of StudentT (dropping the ν parameter). The normal model assumes normally distributed outcomes without heavy tails.  
   Define Model 2 in PyMC:  
   ```python
   with pm.Model() as drug_model_normal:
       mu_control = pm.Normal("mu_control", mu=0, sigma=30)
       mu_treatment = pm.Normal("mu_treatment", mu=0, sigma=30)
       sigma_control = pm.HalfNormal("sigma_control", sigma=10)
       sigma_treatment = pm.HalfNormal("sigma_treatment", sigma=10)
       y_control = pm.Normal("y_control", mu=mu_control, sigma=sigma_control, observed=data_control)
       y_treatment = pm.Normal("y_treatment", mu=mu_treatment, sigma=sigma_treatment, observed=data_treatment)
       trace_normal = pm.sample(draws=3000, tune=1000, chains=4, random_seed=42)
   ```  
   (Here we even allowed each group its own $\sigma$ for fairness, since the Student-T model had a common σ but heavier tails.) Check diagnostics for `trace_normal` – likely it converges fine (normals are easy to sample).  
3. **Compute WAIC / LOO for Both Models:** Now we use ArviZ to calculate information criteria:  
   ```python
   waic_robust = az.waic(trace, pointwise=True)
   waic_normal = az.waic(trace_normal, pointwise=True)
   print(waic_robust.waic, waic_normal.waic)
   ```  
   Or directly compare:  
   ```python
   compare_df = az.compare({"robust_model": trace, "normal_model": trace_normal}, metric="waic")
   print(compare_df)
   ```  
   This will output a DataFrame with rows for each model, including WAIC, the difference ΔWAIC, standard error of the difference, and a weight (which can be loosely interpreted as the probability that each model is the best model for out-of-sample prediction in the set). For example, we might see something like:  
   |           |   rank |   waic |   waic_se |   p_waic |   d_waic |   weight |  
   |-----------|-------:|-------:|----------:|---------:|---------:|---------:|  
   | robust_model  | 0 | 123.4 | 5.6      | 5.1     | 0.0     | 0.95    |  
   | normal_model  | 1 | 130.2 | 5.2      | 4.0     | 6.8     | 0.05    |  
   This indicates the robust model has a lower WAIC (better fit) by about 6.8 points. Given the WAIC SE ~5, this difference is on the order of ~1.3 SE – it’s somewhat substantial but not extremely large. The weights suggest ~95% preference for the robust model. We would interpret that the Student-T model is likely better at predicting new data (which might contain outliers) than the normal model.  
   Additionally, examine the effective number of parameters (`p_waic`). The Student-T model might show a higher p_waic (used more “parameters” to fit the data, due to the extra ν flexibility), which is expected since it’s more complex. However, it paid off in lower WAIC.  
4. **Model Choice:** Based on the comparison, we would choose the robust model in this case. We also confirm via PPCs that the normal model was indeed lacking – we could do a PPC for `drug_model_normal` and likely find it struggles with outliers (maybe predicting too few extreme values). The robust model, while a bit more complex, improved predictive performance. This demonstrates the trade-off that information criteria capture: **goodness of fit vs. complexity**.  
   Emphasize to students that if the WAIC/LOO differences were very small (say < 2), one might prefer the simpler model for parsimony. But here, since the difference is notable, it validates the use of the more complex model. Also, always consider if the model makes sense scientifically, not just statistically – e.g., if Normal model had won, one could argue it’s sufficient and easier to interpret (no ν parameter).  
5. **Other Evaluation Tools:** Mention briefly that there are other ways to compare models:  
   - **LOO predictive distributions**: We could examine the distribution of residuals or predictions for left-out points (if one model consistently predicts certain points poorly, that’s informative).  
   - **Bayes Factors**: comparing marginal likelihoods of models, which we won’t compute here due to sensitivity to priors and computational difficulty, but students might encounter in Bayesian literature.  
   - **Cross-validation**: one can manually do k-fold CV in a Bayesian context (which is heavy to compute but conceptually straightforward). WAIC and PSIS-LOO are popular because they approximate this efficiently.  
   - **Calibration checks**: Ensure that predictive intervals indeed contain the observed proportion of data they’re supposed to (e.g., ~90% of data points fall in the model’s 90% predictive interval). This can be done via simulation as well.  

6. **Conclusion:** To wrap up the course, highlight that a Bayesian modeling workflow is iterative: you build a model, check fit (PPCs), possibly compare with alternatives, and refine as needed. The power of PyMC and MCMC is that we can fit very flexible models – but with that power comes the responsibility to verify that our models are appropriate and to avoid overfitting by using principled comparisons. The end goal is a model that *best balances fit and simplicity* and provides insights into the problem at hand. Encourage students to apply these concepts to their own data and continue practicing interpretation and validation of Bayesian models.

**Exercise:** Given a scenario with two or more competing models (for example, linear vs. polynomial regression on a dataset, or a model with a certain interaction effect vs. one without), use PyMC to fit both and then compare them using WAIC or LOO. Perform posterior predictive checks on the preferred model to see if there are any systematic patterns in the residuals that the model doesn’t capture (if so, that might inspire yet another model!). Write a short report on which model you chose and why, including the numerical criteria and substantive reasoning. This will reinforce a full modeling cycle from conception to evaluation.

**References (Class 5):** Posterior predictive check concept, WAIC and LOO definitions and advantages, practical comparison of models using WAIC/LOO (Vehtari et al. 2017).

Markov Chain Monte Carlo (MCMC) algorithms are used to sample from complex probability distributions, especially in Bayesian inference where the posterior is intractable. The main MCMC algorithms are:

### **1. Metropolis-Hastings (MH) Algorithm**
- The **Metropolis-Hastings algorithm** is the foundation of many MCMC methods.
- It proposes a new sample \( \theta' \) based on the current sample \( \theta \) using a proposal distribution \( q(\theta' | \theta) \).
- The new sample is accepted with probability:
  \[
  A = \min \left( 1, \frac{p(\theta') q(\theta | \theta')}{p(\theta) q(\theta' | \theta)} \right)
  \]
  where \( p(\theta) \) is the target distribution (e.g., posterior).
- If \( q(\theta' | \theta) \) is symmetric (e.g., Normal proposal \( q(\theta' | \theta) = \mathcal{N}(\theta, \sigma^2) \)), the algorithm reduces to the **Metropolis algorithm**.
- Strengths: Simple, flexible.
- Weaknesses: Requires manual tuning of proposal distribution; can suffer from slow mixing in high dimensions.

### **2. Gibbs Sampling**
- Gibbs sampling is a special case of Metropolis-Hastings where proposals are always accepted.
- It samples each parameter sequentially from its **full conditional distribution** given the other parameters:
  \[
  \theta_i^{(t+1)} \sim p(\theta_i | \theta_1^{(t+1)}, ..., \theta_{i-1}^{(t+1)}, \theta_{i+1}^{(t)}, ..., \theta_n^{(t)})
  \]
- Requires knowing and efficiently sampling from the full conditionals.
- Strengths: Efficient when full conditionals are known analytically.
- Weaknesses: Can mix slowly if variables are highly correlated.

### **3. Hamiltonian Monte Carlo (HMC)**
- **Hamiltonian Monte Carlo (HMC)** improves upon Metropolis-Hastings by using **gradient information** to propose efficient moves.
- It treats the parameter space as a physical system with a position \( \theta \) and momentum \( p \), governed by Hamiltonian dynamics.
- A series of **leapfrog steps** integrate the Hamiltonian equations:
  \[
  p' = p - \frac{\epsilon}{2} \nabla_{\theta} U(\theta), \quad
  \theta' = \theta + \epsilon p', \quad
  p'' = p' - \frac{\epsilon}{2} \nabla_{\theta} U(\theta')
  \]
  where \( U(\theta) \) is the negative log-posterior and \( \epsilon \) is the step size.
- The final proposal \( (\theta', p') \) is accepted with Metropolis probability.
- **No-U-Turn Sampler (NUTS)** is an adaptive version of HMC that automatically selects the step size and number of leapfrog steps.
- Strengths: Efficient in high dimensions; faster mixing than Metropolis.
- Weaknesses: Requires gradient computation; more computationally intensive.

### **4. Slice Sampling**
- Slice sampling works by sampling uniformly from the region where the posterior density is above a randomly chosen threshold.
- It introduces an auxiliary variable \( u \) and samples:
  \[
  u \sim \text{Uniform}(0, p(\theta)), \quad \theta' \sim \text{Uniform}(\{ \theta : p(\theta) \geq u \})
  \]
- The sampling region is usually determined by an **expanding and shrinking bracket** around \( \theta \).
- Strengths: Does not require tuning a proposal distribution.
- Weaknesses: Can be inefficient for multimodal distributions.

### **5. Random-Walk Metropolis (RWM)**
- A simple form of Metropolis-Hastings where proposals are drawn from a symmetric normal distribution:
  \[
  \theta' \sim \mathcal{N}(\theta, \sigma^2)
  \]
- Commonly used in high-dimensional problems, but **step size tuning is crucial** for good performance.
- Strengths: Simple implementation.
- Weaknesses: Slow mixing in correlated or high-dimensional spaces.

### **Comparison and Use Cases**
| Algorithm             | Strengths                                      | Weaknesses |
|----------------------|--------------------------------|-------------|
| **Metropolis-Hastings** | General-purpose, works with any proposal | Can mix slowly, requires tuning |
| **Gibbs Sampling** | Efficient when full conditionals are available | Cannot be used if full conditionals are unknown |
| **Hamiltonian Monte Carlo (HMC)** | Fast mixing in high dimensions, uses gradients | Computationally expensive |
| **No-U-Turn Sampler (NUTS)** | Adaptive HMC, requires no manual tuning | Still expensive, requires gradient computation |
| **Slice Sampling** | No need for proposal tuning | Can be inefficient for complex posteriors |
| **Random-Walk Metropolis (RWM)** | Simple to implement | Requires careful tuning, slow in high dimensions |

For practical applications in **PyMC**, the **No-U-Turn Sampler (NUTS)** is the default and is usually the best choice for continuous models.

Would you like more details on any of these algorithms? 🚀