# Session 2: Building Models with PyMC and MCMC Fundamentals

In the previous session, we explored the foundations of Bayesian inference, including Bayes' theorem, conjugate priors, and the mechanics of Bayesian updating. While conjugate priors provide elegant closed-form solutions, they are limited to a small set of model families. Real-world problems often require more flexible models that don't have analytical solutions.

This session introduces PyMC, a powerful probabilistic programming framework that enables us to build and analyze complex Bayesian models. We'll learn how to specify models using PyMC's intuitive API, understand the theoretical foundations of Markov Chain Monte Carlo (MCMC) methods that make modern Bayesian computation possible, and work through practical examples that demonstrate the complete modeling workflow.

## Learning Objectives

By the end of this session, you will be able to:

1. **Build probabilistic models in PyMC**: Understand PyMC's core components including distributions, random variables, and the model context
2. **Specify model structure**: Learn how to encode assumptions about data generating processes using priors and likelihoods
3. **Understand MCMC fundamentals**: Grasp why we need MCMC, how it works conceptually, and what makes it powerful for Bayesian inference
4. **Implement complete Bayesian analyses**: Build, fit, and interpret results from real-world models including linear regression

## Why PyMC?

PyMC provides several key advantages for Bayesian modeling:

- **Expressive model specification**: Write models that look like their mathematical notation
- **Automatic differentiation**: No need to derive gradients by hand
- **State-of-the-art samplers**: Access to efficient MCMC algorithms like NUTS (No-U-Turn Sampler)
- **Comprehensive diagnostics**: Built-in tools for assessing convergence and model quality
- **Integration with the PyData ecosystem**: Works seamlessly with NumPy, Pandas, and visualization libraries

Let's begin by setting up our environment and exploring PyMC's fundamental concepts.

In [None]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import polars as pl
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
import scipy.stats as st
from itertools import takewhile
import warnings

warnings.simplefilter("ignore")

# Set random seed for reproducibility
rng = np.random.default_rng(42)

## Part 1: Building Models in PyMC

Probabilistic programming represents a paradigm shift in how we approach statistical modeling. Instead of deriving update equations or coding samplers by hand, we declare the structure of our model and let the framework handle the computational details. PyMC exemplifies this approach by providing an intuitive interface that closely mirrors mathematical notation while leveraging sophisticated algorithms under the hood.

### The Philosophy of Probabilistic Programming

In traditional statistical programming, we often work backwards from the inference algorithm. We might derive the posterior distribution analytically (if we're lucky), implement a specific sampling scheme, or resort to approximations. This approach becomes increasingly difficult as models grow in complexity.

Probabilistic programming flips this workflow. We start by specifying what we know:
- **Prior knowledge** about parameters before seeing data
- **The data generating process** that connects parameters to observations
- **The observed data** itself

From this specification, PyMC automatically constructs the computational graph needed for inference, applies appropriate transformations for constrained parameters, and selects suitable sampling algorithms.

### PyMC's Core Abstractions

PyMC organizes probabilistic models around several key concepts:

1. **The Model Context**: Every PyMC model exists within a context that tracks relationships between variables. This context manager pattern ensures that all model components are properly registered and connected.

2. **Random Variables**: These represent quantities with uncertainty. In Bayesian modeling, parameters are random variables with prior distributions, and data are random variables with likelihood distributions.

3. **Distributions**: PyMC provides a comprehensive library of probability distributions. Each distribution can create random variables when used within a model context.

4. **Deterministic Transformations**: Often we need to transform parameters or compute derived quantities. PyMC tracks these deterministic relationships to maintain the full model structure.

5. **Observed Data**: By marking random variables as observed, we condition the model on actual data, transforming prior distributions into posterior distributions.

Let's explore these concepts through hands-on examples.

### The Distribution Class

At the heart of PyMC lies the `Distribution` class, which encapsulates probability distributions and their properties. Understanding how distributions work in PyMC is crucial for effective model building.

#### Random Variables and Distributions

In probability theory, a random variable is a function that assigns numerical values to outcomes of random phenomena. In PyMC, we create random variables by instantiating distributions within a model context. This seemingly simple act triggers a sophisticated chain of events:

1. **Registration**: The variable is registered with the model's computational graph
2. **Transformation**: If the variable has constrained support (e.g., positive only), PyMC automatically applies bijective transformations to map it to unconstrained space
3. **Tracking**: Dependencies between variables are recorded for efficient computation

#### Key Properties of Distributions

Every PyMC distribution provides several important methods and properties:

- **Sampling**: Generate random draws from the distribution
- **Log probability**: Evaluate the log probability density/mass at specific values
- **Moments**: Access theoretical moments like mean and variance
- **Support**: The valid range of values for the distribution

#### Common Arguments

When creating distributions in PyMC, you'll encounter these standard arguments:

- **`name`** (required): A unique string identifier for the variable. This name is used in results, plots, and diagnostics.

- **`shape`**: Specifies the dimensions of the variable. For example, `shape=(3,)` creates three independent draws from the distribution.

- **`dims`**: Named dimensions that provide semantic meaning and integrate with labeled coordinates. This is particularly useful for hierarchical models.

- **`observed`**: Providing data here conditions the distribution, turning it from a prior into a likelihood. This is how we incorporate evidence into our models.

- **`transform`**: While usually handled automatically, you can specify custom transformations for specialized needs.

- **`initval`**: Starting values for sampling algorithms. PyMC usually chooses sensible defaults, but manual specification can help with convergence.

### Working with Distributions

PyMC distributions can be used in two ways:

1. **Inside a model context**: Creates a random variable tracked by the model
2. **Standalone with `.dist()`**: Creates a distribution object for direct use

In [None]:
# Standalone distribution - useful for exploration
x_dist = pm.Normal.dist(mu=0, sigma=1)

# Draw samples from the distribution
samples = pm.draw(x_dist, draws=1000, random_seed=rng)

# Calculate log probability
log_prob = pm.logp(x_dist, 0.5).eval()

print(f"Log probability of x=0.5: {log_prob:.4f}")
px.histogram(samples, title="Samples from Normal(0, 1)").update_layout(
    xaxis_title="Value", yaxis_title="Count", showlegend=False
)

In [None]:
# Inside a model context - creates tracked random variables
with pm.Model() as model:
    # Prior distribution
    mu = pm.Normal('mu', mu=0, sigma=10)
    
    # Another prior
    sigma = pm.HalfNormal('sigma', sigma=5)
    
    # Likelihood (with observed data)
    y = pm.Normal('y', mu=mu, sigma=sigma, observed=[1.2, 0.8, 1.5, 0.9])

# The model now contains these variables
print(f"Model variables: {list(model.values_to_rvs.keys())}")

## Part 2: Model Specification and Structure

PyMC models are built using a context manager pattern. The `Model` context tracks all variables and their relationships, creating a directed acyclic graph (DAG) that represents the probabilistic model.

### Core Components

1. **Stochastic Random Variables**: Variables with uncertainty, defined by probability distributions
   - Unobserved (parameters to estimate)
   - Observed (data)

2. **Deterministic Variables**: Variables computed from other variables with no additional randomness

3. **Factor Potentials**: Terms that modify the joint log-probability

4. **Data Containers**: Special variables for data that might change

In [None]:
# Example: Complete model structure
np.random.seed(42)
true_mu = 5
true_sigma = 2
data = np.random.normal(true_mu, true_sigma, size=100)

with pm.Model() as structured_model:
    # 1. Stochastic variables (priors)
    mu = pm.Normal('mu', mu=0, sigma=10)
    log_sigma = pm.Normal('log_sigma', mu=0, sigma=1)
    
    # 2. Deterministic transformation
    sigma = pm.Deterministic('sigma', pm.math.exp(log_sigma))
    
    # 3. Data container (allows updating data later)
    x_data = pm.Data('x_data', data)
    
    # 4. Likelihood (observed stochastic variable)
    likelihood = pm.Normal('likelihood', mu=mu, sigma=sigma, observed=x_data)
    
    # 5. Factor potential (example: constrain mu to be positive)
    positive_mu = pm.Potential('positive_mu', 
                              pm.math.switch(mu > 0, 0, -np.inf))

# Visualize the model structure
pm.model_to_graphviz(structured_model)

## Part 3: Likelihood Functions and Prior Distributions

In Bayesian modeling, we combine prior beliefs about parameters with data through a likelihood function. PyMC provides a rich set of distributions for both.

### Common Prior Distributions

**For location parameters (means, intercepts):**
- `Normal(mu, sigma)`: When you expect values around a certain point
- `StudentT(nu, mu, sigma)`: Robust to outliers, heavier tails than Normal
- `Uniform(lower, upper)`: Flat prior over a range (use sparingly)

**For scale parameters (standard deviations, variances):**
- `HalfNormal(sigma)`: Positive values with mode at 0
- `HalfCauchy(beta)`: Heavy-tailed, good for hierarchical models
- `Exponential(lam)`: Alternative for positive parameters
- `InverseGamma(alpha, beta)`: Traditional but less recommended

**For probabilities and proportions:**
- `Beta(alpha, beta)`: Flexible distribution on [0, 1]
- `Dirichlet(a)`: Multivariate generalization of Beta

**For counts:**
- `Poisson(mu)`: Count data with equal mean and variance
- `NegativeBinomial(mu, alpha)`: Overdispersed count data

### Common Likelihood Functions

**For continuous data:**
- `Normal(mu, sigma)`: Symmetric errors around a mean
- `StudentT(nu, mu, sigma)`: Heavy-tailed errors, robust to outliers
- `Lognormal(mu, sigma)`: Positive data with right skew
- `Gamma(alpha, beta)`: Positive continuous data
- `Beta(alpha, beta)`: Data bounded between 0 and 1

**For discrete data:**
- `Bernoulli(p)`: Binary outcomes (0/1)
- `Binomial(n, p)`: Number of successes in n trials
- `Poisson(mu)`: Count data
- `NegativeBinomial(mu, alpha)`: Overdispersed counts
- `Categorical(p)`: One of K categories
- `Multinomial(n, p)`: Counts in K categories

In [None]:
# Example: Exploring different prior distributions
fig = make_subplots(rows=2, cols=2, subplot_titles=[
    'Location Priors', 'Scale Priors', 
    'Probability Priors', 'Count Priors'
])

x_range = np.linspace(-5, 5, 1000)
x_positive = np.linspace(0.01, 5, 1000)
x_prob = np.linspace(0, 1, 1000)
x_count = np.arange(0, 20)

# Location priors
fig.add_trace(go.Scatter(x=x_range, y=pm.Normal.dist(0, 1).logp(x_range).eval(), 
                        name='Normal(0,1)', line=dict(width=2)), row=1, col=1)
fig.add_trace(go.Scatter(x=x_range, y=pm.StudentT.dist(nu=3, mu=0, sigma=1).logp(x_range).eval(), 
                        name='StudentT(3,0,1)', line=dict(width=2, dash='dash')), row=1, col=1)

# Scale priors  
fig.add_trace(go.Scatter(x=x_positive, y=pm.HalfNormal.dist(1).logp(x_positive).eval(), 
                        name='HalfNormal(1)', line=dict(width=2)), row=1, col=2)
fig.add_trace(go.Scatter(x=x_positive, y=pm.HalfCauchy.dist(1).logp(x_positive).eval(), 
                        name='HalfCauchy(1)', line=dict(width=2, dash='dash')), row=1, col=2)

# Probability priors
fig.add_trace(go.Scatter(x=x_prob, y=pm.Beta.dist(2, 2).logp(x_prob).eval(), 
                        name='Beta(2,2)', line=dict(width=2)), row=2, col=1)
fig.add_trace(go.Scatter(x=x_prob, y=pm.Beta.dist(0.5, 0.5).logp(x_prob).eval(), 
                        name='Beta(0.5,0.5)', line=dict(width=2, dash='dash')), row=2, col=1)

# Count priors
fig.add_trace(go.Scatter(x=x_count, y=pm.Poisson.dist(5).logp(x_count).eval(), 
                        mode='markers+lines', name='Poisson(5)'), row=2, col=2)
fig.add_trace(go.Scatter(x=x_count, y=pm.NegativeBinomial.dist(5, 2).logp(x_count).eval(), 
                        mode='markers+lines', name='NegBinom(5,2)', 
                        line=dict(dash='dash')), row=2, col=2)

fig.update_layout(height=600, showlegend=True)
fig.update_yaxes(title_text="Log Probability")
fig.show()

## Part 4: Introduction to MCMC Methods

Now that we understand how to specify models in PyMC, we need to address a fundamental question: how do we actually compute with these models? For all but the simplest cases, the posterior distribution cannot be derived analytically. This is where Markov Chain Monte Carlo (MCMC) methods become essential.

### The Challenge of Bayesian Computation

Recall Bayes' theorem:

$$P(\theta|x) = \frac{P(x|\theta) P(\theta)}{P(x)}$$

The denominator, $P(x) = \int P(x|\theta) P(\theta) \, d\theta$, is called the marginal likelihood or evidence. For most models, this integral is intractable:

1. **High dimensionality**: With multiple parameters, we need to integrate over many dimensions
2. **Complex dependencies**: Parameters often have intricate relationships
3. **Non-standard distributions**: The posterior rarely has a recognizable form

### From Integration to Sampling

MCMC methods sidestep the integration problem through a clever insight: instead of computing the posterior distribution directly, we can draw samples from it. With enough samples, we can approximate any quantity of interest:

- **Posterior means**: $E[\theta|x] \approx \frac{1}{n}\sum_{i=1}^n \theta_i$
- **Credible intervals**: Use sample quantiles
- **Posterior probabilities**: Proportion of samples in a region

The key challenge is: how do we draw samples from a distribution when we only know it up to a normalizing constant?

### The Markov Chain Monte Carlo Solution

MCMC methods construct a Markov chain whose stationary distribution is the posterior distribution we want to sample from. The "Markov" property means that each sample depends only on the previous sample, not the entire history. The "Monte Carlo" aspect refers to the use of random sampling.

The general MCMC algorithm follows this pattern:

1. Start at some initial parameter values $\theta^{(0)}$
2. For iteration $t = 1, 2, ...$:
   - Propose new parameter values $\theta^*$ based on current values $\theta^{(t-1)}$
   - Accept or reject the proposal based on the posterior probability
   - Set $\theta^{(t)} = \theta^*$ if accepted, otherwise $\theta^{(t)} = \theta^{(t-1)}$

Different MCMC algorithms vary in how they propose new values and decide whether to accept them. The art lies in designing proposals that efficiently explore the posterior distribution.

## Part 5: Sampling Algorithms and Monte Carlo Integration

To understand how MCMC works in practice, let's explore the fundamental concepts through concrete examples. We'll start with basic Monte Carlo integration and build up to understanding why sophisticated algorithms like those in PyMC are necessary.

### Monte Carlo Integration

The foundation of all Monte Carlo methods is a simple but powerful idea: we can approximate integrals using random samples. Consider estimating the expected value of a function $h(\theta)$ under a probability distribution $p(\theta)$:

$$E[h(\theta)] = \int h(\theta) p(\theta) d\theta$$

If we can draw samples $\theta_1, \theta_2, ..., \theta_n$ from $p(\theta)$, then by the Law of Large Numbers:

$$E[h(\theta)] \approx \frac{1}{n} \sum_{i=1}^n h(\theta_i)$$

This approximation becomes exact as $n \to \infty$, and we can quantify the uncertainty in our estimate using the Central Limit Theorem.

In [None]:
# Example: Monte Carlo estimation of pi
# We'll estimate pi by randomly sampling points in a square and checking if they fall in a circle

n_samples = 10000
x = np.random.uniform(-1, 1, n_samples)
y = np.random.uniform(-1, 1, n_samples)

# Check if points are inside the unit circle
inside_circle = (x**2 + y**2) <= 1

# Estimate pi: (area of circle / area of square) = (pi * r^2) / (2r)^2 = pi/4
pi_estimate = 4 * np.mean(inside_circle)

# Visualize
fig = go.Figure()

# Add points
fig.add_trace(go.Scatter(
    x=x[inside_circle], y=y[inside_circle],
    mode='markers', marker=dict(size=2, color='blue'),
    name='Inside circle'
))
fig.add_trace(go.Scatter(
    x=x[~inside_circle], y=y[~inside_circle],
    mode='markers', marker=dict(size=2, color='red'),
    name='Outside circle'
))

# Add circle
theta = np.linspace(0, 2*np.pi, 100)
fig.add_trace(go.Scatter(
    x=np.cos(theta), y=np.sin(theta),
    mode='lines', line=dict(color='black', width=2),
    name='Unit circle'
))

fig.update_layout(
    title=f'Monte Carlo Estimation of π ≈ {pi_estimate:.4f} (true value: {np.pi:.4f})',
    xaxis=dict(scaleanchor="y", scaleratio=1),
    yaxis=dict(scaleanchor="x", scaleratio=1),
    width=600, height=600
)
fig.show()

print(f"Estimated π: {pi_estimate:.4f}")
print(f"True π:      {np.pi:.4f}")
print(f"Error:       {abs(pi_estimate - np.pi):.4f}")

### The Metropolis-Hastings Algorithm

While Monte Carlo integration is powerful, it assumes we can directly sample from the distribution of interest. In Bayesian inference, we typically can only evaluate the unnormalized posterior $\tilde{p}(\theta|x) = p(x|\theta)p(\theta)$. The Metropolis-Hastings algorithm provides an elegant solution to this problem.

#### The Algorithm

The Metropolis-Hastings algorithm generates a Markov chain whose stationary distribution is the target posterior distribution. The key insight is the **Metropolis acceptance criterion**, which ensures detailed balance:

$$A(\theta^* | \theta) = \min\left\{1, \frac{\tilde{p}(\theta^*)}{\tilde{p}(\theta)} \cdot \frac{q(\theta|\theta^*)}{q(\theta^*|\theta)}\right\}$$

where:
- $\tilde{p}(\theta)$ is the unnormalized posterior
- $q(\theta^*|\theta)$ is the proposal distribution
- $A(\theta^* | \theta)$ is the acceptance probability

This criterion automatically adjusts for asymmetric proposals and ensures the chain converges to the correct distribution.

Let's implement the general Metropolis-Hastings algorithm:

In [None]:
def metropolis_hastings(pdf, prop_dist, init=0):
    """General Metropolis-Hastings sampler.
    
    Args:
        pdf: Target probability density function (unnormalized is ok)
        prop_dist: Proposal distribution (scipy.stats distribution)
        init: Initial value
        
    Yields:
        (sample, accepted): Current sample and whether it was accepted
    """
    current = init
    while True:
        # Propose new state from proposal distribution
        prop = prop_dist.rvs()
        
        # Calculate acceptance ratio
        p_accept = min(1, pdf(prop) / pdf(current) * 
                      prop_dist.pdf(current) / prop_dist.pdf(prop))
        
        # Accept or reject
        accept = np.random.rand() < p_accept
        if accept:
            current = prop
        yield current, accept
        
def gen_samples(draws, sampler):
    """Generate samples from a sampler."""
    samples = np.empty(draws)
    accepts = 0
    for idx, (z, accept) in takewhile(lambda j: j[0] < draws, enumerate(sampler)):
        accepts += accept
        samples[idx] = z
    return samples, accepts

# Example: Sample from a mixture of Gaussians
def target_pdf(x):
    """Mixture of two Gaussians"""
    return 0.3 * st.norm.pdf(x, -2, 0.8) + 0.7 * st.norm.pdf(x, 3, 1.2)

# Use a wide normal as proposal
proposal_dist = st.norm(0, 10)

# Generate samples
samples, accepts = gen_samples(10_000, metropolis_hastings(target_pdf, proposal_dist))

# Visualize results
t = np.linspace(-6, 8, 500)
pdf_values = [target_pdf(x) for x in t]

hist = go.Histogram(
    x=samples,
    histnorm='probability density',
    name='Samples',
    marker=dict(color='rgba(0, 0, 255, 0.7)')
)

pdf_curve = go.Scatter(
    x=t, y=pdf_values,
    mode='lines',
    name='True PDF',
    line=dict(color='orange', width=2)
)

go.Figure(
    data=[hist, pdf_curve]
).update_layout(
    title=f'Metropolis-Hastings: {samples.size:,d} samples with {100 * accepts / samples.size:.1f}% acceptance rate',
    xaxis_title='Value',
    yaxis_title='Density',
    width=750,
    height=400
)

### Random Walk Metropolis-Hastings

The general Metropolis-Hastings algorithm can be inefficient if the proposal distribution is poorly chosen. A popular special case is **Random Walk Metropolis**, where the proposal is centered at the current position:

$$\theta^* \sim \mathcal{N}(\theta, \sigma^2)$$

This symmetric proposal simplifies the acceptance ratio because the proposal terms cancel out:

$$A(\theta^* | \theta) = \min\left\{1, \frac{\tilde{p}(\theta^*)}{\tilde{p}(\theta)}\right\}$$

The key hyperparameter is the step size $\sigma$:
- **Too small**: Chain takes tiny steps and explores slowly (high acceptance, slow mixing)
- **Too large**: Many proposals are rejected (low acceptance, slow mixing)
- **Just right**: Balance between acceptance rate and step size (typically 20-50% acceptance)

Let's implement and compare different step sizes:

In [None]:
def random_walk_metropolis(pdf, step_size, init=0):
    """Random walk Metropolis algorithm.
    
    Args:
        pdf: Target probability density function (unnormalized is ok)
        step_size: Standard deviation of proposal distribution
        init: Initial value
        
    Yields:
        (sample, accepted): Current sample and whether it was accepted
    """
    current = init
    while True:
        # Random walk proposal
        prop = current + np.random.normal(0, step_size)
        
        # Simple acceptance ratio (proposal terms cancel)
        p_accept = min(1, pdf(prop) / pdf(current))
        
        # Accept or reject
        accept = np.random.rand() < p_accept
        if accept:
            current = prop
        yield current, accept

# Compare different step sizes
fig = make_subplots(rows=3, cols=1, 
                    subplot_titles=['Small Step Size (σ=0.1)', 
                                   'Medium Step Size (σ=8.0)', 
                                   'Large Step Size (σ=70.0)'])

step_sizes = [0.1, 8.0, 70.0]

for i, step_size in enumerate(step_sizes, 1):
    # Generate samples
    samples, accepts = gen_samples(10_000, random_walk_metropolis(target_pdf, step_size))
    
    # Calculate t and pdf values for the line
    t = np.linspace(samples.min(), samples.max(), 500)
    pdf_values = [target_pdf(x) for x in t]
    
    # Add histogram
    fig.add_trace(
        go.Histogram(
            x=samples,
            histnorm='probability density',
            marker=dict(color='rgba(0, 0, 255, 0.7)'),
            name=f"Samples (σ={step_size})",
            showlegend=False
        ),
        row=i, col=1
    )
    
    # Add PDF line
    fig.add_trace(
        go.Scatter(
            x=t,
            y=pdf_values,
            mode='lines',
            line=dict(color='orange', width=2),
            name=f"True PDF",
            showlegend=False
        ),
        row=i, col=1
    )
    
    # Add annotation with acceptance rate
    fig.add_annotation(
        text=f"Acceptance rate: {100 * accepts / samples.size:.1f}%",
        xref=f"x{i}", yref=f"y{i}",
        x=0.95, y=0.95,
        xanchor="right", yanchor="top",
        showarrow=False,
        bgcolor="white",
        bordercolor="black",
        borderwidth=1
    )

fig.update_layout(height=900, width=700, showlegend=False)
fig.update_xaxes(title_text="Value", row=3, col=1)
fig.update_yaxes(title_text="Density")
fig.show()

### Visualizing MCMC Behavior

To better understand how different step sizes affect the sampler's behavior, let's look at trace plots that show the evolution of the chain over time:

In [None]:
# Generate shorter chains for visualization
n_steps = 1000

fig = make_subplots(rows=3, cols=1, 
                    subplot_titles=['Small Step Size (σ=0.1): Slow Exploration', 
                                   'Medium Step Size (σ=8.0): Good Mixing', 
                                   'Large Step Size (σ=70.0): Many Rejections'])

for i, step_size in enumerate(step_sizes, 1):
    # Generate samples
    sampler = random_walk_metropolis(target_pdf, step_size)
    trace = []
    for _ in range(n_steps):
        sample, _ = next(sampler)
        trace.append(sample)
    
    # Add trace plot
    fig.add_trace(
        go.Scatter(
            y=trace,
            mode='lines',
            line=dict(width=1),
            name=f"σ={step_size}",
            showlegend=False
        ),
        row=i, col=1
    )
    
    # Add horizontal lines at the modes
    fig.add_hline(y=-2, line_dash="dash", line_color="red", 
                  row=i, col=1, annotation_text="Mode 1")
    fig.add_hline(y=3, line_dash="dash", line_color="red", 
                  row=i, col=1, annotation_text="Mode 2")

fig.update_layout(height=900, width=800, showlegend=False)
fig.update_xaxes(title_text="Iteration", row=3, col=1)
fig.update_yaxes(title_text="Value")
fig.show()

### Modern MCMC: The No-U-Turn Sampler (NUTS)

While Metropolis-Hastings and its variants are foundational, they can be inefficient for complex, high-dimensional models. Modern MCMC algorithms like Hamiltonian Monte Carlo (HMC) and its extension, the No-U-Turn Sampler (NUTS), use gradient information to make more intelligent proposals.

#### Key Advantages of NUTS:

1. **Automatic tuning**: NUTS adapts its parameters during warmup, eliminating the need for manual tuning
2. **Efficient exploration**: Uses gradient information to propose distant points that are still likely to be accepted
3. **Handles correlations**: Can efficiently sample from highly correlated posteriors
4. **Fewer tuning parameters**: Works well "out of the box" for many models

#### When NUTS Shines:

- High-dimensional parameter spaces (dozens to thousands of parameters)
- Complex posterior geometries with strong correlations
- Models with continuous parameters

#### Implementation in PyMC:

PyMC uses NUTS as the default sampler for continuous variables. When you call `pm.sample()`, PyMC:

1. Automatically differentiates your model to compute gradients
2. Runs adaptation to tune the step size and mass matrix
3. Generates samples using the tuned sampler
4. Provides diagnostics to assess convergence

The beauty of PyMC is that all this complexity is hidden behind a simple interface. However, understanding the basics helps us interpret diagnostics and troubleshoot when things go wrong.

## Part 6: Simple Linear Regression in PyMC

Let's conclude with a complete example of Bayesian linear regression. This fundamental model serves as a building block for more complex analyses and demonstrates the complete Bayesian workflow.

### The Fish Market Problem

Imagine we work for an e-commerce company that sells fresh fish. We need to predict fish weights for inventory and shipping purposes, but weighing each fish individually is time-consuming and expensive. However, we can easily measure fish dimensions using automated cameras.

Our goal is to build a model that:
1. Predicts fish weight from easily measured dimensions
2. Quantifies uncertainty in predictions
3. Identifies which measurements are most informative
4. Handles different fish species appropriately

This is a perfect application for Bayesian regression because:
- We care about prediction uncertainty (shipping costs have tiers)
- We have prior knowledge about fish (weight should increase with size)
- We want to understand which features matter most

In [None]:
# Load the fish market data
fish_data = pl.read_csv("../data/fish-market.csv")

print(f"Dataset shape: {fish_data.shape}")
print(f"\nColumns: {fish_data.columns}")
print(f"\nSpecies: {fish_data['Species'].unique()}")

# Display first few rows
fish_data.head()

In [None]:
# Explore relationships between variables
fig = px.scatter_matrix(
    fish_data.to_pandas(),
    dimensions=['Weight', 'Length1', 'Length2', 'Length3', 'Height', 'Width'],
    color='Species',
    title='Fish Measurements Scatter Matrix',
    height=800
)
fig.update_traces(diagonal_visible=False, showupperhalf=False, marker=dict(size=3))
fig.show()

### Modeling Approach

The scatter matrix reveals several important patterns:
1. Strong positive correlations between weight and all size measurements
2. The three length measurements are highly correlated (multicollinearity)
3. Different species show distinct patterns (suggesting species-specific models might be better)
4. The relationship appears non-linear (weight increases with volume, not length)

For this initial model, we'll:
1. Use all available predictors despite multicollinearity (Bayesian methods handle this gracefully)
2. Model all species together (keeping it simple)
3. Use log transformation on weight to linearize relationships
4. Standardize predictors for numerical stability

In [None]:
# Prepare data for modeling
# Remove rows with zero weight (data quality issue)
fish_clean = fish_data.filter(pl.col('Weight') > 0)

# Log transform the target
fish_clean = fish_clean.with_columns(
    pl.col('Weight').log().alias('LogWeight')
)

# Select and standardize predictors
predictor_vars = ['Length1', 'Length2', 'Length3', 'Height', 'Width']
X_fish = fish_clean.select(predictor_vars).to_numpy()

# Standardize
X_mean = X_fish.mean(axis=0)
X_std = X_fish.std(axis=0)
X_fish_std = (X_fish - X_mean) / X_std

# Target variable
y_fish = fish_clean['LogWeight'].to_numpy()

print(f"Final dataset: {X_fish_std.shape[0]} fish, {X_fish_std.shape[1]} predictors")
print(f"Target range: {y_fish.min():.2f} to {y_fish.max():.2f} (log scale)")

### Building the Bayesian Linear Regression Model

Our model follows the standard Bayesian regression framework:

$$\log(\text{Weight}_i) = \alpha + \sum_{j=1}^{5} \beta_j X_{ij} + \epsilon_i$$

where:
- $\alpha$ is the intercept
- $\beta_j$ are coefficients for each predictor
- $\epsilon_i \sim \text{Normal}(0, \sigma)$ is the error term

We'll use weakly informative priors that encode our basic knowledge about the problem while letting the data dominate.

In [None]:
# Build the regression model
coords_fish = {"predictor": predictor_vars}

with pm.Model(coords=coords_fish) as fish_model:
    
    # Intercept - centered on mean log weight
    alpha = pm.Normal('alpha', mu=np.mean(y_fish), sigma=2)
    
    # Regression coefficients
    # Prior: We expect standardized coefficients to be moderate in size
    beta = pm.Normal('beta', mu=0, sigma=1, dims="predictor")
    
    # Error standard deviation
    # HalfStudentT provides robustness to outliers
    sigma = pm.HalfStudentT('sigma', nu=4, sigma=0.5)
    
    # Expected value
    mu = pm.Deterministic('mu', alpha + X_fish_std @ beta)
    
    # Likelihood
    pm.Normal('log_weight', mu=mu, sigma=sigma, observed=y_fish)
    
# Visualize model
pm.model_to_graphviz(fish_model)

### Prior Predictive Checks

Before fitting the model, let's check that our priors generate reasonable predictions. This is an important step that can catch specification errors early.

In [None]:
# Sample from prior predictive distribution
with fish_model:
    prior_pred = pm.sample_prior_predictive(samples=500, random_seed=42)

# Convert prior predictions back to original scale
prior_weights = np.exp(prior_pred.prior_predictive['log_weight'].values.flatten())

# Plot prior predictive distribution
fig = go.Figure()
fig.add_trace(go.Histogram(x=prior_weights[prior_weights < 10000], # Remove extreme values for visualization
                          name='Prior Predictions',
                          nbinsx=50,
                          histnormalization='probability'))
fig.add_trace(go.Histogram(x=fish_clean['Weight'], 
                          name='Actual Data',
                          nbinsx=50,
                          histnormalization='probability',
                          opacity=0.7))
fig.update_layout(title='Prior Predictive Check: Fish Weights',
                 xaxis_title='Weight (g)',
                 yaxis_title='Probability')
fig.show()

print(f"Prior predictions range: {np.percentile(prior_weights, 5):.0f} to {np.percentile(prior_weights, 95):.0f} grams")
print(f"Actual data range: {fish_clean['Weight'].min():.0f} to {fish_clean['Weight'].max():.0f} grams")

### Fitting the Model

With reasonable priors confirmed, we can now fit the model using MCMC. PyMC will use the NUTS sampler, which is particularly effective for regression models.

In [None]:
# Fit the model
with fish_model:
    # Sample from posterior
    fish_trace = pm.sample(2000, tune=2000, random_seed=42)
    
    # Sample posterior predictive for model checking
    post_pred = pm.sample_posterior_predictive(fish_trace, random_seed=42)

### Analyzing Results

Now we examine the posterior distributions to understand what the model has learned from the data.

In [None]:
# Summary statistics
print("Regression Coefficients:")
print(az.summary(fish_trace, var_names=['alpha', 'beta', 'sigma'], 
                filter_vars="like"))

# Visualize coefficients
az.plot_forest(fish_trace, 
               var_names=['beta'],
               combined=True,
               hdi_prob=0.95,
               figsize=(10, 6))
plt.title('Posterior Distributions of Standardized Coefficients')
plt.xlabel('Standardized Coefficient Value')
plt.show()

# Interpret coefficients
beta_means = fish_trace.posterior['beta'].mean(dim=['chain', 'draw'])
for i, var in enumerate(predictor_vars):
    print(f"{var}: {beta_means[i].item():.3f}")

### Model Checking

A crucial part of Bayesian workflow is checking whether our model adequately captures the data generating process. We'll use posterior predictive checks to compare model predictions with actual data.

In [None]:
# Posterior predictive check
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot 1: Predicted vs Actual
y_pred = post_pred.posterior_predictive['log_weight'].mean(dim=['chain', 'draw'])
axes[0].scatter(y_fish, y_pred, alpha=0.5)
axes[0].plot([y_fish.min(), y_fish.max()], [y_fish.min(), y_fish.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual Log Weight')
axes[0].set_ylabel('Predicted Log Weight')
axes[0].set_title('Predicted vs Actual Values')

# Plot 2: Residuals
residuals = y_fish - y_pred
axes[1].scatter(y_pred, residuals, alpha=0.5)
axes[1].axhline(y=0, color='r', linestyle='--')
axes[1].set_xlabel('Predicted Log Weight')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')

plt.tight_layout()
plt.show()

# Calculate R-squared (Bayesian R²)
ss_res = np.sum(residuals**2)
ss_tot = np.sum((y_fish - y_fish.mean())**2)
r_squared = 1 - (ss_res / ss_tot)
print(f"\nBayesian R²: {r_squared:.3f}")

### Making Predictions with Uncertainty

One of the key advantages of Bayesian regression is that we get full posterior distributions for predictions, not just point estimates. Let's predict the weight of a new fish and quantify our uncertainty.

In [None]:
# Predict for a new fish
# Let's say we have a fish with these measurements:
new_fish = np.array([[
    25.0,  # Length1
    27.0,  # Length2  
    30.0,  # Length3
    11.5,  # Height
    6.5    # Width
]])

# Standardize using training statistics
new_fish_std = (new_fish - X_mean) / X_std

# Generate predictions
with fish_model:
    # Update predictors for prediction
    mu_pred = alpha + new_fish_std @ beta
    
    # Sample predictions
    predictions = pm.sample_posterior_predictive(
        fish_trace, 
        var_names=['alpha', 'beta'],
        random_seed=42
    )

# Get predicted log weights
log_weight_pred = predictions.posterior['alpha'] + (new_fish_std @ predictions.posterior['beta'].T).T

# Convert to weight scale
pred_weights = np.exp(log_weight_pred.values.flatten())

# Summarize predictions
mean_weight = pred_weights.mean()
hdi = az.hdi(pred_weights, hdi_prob=0.95)

print(f"Predicted weight: {mean_weight:.0f} grams")
print(f"95% HDI: [{hdi[0]:.0f}, {hdi[1]:.0f}] grams")

# Visualize prediction distribution
fig = go.Figure()
fig.add_trace(go.Histogram(x=pred_weights, nbinsx=50, name='Predicted Weight'))
fig.add_vline(x=mean_weight, line_dash="dash", line_color="red", 
              annotation_text=f"Mean: {mean_weight:.0f}g")
fig.update_layout(title='Predicted Weight Distribution for New Fish',
                 xaxis_title='Weight (g)',
                 yaxis_title='Count')
fig.show()

## Summary and Key Takeaways

In this session, we've covered the essential components of building Bayesian models with PyMC:

### PyMC Fundamentals
- **Model Context**: All PyMC models exist within a context that tracks variable relationships
- **Distributions**: PyMC provides a rich library of probability distributions for priors and likelihoods
- **Random Variables**: Parameters and data are represented as random variables with associated distributions
- **Automatic Differentiation**: PyMC handles gradient calculations automatically

### MCMC Concepts
- **The Challenge**: Bayesian inference requires integration over high-dimensional spaces
- **The Solution**: MCMC methods draw samples from the posterior distribution
- **Metropolis-Hastings**: The foundational MCMC algorithm that inspired modern methods
- **Tuning Matters**: Step size and other hyperparameters critically affect performance
- **Modern Algorithms**: NUTS (used by PyMC) efficiently explores complex posteriors
- **Convergence**: Multiple chains help assess whether sampling has converged

### Practical Workflow
1. **Model Specification**: Encode assumptions using priors and likelihoods
2. **Prior Predictive Checks**: Verify that priors generate reasonable data
3. **Sampling**: Use `pm.sample()` to draw from the posterior
4. **Diagnostics**: Check convergence and sampling quality
5. **Posterior Analysis**: Examine parameter estimates and uncertainties
6. **Posterior Predictive Checks**: Verify that the model captures data patterns
7. **Predictions**: Generate predictions with full uncertainty quantification

### Key Advantages of Bayesian Modeling
- **Uncertainty Quantification**: Get distributions, not just point estimates
- **Prior Information**: Incorporate domain knowledge systematically
- **Hierarchical Models**: Naturally handle grouped data (next session!)
- **Model Comparison**: Principled ways to compare models
- **Missing Data**: Automatic handling of missing values

### Next Steps
In the next session, we'll extend these concepts to hierarchical models, which allow us to model data with natural grouping structures. We'll see how Bayesian methods excel at sharing information across groups while respecting their differences.

### Exercises
1. Try different prior distributions and observe their effect on the posterior
2. Fit separate models for different fish species and compare results
3. Add interaction terms between predictors
4. Implement a robust regression using Student-t likelihood instead of Normal
5. Use cross-validation to assess predictive performance

In [None]:
%load_ext watermark
%watermark -n -u -v -iv -w