[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fonnesbeck/instats_gp/blob/main/sessions/Session_1.ipynb)

# Session 1: Introduction to Gaussian Processes and PyMC

## Learning Objectives

By the end of this session, you will be able to:

- Understand what Bayesian non-parametric models are and how Gaussian processes fit into this framework
- Connect the familiar concept of multivariate normal distributions to the more general idea of Gaussian processes
- Learn the basics of PyMC and how to implement simple models
- Understand the roles of mean and covariance functions in defining a GP's behavior
- Build and fit your first GP model using real data

## LLM-Assisted Exercises

Throughout these notebooks, you'll notice sections marked with ü§ñ. These are **LLM-assisted exercises** where you'll practice using large language models (like ChatGPT, Claude, or your favorite coding assistant) to help you implement GP concepts. This approach reflects modern data science practice: knowing *what* you want to accomplish and *how* to verify it is often more important than memorizing every implementation detail.

The exercises are designed to help you develop the skill of effectively communicating with AI assistants‚Äîa critical ability in today's data science workflow. You'll learn to write clear prompts, test implementations, and validate results, building both your GP expertise and your collaborative coding skills.

## Section 1.1: Introduction and Setup

### What Does "Non-Parametric" Mean in a GP Context?

When we talk about Gaussian processes being "non-parametric," we're not saying they have no parameters‚Äîthat would be confusing! Instead, we mean that GPs don't assume a fixed functional form with a finite number of parameters. 

Think about it this way: in linear regression, you commit to a straight line (or hyperplane) defined by a slope and intercept. You're making a strong assumption about the shape of your function before seeing the data. With a GP, you're defining a *distribution over functions*. The data then tells you which functions from this distribution are most plausible.

This flexibility makes GPs incredibly powerful for modeling complex, unknown relationships. The "parameters" in a GP are actually hyperparameters that control properties like smoothness and lengthscale‚Äîthey shape the space of possible functions rather than defining a single function.

In [None]:
import pymc as pm
import numpy as np
import polars as pl
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import arviz as az
import plotly.io as pio
from scipy import stats
import platform

DATA_DIR = "../data/"

# Set random seed for reproducibility
RANDOM_SEED = 20090425
rng = np.random.default_rng(RANDOM_SEED)

# Configure plotly for nice interactive plots
pio.templates.default = "plotly_white"

# Print versions to verify environment
print(f"PyMC version: {pm.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Polars version: {pl.__version__}")
print(f"\nEnvironment ready! Let's explore Gaussian Processes.")

Everything should import cleanly and you should see version numbers printed above. If you encounter any import errors, make sure your environment is set up correctly with PyMC 5.16+, NumPy 2.x, and Polars 1.x.

Notice that we're setting `RANDOM_SEED` right at the start. Reproducibility is crucial in Bayesian workflows‚Äîyou want to be able to recreate your results exactly, especially when debugging or sharing your work with colleagues.

## Section 1.2: Bayesian Inference Primer

Before we learn the PyMC API or dive into Gaussian processes, let's build intuition for the Bayesian framework itself. Understanding how we update beliefs with data is fundamental to everything that follows.

### The Three-Step Bayesian Workflow

Every Bayesian analysis follows the same three-step pattern:

1. **Specify a probability model**: Assign probability distributions to all unknowns‚Äîparameters, predictions, even missing data
2. **Calculate the posterior distribution**: Update our beliefs by combining prior knowledge with observed data  
3. **Check and interpret**: Validate the model and draw conclusions

This workflow applies whether we're estimating a simple proportion or building complex Gaussian process models.

### Bayes' Theorem: The Foundation

At its heart, Bayesian inference is about updating beliefs with data. We start with prior knowledge (the prior), observe data (the likelihood), and combine them through multiplication to get updated knowledge (the posterior):

$$P(\theta | y) \propto P(y | \theta) P(\theta)$$

**Reading this equation aloud:**
- **Posterior** ‚àù **Likelihood** √ó **Prior**
- What we believe about $\theta$ after seeing data is proportional to how likely the data would be under different $\theta$ values, weighted by what we believed before

This seemingly simple equation is incredibly powerful. Think of it visually: if your prior was a curve and the likelihood was another curve, the posterior is their product‚Äîpeaks where both agree, valleys where they disagree.

The proportionality constant is just a normalization factor ensuring probabilities sum to one‚Äîimportant mathematically, but not essential for building intuition.

### A Concrete Example: Estimating a Proportion

Let's work through a simple analytical example to see Bayesian updating in action. Imagine we want to estimate the probability $\theta$ that a coin lands heads.

**Setup**: We flip a coin 5 times and observe 3 heads. What can we infer about $\theta$?

**Step 1: Specify the model**
- **Likelihood** (Binomial): Number of heads follows $\text{Binomial}(n=5, p=\theta)$
- **Prior** (Beta): We start with a uniform prior, $\text{Beta}(1, 1)$‚Äîmaximum uncertainty about $\theta$

**Step 2: Compute the posterior**

The beautiful thing about the Beta-Binomial combination is that we can calculate the posterior analytically‚Äîno fancy algorithms needed! This is called *conjugacy*.

The posterior is $\text{Beta}(1+3, 1+2) = \text{Beta}(4, 3)$. Notice how the parameters simply add:
- Prior "successes": 1 -> Posterior: 1 + 3 = 4  
- Prior "failures": 1 -> Posterior: 1 + 2 = 3

Let's visualize this updating process:

In [None]:
# Setup: 3 heads from 5 flips
n, y = 5, 3
theta = np.linspace(0, 1, 200)

# Prior: Beta(1, 1) = Uniform
prior = stats.beta.pdf(theta, 1, 1)

# Likelihood: Binomial (normalized for visibility)
likelihood = theta**y * (1-theta)**(n-y)
likelihood = likelihood / likelihood.max()

# Posterior: Beta(1+y, 1+n-y) = Beta(4, 3)
posterior = stats.beta.pdf(theta, 1+y, 1+n-y)

# Visualize
fig = go.Figure()
fig.add_trace(go.Scatter(x=theta, y=prior, name='Prior',
                         mode='lines', line=dict(color='gray', width=2, dash='dash')))
fig.add_trace(go.Scatter(x=theta, y=likelihood, name='Likelihood (normalized)',
                         mode='lines', line=dict(color='blue', width=2)))
fig.add_trace(go.Scatter(x=theta, y=posterior, name='Posterior',
                         mode='lines', line=dict(color='red', width=3)))

fig.update_layout(
    title='Bayesian Updating: Prior √ó Likelihood = Posterior',
    xaxis_title='Œ∏ (probability of heads)',
    yaxis_title='Probability Density',
    showlegend=True
)
fig.show()

# Posterior mean
posterior_mean = (1 + y) / (1 + 1 + n)
print(f"\nPosterior mean: {posterior_mean:.3f}")
print(f"This is between our prior mean (0.5) and the observed proportion ({y/n:.3f})")

Notice the beautiful interplay:

- The **prior** (gray dashed) is flat‚Äîwe started with no preference for any value of $\theta$
- The **likelihood** (blue) peaks at the observed proportion (3/5 = 0.6)‚Äîthis is what the data alone suggests
- The **posterior** (red) is their product, peaked near the likelihood but slightly regularized by the prior

With only 5 flips, there's still substantial uncertainty. Let's see what happens as we collect more data:


In [None]:
# Show how posterior narrows with more data (keeping proportion at 60%)
sample_sizes = [5, 20, 100]
observed_proportion = 0.6

fig = go.Figure()

for n in sample_sizes:
    y = int(n * observed_proportion)
    posterior = stats.beta.pdf(theta, 1+y, 1+n-y)
    fig.add_trace(go.Scatter(
        x=theta, y=posterior,
        name=f'n={n}, y={y}',
        mode='lines', line=dict(width=2)
    ))

fig.update_layout(
    title='Posterior Distribution Narrows with More Data',
    xaxis_title='Œ∏ (probability of heads)',
    yaxis_title='Posterior Density',
    showlegend=True
)
fig.show()

This progression beautifully illustrates Bayesian learning:

- With **5 observations**, the posterior is wide‚Äîwe're still quite uncertain
- With **20 observations**, uncertainty decreases substantially  
- With **100 observations**, the posterior becomes very sharp‚Äîwe're confident $\theta$ is near 0.6

Notice that the posterior mean is always a weighted average of the prior and the data, with the data getting more weight as sample size grows. This is Bayesian learning in action‚Äîprior beliefs get updated, and eventually overwhelmed, by evidence.

### Why This Matters for Gaussian Processes

This same Bayesian framework powers GP modeling. Instead of updating beliefs about a single parameter $\theta$, we'll update beliefs about entire *functions*. The prior becomes a distribution over functions (specified by mean and covariance), the likelihood describes how data relates to function values, and the posterior gives us updated beliefs about which functions are plausible.

But first, we need to learn how to implement Bayesian models in PyMC. Let's dive into the API with a real dataset.

## Section 1.3: Introduction to PyMC with Real Data

Now that we understand the Bayesian framework conceptually, let's learn how to implement it using PyMC. We'll start with the simplest possible model: estimating the distribution of a single continuous variable.

### The Data: Baseball Launch Angles

We'll use real baseball data from the `fastball_bat_angles.csv` dataset. When a batter makes contact with a fastball, the ball leaves the bat at a particular **launch angle**‚Äîthe vertical angle relative to horizontal. Launch angle is crucial for hitting outcomes: too low and you hit a ground ball, too high and you pop out, just right and you might hit a home run.

Let's focus on a single player and estimate the distribution of their launch angles. We'll use this to learn PyMC syntax before moving to more complex models.


In [None]:
# Load the fastball bat angle data
df = pl.read_csv(DATA_DIR + 'fastball_bat_angles.csv')

# See what batters we have
print("Sample of available batters:")
print(df.select('batter_name').unique().head(10))

# Let's focus on Bryce Harper
harper = df.filter(pl.col('batter_name') == 'Harper, Bryce')
print(f"\nBryce Harper: {harper.shape[0]} fastball contacts")

# Extract launch angles as numpy array
launch_angles = harper['launch_angle'].to_numpy()

# Visualize the distribution
fig = go.Figure()
fig.add_trace(go.Histogram(
    x=launch_angles,
    nbinsx=30,
    name='Observed Launch Angles',
    marker_color='steelblue'
))
fig.update_layout(
    title='Bryce Harper: Launch Angle Distribution',
    xaxis_title='Launch Angle (degrees)',
    yaxis_title='Count',
    showlegend=False
)
fig.show()

print(f"\nSummary statistics:")
print(f"Mean: {launch_angles.mean():.1f}¬∞")
print(f"Std Dev: {launch_angles.std():.1f}¬∞")
print(f"Range: [{launch_angles.min():.1f}¬∞, {launch_angles.max():.1f}¬∞]")

The distribution looks roughly bell-shaped, though perhaps not perfectly symmetric. A normal distribution seems like a reasonable starting point.

### Building Your First PyMC Model

Let's estimate the mean and standard deviation of Harper's launch angle distribution. The model is simple:

$$\begin{aligned}
\mu &\sim \text{Normal}(30, 20) \\
\sigma &\sim \text{HalfNormal}(20) \\
y_i &\sim \text{Normal}(\mu, \sigma)
\end{aligned}$$

We're saying:
- The mean launch angle $\mu$ is probably around 30¬∞ (that's a good launch angle), but we're quite uncertain (SD of 20¬∞)
- The standard deviation $\sigma$ is positive, probably not huge‚Äîmaybe around 20¬∞ or less
- Each observed launch angle is drawn from a normal distribution with these parameters

Here's how to write this in PyMC:

In [None]:
with pm.Model() as launch_model:
    # Priors
    mu = pm.Normal('mu', mu=30, sigma=20)
    sigma = pm.HalfNormal('sigma', sigma=20)
    
    # Likelihood
    y_obs = pm.Normal('y', mu=mu, sigma=sigma, observed=launch_angles)
    
    # Sample from the posterior
    trace = pm.sample(1000, tune=1000, nuts_sampler='nutpie',
                      random_seed=rng)

Let's unpack what just happened:

1. `with pm.Model() as launch_model:` creates a model context. All variables defined inside this block are part of the model.

2. `pm.Normal('mu', mu=30, sigma=20)` defines our prior for the mean. The first argument is the variable name, then come the parameters.

3. `pm.HalfNormal('sigma', sigma=20)` defines our prior for the standard deviation. HalfNormal is like Normal but constrained to be positive‚Äîperfect for scale parameters.

4. `pm.Normal('y', mu=mu, sigma=sigma, observed=launch_angles)` defines the likelihood. The `observed=` argument connects the model to our actual data.

5. `pm.sample()` runs MCMC sampling to approximate the posterior distribution. We'll learn more about this later‚Äîfor now, just know it's finding the parameter values most consistent with our data and priors.

Now let's examine what we learned:

In [None]:
# Summary table
print("Posterior Summary:")
print(az.summary(trace, var_names=['mu', 'sigma']))

# Visualize posteriors
fig = az.plot_posterior(trace, var_names=['mu', 'sigma'])
print("\n(See posterior plots above)")

The posterior tells us what we've learned about Harper's launch angle distribution:

- The **posterior mean** for $\mu$ is our best estimate of Harper's average launch angle
- The **94% HDI** (Highest Density Interval) tells us the range of plausible values‚Äîthere's a 94% chance the true mean is in this interval
- Notice how much narrower the posterior is compared to our prior‚Äîthe data has substantially reduced our uncertainty

### Posterior Predictive Checks

A crucial part of the Bayesian workflow is asking: "If this model is correct, would it generate data that looks like what we actually observed?" This is called a **posterior predictive check**.

We'll use the posterior distributions of $\mu$ and $\sigma$ to simulate new datasets and compare them to our observed data:

In [None]:
with launch_model:
    # Sample from the posterior predictive distribution
    trace.extend(pm.sample_posterior_predictive(trace, random_seed=rng))

# Use ArviZ's built-in posterior predictive check plot
az.plot_ppc(trace, num_pp_samples=100);
print("\n(Blue: observed data, Orange: posterior predictive samples)")

The posterior predictive samples (orange) overlay nicely with the observed data (blue), suggesting our simple normal model is a reasonable approximation. If there were systematic discrepancies‚Äîsay, the observed data had heavy tails that the model couldn't capture‚Äîwe'd see clear differences and might consider a different likelihood (like Student's t).

This workflow‚Äîspecify model, fit, check‚Äîis fundamental to Bayesian modeling and will be our pattern throughout this workshop.


### ü§ñ EXERCISE 1: Compare Launch Angles Across Batters

Now that you've seen the basic workflow, practice it with different players.

In [None]:
# ü§ñ EXERCISE: Use your LLM to help complete this comparison

def compare_batters(batter_names, df):
    """
    Fit launch angle models for multiple batters and compare posteriors.
    
    Parameters:
    -----------
    batter_names : list of str
        Names of batters to compare (e.g., ['Harper, Bryce', 'Alonso, Pete'])
    df : polars DataFrame
        The fastball bat angles dataset
        
    Returns:
    --------
    results : dict
        Dictionary mapping batter names to their trace objects
        
    Prompt suggestion: "Help me write a function that fits the same PyMC launch angle
    model (Normal likelihood with Normal and HalfNormal priors) for multiple batters,
    stores the traces, and creates a forest plot comparing the posterior mean launch
    angles across batters using ArviZ."
    """
    # YOUR LLM-ASSISTED CODE HERE
    pass

# Test with a few batters
# batters = ['Harper, Bryce', 'Alonso, Pete', 'Tucker, Kyle']
# results = compare_batters(batters, df)

## Section 1.4: Multivariate Normal Models with Coordinates

So far we've modeled single variables in isolation. But often we have multiple related measurements‚Äîlike attack angle and launch angle in baseball‚Äîthat vary together. A **multivariate normal distribution** lets us model the joint distribution of multiple variables, capturing how they covary.

This is also where PyMC's **coordinate system** shines. Coordinates (`coords`) let us:
- Give meaningful names to dimensions (like "variable" or "observation")
- Make models more readable and easier to extend
- Enable advanced features like `pm.Data()` for out-of-sample prediction

### Attack Angle and Launch Angle

When a batter swings, two angles matter:
- **Attack angle**: The vertical angle of the bat's path as it moves through the hitting zone
- **Launch angle**: The vertical angle of the ball as it leaves the bat

Intuitively, these should be related‚Äîif you swing with a more upward bat path (higher attack angle), you're more likely to hit the ball upward (higher launch angle). Let's model their joint distribution.


In [None]:
# Filter to observations with BOTH attack_angle and launch_angle
df_both = df.filter(pl.col('attack_angle').is_not_null())

print(f"Observations with both angles: {df_both.shape[0]}")
print(f"Original dataset: {df.shape[0]}")
print(f"Fraction with attack angle data: {df_both.shape[0]/df.shape[0]:.1%}")

# Stack into a (n_obs, 2) array
angles = np.column_stack([
    df_both['attack_angle'].to_numpy(),
    df_both['launch_angle'].to_numpy()
])

print(f"\nData shape: {angles.shape}")
print(f"Attack angle range: [{angles[:, 0].min():.1f}, {angles[:, 0].max():.1f}]")
print(f"Launch angle range: [{angles[:, 1].min():.1f}, {angles[:, 1].max():.1f}]")

# Visualize the joint distribution
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=angles[:, 0],
    y=angles[:, 1],
    mode='markers',
    marker=dict(size=3, opacity=0.3, color='steelblue'),
    name='Observed'
))
fig.update_layout(
    title='Joint Distribution of Attack Angle and Launch Angle',
    xaxis_title='Attack Angle (degrees)',
    yaxis_title='Launch Angle (degrees)',
    width=600, height=500
)
fig.show()

There's a clear positive relationship! Higher attack angles tend to produce higher launch angles, though there's substantial scatter.

### Building a Multivariate Normal Model

A bivariate normal distribution is characterized by:
- A **mean vector** $\boldsymbol{\mu} = [\mu_{\text{attack}}, \mu_{\text{launch}}]$
- A **covariance matrix** $\boldsymbol{\Sigma}$ that captures variances and covariance:

$$\boldsymbol{\Sigma} = \begin{bmatrix}
\sigma_{\text{attack}}^2 & \rho \sigma_{\text{attack}} \sigma_{\text{launch}} \\
\rho \sigma_{\text{attack}} \sigma_{\text{launch}} & \sigma_{\text{launch}}^2
\end{bmatrix}$$

where $\rho$ is the correlation coefficient.

In PyMC, we'll use the **LKJ prior** for the correlation matrix‚Äîit's a flexible prior that doesn't favor any particular correlation structure. We'll also use coordinates to make the model readable:


In [None]:
# Define coordinates
COORDS = {
    'variable': ['attack_angle', 'launch_angle'],
    'obs': np.arange(angles.shape[0])
}

with pm.Model(coords=COORDS) as mvn_model:
    # Mean vector (one mean per variable)
    mu = pm.Normal('mu', mu=0, sigma=30, dims='variable')
    
    # Correlation matrix with LKJ prior (eta=2 is weakly informative)
    # Standard deviations for each variable
    chol, corr, stds = pm.LKJCholeskyCov(
        'chol', n=2, eta=2.0, 
        sd_dist=pm.HalfNormal.dist(sigma=20, shape=2),
        compute_corr=True
    )
    
    # Multivariate normal likelihood
    pm.MvNormal('angles', mu=mu, chol=chol,
                      observed=angles, dims=('obs', 'variable'))
    
    # Sample
    trace_mvn = pm.sample(1000, tune=1000, nuts_sampler='nutpie',
                          random_seed=rng)

Let's examine what we learned about the means, standard deviations, and correlation:


In [None]:
# Summary table
print("Posterior Summary:")
print(az.summary(trace_mvn, var_names=['mu', 'chol_stds', 'chol_corr']))

# Visualize mean and standard deviation estimates
az.plot_posterior(trace_mvn, var_names=['mu', 'chol_stds']);
print("\n(See posterior distributions for means and standard deviations above)")

In [None]:
# Extract and visualize the correlation
corr_samples = trace_mvn.posterior['chol_corr'].values
# Correlation is the off-diagonal element [0,1]
rho = corr_samples[:, :, 0, 1].flatten()

fig = go.Figure()
fig.add_trace(go.Histogram(
    x=rho,
    nbinsx=40,
    name='Correlation',
    marker_color='coral'
))
fig.update_layout(
    title='Posterior Distribution of Correlation between Attack Angle and Launch Angle',
    xaxis_title='Correlation (œÅ)',
    yaxis_title='Count',
    showlegend=False
)
fig.show()

print(f"\nPosterior mean correlation: {rho.mean():.3f}")
print(f"95% credible interval: [{np.percentile(rho, 2.5):.3f}, {np.percentile(rho, 97.5):.3f}]")

The positive correlation confirms our intuition: attack angle and launch angle are positively related. When batters swing upward, they tend to hit the ball upward.

### Why Use Coordinates?

Notice how `dims='variable'` makes the model self-documenting. When we look at posterior summaries, we see `mu[attack_angle]` and `mu[launch_angle]` instead of `mu[0]` and `mu[1]`. This readability becomes crucial in larger models.

The `pm.Data()` container (which we could add) would let us:
- Update the data without rebuilding the model
- Make predictions on new observations
- Perform leave-one-out cross-validation

We'll use these features extensively when building GP models in Session 2.


### ü§ñ EXERCISE 2: Posterior Predictive Check for Multivariate Model

Verify that the bivariate normal model captures the joint distribution well by generating posterior predictive samples and comparing to the observed data.

In [None]:
# ü§ñ EXERCISE: Use your LLM to help create a posterior predictive check

def mvn_posterior_predictive_check(trace_mvn, angles):
    """
    Generate posterior predictive samples and visualize against observed data.
    
    Create a scatter plot with:
    - Observed (attack_angle, launch_angle) points in blue
    - A few posterior predictive samples in red/orange with transparency
    
    Prompt suggestion: "Help me sample from the posterior predictive distribution
    of the bivariate normal model using pm.sample_posterior_predictive(), then
    create a Plotly scatter plot overlaying observed data and 3-5 posterior
    predictive samples to visually check model fit."
    """
    # YOUR LLM-ASSISTED CODE HERE
    pass

# Uncomment to test:
# mvn_posterior_predictive_check(trace_mvn, angles)

## Section 1.5: From Multivariate Normals to Gaussian Processes

### Building Intuition: Functions as Infinite-Dimensional Vectors

Here's the key insight that unlocks Gaussian processes: if you squint just right, a function is just a really long vector‚Äîinfinitely long, in fact. At every point $x$ in the input space, the function has a value $f(x)$. You can think of $f$ as an infinite-dimensional vector, with one entry for each possible $x$.

Now, you already know what a multivariate normal (MVN) distribution is‚Äîit's a probability distribution over vectors. A Gaussian process is simply the natural extension of an MVN distribution to this infinite-dimensional case. It's a distribution over *functions*.

The magic is that we can work with GPs using only finite-dimensional operations. We never actually deal with infinity‚Äîwe only care about the function values at the specific points where we have (or want) observations. Let's make this concrete.

### Drawing Functions from a GP Prior

A Gaussian process is completely specified by two functions:
- A **mean function** $m(x)$ that gives the expected value at each point
- A **covariance function** (or kernel) $k(x, x')$ that describes how function values at different points relate to each other

We write: $f(x) \sim \mathcal{GP}(m(x), k(x, x'))$

Let's implement a simple GP prior using the squared exponential (also called RBF or ExpQuad) kernel and draw some functions from it. First, we'll implement the kernel function ourselves to build intuition:

### ü§ñ EXERCISE 3: Implement the ExpQuad Kernel

Let's start by implementing the exponential quadratic (RBF) kernel from scratch. This exercise will help you understand what a covariance function actually does.

In [None]:
# ü§ñ EXERCISE 3: Use your LLM to help complete this implementation

# STEP 1: Ask your LLM to help you implement this function
def exp_quad_kernel(X1, X2, lengthscale=1.0, variance=1.0):
    """
    Compute the ExpQuad (RBF/Squared Exponential) covariance matrix.
    
    The ExpQuad kernel is: k(x, x') = variance * exp(-||x - x'||^2 / (2 * lengthscale^2))
    
    Parameters:
    -----------
    X1 : array-like, shape (n_samples_1, n_features)
        First input matrix
    X2 : array-like, shape (n_samples_2, n_features)  
        Second input matrix
    lengthscale : float
        Length scale parameter - controls how quickly correlation decays with distance
    variance : float
        Variance parameter - controls overall scale of function values
        
    Returns:
    --------
    K : array, shape (n_samples_1, n_samples_2)
        Covariance matrix
    
    Prompt suggestion: "Help me implement an RBF/ExpQuad kernel that returns the pairwise
    covariance matrix k(x, x') = sigma^2 * exp(-||x - x'||^2 / (2 * ell^2)) using NumPy 2.
    The inputs X1 and X2 should be 2D arrays where each row is a data point."
    """
    # REFERENCE SOLUTION (for instructors):
    # Compute squared Euclidean distances
    # For each pair (i,j), we need ||X1[i] - X2[j]||^2
    X1_sq = np.sum(X1**2, axis=1, keepdims=True)  # shape (n1, 1)
    X2_sq = np.sum(X2**2, axis=1, keepdims=True)  # shape (n2, 1)
    distances_sq = X1_sq + X2_sq.T - 2 * X1 @ X2.T  # shape (n1, n2)
    
    # Apply RBF kernel formula
    K = variance * np.exp(-distances_sq / (2 * lengthscale**2))
    return K

# STEP 2: Test your implementation
# Create a simple test case
X_test = np.array([[0.], [1.], [2.]])
K_test = exp_quad_kernel(X_test, X_test, lengthscale=1.0, variance=1.0)

print("Covariance matrix:")
print(K_test)
print("\nProperties to check:")
print(f"1. Symmetric? {np.allclose(K_test, K_test.T)}")
print(f"2. Diagonal values (should be ~{1.0}): {np.diag(K_test)}")
print(f"3. Positive semi-definite? {np.all(np.linalg.eigvals(K_test) >= -1e-10)}")

Let's visualize this kernel as a heatmap to understand what it's doing. The kernel matrix tells us how correlated function values are at different input locations:

In [None]:
# Create a grid of points
X_grid = np.linspace(0, 10, 50)[:, None]
K_grid = exp_quad_kernel(X_grid, X_grid, lengthscale=1.5, variance=1.0)

# Visualize the covariance matrix
fig = px.imshow(K_grid, 
                x=X_grid.flatten(),
                y=X_grid.flatten(),
                color_continuous_scale='Viridis',
                labels=dict(x="x", y="x'", color="Covariance"),
                title="ExpQuad Kernel: How function values covary")
fig.update_layout(width=600, height=500)
fig.show()

Notice the structure: the diagonal is bright (variance = 1.0), meaning each point is perfectly correlated with itself. As you move away from the diagonal, the correlation decays smoothly. This decay rate is controlled by the lengthscale parameter‚Äîit determines how far apart two points can be before they become essentially uncorrelated.

Now let's use this kernel to draw actual functions from a GP prior:

In [None]:
# Define input points where we want to evaluate the function
X = np.linspace(0, 10, 100)[:, None]

# Build the covariance matrix
lengthscale = 1.5
variance = 1.0
K = exp_quad_kernel(X, X, lengthscale=lengthscale, variance=variance)

# Add small jitter for numerical stability
K += 1e-8 * np.eye(len(X))

# Zero mean function (we'll explore non-zero means later)
mean = np.zeros(len(X))

# Draw 5 sample functions from the GP prior
n_samples = 5
f_samples = rng.multivariate_normal(mean, K, size=n_samples)

# Plot the sample functions
fig = go.Figure()
for i in range(n_samples):
    fig.add_trace(go.Scatter(
        x=X.flatten(),
        y=f_samples[i],
        mode='lines',
        name=f'Sample {i+1}',
        line=dict(width=2),
        opacity=0.7
    ))

fig.update_layout(
    title=f'Functions Drawn from GP Prior (lengthscale={lengthscale}, variance={variance})',
    xaxis_title='x',
    yaxis_title='f(x)',
    showlegend=True
)
fig.show()

These are functions drawn from our GP prior! Each colored line represents one possible function that the GP considers plausible before seeing any data. Notice how smooth they are‚Äîthat smoothness comes from the ExpQuad kernel. 

Here's the beautiful thing: we just drew these functions by sampling from a multivariate normal distribution. The connection is direct:
- The mean vector of the MVN is our mean function evaluated at our input points
- The covariance matrix of the MVN is our kernel evaluated at all pairs of input points

### Understanding Lengthscale and Variance

Let's explore how the kernel parameters affect the functions we draw:

In [None]:
# Compare different lengthscales
lengthscales = [0.5, 1.5, 3.0]

fig = make_subplots(rows=1, cols=3,
                    subplot_titles=[f'lengthscale = {l}' for l in lengthscales])

for idx, ls in enumerate(lengthscales, 1):
    K = exp_quad_kernel(X, X, lengthscale=ls, variance=1.0)
    K += 1e-8 * np.eye(len(X))
    
    samples = rng.multivariate_normal(np.zeros(len(X)), K, size=3)
    
    for sample in samples:
        fig.add_trace(
            go.Scatter(x=X.flatten(), y=sample, mode='lines',
                      showlegend=False, line=dict(width=1.5)),
            row=1, col=idx
        )

fig.update_xaxes(title_text="x")
fig.update_yaxes(title_text="f(x)", col=1)
fig.update_layout(height=300, width=900,
                  title_text="Effect of Lengthscale on Function Smoothness")
fig.show()

The lengthscale parameter controls the "wiggliness" of functions:
- **Small lengthscale** (left): Functions vary rapidly, with high-frequency wiggles
- **Medium lengthscale** (middle): Smooth functions with moderate variation
- **Large lengthscale** (right): Very smooth, slowly varying functions

Think of lengthscale as answering the question: "How far do I need to move in input space before function values become uncorrelated?" This is one of the most important hyperparameters you'll tune when fitting GPs to real data.

The variance parameter, on the other hand, controls the typical distance of functions from the mean function (in this case, zero). Larger variance means larger excursions from the mean.

We've just connected multivariate normals to Gaussian processes! A GP is simply a generalization of the MVN to infinite dimensions, and we can work with it by considering only the finite set of points we care about. This is the key insight that makes GPs practical.

## Section 1.6: Mean Functions

### The Role of the Mean Function

So far, we've been using a zero mean function, but this is just one choice among many. The mean function $m(x)$ specifies the expected value of the function at each input point, before we see any data. Think of it as your baseline assumption about what the function looks like.

The mean function shifts functions up or down (or even gives them a trend), but it doesn't affect their smoothness or correlation structure‚Äîthat's the job of the covariance function. In many applications, we stick with a zero mean because:
1. It's simple and introduces no additional parameters
2. The covariance function is usually flexible enough to capture the important structure
3. If we're working with standardized data (mean-centered), a zero mean function makes sense

However, when you have strong prior knowledge about the general shape of your function‚Äîmaybe you know it should have a linear trend or oscillate around some value‚Äîencoding this in the mean function can improve your model.

Let's explore the three most common mean functions: zero, constant, and linear.

In [None]:
# Define our input points
X_mean = np.linspace(0, 10, 100)[:, None]
n_samples = 3

# Shared covariance (we'll use the same kernel for all)
K = exp_quad_kernel(X_mean, X_mean, lengthscale=1.5, variance=1.0)
K += 1e-8 * np.eye(len(X_mean))

# Three different mean functions
mean_zero = np.zeros(len(X_mean))
mean_constant = np.full(len(X_mean), 5.0)  # Constant offset of 5
mean_linear = 0.5 * X_mean.flatten() + 2.0  # Linear trend

# Create subplots
fig = make_subplots(rows=1, cols=3,
                    subplot_titles=['Zero Mean', 'Constant Mean (5.0)', 'Linear Mean (0.5x + 2)'])

for idx, (mean_func, title) in enumerate([
    (mean_zero, 'Zero'),
    (mean_constant, 'Constant'),
    (mean_linear, 'Linear')
], 1):
    # Draw samples
    samples = rng.multivariate_normal(mean_func, K, size=n_samples)
    
    # Plot mean function
    fig.add_trace(
        go.Scatter(x=X_mean.flatten(), y=mean_func, mode='lines',
                  line=dict(color='black', width=3, dash='dash'),
                  showlegend=False, name='Mean'),
        row=1, col=idx
    )
    
    # Plot samples
    for sample in samples:
        fig.add_trace(
            go.Scatter(x=X_mean.flatten(), y=sample, mode='lines',
                      showlegend=False, line=dict(width=1.5), opacity=0.7),
            row=1, col=idx
        )

fig.update_xaxes(title_text="x")
fig.update_yaxes(title_text="f(x)", col=1)
fig.update_layout(height=300, width=900,
                  title_text="GP Prior Samples with Different Mean Functions (black dashed = mean)")
fig.show()

Notice how the mean function (black dashed line) acts as an "anchor" around which the random functions vary:

- **Zero mean** (left): Functions wander around zero, with equal probability of being positive or negative
- **Constant mean** (middle): Functions are shifted up to fluctuate around 5.0, but remain roughly constant on average
- **Linear mean** (right): Functions inherit the upward trend, but with wiggles around that trend

The key insight: the covariance function determines the smoothness and correlation structure (how wiggly the functions are), while the mean function determines where they're centered or whether they have a systematic trend. They work together, but control different aspects of the prior.

### When to Use Each Mean Function

Here's a practical guide:

- **Zero mean**: Default choice, especially with standardized data or when you have no strong prior beliefs
- **Constant mean**: When you expect the function to fluctuate around some non-zero value (e.g., temperature anomalies around a baseline)
- **Linear mean**: When you have a clear trend in your data and want the GP to model deviations from that trend

In PyMC, you can easily specify these using built-in mean functions:

In [None]:
# PyMC makes it easy to specify mean functions
mean_zero_pymc = pm.gp.mean.Zero()  # Default
mean_const_pymc = pm.gp.mean.Constant(c=5.0)
mean_linear_pymc = pm.gp.mean.Linear(coeffs=0.5, intercept=2.0)

# Test them
X_test = np.array([[0.], [5.], [10.]])
print("Zero mean:", mean_zero_pymc(X_test).eval())
print("Constant mean:", mean_const_pymc(X_test).eval())
print("Linear mean:", mean_linear_pymc(X_test).eval())

### ü§ñ EXERCISE 4: Compare Different Mean Functions

Now let's explore how different mean functions affect GP prior draws more systematically.

In [None]:
# ü§ñ EXERCISE 3: Use your LLM to help complete this implementation

def compare_mean_functions(X, n_samples=5, lengthscale=1.5, variance=1.0):
    """
    Create comparison plots of GP priors with different mean functions.
    
    Parameters:
    -----------
    X : array, shape (n_points, 1)
        Input locations
    n_samples : int
        Number of function samples to draw from each GP
    lengthscale : float
        Kernel lengthscale
    variance : float
        Kernel variance
        
    Returns:
    --------
    fig : plotly Figure
        Comparison plot
    
    Prompt suggestion: "Help me write a function that compares GP prior samples using 
    three different mean functions: zero, constant (value=5), and linear (slope=0.5, intercept=2).
    Use the exp_quad_kernel defined earlier and create side-by-side Plotly subplots showing
    samples from each GP. Draw the mean function as a dashed black line and samples as colored lines."
    """
    # REFERENCE SOLUTION:
    K = exp_quad_kernel(X, X, lengthscale=lengthscale, variance=variance)
    K += 1e-8 * np.eye(len(X))
    
    means = {
        'Zero': np.zeros(len(X)),
        'Constant (5.0)': np.full(len(X), 5.0),
        'Linear (0.5x + 2)': 0.5 * X.flatten() + 2.0
    }
    
    fig = make_subplots(rows=1, cols=3, subplot_titles=list(means.keys()))
    
    for idx, (name, mean_func) in enumerate(means.items(), 1):
        samples = rng.multivariate_normal(mean_func, K, size=n_samples)
        
        # Plot mean
        fig.add_trace(
            go.Scatter(x=X.flatten(), y=mean_func, mode='lines',
                      line=dict(color='black', width=2, dash='dash'),
                      name='Mean', showlegend=(idx==1)),
            row=1, col=idx
        )
        
        # Plot samples
        for i, sample in enumerate(samples):
            fig.add_trace(
                go.Scatter(x=X.flatten(), y=sample, mode='lines',
                          name=f'Sample {i+1}' if idx==1 else None,
                          showlegend=(idx==1), line=dict(width=1.5),
                          opacity=0.7),
                row=1, col=idx
            )
    
    fig.update_xaxes(title_text="x")
    fig.update_yaxes(title_text="f(x)", col=1)
    fig.update_layout(height=350, width=1000,
                      title_text="Comparing Different Mean Functions")
    return fig

# Test the function
X_compare = np.linspace(0, 10, 100)[:, None]
fig = compare_mean_functions(X_compare, n_samples=5)
fig.show()

The exercise above reinforces an important point: **the mean function doesn't change the "wiggliness" or correlation structure**‚Äîthat's entirely determined by the covariance function. The mean function simply shifts or tilts the family of functions we're considering.

In practice, most GP applications use a zero or constant mean function, letting the flexible covariance function do the heavy lifting. Now let's dive deep into those covariance functions!

## Section 1.7: Covariance Functions

### The Heart of Gaussian Processes

If the mean function is where your prior belief lives, the covariance function (or kernel) is where the magic happens. The kernel defines the structure, smoothness, and patterns that your GP can express. Choosing the right kernel is often the most important modeling decision you'll make.

A covariance function $k(x, x')$ must satisfy one critical property: it must produce **positive semi-definite covariance matrices** for any set of input points. This ensures that the resulting multivariate normal distributions are valid. Fortunately, PyMC provides a rich library of kernels that satisfy this property, and you can combine them in powerful ways.

Let's explore the most commonly used kernels and understand when to use each.

In [None]:
# Helper function to visualize kernels
def visualize_kernel(cov_func, X, title, n_samples=3):
    """Visualize kernel covariance matrix and sample functions."""
    K = cov_func(X).eval()
    K += 1e-6 * np.eye(len(X))
    
    fig = make_subplots(rows=1, cols=2, subplot_titles=['Covariance Matrix', 'Sample Functions'],
                        specs=[[{'type': 'heatmap'}, {'type': 'scatter'}]])
    
    # Covariance matrix heatmap
    fig.add_trace(go.Heatmap(z=K, colorscale='Viridis', showscale=True), row=1, col=1)
    
    # Sample functions
    samples = rng.multivariate_normal(np.zeros(len(X)), K, size=n_samples)
    for sample in samples:
        fig.add_trace(go.Scatter(x=X.flatten(), y=sample, mode='lines', showlegend=False, line=dict(width=1.5)), row=1, col=2)
    
    fig.update_xaxes(title_text="Index", row=1, col=1)
    fig.update_yaxes(title_text="Index", row=1, col=1)
    fig.update_xaxes(title_text="x", row=1, col=2)
    fig.update_yaxes(title_text="f(x)", row=1, col=2)
    fig.update_layout(height=350, width=900, title_text=title)
    return fig

# Grid for visualization
X_kern = np.linspace(0, 10, 80)[:, None]

### ExpQuad (Squared Exponential / RBF)

We've already seen this kernel! It's infinitely differentiable, producing extremely smooth functions. Use it when you expect your function to be smooth everywhere.

$$k(x, x') = \sigma^2 \exp\left(-\frac{\|x - x'\|^2}{2\ell^2}\right)$$

In [None]:
cov_expquad = pm.gp.cov.ExpQuad(input_dim=1, ls=1.5)
fig = visualize_kernel(cov_expquad, X_kern, "ExpQuad Kernel")
fig.show()

### Mat√©rn Family

The Mat√©rn kernels generalize the ExpQuad by controlling smoothness through a parameter $\nu$. Smaller $\nu$ means less smooth. PyMC provides three variants:

- **Mat√©rn 1/2**: Continuous but not differentiable (roughest)
- **Mat√©rn 3/2**: Once differentiable (medium smoothness)
- **Mat√©rn 5/2**: Twice differentiable (smooth, but not as extreme as ExpQuad)

Use Mat√©rn when you want control over smoothness or when your function might not be infinitely smooth.

In [None]:
# Compare Mat√©rn kernels
matern_kernels = [
    (pm.gp.cov.Matern12(input_dim=1, ls=1.5), "Mat√©rn 1/2 (Exponential)"),
    (pm.gp.cov.Matern32(input_dim=1, ls=1.5), "Mat√©rn 3/2"),
    (pm.gp.cov.Matern52(input_dim=1, ls=1.5), "Mat√©rn 5/2")
]

for cov, title in matern_kernels:
    fig = visualize_kernel(cov, X_kern, title)
    fig.show()

Notice how the functions become progressively smoother as we go from Mat√©rn 1/2 to 5/2. The Mat√©rn 1/2 (also called Exponential kernel) produces jagged, continuous but non-differentiable functions‚Äîperfect for modeling rough phenomena.

### Periodic Kernel

When your data has periodic structure (seasons, daily cycles, etc.), the Periodic kernel is your friend:

$$k(x, x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi|x - x'|/T)}{\ell^2}\right)$$

where $T$ is the period. This kernel enforces exact periodicity.

In [None]:
cov_periodic = pm.gp.cov.Periodic(input_dim=1, period=3.0, ls=1.0)
fig = visualize_kernel(cov_periodic, X_kern, "Periodic Kernel (period=3.0)")
fig.show()

See the repeating pattern? Functions drawn from this GP will have exact periodicity. In real applications, you often combine Periodic with other kernels to model periodic trends that slowly evolve over time.

### ü§ñ EXERCISE 5: Explore Lengthscale Effects

Let's explore how lengthscale affects different kernels systematically.

In [None]:
# ü§ñ EXERCISE: Use your LLM to help complete this visualization

def visualize_lengthscale_effects(X, lengthscales=(0.3, 1.0, 3.0), kernel_type='ExpQuad'):
    """
    Visualize how lengthscale affects GP prior samples.
    
    Parameters:
    -----------
    X : array
        Input points
    lengthscales : tuple
        Different lengthscales to compare
    kernel_type : str
        'ExpQuad', 'Matern32', or 'Matern52'
        
    Prompt suggestion: "Help me create a function that draws GP prior samples
    for multiple lengthscales using PyMC kernels. Create Plotly subplots showing
    3 functions for each lengthscale. Support ExpQuad, Matern32, and Matern52 kernels."
    """
    # REFERENCE SOLUTION:
    fig = make_subplots(rows=1, cols=len(lengthscales),
                        subplot_titles=[f'lengthscale = {ls}' for ls in lengthscales])
    
    for idx, ls in enumerate(lengthscales, 1):
        # Select kernel
        if kernel_type == 'ExpQuad':
            cov = pm.gp.cov.ExpQuad(input_dim=1, ls=ls)
        elif kernel_type == 'Matern32':
            cov = pm.gp.cov.Matern32(input_dim=1, ls=ls)
        elif kernel_type == 'Matern52':
            cov = pm.gp.cov.Matern52(input_dim=1, ls=ls)
        
        K = cov(X).eval() + 1e-6 * np.eye(len(X))
        samples = rng.multivariate_normal(np.zeros(len(X)), K, size=3)
        
        for sample in samples:
            fig.add_trace(
                go.Scatter(x=X.flatten(), y=sample, mode='lines',
                          showlegend=False, line=dict(width=1.5)),
                row=1, col=idx
            )
    
    fig.update_xaxes(title_text="x")
    fig.update_yaxes(title_text="f(x)", col=1)
    fig.update_layout(height=300, width=1000,
                      title_text=f"Lengthscale Effects on {kernel_type} Kernel")
    return fig

# Test with different kernels
X_test = np.linspace(0, 10, 100)[:, None]
for kernel in ['ExpQuad', 'Matern32', 'Matern52']:
    fig = visualize_lengthscale_effects(X_test, kernel_type=kernel)
    fig.show()

The exercise above demonstrates that lengthscale has consistent effects across kernel families: smaller lengthscales create more rapidly varying functions, while larger lengthscales produce smoother, slowly varying functions. The kernel family (ExpQuad vs. Mat√©rn) controls the *type* of smoothness, while lengthscale controls the *scale* at which variation occurs.

Now you have a solid understanding of the main kernel building blocks. In Session 2, we'll explore how to combine these kernels to model complex phenomena like trends plus seasonality!

## Section 1.8: Summary and Next Steps

Congratulations! You've completed Session 1 and built a strong foundation for working with Gaussian processes in PyMC.

### What We Covered

We took a three-part journey through the fundamentals:

**1. Bayesian Inference Principles**
- The three-step workflow: specify, calculate, check
- Bayes' theorem: posterior ‚àù likelihood √ó prior
- Conjugacy and analytical posteriors (Beta-Binomial example)
- How data updates beliefs and reduces uncertainty

**2. The PyMC API**
- Building your first PyMC model with real baseball data
- Priors, likelihoods, and the `observed=` keyword
- MCMC sampling with `pm.sample()` and NUTPIE
- Posterior predictive checks for model validation
- Coordinate systems (`coords=` and `dims=`) for structured models
- Multi-level models with shared parameters

**3. Gaussian Process Components**
- The connection between multivariate normals and Gaussian processes
- Functions as infinite-dimensional vectors
- Mean functions: zero, constant, and linear
- Covariance functions (kernels): ExpQuad, Mat√©rn family, Periodic
- How lengthscale and variance parameters control smoothness and scale
- Drawing sample functions from GP priors

### The Path Forward

We've deliberately kept things modular. You now understand:
- **What** Gaussian processes are (distributions over functions)
- **How** they're built (mean functions + covariance functions)
- **How** to implement Bayesian models in PyMC

In **Session 2**, we'll combine everything:
- Building GP models in PyMC with `pm.gp.Marginal` and `pm.gp.Latent`
- Fitting GPs to real data and making predictions
- Kernel composition: combining kernels to model complex patterns
- Non-Gaussian likelihoods for classification and count data
- Understanding marginal vs latent formulations

You're well-prepared. The concepts we've learned‚ÄîBayesian updating, PyMC workflow, and GP theory‚Äîwill come together into a powerful modeling framework.

See you in Session 2!