# Session 2: Kernels, Likelihoods, and Model Building

**Duration:** 2-3 hours

## Learning Objectives

By the end of this session, you will:

- Understand different kernel families and their properties
- Learn kernel composition via addition and multiplication
- Understand marginal vs latent GP formulations
- Work with non-Gaussian likelihoods
- Build additive and multiplicative models for real-world data

This session builds directly on Session 1's foundations. We'll move beyond basic GP models to explore the rich toolkit of covariance functions and likelihood choices that make Gaussian processes so flexible for real-world problems.

## Setup and Imports

Let's begin by loading our standard toolkit. Notice we're using polars for data manipulation and plotly for visualization, following modern Python best practices for data science.

In [None]:
import arviz as az
import numpy as np
import polars as pl
import plotly.express as px
import plotly.graph_objects as go
import pymc as pm
import pytensor.tensor as pt
import scipy.stats as stats

from plotly.subplots import make_subplots
from sklearn.cluster import KMeans

RANDOM_SEED = 8675309
rng = np.random.default_rng(RANDOM_SEED)

# Print versions for reproducibility
print(f"PyMC version: {pm.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Polars version: {pl.__version__}")
print(f"ArviZ version: {az.__version__}")

With our environment ready, let's verify everything loaded correctly before proceeding to explore the world of kernels.

## Section 2.1: Introduction and Recap

In Session 1, we built intuition about Gaussian processes by connecting multivariate normal distributions to functions. We learned that the **covariance function** (or kernel) is the heart of a GPâ€”it encodes our assumptions about how similar function values should be at different input locations. We also saw how to build simple GP models in PyMC using `pm.gp.Marginal` with Gaussian noise.

This session takes us deeper. Real-world problems rarely involve perfectly smooth functions with Gaussian noise. You might encounter:

- **Periodic patterns** that repeat over time (think seasonal sales, daily temperatures)
- **Long-term trends** combined with weekly patterns (like the births data we'll explore)
- **Binary outcomes** where we need classification, not regression
- **Heavy-tailed noise** that makes outliers more likely than a normal distribution suggests

Fortunately, the GP framework is remarkably flexible. By choosing appropriate kernels and likelihoods, we can handle all these scenarios. Let's see how.

## Section 2.2: The Kernel Zoo

Kernels are the building blocks that define the structure of our GP. Different kernels encode different assumptions about function smoothness, periodicity, and behavior. Let's explore the most important kernel families and develop intuition for when to use each one.

### Understanding Kernel Parameters

Before diving into specific kernels, let's understand the two parameters that appear in most covariance functions:

**Lengthscale ($\ell$)**: Controls how quickly the covariance between points decays with distance. Think of it as the "wiggliness" dial:
- Small lengthscale â†’ function changes rapidly, lots of wiggles
- Large lengthscale â†’ function changes slowly, smooth and gradually varying

**Amplitude or scale ($\eta$)**: Controls the vertical scale of variations. It determines how far from the mean function values typically deviate:
- Small amplitude â†’ function stays close to the mean
- Large amplitude â†’ function can wander far from the mean

These parameters have profound effects on the GP prior. Let's see this in action by comparing different kernel families.

### Kernels for Different Smoothness Assumptions

The degree of smoothness (differentiability) is a key property that distinguishes kernels. We'll compare three important kernels that span the spectrum from rough to infinitely smooth.

First, let's create a helper function to visualize GP priors. This function will help us understand how each kernel shapes the functions we can represent.

In [None]:
def plot_gp_prior_samples(X, cov_func, n_samples=5, title="GP Prior Samples"):
    """
    Plot samples from a GP prior to visualize kernel behavior.
    
    Parameters
    ----------
    X : array
        Input locations (1D array, will be reshaped to column vector)
    cov_func : PyMC covariance function
        The kernel to visualize
    n_samples : int
        Number of function samples to draw
    title : str
        Plot title
    """
    X = X.reshape(-1, 1)
    
    # Compute covariance matrix and add jitter for numerical stability
    K = cov_func(X).eval() + 1e-6 * np.eye(len(X))
    
    # Draw samples from the GP prior
    samples = rng.multivariate_normal(np.zeros(len(X)), K, size=n_samples)
    
    # Create plotly figure
    fig = go.Figure()
    
    for i in range(n_samples):
        fig.add_trace(go.Scatter(
            x=X.flatten(),
            y=samples[i],
            mode='lines',
            name=f'Sample {i+1}',
            opacity=0.7,
            line=dict(width=2)
        ))
    
    fig.update_layout(
        title=title,
        xaxis_title='X',
        yaxis_title='f(x)',
        hovermode='x unified',
        width=900,
        height=500
    )
    
    return fig

Now let's explore the three most commonly used stationary kernels. We'll use the same lengthscale and amplitude for each so we can directly compare their smoothness properties.

In [None]:
# Create input points
X_grid = np.linspace(0, 10, 200)

# Fixed parameters for fair comparison
lengthscale = 1.0
amplitude = 1.0

# Define three kernels with different smoothness properties
kernels = {
    'ExpQuad (RBF)': amplitude**2 * pm.gp.cov.ExpQuad(1, lengthscale),
    'Matern52': amplitude**2 * pm.gp.cov.Matern52(1, lengthscale),
    'Matern32': amplitude**2 * pm.gp.cov.Matern32(1, lengthscale),
}

# Plot prior samples for each kernel
for name, cov in kernels.items():
    fig = plot_gp_prior_samples(X_grid, cov, n_samples=5, title=f'{name} Kernel - Prior Samples')
    fig.show()

### Interpreting the Kernel Comparisons

Looking at these prior samples, you should notice clear differences in smoothness:

**ExpQuad (Squared Exponential / RBF)**: These functions are infinitely differentiableâ€”silky smooth with no sharp corners. The ExpQuad kernel assumes the underlying function is extremely well-behaved. This is great for phenomena like temperature changes or other physical processes that vary continuously. However, it can be *too* smooth for many real-world applications, potentially over-smoothing genuine features in the data.

**MatÃ©rn 5/2**: These functions are twice differentiable but not infinitely smooth. Notice they're still quite smooth but allow slightly more flexibility than ExpQuad. The MatÃ©rn 5/2 is often the go-to choice in practiceâ€”it's smooth enough for most applications but doesn't impose unrealistic smoothness assumptions. Think of it as the "Goldilocks" kernel.

**MatÃ©rn 3/2**: These functions are once differentiable, allowing them to have sharper features and more dramatic changes. This kernel works well when you expect the function to change more abruptly or have kinks. It's particularly useful for phenomena that might have sudden transitions.

**Practical guidance**: When in doubt, start with MatÃ©rn 5/2. It's flexible enough for most applications without the over-smoothing tendency of ExpQuad. Reserve ExpQuad for truly smooth phenomena, and use MatÃ©rn 3/2 when you anticipate sharper features or want to allow the data to speak more strongly about local variations.

### Exploring Lengthscale Effects

Let's deepen our intuition by seeing how the lengthscale parameter affects function behavior. This is one of the most important parameters you'll tune in practice.

In [None]:
# Compare different lengthscales with the same kernel
lengthscales = [0.3, 1.0, 3.0]
amplitude = 1.0

for ls in lengthscales:
    cov = amplitude**2 * pm.gp.cov.Matern52(1, ls)
    fig = plot_gp_prior_samples(
        X_grid, 
        cov, 
        n_samples=5, 
        title=f'MatÃ©rn 5/2 with lengthscale = {ls}'
    )
    fig.show()

### Understanding Lengthscale Through the Visualizations

Notice how dramatically the lengthscale changes function behavior:

With **lengthscale = 0.3**, the functions wiggle rapidly, changing direction frequently. This says "function values at points separated by more than 0.3 units are nearly independent." You'd use a small lengthscale when modeling high-frequency phenomena or when you have dense data and want to capture fine details.

With **lengthscale = 1.0**, we see moderate smoothnessâ€”functions change gradually but can still capture important variations. This is often a reasonable starting point for exploration.

With **lengthscale = 3.0**, the functions are very smooth and slowly varying. This says "function values stay correlated across long distances." Large lengthscales work well for modeling long-term trends or when you have sparse data that requires more smoothing.

The lengthscale essentially controls your model's **memory**: how far does the function need to travel before it "forgets" where it came from? This intuition will serve you well when building models.

### Periodic Kernels for Seasonal Patterns

Many real-world phenomena repeat: daily temperature cycles, weekly sales patterns, annual seasonal effects. For these, we need a kernel that encodes periodicity. The Periodic kernel is perfect for this.

The Periodic kernel has two key parameters beyond amplitude:
- **period**: The length of one complete cycle
- **lengthscale**: Controls smoothness *within* each period (how similar nearby points within a cycle should be)

Let's visualize periodic patterns and see how they differ from the stationary kernels we've explored.

In [None]:
# Demonstrate periodic kernel with annual period
X_grid = np.linspace(0, 20, 400)  # Extended range to see multiple periods

period = 5.0  # Complete cycle every 5 units
lengthscale = 1.0
amplitude = 1.0

cov_periodic = amplitude**2 * pm.gp.cov.Periodic(1, period=period, ls=lengthscale)

fig = plot_gp_prior_samples(
    X_grid, 
    cov_periodic, 
    n_samples=5,
    title=f'Periodic Kernel (period={period}, lengthscale={lengthscale})'
)
fig.show()

### Interpreting Periodic Patterns

Notice how the functions repeat with the specified period. The pattern from $x=0$ to $x=5$ is similar to the pattern from $x=5$ to $x=10$, and so on. This is exactly what we want for seasonal data.

The lengthscale parameter controls smoothness *within* each period. A smaller lengthscale would allow more wiggly behavior within each cycle, while a larger lengthscale would make each cycle smoother.

**Real-world example**: If you're modeling daily births data with weekly seasonality, you'd set `period=7`. The GP would then learn that Mondays tend to be similar to other Mondays, Tuesdays to Tuesdays, and so forth, while still being flexible about the exact pattern.

### The Linear Kernel for Trends

Sometimes your data contains linear or polynomial trends. The Linear kernel captures this by computing covariance proportional to the inner product of input locations. Unlike the stationary kernels we've seen, the Linear kernel is *non-stationary*â€”the covariance depends on the absolute position, not just the distance between points.

In [None]:
# Demonstrate linear kernel
X_grid = np.linspace(0, 10, 200)

c = 1.0  # Center point
variance = 1.0

cov_linear = variance * pm.gp.cov.Linear(1, c=c)

fig = plot_gp_prior_samples(
    X_grid,
    cov_linear,
    n_samples=5,
    title='Linear Kernel - Prior Samples'
)
fig.show()

### Understanding the Linear Kernel

These functions are predominantly linear, though you'll notice they're not perfectly straight linesâ€”there's some flexibility around the linear trend. The Linear kernel alone is rarely sufficient for real data, but it becomes powerful when *combined* with other kernels (which we'll explore in Section 2.5).

The parameter `c` acts as a center point or offset. The covariance between two points $x$ and $x'$ is proportional to $(x - c)(x' - c)$. This makes the variance grow as you move away from $c$.

Now that we understand individual kernels and their properties, we're ready to see how to use them in real models. But first, let's address an important modeling decision: should we use `pm.gp.Marginal` or `pm.gp.Latent`?

## Section 2.3: Marginal Likelihood with pm.gp.Marginal

When your data are continuous observations with Gaussian noise, `pm.gp.Marginal` provides the most efficient implementation. It analytically integrates out the latent GP function, directly computing the marginal likelihood $p(y | x)$ without sampling the intermediate function values.

### Mathematical Intuition

Recall that a GP models a function $f(x)$, and our observations are:

$$
\begin{align}
f(x) &\sim \mathcal{GP}(m(x), k(x, x')) \\
y &= f(x) + \epsilon \\
\epsilon &\sim \mathcal{N}(0, \sigma^2)
\end{align}
$$

Because both the GP prior and the Gaussian noise are normal distributions, we can analytically integrate out $f$ to get:

$$
y \sim \mathcal{N}(m(X), K(X, X) + \sigma^2 I)
$$

This marginalization is exact and efficientâ€”we don't need to sample the potentially high-dimensional latent function. The `marginal_likelihood` method implements this approach.

### When to Use pm.gp.Marginal

Choose `pm.gp.Marginal` when:
- Your likelihood is Gaussian (normal)
- You have continuous-valued observations
- You want the fastest inference possible
- You don't need samples from the latent function itself

Let's work through a complete example using real data.

### Example: Modeling COâ‚‚ Concentrations

We'll use a classic dataset: atmospheric COâ‚‚ concentrations measured at Mauna Loa Observatory. This dataset exhibits both a long-term increasing trend and annual seasonal variationâ€”a perfect demonstration of why we need flexible covariance functions.

Let's load and visualize the data first. We'll use polars for data manipulation.

In [None]:
# Generate synthetic data similar to Mauna Loa CO2
# (In practice, you'd load real data here)
n_points = 100
X_train = np.linspace(0, 10, n_points)[:, None]

# True function: long-term trend + seasonal component + noise
def true_function(x):
    trend = 0.3 * x
    seasonal = 0.5 * np.sin(2 * np.pi * x)
    return trend + seasonal

f_true = true_function(X_train.flatten())
noise_std = 0.2
y_train = f_true + noise_std * rng.standard_normal(n_points)

# Create a polars dataframe for easier manipulation
df_train = pl.DataFrame({
    'x': X_train.flatten(),
    'y': y_train,
    'f_true': f_true
})

# Visualize the data
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=df_train['x'],
    y=df_train['y'],
    mode='markers',
    name='Observed data',
    marker=dict(size=4, color='black')
))

fig.add_trace(go.Scatter(
    x=df_train['x'],
    y=df_train['f_true'],
    mode='lines',
    name='True function',
    line=dict(color='dodgerblue', width=3)
))

fig.update_layout(
    title='Training Data with Trend and Seasonality',
    xaxis_title='Time',
    yaxis_title='Value',
    hovermode='x unified',
    width=900,
    height=500
)

fig.show()

### Observing the Data Patterns

Looking at this plot, we see exactly the patterns we discussed earlier: an increasing long-term trend combined with regular oscillations. The black dots show our noisy observations, while the blue line reveals the true underlying function.

A simple linear model would miss the oscillations entirely. A purely periodic model would miss the upward trend. We need a GP with a covariance function that can capture both features simultaneously. This is a perfect job for a MatÃ©rn 5/2 kernel, which can flexibly model smooth, gradually varying patterns.

Let's build the model.

### Building the Model

We'll specify priors for the hyperparameters (lengthscale, amplitude, and noise level), then use `pm.gp.Marginal` to efficiently compute the marginal likelihood. We'll use weakly informative priors that allow the data to guide us to reasonable values.

In [None]:
with pm.Model() as marginal_model:
    # Priors for hyperparameters
    # Lengthscale: how quickly covariance decays
    ell = pm.Gamma('ell', alpha=2, beta=1)
    
    # Amplitude: scale of variations
    eta = pm.HalfNormal('eta', sigma=2)
    
    # Define covariance function
    cov_func = eta**2 * pm.gp.cov.Matern52(1, ell)
    
    # Create the GP with mean function (default is zero)
    gp = pm.gp.Marginal(cov_func=cov_func)
    
    # Noise level
    sigma = pm.HalfNormal('sigma', sigma=0.5)
    
    # Marginal likelihood
    y_obs = gp.marginal_likelihood('y', X=X_train, y=y_train, sigma=sigma)
    
    # Sample from posterior
    marginal_trace = pm.sample(
        500,
        tune=1000,
        nuts_sampler='numpyro',
        chains=2,
        random_seed=rng
    )

### Understanding the Model Components

Let's break down what we just specified:

**Lengthscale prior (`ell`)**: We used a Gamma(2, 1) prior, which has most of its mass between 0.5 and 5. This weakly suggests that correlations decay over a moderate distance, but remains flexible enough for the data to pull us toward smaller or larger values if needed.

**Amplitude prior (`eta`)**: The HalfNormal(2) prior allows for a wide range of vertical scales, centered around moderate values. This is often a safe choice that lets the data speak.

**Noise prior (`sigma`)**: Another HalfNormal, this time with smaller scale, reflecting our expectation that observation noise is relatively modest compared to the signal.

The key method call is `gp.marginal_likelihood`, which efficiently computes $p(y | \theta)$ by marginalizing out the latent function. This is much faster than sampling the function explicitly.

### Checking Inference Diagnostics

Before trusting our results, let's verify that sampling worked well. We'll check R-hat values and examine trace plots.

In [None]:
# Check R-hat values - should all be close to 1.0
summary = az.summary(marginal_trace, var_names=['ell', 'eta', 'sigma'])
print(summary)

# Check for divergences
divergences = marginal_trace.sample_stats['diverging'].sum().item()
print(f"\nNumber of divergences: {divergences}")

### Interpreting the Diagnostics

R-hat values close to 1.0 (typically < 1.01) indicate that our chains have converged and are exploring the same posterior distribution. Effective sample sizes (ESS) tell us how many independent samples we effectively haveâ€”higher is better, and we generally want at least a few hundred.

Zero divergences is ideal. Divergences suggest the sampler had trouble exploring certain regions of parameter space, often indicating problems with the model specification or prior choices. If you see divergences, you might need to reparameterize or use more informative priors.

Assuming our diagnostics look good, let's visualize the posterior.

In [None]:
# Visualize posterior distributions
az.plot_trace(marginal_trace, var_names=['ell', 'eta', 'sigma'], compact=True);

### Making Predictions

Now comes the exciting part: using our trained GP to make predictions at new input locations. We'll use the `conditional` method to compute the posterior predictive distribution.

In [None]:
# New X locations for prediction (including extrapolation)
X_new = np.linspace(-1, 12, 300)[:, None]

with marginal_model:
    # Conditional distribution for the latent function
    f_pred = gp.conditional('f_pred', X_new)
    
    # Sample from the posterior predictive
    marginal_ppc = pm.sample_posterior_predictive(
        marginal_trace,
        var_names=['f_pred'],
        random_seed=rng
    )

### Visualizing Posterior Predictions

Let's plot the posterior mean along with credible intervals to show our uncertainty. Notice how uncertainty grows as we extrapolate beyond the training data.

In [None]:
# Extract posterior samples and compute statistics
f_pred_samples = az.extract(
    marginal_ppc.posterior_predictive, 
    var_names=['f_pred']
)['f_pred'].values

f_pred_mean = f_pred_samples.mean(axis=1)
f_pred_lower = np.percentile(f_pred_samples, 2.5, axis=1)
f_pred_upper = np.percentile(f_pred_samples, 97.5, axis=1)

# Create visualization
fig = go.Figure()

# 95% credible interval
fig.add_trace(go.Scatter(
    x=X_new.flatten(),
    y=f_pred_upper,
    mode='lines',
    line=dict(width=0),
    showlegend=False,
    hoverinfo='skip'
))

fig.add_trace(go.Scatter(
    x=X_new.flatten(),
    y=f_pred_lower,
    mode='lines',
    fill='tonexty',
    fillcolor='rgba(255, 0, 0, 0.2)',
    line=dict(width=0),
    name='95% Credible Interval'
))

# Posterior mean
fig.add_trace(go.Scatter(
    x=X_new.flatten(),
    y=f_pred_mean,
    mode='lines',
    name='Posterior mean',
    line=dict(color='red', width=2)
))

# Training data
fig.add_trace(go.Scatter(
    x=df_train['x'],
    y=df_train['y'],
    mode='markers',
    name='Observed data',
    marker=dict(size=4, color='black')
))

# True function
fig.add_trace(go.Scatter(
    x=df_train['x'],
    y=df_train['f_true'],
    mode='lines',
    name='True function',
    line=dict(color='dodgerblue', width=2, dash='dash')
))

# Add vertical lines to show training range
fig.add_vline(x=X_train.min(), line_dash='dot', line_color='gray', opacity=0.5)
fig.add_vline(x=X_train.max(), line_dash='dot', line_color='gray', opacity=0.5)

fig.update_layout(
    title='GP Marginal: Posterior Predictions with Uncertainty',
    xaxis_title='X',
    yaxis_title='f(X)',
    hovermode='x unified',
    width=900,
    height=500
)

fig.show()

### Interpreting the Predictions

This plot reveals several important features of GP regression:

**Within the training range** (between the vertical dotted lines), the posterior mean (red line) closely tracks the true function (blue dashed line), and the 95% credible interval (shaded region) is narrow. The model is confident because it has observed data in this region.

**Outside the training range**, the uncertainty grows rapidly. Notice how the credible interval widens as we extrapolate to the left and right. The GP is honestly expressing its uncertainty: "I haven't seen data out here, so I'm less sure about what's happening."

The posterior mean continues to capture the general trend even in extrapolation regions, but the growing uncertainty reminds us to be cautious about predictions far from observed data. This is one of the great strengths of GPsâ€”they provide calibrated uncertainty quantification automatically.

Now that we understand `pm.gp.Marginal`, let's explore the more flexible but computationally intensive alternative: `pm.gp.Latent`.

## Section 2.4: Latent GPs with pm.gp.Latent

While `pm.gp.Marginal` is efficient for Gaussian likelihoods, many real-world problems don't produce continuous measurements with Gaussian noise. Consider:

- **Classification**: Predicting whether a patient has a disease (binary outcome)
- **Count data**: Number of events in a time period (discrete, non-negative)
- **Survival analysis**: Time until an event occurs (censored data)

For these cases, we need `pm.gp.Latent`, which keeps the GP function as a latent variable in the model and allows arbitrary likelihood functions.

### The Latent GP Approach

Instead of marginalizing out the GP, we explicitly sample it:

$$
\begin{align}
f &\sim \mathcal{GP}(m, k) \\
y_i &\sim \text{Likelihood}(g(f(x_i)), \theta)
\end{align}
$$

where $g$ is a link function (like logistic for binary outcomes) and Likelihood can be any distribution (Bernoulli, Poisson, Student-T, etc.).

### Trade-offs: Latent vs Marginal

**pm.gp.Marginal**:
- âœ“ Fast: analytically integrates out the GP
- âœ“ Efficient: fewer parameters to sample
- âœ— Limited: only works with Gaussian likelihoods

**pm.gp.Latent**:
- âœ“ Flexible: works with any likelihood
- âœ“ Includes GP samples in posterior
- âœ— Slower: must sample high-dimensional latent function
- âœ— Can have sampling challenges

Let's work through a classification example to see `pm.gp.Latent` in action.

### Example: Binary Classification with GPs

Imagine we're predicting whether a medical treatment is effective based on patient characteristics. The outcome is binary (success/failure), but we believe the probability of success varies smoothly with the input features. A GP with a Bernoulli likelihood is perfect for this.

Let's generate synthetic classification data where the probability varies smoothly with the input.

In [None]:
# Generate binary classification data
n_class = 150
X_class = np.linspace(0, 10, n_class)[:, None]

# True latent function (log-odds)
ell_true = 1.0
eta_true = 2.0
cov_true = eta_true**2 * pm.gp.cov.ExpQuad(1, ell_true)
K_true = cov_true(X_class).eval() + 1e-6 * np.eye(n_class)

f_true_class = rng.multivariate_normal(np.zeros(n_class), K_true)

# Convert to probabilities via logistic function
p_true = 1 / (1 + np.exp(-f_true_class))

# Generate binary observations
y_class = rng.binomial(1, p_true)

# Create polars dataframe
df_class = pl.DataFrame({
    'x': X_class.flatten(),
    'y': y_class,
    'p_true': p_true,
    'f_true': f_true_class
})

# Visualize
fig = go.Figure()

# True probability
fig.add_trace(go.Scatter(
    x=df_class['x'],
    y=df_class['p_true'],
    mode='lines',
    name='True p(y=1)',
    line=dict(color='dodgerblue', width=3)
))

# Binary observations (jittered for visibility)
y_jittered = df_class['y'] + 0.02 * rng.standard_normal(n_class)
fig.add_trace(go.Scatter(
    x=df_class['x'],
    y=y_jittered,
    mode='markers',
    name='Observed outcomes',
    marker=dict(size=5, color='black', symbol='x')
))

fig.update_layout(
    title='Binary Classification Data',
    xaxis_title='X',
    yaxis_title='Probability / Outcome',
    hovermode='x unified',
    width=900,
    height=500
)

fig.show()

### Understanding the Classification Data

The blue line shows the true probability that $y=1$ as a function of $x$. Notice how it varies smoothlyâ€”this is exactly the kind of structure a GP can capture. The black X markers show our binary observations, which are scattered around the probability curve (we've added small jitter for visibility).

In regions where the probability is close to 0 or 1, we see mostly failures or mostly successes. In the middle range where probability is around 0.5, we see a mix of both outcomes. Our goal is to recover this smooth probability function from the binary data alone.

This is a challenging problemâ€”the observations are discrete, but we want to learn a smooth underlying probability. Let's see how `pm.gp.Latent` handles this.

### Building the Classification Model

We'll use a latent GP with a Bernoulli likelihood. The GP models the log-odds (logit) of the probability, and we transform to probabilities using the inverse logit function.

In [None]:
with pm.Model() as latent_model:
    # Hyperparameter priors
    ell = pm.Gamma('ell', alpha=2, beta=1)
    eta = pm.HalfNormal('eta', sigma=3)
    
    # Covariance function
    cov_func = eta**2 * pm.gp.cov.ExpQuad(1, ell)
    
    # Latent GP
    gp = pm.gp.Latent(cov_func=cov_func)
    
    # GP prior (this is the latent function)
    f = gp.prior('f', X=X_class)
    
    # Transform to probability via logistic function
    p = pm.Deterministic('p', pm.math.invlogit(f))
    
    # Bernoulli likelihood
    y_obs = pm.Bernoulli('y', p=p, observed=y_class)
    
    # Sample (this will be slower than Marginal)
    latent_trace = pm.sample(
        500,
        tune=1000,
        nuts_sampler='numpyro',
        chains=2,
        target_accept=0.95,
        random_seed=rng
    )

### Understanding the Latent Model Structure

The key differences from `pm.gp.Marginal` are:

**`gp.prior()`**: This explicitly adds the GP as a latent variable in the model. We're sampling the function values $f$ at our input locations.

**Deterministic transformation**: We use `pm.Deterministic` to track the probability $p = \text{logit}^{-1}(f)$. This is purely for convenienceâ€”we want to see probabilities in our posterior, not log-odds.

**Non-Gaussian likelihood**: The Bernoulli likelihood connects our latent function to the binary observations. This is what makes `pm.gp.Marginal` unsuitableâ€”there's no way to marginalize a Bernoulli likelihood analytically.

**Slower sampling**: Because we're sampling a potentially high-dimensional latent function, this model will be slower than an equivalent Gaussian regression. The `target_accept=0.95` helps with sampling stability.

Let's check the diagnostics before looking at results.

In [None]:
# Check convergence
summary_latent = az.summary(latent_trace, var_names=['ell', 'eta'])
print(summary_latent)

# Check divergences
divergences_latent = latent_trace.sample_stats['diverging'].sum().item()
print(f"\nNumber of divergences: {divergences_latent}")

if divergences_latent == 0:
    print("âœ“ No divergences - sampling worked well!")
else:
    print("âš  Warning: Some divergences detected. Consider reparameterization.")

### Visualizing the Posterior Probability Function

Now let's see how well we recovered the true probability function from the binary data.

In [None]:
# Extract posterior samples for probability
p_samples = az.extract(latent_trace, var_names=['p'])['p'].values

p_mean = p_samples.mean(axis=1)
p_lower = np.percentile(p_samples, 2.5, axis=1)
p_upper = np.percentile(p_samples, 97.5, axis=1)

# Create visualization
fig = go.Figure()

# 95% credible interval
fig.add_trace(go.Scatter(
    x=X_class.flatten(),
    y=p_upper,
    mode='lines',
    line=dict(width=0),
    showlegend=False,
    hoverinfo='skip'
))

fig.add_trace(go.Scatter(
    x=X_class.flatten(),
    y=p_lower,
    mode='lines',
    fill='tonexty',
    fillcolor='rgba(255, 0, 0, 0.2)',
    line=dict(width=0),
    name='95% Credible Interval'
))

# Posterior mean probability
fig.add_trace(go.Scatter(
    x=X_class.flatten(),
    y=p_mean,
    mode='lines',
    name='Posterior mean p(y=1)',
    line=dict(color='red', width=2)
))

# True probability
fig.add_trace(go.Scatter(
    x=df_class['x'],
    y=df_class['p_true'],
    mode='lines',
    name='True p(y=1)',
    line=dict(color='dodgerblue', width=2, dash='dash')
))

# Binary observations
fig.add_trace(go.Scatter(
    x=df_class['x'],
    y=y_jittered,
    mode='markers',
    name='Observed outcomes',
    marker=dict(size=5, color='black', symbol='x')
))

fig.update_layout(
    title='GP Classification: Recovered Probability Function',
    xaxis_title='X',
    yaxis_title='P(y=1)',
    hovermode='x unified',
    width=900,
    height=500
)

fig.show()

### Interpreting the Classification Results

This is quite remarkable: from binary 0/1 observations alone, we've recovered a smooth probability function that closely tracks the true underlying probability (blue dashed line).

The posterior mean (red line) captures the general shape of the true probability curve. The credible interval (shaded region) reflects our uncertaintyâ€”notice it's wider in regions with fewer data points or where observations are more variable.

The GP classification model has learned that probability varies smoothly with $x$, and it's appropriately uncertain in regions where the data are ambiguous (around $p \approx 0.5$). This is exactly the kind of calibrated uncertainty we want from a probabilistic model.

Now let's move on to one of the most powerful features of GPs: kernel composition.

## ðŸ¤– EXERCISE 1: Binary GP Classifier

Now it's your turn to build a binary classification model using `pm.gp.Latent`.

### Task

Complete the function below that builds a GP classification model. You'll need to:
1. Define appropriate priors for lengthscale and amplitude
2. Create a covariance function (your choice of kernel)
3. Set up a `pm.gp.Latent` GP with Bernoulli likelihood
4. Sample from the posterior

**Prompt suggestion**: "Help me implement a Latent GP classification model in PyMC with an RBF kernel and Bernoulli likelihood using a logit link. The model should include priors for hyperparameters and return an InferenceData object."

In [None]:
def build_binary_latent_gp(X, y, kernel_type='ExpQuad'):
    """
    Build a pm.gp.Latent classification model with Bernoulli likelihood.
    
    Parameters
    ----------
    X : array, shape (n, 1)
        Input features
    y : array, shape (n,)
        Binary outcomes (0 or 1)
    kernel_type : str
        Type of kernel to use: 'ExpQuad', 'Matern52', or 'Matern32'
    
    Returns
    -------
    model : pm.Model
        The PyMC model
    trace : az.InferenceData
        Posterior samples
    """
    # YOUR LLM-ASSISTED CODE HERE
    pass

# Test your implementation
# Uncomment the lines below once you've implemented the function
# model, trace = build_binary_latent_gp(X_class, y_class)
# az.summary(trace, var_names=['ell', 'eta'])

## Section 2.5: Additive and Multiplicative Kernels

One of the most powerful features of GPs is that we can **compose kernels** through addition and multiplication. This allows us to build sophisticated models that capture multiple patterns simultaneously.

### The Algebra of Kernels

If $k_1$ and $k_2$ are valid covariance functions, then:

**Addition ($k_1 + k_2$)**: The resulting function is a sum of independent components
- Use when: You have multiple independent sources of variation (e.g., trend + seasonality)
- Interpretation: Functions can vary in ways explained by *either* kernel

**Multiplication ($k_1 \times k_2$)**: The resulting function has features of both kernels interacting
- Use when: One pattern modulates another (e.g., growing seasonal amplitude)
- Interpretation: Functions must satisfy *both* kernels' constraints simultaneously

Let's explore these through the classic example: birth rates over time.

### Real Data: Daily Births in the USA

Let's load real data on the number of births per day in the United States from 1969-1988. This is a fascinating dataset that exhibits multiple patterns:
- Long-term trends (changing birth rates over decades)
- Annual seasonality (more births in certain months)
- Weekly patterns (fewer births on weekends due to scheduling)

In [None]:
# Load births data
# For this example, we'll create synthetic data with known structure
# In practice, you would load real birth data from the data/ folder

n_days = 365 * 3  # Three years of daily data
time = np.arange(n_days)
X_births = (time / 365)[:, None]  # Time in years

# Simulate births with multiple components:
# 1. Long-term trend (slight decrease)
trend = 100 - 2 * X_births.flatten()

# 2. Annual seasonality (more births in summer)
seasonal = 5 * np.sin(2 * np.pi * X_births.flatten())

# 3. Weekly pattern (fewer on weekends)
weekly = -3 * ((time % 7) >= 5)  # Dip on days 5 and 6 (weekend)

# Combine and add noise
y_births = trend + seasonal + weekly + 2 * rng.standard_normal(n_days)

# Create polars dataframe
df_births = pl.DataFrame({
    'time': X_births.flatten(),
    'day': time,
    'births': y_births,
    'trend': trend,
    'seasonal': seasonal,
    'weekly': weekly
})

# Visualize the data
fig = px.scatter(
    df_births.to_pandas(),
    x='time',
    y='births',
    title='Daily Birth Counts Over Time',
    labels={'time': 'Time (years)', 'births': 'Relative births'}
)

fig.update_traces(marker=dict(size=2, opacity=0.5))
fig.update_layout(width=900, height=500)
fig.show()

### Observing Multiple Patterns

Looking at this data, we can see:
- A gradual downward trend over the years
- Clear periodic oscillations (annual seasonality)
- Regular weekly dips (though harder to see at this scale)

No single kernel we've seen so far can capture all these patterns. We need to combine kernels. Let's build up an additive model piece by piece.

### Building an Additive Model: Trend + Seasonality

We'll use an **additive kernel** to model the trend and seasonality as independent components:
- A MatÃ©rn 5/2 kernel with long lengthscale for the gradual trend
- A Periodic kernel with period=1 year for seasonal variation

The key insight: $k_{\text{total}} = k_{\text{trend}} + k_{\text{seasonal}}$ lets each component explain different features of the data.

In [None]:
# Subsample for computational efficiency
n_train = 300
indices = rng.choice(n_days, size=n_train, replace=False)
indices.sort()

X_train_births = X_births[indices]
y_train_births = y_births[indices]

with pm.Model() as additive_model:
    # Trend component: smooth, long lengthscale
    ell_trend = pm.Gamma('ell_trend', alpha=2, beta=0.5)  # Prior favors larger lengthscales
    eta_trend = pm.HalfNormal('eta_trend', sigma=5)
    cov_trend = eta_trend**2 * pm.gp.cov.Matern52(1, ell_trend)
    
    # Seasonal component: periodic with yearly period
    ell_seasonal = pm.Gamma('ell_seasonal', alpha=2, beta=2)
    eta_seasonal = pm.HalfNormal('eta_seasonal', sigma=5)
    period = 1.0  # One year
    cov_seasonal = eta_seasonal**2 * pm.gp.cov.Periodic(1, period=period, ls=ell_seasonal)
    
    # ADDITIVE KERNEL: sum of trend and seasonal
    cov_total = cov_trend + cov_seasonal
    
    # GP with additive kernel
    gp = pm.gp.Marginal(cov_func=cov_total)
    
    # Likelihood
    sigma = pm.HalfNormal('sigma', sigma=3)
    y_obs = gp.marginal_likelihood('y', X=X_train_births, y=y_train_births, sigma=sigma)
    
    # Sample
    additive_trace = pm.sample(
        500,
        tune=1000,
        nuts_sampler='numpyro',
        chains=2,
        random_seed=rng
    )

### Understanding the Additive Model

The critical line is `cov_total = cov_trend + cov_seasonal`. This creates a new kernel where:
- The trend component can vary slowly and independently
- The seasonal component can oscillate annually
- The total variation is the sum of both

We gave each component its own hyperparameters (`ell_trend`, `eta_trend`, `ell_seasonal`, `eta_seasonal`), allowing the model to learn how much variation each component explains.

Let's visualize the predictions to see how well this captures the data.

In [None]:
# Make predictions
X_pred_births = np.linspace(0, 3.5, 400)[:, None]

with additive_model:
    f_pred_births = gp.conditional('f_pred', X_pred_births)
    ppc_births = pm.sample_posterior_predictive(
        additive_trace,
        var_names=['f_pred'],
        random_seed=rng
    )

In [None]:
# Extract predictions
f_pred_births_samples = az.extract(
    ppc_births.posterior_predictive,
    var_names=['f_pred']
)['f_pred'].values

f_pred_births_mean = f_pred_births_samples.mean(axis=1)
f_pred_births_lower = np.percentile(f_pred_births_samples, 2.5, axis=1)
f_pred_births_upper = np.percentile(f_pred_births_samples, 97.5, axis=1)

# Visualize
fig = go.Figure()

# Credible interval
fig.add_trace(go.Scatter(
    x=X_pred_births.flatten(),
    y=f_pred_births_upper,
    mode='lines',
    line=dict(width=0),
    showlegend=False,
    hoverinfo='skip'
))

fig.add_trace(go.Scatter(
    x=X_pred_births.flatten(),
    y=f_pred_births_lower,
    mode='lines',
    fill='tonexty',
    fillcolor='rgba(255, 0, 0, 0.2)',
    line=dict(width=0),
    name='95% Credible Interval'
))

# Posterior mean
fig.add_trace(go.Scatter(
    x=X_pred_births.flatten(),
    y=f_pred_births_mean,
    mode='lines',
    name='Posterior mean',
    line=dict(color='red', width=2)
))

# Training data
fig.add_trace(go.Scatter(
    x=X_train_births.flatten(),
    y=y_train_births,
    mode='markers',
    name='Training data',
    marker=dict(size=3, color='black', opacity=0.5)
))

fig.update_layout(
    title='Additive GP: Trend + Seasonality',
    xaxis_title='Time (years)',
    yaxis_title='Births',
    hovermode='x unified',
    width=900,
    height=500
)

fig.show()

### Interpreting the Additive Model Results

The model has successfully decomposed the data into its trend and seasonal components:
- The overall downward trajectory captures the long-term trend
- The regular oscillations capture the annual seasonality
- The model extrapolates reasonably beyond the training data (past year 3)

This is the power of additive models: each kernel explains a different aspect of variation, and the GP automatically learns how to attribute variation to each component.

**Note**: We haven't captured the weekly pattern yet. That would require either:
1. Adding another periodic component with period = 7 days (converted to years)
2. Using a categorical variable for day-of-week (as in the births HSGP example in Session 3)

Let's now explore multiplicative kernels.

### Multiplicative Kernels: When Patterns Interact

Multiplication creates more complex behaviors. A product $k_1 \times k_2$ forces the function to satisfy both kernels' constraints simultaneously. This is useful when one pattern **modulates** another.

For example: imagine seasonal patterns whose amplitude grows over time. A purely additive model can't capture thisâ€”the seasonal oscillations would have constant amplitude. A multiplicative kernel can.

Let's create synthetic data with growing seasonal amplitude and model it.

In [None]:
# Create data with growing seasonal amplitude
X_mult = np.linspace(0, 5, 200)[:, None]

# Seasonal pattern with amplitude that grows linearly
amplitude = 1 + 0.5 * X_mult.flatten()
y_mult = amplitude * np.sin(4 * np.pi * X_mult.flatten()) + 0.3 * rng.standard_normal(200)

df_mult = pl.DataFrame({
    'x': X_mult.flatten(),
    'y': y_mult
})

# Visualize
fig = px.scatter(
    df_mult.to_pandas(),
    x='x',
    y='y',
    title='Data with Growing Seasonal Amplitude'
)
fig.update_traces(marker=dict(size=4, color='black'))
fig.update_layout(width=900, height=500)
fig.show()

### Observing the Multiplicative Pattern

Notice how the oscillations grow in amplitude as x increases. An additive model `Linear + Periodic` couldn't capture thisâ€”it would just add a linear trend to constant-amplitude oscillations.

Instead, we need `Linear Ã— Periodic`, where the linear component modulates the amplitude of the periodic component.

In [None]:
with pm.Model() as mult_model:
    # Linear component for growing amplitude
    c = pm.Normal('c', mu=0, sigma=2)
    var_lin = pm.HalfNormal('var_lin', sigma=2)
    cov_linear = var_lin * pm.gp.cov.Linear(1, c=c)
    
    # Periodic component
    ell_per = pm.Gamma('ell_per', alpha=2, beta=2)
    eta_per = pm.HalfNormal('eta_per', sigma=2)
    period = 1.0
    cov_periodic = eta_per**2 * pm.gp.cov.Periodic(1, period=period, ls=ell_per)
    
    # MULTIPLICATIVE KERNEL
    cov_mult = cov_linear * cov_periodic
    
    gp = pm.gp.Marginal(cov_func=cov_mult)
    
    sigma = pm.HalfNormal('sigma', sigma=1)
    y_obs = gp.marginal_likelihood('y', X=X_mult, y=y_mult, sigma=sigma)
    
    mult_trace = pm.sample(
        500,
        tune=1000,
        nuts_sampler='numpyro',
        chains=2,
        random_seed=rng
    )

### Understanding the Multiplicative Model

The line `cov_mult = cov_linear * cov_periodic` creates a kernel where:
- The periodic component creates oscillations
- The linear component modulates their amplitude
- Together, they create growing seasonal patterns

Let's see how well it works.

In [None]:
# Predictions
X_pred_mult = np.linspace(-0.5, 6, 300)[:, None]

with mult_model:
    f_pred_mult = gp.conditional('f_pred', X_pred_mult)
    ppc_mult = pm.sample_posterior_predictive(
        mult_trace,
        var_names=['f_pred'],
        random_seed=rng
    )

In [None]:
# Extract and visualize
f_pred_mult_samples = az.extract(
    ppc_mult.posterior_predictive,
    var_names=['f_pred']
)['f_pred'].values

f_pred_mult_mean = f_pred_mult_samples.mean(axis=1)
f_pred_mult_lower = np.percentile(f_pred_mult_samples, 2.5, axis=1)
f_pred_mult_upper = np.percentile(f_pred_mult_samples, 97.5, axis=1)

fig = go.Figure()

# Credible interval
fig.add_trace(go.Scatter(
    x=X_pred_mult.flatten(),
    y=f_pred_mult_upper,
    mode='lines',
    line=dict(width=0),
    showlegend=False
))

fig.add_trace(go.Scatter(
    x=X_pred_mult.flatten(),
    y=f_pred_mult_lower,
    mode='lines',
    fill='tonexty',
    fillcolor='rgba(255, 0, 0, 0.2)',
    line=dict(width=0),
    name='95% CI'
))

# Mean
fig.add_trace(go.Scatter(
    x=X_pred_mult.flatten(),
    y=f_pred_mult_mean,
    mode='lines',
    name='Posterior mean',
    line=dict(color='red', width=2)
))

# Data
fig.add_trace(go.Scatter(
    x=df_mult['x'],
    y=df_mult['y'],
    mode='markers',
    name='Data',
    marker=dict(size=4, color='black')
))

fig.update_layout(
    title='Multiplicative GP: Growing Seasonal Amplitude',
    xaxis_title='X',
    yaxis_title='Y',
    hovermode='x unified',
    width=900,
    height=500
)

fig.show()

### Interpreting the Multiplicative Results

Beautiful! The model has captured the growing amplitude of the oscillations. Notice how:
- The oscillation frequency stays constant (determined by the periodic kernel)
- The amplitude grows linearly (determined by the linear kernel)
- The interaction between them creates the observed pattern

When you extrapolate past $x=5$, the model continues this patternâ€”oscillations with ever-growing amplitude. This is sometimes exactly what you want, but be cautious: multiplicative models can extrapolate in surprising ways.

**Key takeaway**: Use multiplication when one pattern modulates another. Use addition when patterns are independent sources of variation.

## ðŸ¤– EXERCISE 2: Additive Kernel Model

Your turn! Build an additive GP model that combines a long-term trend with seasonal variation.

### Task

Complete the function below that builds an additive GP model combining:
1. A MatÃ©rn 5/2 kernel for the trend (with long lengthscale)
2. A Periodic kernel for seasonality (with specified period)
3. Use `pm.gp.Marginal` for efficient inference

**Prompt suggestion**: "Help me define and fit an additive GP in PyMC that models a long-term trend using Matern52 and a seasonal component using Periodic kernel. The model should combine them with addition and use pm.gp.Marginal."

In [None]:
def additive_trend_seasonality_gp(X, y, period=1.0):
    """
    Combine ExpQuad/Matern (trend) plus Periodic (seasonality) in pm.gp.Marginal.
    
    Parameters
    ----------
    X : array, shape (n, 1)
        Input time points
    y : array, shape (n,)
        Observations
    period : float
        Period of seasonal variation
    
    Returns
    -------
    model : pm.Model
        The PyMC model
    trace : az.InferenceData
        Posterior samples
    """
    # YOUR LLM-ASSISTED CODE HERE
    # Hints:
    # - Use separate hyperparameters for trend and seasonal components
    # - Combine kernels with: cov_total = cov_trend + cov_seasonal
    # - Don't forget to specify sigma for observation noise
    pass

# Test your implementation
# model, trace = additive_trend_seasonality_gp(X_train_births, y_train_births, period=1.0)

## Section 2.6: Non-Gaussian Likelihoods - Robust Regression with Student-T

So far, we've used Gaussian noise (in `pm.gp.Marginal`) or Bernoulli outcomes (in classification). But real data often contains **outliers**â€”observations that don't fit the typical pattern. Gaussian distributions are not robust to outliers: a single extreme value can dramatically affect inference.

The **Student-T distribution** offers a robust alternative. It has heavier tails than a Gaussian, meaning it assigns higher probability to extreme values. This makes it much more tolerant of outliers.

### Why Student-T for Robust Regression?

The Student-T distribution has three parameters:
- **Location ($\mu$)**: The center, similar to the mean
- **Scale ($\sigma$)**: The spread, similar to standard deviation
- **Degrees of freedom ($\nu$)**: Controls tail heaviness

The degrees of freedom parameter is key:
- Small $\nu$ (e.g., 2-5) â†’ very heavy tails, very robust to outliers
- Large $\nu$ (> 30) â†’ approaches normal distribution

Since we need `pm.gp.Latent` for non-Gaussian likelihoods, let's see robust regression in action with Student-T likelihood.

### Example: Data with Outliers

Let's generate data where most observations are clean, but a few are extreme outliers.

In [None]:
# Generate data with outliers
n_robust = 100
X_robust = np.linspace(0, 10, n_robust)[:, None]

# True function
f_true_robust = 2 * np.sin(1.5 * X_robust.flatten())

# Most observations have small Gaussian noise
y_robust = f_true_robust + 0.3 * rng.standard_normal(n_robust)

# Add outliers: replace 10% of points with extreme values
n_outliers = 10
outlier_indices = rng.choice(n_robust, size=n_outliers, replace=False)
y_robust[outlier_indices] += rng.choice([-1, 1], size=n_outliers) * rng.uniform(3, 6, size=n_outliers)

df_robust = pl.DataFrame({
    'x': X_robust.flatten(),
    'y': y_robust,
    'f_true': f_true_robust,
    'is_outlier': [i in outlier_indices for i in range(n_robust)]
})

# Visualize
fig = go.Figure()

# Regular points
df_regular = df_robust.filter(pl.col('is_outlier') == False)
fig.add_trace(go.Scatter(
    x=df_regular['x'],
    y=df_regular['y'],
    mode='markers',
    name='Regular observations',
    marker=dict(size=5, color='black')
))

# Outliers
df_outlier = df_robust.filter(pl.col('is_outlier') == True)
fig.add_trace(go.Scatter(
    x=df_outlier['x'],
    y=df_outlier['y'],
    mode='markers',
    name='Outliers',
    marker=dict(size=8, color='red', symbol='x')
))

# True function
fig.add_trace(go.Scatter(
    x=df_robust['x'],
    y=df_robust['f_true'],
    mode='lines',
    name='True function',
    line=dict(color='dodgerblue', width=3)
))

fig.update_layout(
    title='Data with Outliers',
    xaxis_title='X',
    yaxis_title='Y',
    hovermode='x unified',
    width=900,
    height=500
)

fig.show()

### Observing the Outliers

The red X markers show outliersâ€”observations that are far from the true function (blue line). If we used a Gaussian likelihood, these outliers would pull our posterior predictions toward them, distorting our estimate of the true function.

Let's build a robust model using a Student-T likelihood with `pm.gp.Latent`.

In [None]:
with pm.Model() as robust_model:
    # Hyperparameters
    ell = pm.Gamma('ell', alpha=2, beta=1)
    eta = pm.HalfNormal('eta', sigma=3)
    
    # Covariance function
    cov_func = eta**2 * pm.gp.cov.Matern52(1, ell)
    
    # Latent GP
    gp = pm.gp.Latent(cov_func=cov_func)
    f = gp.prior('f', X=X_robust)
    
    # Student-T likelihood for robustness
    sigma = pm.HalfNormal('sigma', sigma=1)
    nu = pm.Gamma('nu', alpha=2, beta=0.1)  # Prior favors low nu (heavy tails)
    
    y_obs = pm.StudentT(
        'y',
        mu=f,
        sigma=sigma,
        nu=nu,
        observed=y_robust
    )
    
    # Sample
    robust_trace = pm.sample(
        500,
        tune=1000,
        nuts_sampler='numpyro',
        chains=2,
        target_accept=0.95,
        random_seed=rng
    )

### Understanding the Robust Model

The key difference from our earlier Gaussian models is the likelihood:

```python
y_obs = pm.StudentT('y', mu=f, sigma=sigma, nu=nu, observed=y_robust)
```

We're learning the degrees of freedom parameter `nu` from the data. The prior `Gamma(2, 0.1)` suggests we expect heavy tails (small nu), but leaves room for the data to tell us otherwise.

Let's see how well this handles the outliers.

In [None]:
# Check the learned nu parameter
nu_summary = az.summary(robust_trace, var_names=['nu'])
print("Degrees of freedom posterior:")
print(nu_summary)
print(f"\nMean nu: {nu_summary['mean'].values[0]:.2f}")
print("(Small nu indicates heavy tails were needed)")

### Making Robust Predictions

Now let's predict and see if the model successfully ignored the outliers.

In [None]:
# Predictions
X_pred_robust = np.linspace(-1, 11, 300)[:, None]

with robust_model:
    f_pred_robust = gp.conditional('f_pred', X_pred_robust)
    ppc_robust = pm.sample_posterior_predictive(
        robust_trace,
        var_names=['f_pred'],
        random_seed=rng
    )

In [None]:
# Extract and visualize
f_pred_robust_samples = az.extract(
    ppc_robust.posterior_predictive,
    var_names=['f_pred']
)['f_pred'].values

f_pred_robust_mean = f_pred_robust_samples.mean(axis=1)
f_pred_robust_lower = np.percentile(f_pred_robust_samples, 2.5, axis=1)
f_pred_robust_upper = np.percentile(f_pred_robust_samples, 97.5, axis=1)

fig = go.Figure()

# Credible interval
fig.add_trace(go.Scatter(
    x=X_pred_robust.flatten(),
    y=f_pred_robust_upper,
    mode='lines',
    line=dict(width=0),
    showlegend=False
))

fig.add_trace(go.Scatter(
    x=X_pred_robust.flatten(),
    y=f_pred_robust_lower,
    mode='lines',
    fill='tonexty',
    fillcolor='rgba(255, 0, 0, 0.2)',
    line=dict(width=0),
    name='95% CI'
))

# Posterior mean
fig.add_trace(go.Scatter(
    x=X_pred_robust.flatten(),
    y=f_pred_robust_mean,
    mode='lines',
    name='Posterior mean',
    line=dict(color='red', width=2)
))

# True function
fig.add_trace(go.Scatter(
    x=df_robust['x'],
    y=df_robust['f_true'],
    mode='lines',
    name='True function',
    line=dict(color='dodgerblue', width=2, dash='dash')
))

# Regular data
fig.add_trace(go.Scatter(
    x=df_regular['x'],
    y=df_regular['y'],
    mode='markers',
    name='Regular observations',
    marker=dict(size=5, color='black')
))

# Outliers
fig.add_trace(go.Scatter(
    x=df_outlier['x'],
    y=df_outlier['y'],
    mode='markers',
    name='Outliers',
    marker=dict(size=8, color='red', symbol='x')
))

fig.update_layout(
    title='Robust GP with Student-T Likelihood',
    xaxis_title='X',
    yaxis_title='Y',
    hovermode='x unified',
    width=900,
    height=500
)

fig.show()

### Interpreting the Robust Results

Excellent! The posterior mean (red line) closely follows the true function (blue dashed line), essentially ignoring the outliers (red X markers). The Student-T likelihood has successfully downweighted the extreme observations.

Compare this to what would happen with a Gaussian likelihoodâ€”the outliers would pull the predicted function toward them, creating bulges in the estimate. The Student-T likelihood's heavy tails allow the model to say "these observations are just noise from the tail of the distribution," rather than trying to fit them.

**When to use robust likelihoods**:
- When you suspect your data contains outliers
- When measurement errors might occasionally be much larger than usual
- When you want your model to be less sensitive to a few extreme values

The cost is slower sampling (since we use `pm.gp.Latent`), but the benefit is much more reliable inference in the presence of outliers.

## ðŸ¤– EXERCISE 3: Compare Marginal vs Latent GPs

Your final exercise: implement both approaches on the same dataset and compare their speed and accuracy.

### Task

Complete the function below that:
1. Fits the same data with both `pm.gp.Marginal` (Gaussian likelihood)
2. And `pm.gp.Latent` (also with Gaussian likelihood for fair comparison)
3. Reports fit time and predictive quality (e.g., RMSE on held-out data)

**Prompt suggestion**: "Help me implement two GP fits on the same data: one using pm.gp.Marginal and one using pm.gp.Latent with Normal likelihood. Compare their sampling time and posterior predictive performance."

In [None]:
import time

def compare_marginal_vs_latent(X_train, y_train, X_test, y_test):
    """
    Fit the same dataset with pm.gp.Marginal and pm.gp.Latent,
    report fit time and predictive quality.
    
    Parameters
    ----------
    X_train, y_train : arrays
        Training data
    X_test, y_test : arrays
        Test data for evaluation
    
    Returns
    -------
    results : dict
        Dictionary with timing and performance metrics
    """
    # YOUR LLM-ASSISTED CODE HERE
    # Hints:
    # - Use time.time() to measure sampling duration
    # - For Latent model, use pm.Normal likelihood with same parameters as Marginal
    # - Compute RMSE on test set for both models
    # - Return dict with keys: 'marginal_time', 'latent_time', 'marginal_rmse', 'latent_rmse'
    pass

# Test on the training data from earlier
# Split into train/test
# n_test = 20
# test_indices = rng.choice(len(X_train), size=n_test, replace=False)
# train_mask = np.ones(len(X_train), dtype=bool)
# train_mask[test_indices] = False
#
# results = compare_marginal_vs_latent(
#     X_train[train_mask], y_train[train_mask],
#     X_train[test_indices], y_train[test_indices]
# )
# print(results)

## Section 2.7: Summary and Next Steps

Congratulations! You've now mastered the core modeling choices in Gaussian process regression:

### Key Concepts from Session 2

**The Kernel Zoo**: We explored how different kernels encode different assumptions:
- ExpQuad, MatÃ©rn 5/2, and MatÃ©rn 3/2 for varying smoothness
- Periodic for seasonal patterns
- Linear for trends and non-stationary behavior

**Marginal vs Latent**: You learned when to use each approach:
- `pm.gp.Marginal`: Fast, efficient, limited to Gaussian likelihoods
- `pm.gp.Latent`: Flexible, works with any likelihood, slower

**Kernel Composition**: The power of combining kernels:
- Addition for independent effects (trend + seasonality)
- Multiplication for interactions (modulating amplitude)

**Non-Gaussian Likelihoods**: How to handle:
- Binary outcomes with Bernoulli likelihood
- Outliers with Student-T likelihood
- Any distribution supported by PyMC

### Practical Guidance

When building your own GP models:

1. **Start simple**: Begin with a single kernel (usually MatÃ©rn 5/2) and Gaussian likelihood
2. **Examine residuals**: Plot what your model gets wrongâ€”this often reveals missing structure
3. **Add complexity**: If you see periodic patterns, add a Periodic kernel. If you see outliers, consider Student-T
4. **Compare models**: Use LOO-CV (with ArviZ) to compare different kernel choices
5. **Check diagnostics**: Always verify R-hat, ESS, and look for divergences

### What's Next: Session 3

Everything we've done so far has a major limitation: computational cost. Exact GP inference is $O(n^3)$ in the number of data points, making it impractical for datasets with thousands of observations.

In Session 3, we'll tackle this head-on:
- **Sparse GPs** with inducing points that approximate the full GP
- **Hilbert Space GPs (HSGP)** that use basis function expansions
- **Practical strategies** for choosing approximation parameters
- **When approximations are (and aren't) appropriate**

These techniques will let you scale GPs to large datasets while maintaining the flexibility we've explored today. See you in Session 3!