# Session 2: Kernels, Likelihoods, and Model Building

Welcome to Session 2 of our Gaussian Processes with PyMC workshop! In the previous session, we covered the fundamentals of Gaussian processes and their implementation in PyMC. Today, we'll dive deeper into the building blocks that make GPs so powerful and flexible.

## Learning Objectives

By the end of this session, you will be able to:

1. **Master different kernel families** and understand when to use each type
2. **Understand kernel composition** and how to combine kernels for complex patterns
3. **Implement non-Gaussian likelihoods** for classification, count data, and robust regression
4. **Navigate inference trade-offs** between computational efficiency and model flexibility
5. **Build robust GP models** that handle real-world complexities

## Session Overview

### Part A: Kernel Functions Deep Dive
- Understanding the role of kernels in GP modeling
- Exploring different kernel families (RBF, Matérn, Periodic, etc.)
- Kernel hyperparameters and their effects
- Practical guidelines for kernel selection

### Part B: Kernel Composition
- Mathematical foundations of kernel combination
- Addition and multiplication of kernels
- Building complex patterns through composition
- Real-world examples of composite kernels

### Part C: Non-Gaussian Likelihoods
- Beyond Gaussian noise: classification and count data
- Student-t processes for robust regression
- Latent variable implementation for non-Gaussian outcomes
- Computational considerations

### Part D: Model Building Best Practices
- Choosing between Marginal vs Latent implementations
- Hyperparameter priors and initialization
- Model validation and diagnostics
- Handling computational challenges

## Setup and Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import pymc as pm
import arviz as az
import pytensor.tensor as pt

# Set random seed for reproducibility
RANDOM_SEED = 42
rng = np.random.default_rng(RANDOM_SEED)

# Configure plotting
%config InlineBackend.figure_format = 'retina'
plt.style.use('seaborn-v0_8-whitegrid')
az.style.use("arviz-darkgrid")

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Part A: Kernel Functions Deep Dive

The kernel (or covariance) function is the heart of a Gaussian process. It encodes our assumptions about the smoothness, periodicity, and other structural properties of the function we're modeling.

## Understanding Kernel Functions

A kernel function $k(x, x')$ measures the similarity between two input points $x$ and $x'$. The key properties of a valid kernel are:

1. **Symmetry**: $k(x, x') = k(x', x)$
2. **Positive semi-definiteness**: The covariance matrix must be positive semi-definite

Let's explore the most commonly used kernel families and their characteristics.

## 1. Radial Basis Function (RBF) Kernel

The RBF kernel (also called Gaussian or squared exponential kernel) is perhaps the most widely used kernel:

$$k(x, x') = \sigma^2 \exp\left(-\frac{(x - x')^2}{2\ell^2}\right)$$

Where:
- $\sigma^2$ is the variance parameter (controls output scale)
- $\ell$ is the lengthscale parameter (controls input scale)

The RBF kernel assumes infinite differentiability, making it suitable for smooth functions.

In [None]:
# Let's explore how RBF kernel parameters affect the covariance structure
def plot_kernel_comparison():
    X = np.linspace(-3, 3, 100)[:, None]
    x_test = np.array([[0.0]])  # Reference point
    
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=[
            "Effect of Lengthscale (σ²=1)",
            "Effect of Variance (ℓ=1)", 
            "Sample Functions (ℓ=0.5)",
            "Sample Functions (ℓ=2.0)"
        ]
    )
    
    # Effect of lengthscale
    lengthscales = [0.3, 1.0, 2.0]
    colors = ['red', 'blue', 'green']
    
    for i, (ls, color) in enumerate(zip(lengthscales, colors)):
        with pm.Model() as model:
            cov = pm.gp.cov.ExpQuad(1, ls=ls)
            K = cov(X, x_test).eval()
            
        fig.add_trace(
            go.Scatter(
                x=X.flatten(), y=K.flatten(),
                name=f'ℓ={ls}', line=dict(color=color)
            ),
            row=1, col=1
        )
    
    # Effect of variance
    variances = [0.5, 1.0, 2.0]
    
    for i, (var, color) in enumerate(zip(variances, colors)):
        with pm.Model() as model:
            cov = pm.gp.cov.ExpQuad(var, ls=1.0)
            K = cov(X, x_test).eval()
            
        fig.add_trace(
            go.Scatter(
                x=X.flatten(), y=K.flatten(),
                name=f'σ²={var}', line=dict(color=color),
                showlegend=False
            ),
            row=1, col=2
        )
    
    # Sample functions with different lengthscales
    for ls, row_col in [(0.5, (2,1)), (2.0, (2,2))]:
        with pm.Model() as model:
            cov = pm.gp.cov.ExpQuad(1, ls=ls)
            K = cov(X).eval() + 1e-6 * np.eye(len(X))
            
        # Sample from the prior
        L = np.linalg.cholesky(K)
        samples = L @ rng.standard_normal((len(X), 3))
        
        for i in range(3):
            fig.add_trace(
                go.Scatter(
                    x=X.flatten(), y=samples[:, i],
                    name=f'Sample {i+1}', 
                    line=dict(color=colors[i]),
                    showlegend=False
                ),
                row=row_col[0], col=row_col[1]
            )
    
    fig.update_layout(height=600, title="RBF Kernel Properties")
    fig.update_xaxes(title="x")
    fig.update_yaxes(title="k(x, 0)")
    
    return fig

plot_kernel_comparison().show()

**Key Insights:**
- **Lengthscale ($\ell$)**: Controls how far the influence of a data point extends. Smaller values create more wiggly functions.
- **Variance ($\sigma^2$)**: Controls the overall scale of variation. Higher values allow for larger deviations from the mean.

## 2. Matérn Kernel Family

The Matérn kernel family provides more flexibility in modeling smoothness:

$$k(x, x') = \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\sqrt{2\nu} \frac{|x - x'|}{\ell}\right)^\nu K_\nu\left(\sqrt{2\nu} \frac{|x - x'|}{\ell}\right)$$

Where $\nu$ controls smoothness and $K_\nu$ is the modified Bessel function of the second kind.

Common choices:
- $\nu = 1/2$: Exponential kernel (non-differentiable)
- $\nu = 3/2$: Once differentiable
- $\nu = 5/2$: Twice differentiable
- $\nu \to \infty$: RBF kernel (infinitely differentiable)

In [None]:
def compare_matern_kernels():
    X = np.linspace(-3, 3, 100)[:, None]
    x_test = np.array([[0.0]])
    
    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=["Matérn Kernel Comparison", "Sample Functions"]
    )
    
    nus = [0.5, 1.5, 2.5, np.inf]
    nu_labels = ['1/2', '3/2', '5/2', '∞ (RBF)']
    colors = ['red', 'blue', 'green', 'orange']
    
    # Compare kernel shapes
    for nu, label, color in zip(nus, nu_labels, colors):
        with pm.Model() as model:
            if nu == np.inf:
                cov = pm.gp.cov.ExpQuad(1, ls=1.0)
            else:
                cov = pm.gp.cov.Matern52(1, ls=1.0) if nu == 2.5 else \
                      pm.gp.cov.Matern32(1, ls=1.0) if nu == 1.5 else \
                      pm.gp.cov.Exponential(1, ls=1.0)
            K = cov(X, x_test).eval()
            
        fig.add_trace(
            go.Scatter(
                x=X.flatten(), y=K.flatten(),
                name=f'ν={label}', line=dict(color=color)
            ),
            row=1, col=1
        )
    
    # Sample functions from Matérn 3/2
    with pm.Model() as model:
        cov = pm.gp.cov.Matern32(1, ls=1.0)
        K = cov(X).eval() + 1e-6 * np.eye(len(X))
        
    L = np.linalg.cholesky(K)
    samples = L @ rng.standard_normal((len(X), 3))
    
    for i in range(3):
        fig.add_trace(
            go.Scatter(
                x=X.flatten(), y=samples[:, i],
                name=f'Matérn 3/2 Sample {i+1}',
                line=dict(color=colors[i]),
                showlegend=False
            ),
            row=1, col=2
        )
    
    fig.update_layout(height=400, title="Matérn Kernel Family")
    fig.update_xaxes(title="x")
    fig.update_yaxes(title="Kernel Value / Function Value")
    
    return fig

compare_matern_kernels().show()

## 3. Periodic Kernel

The periodic kernel is designed for functions with repeating patterns:

$$k(x, x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi|x - x'|/p)}{\ell^2}\right)$$

Where $p$ is the period parameter.

In [None]:
def demonstrate_periodic_kernel():
    X = np.linspace(0, 8, 200)[:, None]
    
    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=["Periodic Kernel (p=2π)", "Sample Functions"]
    )
    
    # Kernel visualization
    x_test = np.array([[2.0]])
    
    with pm.Model() as model:
        cov = pm.gp.cov.Periodic(1, period=2*np.pi, ls=1.0)
        K = cov(X, x_test).eval()
        
    fig.add_trace(
        go.Scatter(
            x=X.flatten(), y=K.flatten(),
            name='Periodic Kernel', line=dict(color='blue')
        ),
        row=1, col=1
    )
    
    # Sample functions
    with pm.Model() as model:
        cov = pm.gp.cov.Periodic(1, period=2*np.pi, ls=1.0)
        K = cov(X).eval() + 1e-6 * np.eye(len(X))
        
    L = np.linalg.cholesky(K)
    samples = L @ rng.standard_normal((len(X), 3))
    
    colors = ['red', 'green', 'orange']
    for i in range(3):
        fig.add_trace(
            go.Scatter(
                x=X.flatten(), y=samples[:, i],
                name=f'Sample {i+1}',
                line=dict(color=colors[i]),
                showlegend=False
            ),
            row=1, col=2
        )
    
    fig.update_layout(height=400, title="Periodic Kernel")
    fig.update_xaxes(title="x")
    
    return fig

demonstrate_periodic_kernel().show()

# Part B: Kernel Composition

One of the most powerful aspects of GP modeling is the ability to combine kernels to create more complex covariance structures. This allows us to model functions with multiple characteristics simultaneously.

## Mathematical Foundations

If $k_1$ and $k_2$ are valid kernels, then:

1. **Addition**: $k(x, x') = k_1(x, x') + k_2(x, x')$ models functions that are the sum of processes with different characteristics

2. **Multiplication**: $k(x, x') = k_1(x, x') \cdot k_2(x, x')$ models functions where both characteristics must be present simultaneously

## Example: Trend + Seasonality + Noise

Let's build a kernel for modeling data with:
- Long-term smooth trends
- Seasonal patterns
- Short-term variations

In [None]:
def demonstrate_kernel_composition():
    # Generate synthetic data with trend, seasonality, and noise
    X = np.linspace(0, 4*np.pi, 100)
    
    # True components
    trend = 0.1 * X
    seasonal = 0.5 * np.sin(X)
    noise = 0.1 * rng.standard_normal(len(X))
    y_true = trend + seasonal + noise
    
    # Convert to training data
    X_train = X[::5]  # Subsample for training
    y_train = y_true[::5]
    X_test = X[:, None]
    X_train = X_train[:, None]
    
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=[
            "Individual Kernels",
            "Composite Kernel Samples",
            "GP Regression: Simple Kernel",
            "GP Regression: Composite Kernel"
        ]
    )
    
    # Plot individual kernel samples
    kernels = {
        'Long-term (RBF ℓ=3)': pm.gp.cov.ExpQuad(1, ls=3.0),
        'Seasonal (Periodic p=2π)': pm.gp.cov.Periodic(1, period=2*np.pi, ls=1.0),
        'Short-term (RBF ℓ=0.5)': pm.gp.cov.ExpQuad(1, ls=0.5)
    }
    
    colors = ['blue', 'red', 'green']
    for i, (name, kernel) in enumerate(kernels.items()):
        with pm.Model():
            K = kernel(X_test).eval() + 1e-6 * np.eye(len(X_test))
        
        L = np.linalg.cholesky(K)
        sample = L @ rng.standard_normal(len(X_test))
        
        fig.add_trace(
            go.Scatter(
                x=X, y=sample.flatten(),
                name=name, line=dict(color=colors[i])
            ),
            row=1, col=1
        )
    
    # Composite kernel samples
    with pm.Model():
        composite_kernel = (pm.gp.cov.ExpQuad(1, ls=3.0) + 
                          pm.gp.cov.Periodic(1, period=2*np.pi, ls=1.0) +
                          pm.gp.cov.ExpQuad(0.5, ls=0.5))
        K_comp = composite_kernel(X_test).eval() + 1e-6 * np.eye(len(X_test))
    
    L_comp = np.linalg.cholesky(K_comp)
    for i in range(3):
        sample = L_comp @ rng.standard_normal(len(X_test))
        fig.add_trace(
            go.Scatter(
                x=X, y=sample.flatten(),
                name=f'Composite Sample {i+1}',
                line=dict(color=colors[i]),
                showlegend=False
            ),
            row=1, col=2
        )
    
    # Add true data
    fig.add_trace(
        go.Scatter(
            x=X_train.flatten(), y=y_train,
            mode='markers', name='Training Data',
            marker=dict(color='black', size=4)
        ),
        row=2, col=1
    )
    
    fig.add_trace(
        go.Scatter(
            x=X_train.flatten(), y=y_train,
            mode='markers', name='Training Data',
            marker=dict(color='black', size=4),
            showlegend=False
        ),
        row=2, col=2
    )
    
    # Simple GP regression
    with pm.Model() as simple_model:
        ℓ = pm.Gamma("ℓ", alpha=2, beta=1)
        η = pm.HalfCauchy("η", beta=5)
        
        cov = η**2 * pm.gp.cov.ExpQuad(1, ls=ℓ)
        gp = pm.gp.Marginal(cov_func=cov)
        
        σ = pm.HalfCauchy("σ", beta=5)
        y_ = gp.marginal_likelihood("y", X=X_train, y=y_train, sigma=σ)
        
        # Predict
        f_pred = gp.conditional("f_pred", X_test)
        
        # Sample from posterior
        idata_simple = pm.sample(1000, tune=1000, random_seed=RANDOM_SEED, 
                                chains=2, cores=1, progressbar=False)
        pred_samples = pm.sample_posterior_predictive(idata_simple, 
                                                     progressbar=False)
    
    # Plot simple GP results
    f_mean = pred_samples.posterior_predictive['f_pred'].mean(dim=['chain', 'draw'])
    f_std = pred_samples.posterior_predictive['f_pred'].std(dim=['chain', 'draw'])
    
    fig.add_trace(
        go.Scatter(
            x=X, y=f_mean,
            name='GP Mean (Simple)',
            line=dict(color='blue'),
            showlegend=False
        ),
        row=2, col=1
    )
    
    fig.add_trace(
        go.Scatter(
            x=np.concatenate([X, X[::-1]]),
            y=np.concatenate([f_mean + 2*f_std, (f_mean - 2*f_std)[::-1]]),
            fill='toself',
            fillcolor='rgba(0, 0, 255, 0.2)',
            line=dict(color='rgba(255,255,255,0)'),
            showlegend=False
        ),
        row=2, col=1
    )
    
    # Composite GP regression (simplified for demo)
    with pm.Model() as composite_model:
        # Long-term component
        ℓ_long = pm.Gamma("ℓ_long", alpha=2, beta=0.5)
        η_long = pm.HalfCauchy("η_long", beta=2)
        
        # Periodic component  
        ℓ_per = pm.Gamma("ℓ_per", alpha=2, beta=2)
        η_per = pm.HalfCauchy("η_per", beta=2)
        
        # Short-term component
        ℓ_short = pm.Gamma("ℓ_short", alpha=2, beta=4)
        η_short = pm.HalfCauchy("η_short", beta=1)
        
        # Composite kernel
        cov = (η_long**2 * pm.gp.cov.ExpQuad(1, ls=ℓ_long) +
               η_per**2 * pm.gp.cov.Periodic(1, period=2*np.pi, ls=ℓ_per) +
               η_short**2 * pm.gp.cov.ExpQuad(1, ls=ℓ_short))
        
        gp = pm.gp.Marginal(cov_func=cov)
        
        σ = pm.HalfCauchy("σ", beta=0.5)
        y_ = gp.marginal_likelihood("y", X=X_train, y=y_train, sigma=σ)
        
        f_pred = gp.conditional("f_pred", X_test)
        
        idata_composite = pm.sample(1000, tune=1000, random_seed=RANDOM_SEED,
                                  chains=2, cores=1, progressbar=False)
        pred_samples_comp = pm.sample_posterior_predictive(idata_composite,
                                                          progressbar=False)
    
    # Plot composite GP results
    f_mean_comp = pred_samples_comp.posterior_predictive['f_pred'].mean(dim=['chain', 'draw'])
    f_std_comp = pred_samples_comp.posterior_predictive['f_pred'].std(dim=['chain', 'draw'])
    
    fig.add_trace(
        go.Scatter(
            x=X, y=f_mean_comp,
            name='GP Mean (Composite)',
            line=dict(color='red'),
            showlegend=False
        ),
        row=2, col=2
    )
    
    fig.add_trace(
        go.Scatter(
            x=np.concatenate([X, X[::-1]]),
            y=np.concatenate([f_mean_comp + 2*f_std_comp, 
                            (f_mean_comp - 2*f_std_comp)[::-1]]),
            fill='toself',
            fillcolor='rgba(255, 0, 0, 0.2)',
            line=dict(color='rgba(255,255,255,0)'),
            showlegend=False
        ),
        row=2, col=2
    )
    
    fig.update_layout(height=800, title="Kernel Composition Example")
    fig.update_xaxes(title="x")
    fig.update_yaxes(title="y")
    
    return fig

demonstrate_kernel_composition().show()

**Key Insights from Kernel Composition:**

1. **Additive kernels** allow modeling functions as sums of different components
2. **Individual kernels** capture specific patterns (trends, seasonality, noise)
3. **Composite models** can capture complex real-world patterns that simple kernels cannot
4. **Parameter interpretation** becomes more challenging but also more meaningful

## Common Kernel Combinations

Here are some useful kernel combinations for different scenarios:

In [None]:
# Common kernel combination patterns
kernel_combinations = {
    "Smooth trend + noise": "RBF(large_ℓ) + RBF(small_ℓ)",
    "Periodic + trend": "Periodic × RBF + RBF(large_ℓ)", 
    "Local periodicity": "Periodic × RBF",
    "Changepoint model": "RBF + Polynomial + White Noise",
    "Multi-scale patterns": "RBF(ℓ₁) + RBF(ℓ₂) + RBF(ℓ₃)"
}

print("Common Kernel Combination Patterns:")
print("=" * 40)
for use_case, formula in kernel_combinations.items():
    print(f"{use_case:20}: {formula}")

# Part C: Non-Gaussian Likelihoods

So far, we've focused on Gaussian likelihoods, which are appropriate for continuous data with symmetric noise. However, many real-world problems require different likelihood functions:

- **Classification**: Binary or multi-class outcomes
- **Count data**: Poisson or negative binomial observations
- **Robust regression**: Heavy-tailed noise (Student-t)
- **Ordinal data**: Ordered categorical outcomes

## Mathematical Foundation

For non-Gaussian likelihoods, we typically use the **latent variable approach**:

1. **Latent function**: $f(x) \sim \mathcal{GP}(m(x), k(x,x'))$
2. **Link function**: $g: f \mapsto \theta$ (transforms GP to likelihood parameters)
3. **Likelihood**: $y \mid \theta \sim p(y \mid \theta)$

This requires using `pm.gp.Latent` instead of `pm.gp.Marginal`.

## Example 1: Binary Classification with GP

For binary classification, we use a Bernoulli likelihood with a logistic link function:

$$p(y = 1 \mid f) = \text{logit}^{-1}(f) = \frac{1}{1 + e^{-f}}$$

In [None]:
def gp_classification_demo():
    # Generate synthetic classification data
    n = 100
    X = rng.uniform(-3, 3, n)[:, None]
    
    # True latent function
    f_true = 2 * np.sin(X.flatten()) + 0.5 * X.flatten()**2 - 2
    p_true = 1 / (1 + np.exp(-f_true))  # logistic
    y = rng.binomial(1, p_true, n)
    
    # Test points for prediction
    X_test = np.linspace(-3, 3, 100)[:, None]
    
    # GP Classification model
    with pm.Model() as gp_classification:
        # Kernel hyperparameters
        ℓ = pm.Gamma("ℓ", alpha=2, beta=1)
        η = pm.HalfCauchy("η", beta=5)
        
        # Define covariance function
        cov = η**2 * pm.gp.cov.ExpQuad(1, ls=ℓ)
        
        # GP prior on latent function
        gp = pm.gp.Latent(cov_func=cov)
        f = gp.prior("f", X=X)
        
        # Bernoulli likelihood
        p = pm.math.invlogit(f)  # logistic transformation
        y_obs = pm.Bernoulli("y_obs", p=p, observed=y)
        
        # Fit model
        idata = pm.sample(1000, tune=1000, chains=2, cores=1, 
                         random_seed=RANDOM_SEED, progressbar=False)
        
        # Posterior predictions
        f_pred = gp.conditional("f_pred", X_test)
        pred_samples = pm.sample_posterior_predictive(idata, progressbar=False)
    
    # Extract results
    f_pred_mean = pred_samples.posterior_predictive['f_pred'].mean(dim=['chain', 'draw'])
    f_pred_std = pred_samples.posterior_predictive['f_pred'].std(dim=['chain', 'draw'])
    p_pred_mean = 1 / (1 + np.exp(-f_pred_mean))  # Transform to probabilities
    
    # Create plot
    fig = go.Figure()
    
    # Plot training data
    colors = ['red' if yi == 0 else 'blue' for yi in y]
    symbols = ['circle' if yi == 0 else 'diamond' for yi in y]
    
    for yi in [0, 1]:
        mask = y == yi
        fig.add_trace(
            go.Scatter(
                x=X[mask].flatten(),
                y=np.full(np.sum(mask), yi),
                mode='markers',
                name=f'Class {yi}',
                marker=dict(
                    color='red' if yi == 0 else 'blue',
                    size=8,
                    symbol='circle' if yi == 0 else 'diamond'
                )
            )
        )
    
    # Plot predicted probabilities
    fig.add_trace(
        go.Scatter(
            x=X_test.flatten(),
            y=p_pred_mean,
            name='Predicted P(y=1)',
            line=dict(color='green', width=3)
        )
    )
    
    # Add uncertainty bands for latent function
    f_upper = 1 / (1 + np.exp(-(f_pred_mean + 2*f_pred_std)))
    f_lower = 1 / (1 + np.exp(-(f_pred_mean - 2*f_pred_std)))
    
    fig.add_trace(
        go.Scatter(
            x=np.concatenate([X_test.flatten(), X_test.flatten()[::-1]]),
            y=np.concatenate([f_upper, f_lower[::-1]]),
            fill='toself',
            fillcolor='rgba(0, 255, 0, 0.2)',
            line=dict(color='rgba(255,255,255,0)'),
            name='95% Credible Interval',
            showlegend=False
        )
    )
    
    fig.update_layout(
        title="GP Classification Example",
        xaxis_title="x",
        yaxis_title="P(y=1)",
        height=500
    )
    
    return fig, idata

classification_fig, classification_idata = gp_classification_demo()
classification_fig.show()

## Example 2: Student-t Process for Robust Regression

The Student-t process provides robustness to outliers by using a Student-t likelihood instead of Gaussian:

$$y \mid f, \nu, \sigma \sim \text{Student-t}(\nu, f, \sigma)$$

Where $\nu$ controls the tail heaviness (lower values = heavier tails).

In [None]:
def student_t_process_demo():
    # Generate data with outliers
    n = 50
    X = np.linspace(-2, 2, n)[:, None]
    
    # True function
    f_true = np.sin(2 * X.flatten())
    
    # Add normal noise + some outliers
    noise = 0.1 * rng.standard_normal(n)
    # Add some outliers
    outlier_idx = rng.choice(n, size=5, replace=False)
    noise[outlier_idx] += rng.choice([-2, 2], size=5) * rng.uniform(1, 3, 5)
    
    y = f_true + noise
    
    X_test = np.linspace(-2, 2, 100)[:, None]
    
    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=["Standard GP (Gaussian)", "Robust GP (Student-t)"]
    )
    
    # Standard Gaussian GP
    with pm.Model() as gaussian_gp:
        ℓ = pm.Gamma("ℓ", alpha=2, beta=2)
        η = pm.HalfCauchy("η", beta=5)
        σ = pm.HalfCauchy("σ", beta=2)
        
        cov = η**2 * pm.gp.cov.ExpQuad(1, ls=ℓ)
        gp = pm.gp.Marginal(cov_func=cov)
        
        y_ = gp.marginal_likelihood("y", X=X, y=y, sigma=σ)
        
        idata_gaussian = pm.sample(1000, tune=1000, chains=2, cores=1,
                                  random_seed=RANDOM_SEED, progressbar=False)
        
        f_pred_gaussian = gp.conditional("f_pred", X_test)
        pred_gaussian = pm.sample_posterior_predictive(idata_gaussian, progressbar=False)
    
    # Student-t GP
    with pm.Model() as studentt_gp:
        ℓ = pm.Gamma("ℓ", alpha=2, beta=2) 
        η = pm.HalfCauchy("η", beta=5)
        σ = pm.HalfCauchy("σ", beta=2)
        ν = pm.Gamma("ν", alpha=2, beta=0.1)  # Degrees of freedom
        
        cov = η**2 * pm.gp.cov.ExpQuad(1, ls=ℓ)
        gp = pm.gp.Latent(cov_func=cov)
        f = gp.prior("f", X=X)
        
        # Student-t likelihood
        y_ = pm.StudentT("y", nu=ν, mu=f, sigma=σ, observed=y)
        
        idata_studentt = pm.sample(1000, tune=1000, chains=2, cores=1,
                                  random_seed=RANDOM_SEED, progressbar=False)
        
        f_pred_studentt = gp.conditional("f_pred", X_test)
        pred_studentt = pm.sample_posterior_predictive(idata_studentt, progressbar=False)
    
    # Plot results
    models = {
        "Gaussian": (pred_gaussian, (1, 1)),
        "Student-t": (pred_studentt, (1, 2))
    }
    
    for name, (pred, pos) in models.items():
        f_mean = pred.posterior_predictive['f_pred'].mean(dim=['chain', 'draw'])
        f_std = pred.posterior_predictive['f_pred'].std(dim=['chain', 'draw'])
        
        # Add training data
        colors = ['red' if i in outlier_idx else 'black' for i in range(len(y))]
        sizes = [10 if i in outlier_idx else 6 for i in range(len(y))]
        
        fig.add_trace(
            go.Scatter(
                x=X.flatten(),
                y=y,
                mode='markers',
                name='Data (outliers in red)' if pos[1] == 1 else 'Data',
                marker=dict(color=colors, size=sizes),
                showlegend=pos[1] == 1
            ),
            row=pos[0], col=pos[1]
        )
        
        # Add GP predictions
        fig.add_trace(
            go.Scatter(
                x=X_test.flatten(),
                y=f_mean,
                name=f'{name} GP Mean',
                line=dict(color='blue'),
                showlegend=False
            ),
            row=pos[0], col=pos[1]
        )
        
        # Add uncertainty
        fig.add_trace(
            go.Scatter(
                x=np.concatenate([X_test.flatten(), X_test.flatten()[::-1]]),
                y=np.concatenate([f_mean + 2*f_std, (f_mean - 2*f_std)[::-1]]),
                fill='toself',
                fillcolor='rgba(0, 0, 255, 0.2)',
                line=dict(color='rgba(255,255,255,0)'),
                showlegend=False
            ),
            row=pos[0], col=pos[1]
        )
        
        # Add true function
        f_true_test = np.sin(2 * X_test.flatten())
        fig.add_trace(
            go.Scatter(
                x=X_test.flatten(),
                y=f_true_test,
                name='True Function' if pos[1] == 1 else 'True Function',
                line=dict(color='green', dash='dash'),
                showlegend=pos[1] == 1
            ),
            row=pos[0], col=pos[1]
        )
    
    fig.update_layout(height=500, title="GP Regression: Gaussian vs Student-t")
    fig.update_xaxes(title="x")
    fig.update_yaxes(title="y")
    
    return fig, idata_gaussian, idata_studentt

robust_fig, gauss_idata, studentt_idata = student_t_process_demo()
robust_fig.show()

## Example 3: Count Data with Poisson Likelihood

For count data, we use a Poisson likelihood with a log link function:

$$\lambda = \exp(f)$$
$$y \mid \lambda \sim \text{Poisson}(\lambda)$$

In [None]:
def poisson_gp_demo():
    # Generate synthetic count data
    n = 60
    X = np.linspace(0, 4*np.pi, n)[:, None]
    
    # True log-intensity function
    f_true = 1 + 0.5 * np.sin(X.flatten()) + 0.3 * np.cos(2 * X.flatten())
    lambda_true = np.exp(f_true)
    y = rng.poisson(lambda_true)
    
    X_test = np.linspace(0, 4*np.pi, 100)[:, None]
    
    # Poisson GP model
    with pm.Model() as poisson_gp:
        # Kernel parameters
        ℓ = pm.Gamma("ℓ", alpha=2, beta=1)
        η = pm.HalfCauchy("η", beta=2)
        
        # GP prior on log-intensity
        cov = η**2 * pm.gp.cov.ExpQuad(1, ls=ℓ)
        gp = pm.gp.Latent(cov_func=cov)
        f = gp.prior("f", X=X)
        
        # Poisson likelihood
        λ = pm.math.exp(f)  # log link
        y_obs = pm.Poisson("y_obs", mu=λ, observed=y)
        
        # Fit model
        idata = pm.sample(1000, tune=1000, chains=2, cores=1,
                         random_seed=RANDOM_SEED, progressbar=False)
        
        # Predictions
        f_pred = gp.conditional("f_pred", X_test)
        pred_samples = pm.sample_posterior_predictive(idata, progressbar=False)
    
    # Extract results
    f_pred_mean = pred_samples.posterior_predictive['f_pred'].mean(dim=['chain', 'draw'])
    f_pred_std = pred_samples.posterior_predictive['f_pred'].std(dim=['chain', 'draw'])
    lambda_pred_mean = np.exp(f_pred_mean)
    lambda_pred_upper = np.exp(f_pred_mean + 2*f_pred_std)
    lambda_pred_lower = np.exp(f_pred_mean - 2*f_pred_std)
    
    # Plot results
    fig = go.Figure()
    
    # Training data
    fig.add_trace(
        go.Scatter(
            x=X.flatten(),
            y=y,
            mode='markers',
            name='Count Data',
            marker=dict(color='black', size=6)
        )
    )
    
    # True intensity
    lambda_true_test = np.exp(1 + 0.5 * np.sin(X_test.flatten()) + 
                             0.3 * np.cos(2 * X_test.flatten()))
    fig.add_trace(
        go.Scatter(
            x=X_test.flatten(),
            y=lambda_true_test,
            name='True Intensity',
            line=dict(color='green', dash='dash')
        )
    )
    
    # Predicted intensity
    fig.add_trace(
        go.Scatter(
            x=X_test.flatten(),
            y=lambda_pred_mean,
            name='Predicted Intensity',
            line=dict(color='blue', width=3)
        )
    )
    
    # Uncertainty bands
    fig.add_trace(
        go.Scatter(
            x=np.concatenate([X_test.flatten(), X_test.flatten()[::-1]]),
            y=np.concatenate([lambda_pred_upper, lambda_pred_lower[::-1]]),
            fill='toself',
            fillcolor='rgba(0, 0, 255, 0.2)',
            line=dict(color='rgba(255,255,255,0)'),
            name='95% Credible Interval',
            showlegend=False
        )
    )
    
    fig.update_layout(
        title="GP with Poisson Likelihood (Count Data)",
        xaxis_title="x",
        yaxis_title="Count / Intensity",
        height=500
    )
    
    return fig, idata

poisson_fig, poisson_idata = poisson_gp_demo()
poisson_fig.show()

## Computational Considerations for Non-Gaussian Likelihoods

Using non-Gaussian likelihoods comes with computational trade-offs:

### 1. **Inference Complexity**
- **Gaussian**: Analytical posterior (fast)
- **Non-Gaussian**: MCMC required (slower)

### 2. **Model Specification**
- **Marginal**: `pm.gp.Marginal` for Gaussian likelihoods only
- **Latent**: `pm.gp.Latent` required for non-Gaussian likelihoods

### 3. **Hyperparameter Sensitivity**
- Non-Gaussian models often more sensitive to priors
- More careful initialization may be needed

In [None]:
# Summary of likelihood choices
likelihood_guide = {
    "Data Type": ["Continuous (symmetric noise)", "Continuous (outliers)", 
                 "Binary", "Count", "Positive continuous"],
    "Likelihood": ["Gaussian", "Student-t", "Bernoulli", "Poisson", "Gamma/Lognormal"],
    "Link Function": ["Identity", "Identity", "Logistic", "Log", "Log"],
    "PyMC Implementation": ["gp.Marginal", "gp.Latent", "gp.Latent", "gp.Latent", "gp.Latent"]
}

likelihood_df = pd.DataFrame(likelihood_guide)
print("GP Likelihood Selection Guide:")
print("=" * 60)
print(likelihood_df.to_string(index=False))

# Part D: Model Building Best Practices

Building effective GP models requires careful consideration of several factors. Let's explore key best practices for successful GP modeling.

## 1. Choosing Between Marginal vs Latent

The choice between `pm.gp.Marginal` and `pm.gp.Latent` has important implications:

In [None]:
# Comparison of Marginal vs Latent implementations
comparison_data = {
    "Aspect": ["Computational Speed", "Likelihood Types", "Memory Usage", 
              "Posterior Samples", "Prediction", "Scalability"],
    "pm.gp.Marginal": ["Fast (analytical)", "Gaussian only", "Low", 
                       "Parameters only", "Direct", "Better"],
    "pm.gp.Latent": ["Slower (MCMC)", "Any likelihood", "Higher", 
                     "Full GP samples", "Via conditional", "Limited"]
}

comparison_df = pd.DataFrame(comparison_data)
print("Marginal vs Latent Implementation Comparison:")
print("=" * 55)
print(comparison_df.to_string(index=False))

## 2. Hyperparameter Priors and Initialization

Proper prior specification is crucial for GP models. Here are some guidelines:

In [None]:
def demonstrate_prior_sensitivity():
    """Show impact of different prior choices on GP inference"""
    
    # Generate simple test data
    X = np.linspace(0, 1, 20)[:, None]
    y = np.sin(4 * np.pi * X.flatten()) + 0.1 * rng.standard_normal(20)
    
    X_test = np.linspace(0, 1, 100)[:, None]
    
    # Different prior specifications
    prior_configs = {
        "Informative": {
            "ℓ_prior": pm.Gamma("ℓ", alpha=4, beta=20),  # Small lengthscale
            "η_prior": pm.HalfNormal("η", sigma=1),        # Moderate variance
        },
        "Weakly Informative": {
            "ℓ_prior": pm.Gamma("ℓ", alpha=2, beta=2),   # Moderate lengthscale
            "η_prior": pm.HalfCauchy("η", beta=2),        # Moderate variance
        },
        "Vague": {
            "ℓ_prior": pm.Gamma("ℓ", alpha=1, beta=0.1), # Very flexible
            "η_prior": pm.HalfCauchy("η", beta=10),       # Very flexible
        }
    }
    
    fig = make_subplots(
        rows=1, cols=3,
        subplot_titles=list(prior_configs.keys())
    )
    
    results = {}
    
    for i, (name, priors) in enumerate(prior_configs.items()):
        with pm.Model() as model:
            # Apply different priors
            ℓ = priors["ℓ_prior"]
            η = priors["η_prior"]
            σ = pm.HalfNormal("σ", sigma=0.5)
            
            cov = η**2 * pm.gp.cov.ExpQuad(1, ls=ℓ)
            gp = pm.gp.Marginal(cov_func=cov)
            y_ = gp.marginal_likelihood("y", X=X, y=y, sigma=σ)
            
            try:
                idata = pm.sample(500, tune=500, chains=2, cores=1,
                                random_seed=RANDOM_SEED, progressbar=False)
                
                f_pred = gp.conditional("f_pred", X_test)
                pred_samples = pm.sample_posterior_predictive(idata, progressbar=False)
                
                f_mean = pred_samples.posterior_predictive['f_pred'].mean(dim=['chain', 'draw'])
                f_std = pred_samples.posterior_predictive['f_pred'].std(dim=['chain', 'draw'])
                
                results[name] = (f_mean, f_std, idata)
                
            except Exception as e:
                print(f"Sampling failed for {name}: {e}")
                continue
        
        # Plot data
        fig.add_trace(
            go.Scatter(
                x=X.flatten(), y=y,
                mode='markers', name='Data' if i == 0 else None,
                marker=dict(color='black', size=4),
                showlegend=(i == 0)
            ),
            row=1, col=i+1
        )
        
        # Plot true function
        y_true = np.sin(4 * np.pi * X_test.flatten())
        fig.add_trace(
            go.Scatter(
                x=X_test.flatten(), y=y_true,
                name='True' if i == 0 else None, 
                line=dict(color='green', dash='dash'),
                showlegend=(i == 0)
            ),
            row=1, col=i+1
        )
        
        if name in results:
            f_mean, f_std, _ = results[name]
            
            # Plot GP prediction
            fig.add_trace(
                go.Scatter(
                    x=X_test.flatten(), y=f_mean,
                    name='GP' if i == 0 else None,
                    line=dict(color='blue'),
                    showlegend=(i == 0)
                ),
                row=1, col=i+1
            )
            
            # Uncertainty
            fig.add_trace(
                go.Scatter(
                    x=np.concatenate([X_test.flatten(), X_test.flatten()[::-1]]),
                    y=np.concatenate([f_mean + 2*f_std, (f_mean - 2*f_std)[::-1]]),
                    fill='toself',
                    fillcolor='rgba(0, 0, 255, 0.2)',
                    line=dict(color='rgba(255,255,255,0)'),
                    showlegend=False
                ),
                row=1, col=i+1
            )
    
    fig.update_layout(height=400, title="Impact of Prior Specification")
    fig.update_xaxes(title="x")
    fig.update_yaxes(title="y")
    
    return fig, results

prior_fig, prior_results = demonstrate_prior_sensitivity()
prior_fig.show()

## Recommended Prior Guidelines

Based on the above example and general experience, here are some guidelines for choosing priors:

In [None]:
# Prior recommendations
prior_recommendations = {
    "Parameter": ["Lengthscale (ℓ)", "Amplitude (η)", "Noise (σ)", "Period (p)"],
    "Recommended Prior": [
        "Gamma(2, β) where β ≈ 2/expected_scale",
        "HalfCauchy(β) where β ≈ std(y)", 
        "HalfNormal(σ) where σ ≈ 0.1 * std(y)",
        "Normal(μ, σ) where μ = expected period"
    ],
    "Rationale": [
        "Weakly informative, avoids extreme values",
        "Heavy tails allow flexibility",
        "Conservative, prevents overfitting",
        "Domain knowledge essential"
    ]
}

prior_df = pd.DataFrame(prior_recommendations)
print("GP Hyperparameter Prior Recommendations:")
print("=" * 50)
print(prior_df.to_string(index=False))

## 3. Model Validation and Diagnostics

Proper model validation is essential for reliable GP models:

In [None]:
def gp_diagnostics(idata, model_name="GP Model"):
    """Comprehensive diagnostics for GP models"""
    
    print(f"Diagnostics for {model_name}:")
    print("=" * 40)
    
    # 1. Convergence diagnostics
    summary = az.summary(idata, hdi_prob=0.95)
    
    # Check R-hat values
    rhat_issues = summary[summary['r_hat'] > 1.1]
    if len(rhat_issues) > 0:
        print("⚠️  WARNING: Some parameters have R̂ > 1.1:")
        print(rhat_issues[['mean', 'r_hat']].to_string())
    else:
        print("✅ All R̂ values < 1.1 (good convergence)")
    
    # Check effective sample size
    ess_issues = summary[summary['ess_bulk'] < 400]
    if len(ess_issues) > 0:
        print("\n⚠️  WARNING: Low effective sample size:")
        print(ess_issues[['ess_bulk', 'ess_tail']].to_string())
    else:
        print("✅ Adequate effective sample sizes")
    
    # 2. Print key parameter estimates
    print("\nParameter Estimates:")
    print(summary[['mean', 'hdi_2.5%', 'hdi_97.5%']].round(4).to_string())
    
    return summary

# Example with our classification model
if 'classification_idata' in locals():
    classification_diagnostics = gp_diagnostics(classification_idata, "GP Classification")
else:
    print("Classification model not available for diagnostics")

## 4. Computational Efficiency Tips

For larger datasets or complex models, consider these optimization strategies:

In [None]:
# Computational efficiency tips
efficiency_tips = {
    "Strategy": [
        "Use pm.gp.Marginal when possible",
        "Add jitter to diagonal", 
        "Use sparse/inducing points",
        "Hierarchical GPs for grouped data",
        "Consider HSGP approximation",
        "Proper kernel scaling"
    ],
    "When to Use": [
        "Gaussian likelihood",
        "Numerical instability",
        "Large datasets (n > 1000)", 
        "Multiple similar time series",
        "Stationary kernels, large n",
        "Always"
    ],
    "Implementation": [
        "pm.gp.Marginal(cov_func=cov)",
        "cov_func + 1e-6 * pm.gp.cov.WhiteNoise()",
        "Use inducing points",
        "Shared hyperparameters", 
        "pm.gp.HSGP",
        "Scale inputs to [0,1] or [-1,1]"
    ]
}

efficiency_df = pd.DataFrame(efficiency_tips)
print("GP Computational Efficiency Strategies:")
print("=" * 45)
print(efficiency_df.to_string(index=False))

# Summary and Key Takeaways

In this session, we've covered the essential building blocks for advanced GP modeling:

## 🎯 **Key Concepts Mastered**

1. **Kernel Selection**: Understanding when to use RBF, Matérn, Periodic, and other kernels
2. **Kernel Composition**: Building complex patterns through addition and multiplication
3. **Non-Gaussian Likelihoods**: Extending GPs beyond continuous data
4. **Model Building**: Best practices for robust, efficient GP models

## 📋 **Practical Guidelines**

### Kernel Selection Flowchart
```
Data Characteristics → Kernel Choice
├── Smooth, no periodicity → RBF
├── Some roughness → Matérn (3/2 or 5/2) 
├── Periodic patterns → Periodic (+ RBF for trend)
├── Multiple scales → Additive kernels
└── Multiplicative patterns → Product kernels
```

### Implementation Strategy
```
Problem Type → Implementation
├── Gaussian noise → pm.gp.Marginal
├── Classification → pm.gp.Latent + Bernoulli
├── Count data → pm.gp.Latent + Poisson
├── Robust regression → pm.gp.Latent + StudentT
└── Large datasets → Consider HSGP or sparse approximations
```

## 🚀 **Next Steps**

Building on today's foundation, the remaining sessions will cover:
- **Session 3**: Advanced topics and computational methods
- **Session 4**: Real-world applications and case studies

## 💡 **Practice Exercises**

Try applying these concepts to your own data:

1. **Start Simple**: Begin with basic RBF kernels
2. **Add Complexity**: Experiment with kernel composition
3. **Match Likelihood**: Choose appropriate likelihood for your data type
4. **Validate**: Always check convergence and model fit
5. **Compare**: Test multiple kernel choices and compare performance

## 📚 Additional Resources

For deeper understanding, explore:

- **PyMC Documentation**: [GP module documentation](https://www.pymc.io/projects/docs/en/stable/api/gp.html)
- **Rasmussen & Williams**: *Gaussian Processes for Machine Learning* (the definitive textbook)
- **PyMC Examples**: [GP gallery](https://www.pymc.io/projects/examples/en/latest/gaussian_processes/index.html)
- **Kernel Cookbook**: David Duvenaud's kernel cookbook for kernel selection guidance

---

**End of Session 2** 🎉

*You now have the tools to build sophisticated GP models for a wide range of real-world problems!*