# 03 — Deep Learning for TVAR Estimation

This notebook implements neural network approaches for learning time-varying AR coefficients.

## Approaches
- **Neural ODE + TVAR**: ODE evolves latent → Levinson-Durbin → stable AR coeffs
- **Lag-Attention (Transformer)**: Attention over lag bank, sparse top-k selection
- **Neural Operator**: Continuous-τ kernel with Fourier features
- **Transformer AR**: CLS token over lag sequence

---

In [None]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 1. Neural ODE + TVAR (Levinson-Durbin Stability)

**Source**: `NeuralODE.ipynb`

**Key Idea**: A Neural ODE evolves a latent state `z(t)` over time. The latent is mapped through `tanh` to obtain reflection coefficients $\kappa \in (-1, 1)$, which are then converted to stable AR(p) coefficients via Levinson-Durbin recursion.

$$\kappa_i \in (-1, 1) \Rightarrow \text{stable AR}$$

**Components**:
- `ODEFunc`: MLP that defines $dz/dt = f_\theta(z)$
- `Encoder`: Maps initial window of observations to $z_0$
- `NeuralODE_TVAR2`: Integrates ODE (Euler), applies Levinson-Durbin

**Levinson-Durbin for AR(2)**:
$$a_2 = \kappa_2, \quad a_1 = \kappa_1 (1 - \kappa_2)$$

In [None]:
# ==============================================================================
# Neural ODE + TVAR (from NeuralODE.ipynb)
# ==============================================================================

def levinson_order2(kappa):
    """
    Levinson-Durbin recursion for AR(2) from reflection coefficients.
    
    Maps reflection coefficients κ ∈ (-1, 1) to stable AR coefficients.
    
    Parameters
    ----------
    kappa : Tensor, shape (..., 2)
        Reflection coefficients (use tanh to constrain to (-1, 1))
    
    Returns
    -------
    a : Tensor, shape (..., 2)
        AR(2) coefficients [a1, a2] guaranteed to be stable
    """
    k1 = kappa[..., 0]
    k2 = kappa[..., 1]
    a2 = k2
    a1 = k1 * (1.0 - k2)
    return torch.stack([a1, a2], dim=-1)


class ODEFunc(nn.Module):
    """
    Defines the ODE dynamics dz/dt = f_θ(z).
    
    A simple MLP that maps z → dz/dt.
    """
    def __init__(self, dim=2, hidden=32):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden),
            nn.Tanh(),
            nn.Linear(hidden, dim),
        )

    def forward(self, z):
        return self.net(z)


class Encoder(nn.Module):
    """
    Maps an initial window of observations to the initial latent state z0.
    """
    def __init__(self, L=30, hidden=32, out_dim=2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(L, hidden),
            nn.Tanh(),
            nn.Linear(hidden, out_dim),
        )

    def forward(self, x0):
        return self.net(x0)


class NeuralODE_TVAR2(nn.Module):
    """
    Neural ODE for Time-Varying AR(2) estimation.
    
    The latent state z(t) evolves via an ODE, then is mapped to stable
    AR coefficients using Levinson-Durbin recursion.
    
    Parameters
    ----------
    L : int
        Window size for encoding z0
    hidden : int
        Hidden dimension for ODE function
    
    Forward
    -------
    x_seq : Tensor, shape (T,)
        Observed time series
    phi_seq : Tensor, shape (T, 2)
        Lagged features [x_{t-1}, x_{t-2}]
    dt : float
        Integration time step
    
    Returns
    -------
    xhat : Tensor, shape (T,)
        One-step predictions (teacher-forced)
    a : Tensor, shape (T, 2)
        Time-varying AR coefficients
    z : Tensor, shape (T, 2)
        Latent ODE trajectory
    """
    def __init__(self, L=30, hidden=32):
        super().__init__()
        self.L = L
        self.func = ODEFunc(dim=2, hidden=hidden)
        self.enc = Encoder(L=L, hidden=hidden, out_dim=2)

    def forward(self, x_seq, phi_seq, dt=1.0):
        T = x_seq.shape[0]
        z0 = self.enc(x_seq[:self.L].unsqueeze(0)).squeeze(0)  # (2,)

        # Euler integration of ODE
        z_list = [z0]
        for _ in range(1, T):
            z_prev = z_list[-1]
            z_next = z_prev + dt * self.func(z_prev)
            z_list.append(z_next)
        z = torch.stack(z_list, dim=0)  # (T, 2)

        # Map to stable AR coefficients
        kappa = torch.tanh(z)          # reflection coeffs in (-1, 1)
        a = levinson_order2(kappa)     # stable AR(2) coeffs

        # One-step prediction
        xhat = torch.zeros(T, device=x_seq.device)
        xhat[2:] = (a[2:] * phi_seq[2:]).sum(dim=1)
        
        return xhat, a, z

---

## 2. Lag-Attention TVAR (Transformer-based)

**Source**: `DeepLagAttention.ipynb`

**Key Idea**: Build a "lag bank" of past values $[x_{t-1}, x_{t-2}, \ldots, x_{t-L}]$ and use transformer attention to learn which lags are important at each time step. A sparse top-k selection focuses on the most relevant lags.

**Prediction**:
$$\hat{x}_t = \sum_{\ell=1}^{L} w_\ell(t) \cdot x_{t-\ell} + c(t)$$

where $w(t)$ is a learned, time-varying attention distribution over lags.

**Two Variants**:
1. `LagAttentionTVAR` — Full transformer encoder over lag tokens
2. `LagAttentionTVARFast` — Bilinear scoring (no transformer, faster)

In [None]:
# ==============================================================================
# Lag-Attention TVAR (from DeepLagAttention.ipynb)
# ==============================================================================

import torch.nn.functional as F


def build_lag_bank(x_tensor, L):
    """
    Build lag bank from time series.
    
    Parameters
    ----------
    x_tensor : Tensor, shape (B, T)
        Input time series
    L : int
        Number of lags
    
    Returns
    -------
    Xlags : Tensor, shape (B, T, L)
        Xlags[:, t, l] = x_{t-(l+1)}, left-padded with zeros
    """
    B, T = x_tensor.shape
    pads = F.pad(x_tensor, (L, 0))  # [B, T+L]
    idx_base = torch.arange(T, device=x_tensor.device).view(1, T, 1)
    lag_offsets = torch.arange(L, device=x_tensor.device).view(1, 1, L)
    gather_idx = L - 1 + idx_base + lag_offsets
    Xlags = pads[:, gather_idx.squeeze(0)]
    return Xlags


def topk_mask_logits(logits, k):
    """Mask all but top-k logits with -inf for sparse softmax."""
    if k >= logits.shape[-1]:
        return logits
    topk = torch.topk(logits, k, dim=-1)
    mask = torch.full_like(logits, float('-inf'))
    return mask.scatter(-1, topk.indices, topk.values)


class LagAttentionTVAR(nn.Module):
    """
    Lag-Attention TVAR with full Transformer encoder.
    
    Learns time-varying attention weights over a bank of lagged values.
    Uses sparse top-k selection for efficiency.
    
    Parameters
    ----------
    L : int
        Number of lags in the bank
    d_model : int
        Model dimension
    n_layers : int
        Number of transformer layers
    n_heads : int
        Number of attention heads
    topk : int
        Top-k sparsity in attention
    use_var : bool
        If True, also predict log-variance for NLL loss
    """
    def __init__(self, L=256, d_model=128, n_layers=2, n_heads=4, topk=8, use_var=False):
        super().__init__()
        self.L = L
        self.topk = topk
        self.use_var = use_var

        # Embeddings for lag indices and values
        self.lag_embed = nn.Embedding(L + 1, d_model)
        self.val_proj = nn.Linear(1, d_model)

        # Transformer encoder for lag tokens
        enc_layer = nn.TransformerEncoderLayer(
            d_model, n_heads,
            dim_feedforward=4 * d_model,
            dropout=0.1, batch_first=True
        )
        self.lag_encoder = nn.TransformerEncoder(enc_layer, num_layers=n_layers)

        # Causal context (conv over past)
        self.ctx_conv = nn.Conv1d(1, d_model, kernel_size=9, padding=8, dilation=1)
        self.ctx_proj = nn.Linear(d_model, d_model)

        # Scoring: pair(lag_enc, context) → scalar
        self.score = nn.Sequential(
            nn.Linear(2 * d_model, d_model),
            nn.ReLU(),
            nn.Linear(d_model, 1)
        )
        self.bias_head = nn.Linear(d_model, 1)
        if use_var:
            self.logvar_head = nn.Linear(d_model, 1)

    def forward(self, x):
        """
        Parameters
        ----------
        x : Tensor, shape (B, T)
        
        Returns
        -------
        mu : Tensor, shape (B, T)
            Predicted mean
        logvar : Tensor or None
            Log-variance (if use_var=True)
        w : Tensor, shape (B, T, L)
            Attention weights over lags
        """
        B, T = x.shape
        L = self.L

        Xlags = build_lag_bank(x, L)  # [B, T, L]
        lag_vals = Xlags.unsqueeze(-1)  # [B, T, L, 1]
        lag_ids = torch.arange(1, L + 1, device=x.device).view(1, 1, L).expand(B, T, L)

        Hval = self.val_proj(lag_vals).squeeze(-2)  # [B, T, L, d]
        Hidx = self.lag_embed(lag_ids)  # [B, T, L, d]
        Hlag = Hval + Hidx  # [B, T, L, d]

        # Encode lag tokens per time step
        Hlag_flat = Hlag.view(B * T, L, -1)
        Henc = self.lag_encoder(Hlag_flat).view(B, T, L, -1)  # [B, T, L, d]

        # Causal context from x
        ctx = self.ctx_conv(x.unsqueeze(1))  # [B, d, T+pad]
        ctx = ctx[..., :T].transpose(1, 2)  # [B, T, d]
        ctx = self.ctx_proj(ctx)  # [B, T, d]

        # Score lags with context
        ctx_exp = ctx.unsqueeze(2).expand(-1, -1, L, -1)
        pair = torch.cat([Henc, ctx_exp], dim=-1)  # [B, T, L, 2d]
        logits = self.score(pair).squeeze(-1)  # [B, T, L]

        logits_masked = topk_mask_logits(logits, k=self.topk)
        w = torch.softmax(logits_masked, dim=-1)  # [B, T, L]

        mu_ar = (w * Xlags).sum(dim=-1)  # [B, T]
        c = self.bias_head(ctx).squeeze(-1)  # [B, T]
        mu = mu_ar + c

        logvar = None
        if self.use_var:
            logvar = self.logvar_head(ctx).squeeze(-1).clamp(-8, 8)
        
        return mu, logvar, w


class LagAttentionTVARFast(nn.Module):
    """
    Fast Lag-Attention TVAR (no Transformer, bilinear scoring).
    
    Uses bilinear attention: score = <Wq·ctx, Wk·Hlag> instead of
    full transformer, making it much faster for long sequences.
    
    Parameters
    ----------
    L : int
        Number of lags
    d_model : int
        Model dimension
    topk : int
        Top-k sparsity
    use_var : bool
        If True, predict log-variance
    """
    def __init__(self, L=256, d_model=128, topk=8, use_var=False):
        super().__init__()
        self.L = L
        self.topk = topk
        self.use_var = use_var

        self.lag_embed = nn.Embedding(L + 1, d_model)
        self.val_proj = nn.Linear(1, d_model)

        # Causal context (left-padded conv)
        self.ctx_pad = nn.ConstantPad1d((8, 0), 0)
        self.ctx_conv = nn.Conv1d(1, d_model, kernel_size=9, padding=0)
        self.ctx_proj = nn.Linear(d_model, d_model)

        # Bilinear scorer
        self.Wq = nn.Linear(d_model, d_model, bias=False)
        self.Wk = nn.Linear(d_model, d_model, bias=False)

        self.bias_head = nn.Linear(d_model, 1)
        if use_var:
            self.logvar_head = nn.Linear(d_model, 1)

        # Init for stability
        for m in [self.Wq, self.Wk, self.val_proj, self.bias_head]:
            if hasattr(m, 'weight'):
                nn.init.xavier_uniform_(m.weight)

    def forward(self, x):
        B, T = x.shape
        L = self.L

        Xlags = build_lag_bank(x, L)  # [B, T, L]
        lag_vals = Xlags.unsqueeze(-1)  # [B, T, L, 1]
        lag_ids = torch.arange(1, L + 1, device=x.device).view(1, 1, L).expand(B, T, L)

        Hval = self.val_proj(lag_vals).squeeze(-2)  # [B, T, L, d]
        Hidx = self.lag_embed(lag_ids)  # [B, T, L, d]
        Hlag = Hval + Hidx  # [B, T, L, d]

        # Causal context
        ctx = self.ctx_conv(self.ctx_pad(x.unsqueeze(1)))  # [B, d, T]
        ctx = ctx.transpose(1, 2)  # [B, T, d]
        ctx = self.ctx_proj(ctx)  # [B, T, d]

        # Bilinear scores
        q = self.Wq(ctx)  # [B, T, d]
        k = self.Wk(Hlag)  # [B, T, L, d]
        logits = torch.einsum('btd,btld->btl', q, k)  # [B, T, L]

        # Top-k masking
        if (self.topk is not None) and (self.topk < L):
            topk_vals = torch.topk(logits, self.topk, dim=-1)
            mask = torch.full_like(logits, float('-inf'))
            logits = mask.scatter(-1, topk_vals.indices, topk_vals.values)

        w = torch.softmax(logits, dim=-1)  # [B, T, L]
        mu_ar = (w * Xlags).sum(dim=-1)  # [B, T]
        c = self.bias_head(ctx).squeeze(-1)  # [B, T]
        mu = mu_ar + c

        logvar = None
        if self.use_var:
            logvar = self.logvar_head(ctx).squeeze(-1).clamp(-8, 8)
        
        return mu, logvar, w


def gaussian_nll(y, mu, logvar):
    """Gaussian negative log-likelihood loss."""
    if logvar is None:
        logvar = torch.zeros_like(mu)
    return 0.5 * (logvar + (y - mu)**2 / (logvar.exp() + 1e-8))

---

## 3. Transformer AR (Fixed Lag Order)

**Source**: `fixed_linear_transformer.ipynb`

**Key Idea**: Standard transformer encoder operating on a fixed-size lag sequence. Uses a CLS token to summarize the lag information and predict the next value.

**Architecture**:
- Input: $[x_{t-1}, x_{t-2}, \ldots, x_{t-P}]$ as sequence of tokens
- Learned positional embeddings for lag positions
- CLS token prepended, then transformed
- Output head predicts $\hat{x}_t$ from CLS representation

**Use Case**: When AR order is fixed but nonlinear lag interactions are important.

In [None]:
# ==============================================================================
# Transformer AR (from fixed_linear_transformer.ipynb)
# ==============================================================================

class TransformerAR(nn.Module):
    """
    Transformer-based AR model with fixed lag order.
    
    Uses a CLS token to summarize the lag sequence and predict the next value.
    Suitable for learning nonlinear interactions between lags.
    
    Parameters
    ----------
    Pmax : int
        Maximum lag order (sequence length)
    d_model : int
        Transformer model dimension
    nhead : int
        Number of attention heads
    depth : int
        Number of transformer layers
    dropout : float
        Dropout rate
    """
    def __init__(self, Pmax, d_model=64, nhead=4, depth=2, dropout=0.1):
        super().__init__()
        self.Pmax = Pmax
        self.d_model = d_model

        # Project scalar lag values to d_model
        self.in_proj = nn.Linear(1, d_model)

        # Learned positional embeddings for lag positions 0..Pmax-1
        self.pos = nn.Parameter(torch.zeros(1, Pmax, d_model))
        nn.init.normal_(self.pos, mean=0.0, std=0.02)

        # CLS token to summarize the sequence
        self.cls = nn.Parameter(torch.zeros(1, 1, d_model))
        nn.init.normal_(self.cls, mean=0.0, std=0.02)

        # Transformer encoder
        enc_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=4 * d_model,
            dropout=dropout,
            activation="gelu",
            batch_first=True,
            norm_first=True,
        )
        self.encoder = nn.TransformerEncoder(enc_layer, num_layers=depth)
        
        # Output projection
        self.out = nn.Linear(d_model, 1)

    def forward(self, X_seq):
        """
        Parameters
        ----------
        X_seq : Tensor, shape (B, Pmax, 1)
            Lag sequence where position 0 is x_{t-1}, position 1 is x_{t-2}, etc.
        
        Returns
        -------
        yhat : Tensor, shape (B,)
            Predicted next value
        """
        h = self.in_proj(X_seq) + self.pos  # (B, Pmax, d)
        cls = self.cls.expand(X_seq.size(0), 1, -1)  # (B, 1, d)
        h = torch.cat([cls, h], dim=1)  # (B, 1+Pmax, d)

        h = self.encoder(h)  # (B, 1+Pmax, d)
        h_cls = h[:, 0, :]  # (B, d) — CLS token output
        yhat = self.out(h_cls).squeeze(-1)  # (B,)
        
        return yhat

---

## 4. MLP TVAR (Hyper-Network)

**Source**: `tvar_linear_mlp.ipynb`

**Key Idea**: A "hyper-network" MLP that takes the lag vector as input and outputs time-varying AR coefficients. The prediction is then a learned linear combination of lags.

$$\hat{x}_t = \mathbf{a}(z_t)^\top z_t + b(z_t)$$

where $z_t = [x_{t-1}, \ldots, x_{t-p}]$ and both $\mathbf{a}$ and $b$ are produced by the MLP.

**Architecture**:
- Backbone: MLP with GELU activation, LayerNorm, Dropout
- Head: Linear layer outputting $[a_1, \ldots, a_p, b]$
- Prediction: Dot product of learned coefficients with lags + bias

In [None]:
# ==============================================================================
# MLP TVAR — Hyper-Network (from tvar_linear_mlp.ipynb)
# ==============================================================================

class MLPTVAR(nn.Module):
    """
    MLP-based Time-Varying AR model (Hyper-Network approach).
    
    The MLP takes the lag vector z_t = [x_{t-1}, ..., x_{t-p}] and outputs
    time-varying coefficients a(t) and bias b(t). Prediction is:
    
        x̂_t = a(z_t)ᵀ z_t + b(z_t)
    
    Parameters
    ----------
    p_max : int
        Maximum lag order
    hidden : int
        Hidden dimension
    depth : int
        Number of hidden layers
    dropout : float
        Dropout rate
    """
    def __init__(self, p_max, hidden=128, depth=3, dropout=0.1):
        super().__init__()
        
        layers = []
        in_dim = p_max
        for _ in range(depth):
            layers.append(nn.Linear(in_dim, hidden))
            layers.append(nn.GELU())
            layers.append(nn.LayerNorm(hidden))
            layers.append(nn.Dropout(dropout))
            in_dim = hidden
        self.backbone = nn.Sequential(*layers)

        # Output: [a_1, ..., a_p_max, b]
        self.head = nn.Linear(hidden, p_max + 1)
        self.p_max = p_max

    def forward(self, Z):
        """
        Parameters
        ----------
        Z : Tensor, shape (B, p_max)
            Lag features [x_{t-1}, ..., x_{t-p}]
        
        Returns
        -------
        pred : Tensor, shape (B,)
            Predicted next value
        coeffs : Tensor, shape (B, p_max)
            Learned time-varying AR coefficients
        bias : Tensor, shape (B,)
            Learned time-varying bias
        """
        h = self.backbone(Z)
        out = self.head(h)  # [B, p_max+1]
        
        coeffs = out[:, :self.p_max]  # [B, p_max]
        bias = out[:, self.p_max:]  # [B, 1]
        
        # Dot product + bias
        pred = (coeffs * Z).sum(dim=1, keepdim=True) + bias
        
        return pred.squeeze(1), coeffs, bias.squeeze(1)

---

## 5. Neural Operator (Continuous Delay Kernel)

**Source**: `NeuralOperator.ipynb`

**Key Idea**: Model TVAR as an integral operator over continuous delays:

$$\hat{x}_t = c(t) + \int_{\tau_{\min}}^{\tau_{\max}} k_t(\tau) \, x(t-\tau) \, d\tau$$

The kernel $k_t(\tau)$ is parameterized via Fourier features over $\tau$:

$$k_t(\tau) = \sum_{m=1}^{M} \left[ a_m(t) \cos(\omega_m \tau) + b_m(t) \sin(\omega_m \tau) \right]$$

where the amplitudes $a_m(t), b_m(t)$ are produced by a causal context encoder.

**Key Advantage**: Sampling-rate invariant — can handle non-uniform or varying $\Delta t$.

In [None]:
# ==============================================================================
# Neural Operator — Continuous Delay Kernel (from NeuralOperator.ipynb)
# ==============================================================================

import math


def fractional_delay_samples(x, tau_grid, dt, t_offset=0):
    """
    Sample x(t - τ) with linear interpolation for continuous delays.
    
    Parameters
    ----------
    x : Tensor, shape (B, T)
        Input time series
    tau_grid : Tensor, shape (L,)
        Delay values in seconds
    dt : float
        Sampling interval in seconds
    t_offset : int
        Offset for absolute time indexing
    
    Returns
    -------
    Xlags : Tensor, shape (B, T, L)
        Interpolated lagged values
    """
    B, T = x.shape
    device, dtype = x.device, x.dtype

    if isinstance(tau_grid, torch.Tensor):
        tau_idx = tau_grid.to(device=device, dtype=dtype) / float(dt)
    else:
        tau_idx = torch.as_tensor(tau_grid, device=device, dtype=dtype) / float(dt)

    t_idx = torch.arange(T, device=device, dtype=dtype).view(1, T, 1) + float(t_offset)

    src = t_idx - tau_idx.view(1, 1, -1)  # [1, T, L]
    src0 = torch.clamp(torch.floor(src), 0, T - 1).to(torch.long)
    src1 = torch.clamp(src0 + 1, 0, T - 1)
    w = (src - src0.to(dtype)).to(dtype)

    idx0 = src0.expand(B, -1, -1).contiguous()
    idx1 = src1.expand(B, -1, -1).contiguous()

    L = tau_idx.numel()
    x_exp = x.unsqueeze(-1).repeat(1, 1, L).contiguous()

    x0 = torch.gather(x_exp, 1, idx0)
    x1 = torch.gather(x_exp, 1, idx1)
    
    return (1 - w) * x0 + w * x1


def total_variation_time(k):
    """Total variation regularizer over time dimension."""
    return (k[:, 1:, :] - k[:, :-1, :]).abs().mean()


def l1_energy(k):
    """L1 sparsity regularizer."""
    return k.abs().mean()


class TVAROperator(nn.Module):
    """
    Time-Varying AR as a Neural Operator over continuous delays.
    
    Models the prediction as an integral:
        y_t = c(t) + ∫ k_t(τ) x(t-τ) dτ
    
    The kernel k_t(τ) is parameterized with Fourier features over τ,
    with time-varying amplitudes from a causal context encoder.
    
    Parameters
    ----------
    L : int
        Number of delay points to discretize
    tau_min : float
        Minimum delay (seconds)
    tau_max : float
        Maximum delay (seconds)
    n_modes : int
        Number of Fourier modes for kernel
    hidden : int
        Channels in context encoder
    """
    def __init__(self, L=128, tau_min=0.0, tau_max=0.5, n_modes=16, hidden=64):
        super().__init__()
        assert tau_max > tau_min >= 0.0
        self.L = L
        self.tau_min = tau_min
        self.tau_max = tau_max
        self.register_buffer("tau_grid", torch.linspace(tau_min, tau_max, L))

        # Causal context encoder (1D convs, left-padded)
        self.ctx = nn.Sequential(
            nn.Conv1d(1, hidden, kernel_size=9, padding=8, dilation=2),
            nn.ReLU(),
            nn.Conv1d(hidden, hidden, kernel_size=5, padding=4, dilation=2),
            nn.ReLU()
        )

        # Fourier basis over τ
        self.register_buffer("freqs", torch.linspace(0.0, math.pi, n_modes))
        self.head_a = nn.Linear(hidden, n_modes)  # cos amplitudes
        self.head_b = nn.Linear(hidden, n_modes)  # sin amplitudes
        self.bias = nn.Linear(hidden, 1)  # c(t), time-varying intercept

        # Global gain for stability
        self.kernel_gain = nn.Parameter(torch.tensor(0.1))

    def make_kernel(self, h):
        """
        Generate time-varying kernel from context features.
        
        Parameters
        ----------
        h : Tensor, shape (B, T, H)
            Context features
        
        Returns
        -------
        k : Tensor, shape (B, T, L)
            Kernel values over delay grid
        """
        B, T, H = h.shape
        a = self.head_a(h)  # [B, T, M]
        b = self.head_b(h)  # [B, T, M]
        
        # Fourier features over τ
        tau = self.tau_grid.view(1, 1, self.L, 1)  # [1, 1, L, 1]
        omega = self.freqs.view(1, 1, 1, -1)  # [1, 1, 1, M]
        cosF = torch.cos(omega * tau)  # [1, 1, L, M]
        sinF = torch.sin(omega * tau)  # [1, 1, L, M]
        
        # Combine with time-varying amplitudes
        k = (a.unsqueeze(2) * cosF + b.unsqueeze(2) * sinF).sum(-1)  # [B, T, L]
        
        return self.kernel_gain * k

    def forward(self, x, dt, t_offset=0, return_kernel=False):
        """
        Parameters
        ----------
        x : Tensor, shape (B, T)
            Input time series
        dt : float
            Sampling interval in seconds
        t_offset : int
            Offset for absolute time indexing
        return_kernel : bool
            If True, also return intermediate values
        
        Returns
        -------
        yhat : Tensor, shape (B, T)
            Predicted values
        k : Tensor, shape (B, T, L)
            Kernel (if return_kernel=True)
        c : Tensor, shape (B, T)
            Bias (if return_kernel=True)
        Xlags : Tensor, shape (B, T, L)
            Lagged values (if return_kernel=True)
        """
        B, T = x.shape

        # Causal context features
        x1 = x.unsqueeze(1)  # [B, 1, T]
        h = self.ctx(F.pad(x1, (32, 0)))  # [B, H, T+32]
        h = h[..., -T:]  # Crop to length T
        hT = h.transpose(1, 2)  # [B, T, H]

        k = self.make_kernel(hT)  # [B, T, L]
        c = self.bias(hT).squeeze(-1)  # [B, T]

        # Sample lagged signal at continuous τ-grid
        Xlags = fractional_delay_samples(x, self.tau_grid, float(dt), t_offset=t_offset)

        # Riemann sum approximation of integral
        delta_tau = (self.tau_max - self.tau_min) / max(self.L - 1, 1)
        yhat = (k * Xlags).sum(-1) * delta_tau + c  # [B, T]

        if return_kernel:
            return yhat, k, c, Xlags
        return yhat

---

## Utilities

Common utilities for data preparation and training.

---

## 6. AR(p) Baseline (OLS)

**Key Idea**: Classic autoregressive model with fixed coefficients, estimated via Ordinary Least Squares.

$$\hat{x}_t = \sum_{i=1}^{p} a_i \, x_{t-i} + \epsilon_t$$

The coefficients $\mathbf{a} = [a_1, \ldots, a_p]$ are estimated by minimizing:

$$\min_{\mathbf{a}} \| \mathbf{y} - \mathbf{X} \mathbf{a} \|^2$$

**Use Case**: Baseline for stationary signals. Compare TVAR models against this to verify time-varying behavior.

In [None]:
# ==============================================================================
# AR(p) Baseline — OLS Estimation
# ==============================================================================

class ARModel:
    """
    Classic AR(p) model with OLS estimation.
    
    A simple baseline for comparison with time-varying models.
    Uses numpy for fitting (no PyTorch needed).
    
    Parameters
    ----------
    p : int
        AR order (number of lags)
    
    Attributes
    ----------
    coeffs : ndarray, shape (p,)
        Estimated AR coefficients [a_1, ..., a_p]
    intercept : float
        Estimated intercept term
    residual_std : float
        Standard deviation of residuals
    """
    def __init__(self, p: int):
        self.p = p
        self.coeffs = None
        self.intercept = None
        self.residual_std = None
    
    def fit(self, x: np.ndarray) -> "ARModel":
        """
        Fit AR(p) model using OLS.
        
        Parameters
        ----------
        x : ndarray, shape (T,)
            Time series to fit
        
        Returns
        -------
        self : ARModel
            Fitted model
        """
        T = len(x)
        p = self.p
        
        # Build design matrix X and target y
        X = np.column_stack([x[p-k:T-k] for k in range(1, p+1)])  # (T-p, p)
        y = x[p:]  # (T-p,)
        
        # Add intercept column
        X_with_intercept = np.column_stack([np.ones(len(y)), X])  # (T-p, p+1)
        
        # OLS: β = (X'X)^{-1} X'y
        beta = np.linalg.lstsq(X_with_intercept, y, rcond=None)[0]
        
        self.intercept = beta[0]
        self.coeffs = beta[1:]
        
        # Compute residuals
        y_hat = X_with_intercept @ beta
        residuals = y - y_hat
        self.residual_std = np.std(residuals)
        
        return self
    
    def predict(self, x: np.ndarray) -> np.ndarray:
        """
        One-step ahead prediction (teacher-forced).
        
        Parameters
        ----------
        x : ndarray, shape (T,)
            Input time series
        
        Returns
        -------
        y_hat : ndarray, shape (T-p,)
            Predictions for x[p], x[p+1], ...
        """
        if self.coeffs is None:
            raise ValueError("Model not fitted. Call fit() first.")
        
        T = len(x)
        p = self.p
        
        X = np.column_stack([x[p-k:T-k] for k in range(1, p+1)])
        y_hat = self.intercept + X @ self.coeffs
        
        return y_hat
    
    def forecast(self, x: np.ndarray, horizon: int) -> np.ndarray:
        """
        Multi-step ahead forecast (autoregressive rollout).
        
        Parameters
        ----------
        x : ndarray, shape (T,)
            Input time series (uses last p values as initial condition)
        horizon : int
            Number of steps to forecast
        
        Returns
        -------
        forecast : ndarray, shape (horizon,)
            Forecasted values
        """
        if self.coeffs is None:
            raise ValueError("Model not fitted. Call fit() first.")
        
        p = self.p
        buffer = list(x[-p:])  # Last p values
        forecast = []
        
        for _ in range(horizon):
            lags = np.array(buffer[-p:][::-1])  # [x_{t-1}, ..., x_{t-p}]
            y_next = self.intercept + np.dot(self.coeffs, lags)
            forecast.append(y_next)
            buffer.append(y_next)
        
        return np.array(forecast)
    
    def __repr__(self):
        if self.coeffs is None:
            return f"ARModel(p={self.p}, fitted=False)"
        return f"ARModel(p={self.p}, coeffs={self.coeffs.round(4)}, intercept={self.intercept:.4f})"

In [None]:
# ==============================================================================
# Common Utilities
# ==============================================================================

def make_lags(x, p):
    """
    Create lagged design matrix for AR(p) regression.
    
    Parameters
    ----------
    x : ndarray, shape (T,)
        Time series
    p : int
        Lag order
    
    Returns
    -------
    X : ndarray, shape (T-p, p)
        Lagged features [x_{t-1}, ..., x_{t-p}]
    y : ndarray, shape (T-p,)
        Target values x_t
    """
    X = np.stack([x[p-k:len(x)-k] for k in range(1, p+1)], axis=1)
    y = x[p:]
    return X, y


def make_phi_seq(x, p=2):
    """
    Create lagged feature sequence for Neural ODE TVAR.
    
    Parameters
    ----------
    x : ndarray, shape (T,)
        Time series
    p : int
        Number of lags
    
    Returns
    -------
    phi : ndarray, shape (T, p)
        Lagged features (zero-padded for t < p)
    """
    phi = np.stack([np.roll(x, k) for k in range(1, p+1)], axis=1).astype(np.float32)
    phi[:p] = 0.0
    return phi

---

## Summary

| Model | Key Idea | Stability | Continuous τ | Time-Varying | Source |
|-------|----------|-----------|--------------|--------------|--------|
| **ARModel** | OLS baseline | ✅ (check roots) | ❌ | ❌ Fixed | — |
| **NeuralODE_TVAR2** | ODE evolves latent → Levinson-Durbin | ✅ Guaranteed | ❌ | ✅ | `NeuralODE.ipynb` |
| **LagAttentionTVAR** | Transformer attention over lag bank | ❌ Learned | ❌ | ✅ | `DeepLagAttention.ipynb` |
| **LagAttentionTVARFast** | Bilinear scoring (no transformer) | ❌ Learned | ❌ | ✅ | `DeepLagAttention.ipynb` |
| **TransformerAR** | CLS token over lag sequence | ❌ N/A | ❌ | ❌ Nonlinear | `fixed_linear_transformer.ipynb` |
| **MLPTVAR** | Hyper-network outputs coefficients | ❌ Learned | ❌ | ✅ | `tvar_linear_mlp.ipynb` |
| **TVAROperator** | Fourier kernel over continuous delays | ❌ Regularized | ✅ | ✅ | `NeuralOperator.ipynb` |

### Next Steps
- Move models to `stochastic_dynamics/models/`
- Add training loops and evaluation metrics
- Benchmark on simulated TVAR data