In a previous post, I described a procedure for sampling from probability distributions using Monte Carlo methods.
One of the subtle points that I tried to address there is what happens when the quantity we want to sample is a function of space and time.
For example, we might want to infer the conductivity of an aquifer from well head measurements; the material density in the earth's subsurface from satellite gravimetry; or basal drag between a glacier and the landscape from measurements of its surface velocity.
The parameter space is now a space of functions, and so it has infinitely many dimensions.
Extending common algorithms to work in function spaces is hard.

Since writing that first post, I read [this paper](https://doi.org/10.1214/13-STS421) by Cotter and others.
I think my previous post was wrong now.
Here I'd like to explain why (it involves Weyl's law) and how to fix it (it involves Nitsche's method).

The failure is trying to bolt a statistical interpretation onto deterministic inverse problems.
I'm guilty of this.
I'd argue that many textbooks on inverse problems are too.
One misstep is to assume that common choices of regularization functional used in the deterministic inverse problems literature are going to translate into sensible prior distributions.
They don't.

To clarify the notation for later, we're interested in inferring a field $q$ defined over some spatial domain $\Omega$.
In a deterministic inverse problem, we would add a multiple of a functional $R(q)$ to the objective in order to regularize the problem.
The conventional approach is to use a multiple of
$$R(q) = \frac{1}{2}\int_\Omega|\nabla q|^2dx.$$
We can make this look like a statistical inference problem by taking the prior probability distribution $\rho$ to have
$$-\ln \rho(q) \propto R(q) + \ldots$$
The ellipses denote (in the quantum mechanic's parlance) another infinite constant that we can set to zero.

### Theory

Our goal is to construct a procedure for sampling from a probability distribution $\pi(q)$.
We assume that the parameters $q$ live in a separable Hilbert space $H$.
When $H$ is infinite-dimensional, there are a couple of hard parts that don't occur in finite dimensions.

For the rest of this post, we'll be concerned exclusively with the case where $\pi$ is a Gaussian measure.
In future posts we'll look at what happens when it isn't.

#### Proper priors

In the deterministic inverse problems literature, it's common to use
$$R(q) = \frac{1}{2}\int_\Omega|\nabla q|^2dx$$
as a regularization functional.
This accomplishes several goals.
First, the inferred parameters obtained without regularization usually have spurious high-wavenumber noise.
Adding this penalty filters out the noise.
Second, it can guarantee a unique solution where the problem would be otherwise ill-posed.

If we try to form a Gaussian measure $\rho$ for which $-\ln\rho \propto R(q) + \ldots$, we'll run aground right away.
Unless we also add boundary conditions to $q$, there is no constraint on the average value of $q$.
We can add any constant we want to $q$ and the value of $R$ is unchanged.
Viewing this in a probabilistic light now, the putative prior distribution doesn't integrate to 1 -- it is *improprer*.

Strictly speaking, you can use an improper prior so long as the posterior is proper.
The observational data need to constrain the constant mode and this does happen.
So we would use an improper prior when we know nothing at all about what value the parameter can take and we expect that the data we have can give us that informaiton.

Do we really have no prior information whatsoever about the average value of the parameters we want to infer?
Let's take the example I mentioned above about inferring the density of material within the earth from satellite observations of earth's gravitational pull.
Maybe you don't remember common rock densities off the top of your head, but you can go fetch one and weigh it and then put it in a beaker of water.
I'll save you the trouble, it's probably got a density around 3000 kg/m${}^3$.
Granted, the densities deep inside the earth are higher.
Say you guessed that the range went from an order of magnitude in both directions: between 300 and 30,000 kg/m${}^3$.
You wouldn't be doing particularly good but you also wouldn't be doing terrible either.
So it beggars belief that we should use a prior for a geophysical inverse problem that is totally uninformative about the mean value.
It might have high variance.
It shouldn't be infinite.

#### Trace class operators

To prescribe a Gaussian measure, we need to know its mean and covariance.
The covariance has to be symmetric and positive-definite.
(You can phrase my objection about properness above into questions about the nature of the covariance.)
But in the function space setting, there are other criteria.
We don't think about them in finite dimensions because they're almost vacuously true.
In the infinite-dimensional case they demand some thought (ugh).

The first criterion is that the sum of all the eigenvalues of the covariance operator, i.e. the trace, is finite.
In finite dimensions this is obvious.
In function spaces, being a trace class operator is special.
For example, the identity operator or any unitary map is not trace class.

Suppose we wanted to make the most minimal adaptation of a determinstic inverse problem into a statistical one.
Rather than use the improprer prior I showed above, we instead use
$$R(q) = \frac{1}{2}\int_\Omega\left(q^2 + \alpha^2|\nabla q|^2\right)dx$$
where $\alpha$ is some length scale we have to choose.
If we define the operator
$$L = I - \alpha^2\Delta,$$
then $R(q) = \frac{1}{2}\langle Lq, q\rangle$.
The covariance is then equal to $L^{-1}$; the jargon is that $L$ is the *precision* operator.
Is $L^{-1}$ of trace class?

We can answer that question using Weyl's law.
I want to be careful here about the units, so I'll adopt a slightly different convention and say that $\lambda$, $\phi$ are an eigenvalue / eigenfunction pair for $-\Delta$ if
$$-\Delta\phi = \lambda^{-2}\phi.$$
This choice, instead of the more conventional one, makes $\lambda$ have units of length.
Weyl's law then implies that the eigenvalues decay like
$$\lambda_n \sim \text{const}\times\text{vol}(\Omega)^{1/d}\times n^{-1/d}.$$
I'm being so fussy about the units because you can guess at this formula from dimensional analysis.
So the eigenvalues of $(I - \alpha^2\Delta)^{-1}$ go like
$$\sigma_n = (1 + \alpha^2\lambda_n^{-2})^{-1} \sim \text{const}\times\frac{\alpha^2}{\text{vol}(\Omega)^{2/d}} \times n^{-2/d}.$$
In dimension $d = 2$ the eigenvalues decay like $n^{-1}$.
The trace diverges like the harmonic series.
In 3D the divergence is even more rapid.

Suppose we instead take the precision to have the biharmonic operator as the leading term:
$$L = I - \alpha_1^2\Delta + \alpha_2^2\Delta^2$$
where now we need to pick two length scales $\alpha_1$, $\alpha_2$.
Then Weyl's law now comes to the rescue to show that the eigenvalues of $L^{-1}$ are instead asymptotic to
$$\sigma_n \sim \text{const} \times n^{-4/d}$$
which is summable in dimensions 2 and 3.

Further down, I'll show some samples generated from both the wrong $I - \alpha^2\Delta$ prior and the biharmonic prior.

#### Proposal mechanism

Questions about what kinds of covariance operator to choose concern the problem that we wish to solve.
The final challenge I ran into is not about what problem to solve but how to solve it.
A common thread in the advanced MCMC sampling literature is to use a Langevin-type equation to generate samples:
$$\dot q = -\frac{1}{2}K^{-1}\nabla\log\pi + K^{-1/2}\dot W$$
where $G$ is some s.p.d. linear operator.
Discretize the ODE however you like and you get a proposal mechanism.

Seems fine but isn't.
In infinite dimensions, we can only use one discretization scheme.
To understand why, we need to think (ugh) about measure theory (yech).

Our goal was to sample from some intractable probability distribution $\pi$.
MCMC algorithms construct computable Markov chains whose limiting distribution is $\pi$.
The Metropolis-Hastings algorithm works by combining a proposal mechanism -- a random process that suggests new states of the Markov chain from the current value -- with an accept-reject step.
This Markov chain is in statistical equilibrium when the probability of going from $q_1$ to $q_2$ is the same as the probability of going backward.
In other words, if $P(q_1 \rightarrow q_2)$ is the transition kernel for Metropolis-Hastings, then
$$\pi(q_1)P(q_1\rightarrow q_2) = \pi(q_2)P(q_2\rightarrow q_1).$$
At each step, we use a proposal density $Q(q_1\rightarrow q_2)$ to select our next guess.
We then either accept or reject the proposal according to whether it is more or less likely under the posterior distribution, weighted by the dynamics of the proposal mechanism:
$$\alpha(q \rightarrow q^*) = \min\left\{1,\frac{\pi(q^*)Q(q^*\rightarrow q)}{\pi(q)Q(q\rightarrow q^*)}\right\}.$$
(As an aside, I never really understood this on a gut level until I read [this paper](https://www.jstor.org/stable/3182774).)

Let's get pedantic.
What if the numerator or denominator is zero?
Some of the state space would be inaccessible.
The Markov chain wouldn't converge to the limiting density we want.
There's fancy measure theory jargon for the condition I'm talking about: the proposal density needs to be [absolutely continuous](https://en.wikipedia.org/wiki/Absolute_continuity) with respect to the reverse proposal.
In the finite-dimensional case, it's hard to even imagine how you could fuck up so badly that your proposal mechanism has zero probability of inhabiting part of the state space.
A pure random walk -- taking the proposal to be a standard normal random variable -- won't fall into this trap.
It might converge real slow, but it can at least get from anywhere to anywhere.
But in the function space setting, *almost no proposal mechanisms are absolutely continuous*.

To understand this, we need to unpack the [Feldman-Hájek theorem](https://en.wikipedia.org/wiki/Feldman%E2%80%93H%C3%A1jek_theorem), which is kind of a doozy.

Now let's discretize the Langevin equation.
We'll assume again that $-\log\pi = \frac{1}{2}\langle Lq, q\rangle$.
The simplest family of schemes we might come up with are the $\theta$ schemes:
$$\frac{q_{n + 1} - q_n}{\delta t} = -\frac{1}{2}K^{-1}L\left((1 - \theta)q_n + \theta\,q_{n + 1}\right) + \delta t^{-1/2}\,K^{-1/2}\delta W_n.$$
Rearranging terms, we get
$$q_{n + 1} = \left(I + \frac{\theta\,\delta t}{2}K^{-1}L\right)^{-1}\left\{\left(I - \frac{(1 - \theta)\delta t}{2}K^{-1}L\right)q_n + \delta t^{-1/2}\,K^{-1/2}\delta W_n\right\}.$$
That's kind of a lot.
We can learn something useful by assuming that $K = L$, i.e. that we can exactly invert the precision operator.

### Practice

Now let's write some code to try and put this into practice.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import firedrake
from firedrake import (
    Constant, exp, inner, outer, avg, jump, grad, dx, ds, dS, assemble
)
from firedrake.petsc import PETSc

In [None]:
nx, ny = 32, 32
lx, ly = 1.0, 1.0
Lx, Ly = Constant(lx), Constant(ly)
mesh = firedrake.RectangleMesh(nx, ny, lx, ly, diagonal="crossed")
Q = firedrake.FunctionSpace(mesh, "CG", 2)
area = assemble(Constant(1) * dx(mesh))

The code below is copied directly from the previous notebook.

In [None]:
class NoiseGenerator:
    def __init__(
        self,
        function_space,
        covariance=None,
        generator=np.random.default_rng()
    ):
        if covariance is None:
            ϕ = firedrake.TrialFunction(function_space)
            ψ = firedrake.TestFunction(function_space)
            covariance = inner(ϕ, ψ) * dx

        M = assemble(covariance, mat_type='aij').M.handle
        ksp = PETSc.KSP().create()
        ksp.setOperators(M)
        ksp.setUp()

        pc = ksp.pc
        pc.setType(pc.Type.CHOLESKY)
        pc.setFactorSolverType(PETSc.Mat.SolverType.PETSC)
        pc.setFactorSetUpSolverType()
        L = pc.getFactorMatrix()
        pc.setUp()

        self.rng = generator
        self.function_space = function_space
        self.preconditioner = pc
        self.cholesky_factor = L

        self.rhs = firedrake.Function(self.function_space)
        self.noise = firedrake.Function(self.function_space)

    def __call__(self):
        z, ξ = self.rhs, self.noise
        N = len(z.dat.data_ro[:])
        z.dat.data[:] = self.rng.standard_normal(N)

        L = self.cholesky_factor
        with z.dat.vec_ro as Z:
            with ξ.dat.vec as Ξ:
                L.solveBackward(Z, Ξ)
                Ξ *= np.sqrt(area / N)

        return ξ.copy(deepcopy=True)

Here we'll make an object to sample from the first random process, which uses the precision $I - \alpha^2\Delta$.
The operator $L^{-1}$ is not trace-class.

In [None]:
ϕ, ψ = firedrake.TestFunction(Q), firedrake.TrialFunction(Q)
ℓ = firedrake.sqrt(Lx * Ly)
M = (ϕ * ψ + ℓ**2 * inner(grad(ϕ), grad(ψ))) * dx

In [None]:
h1_generator = NoiseGenerator(
    function_space=Q,
    covariance=M,
    generator=np.random.default_rng(1453),
)

Now we'd like to use the functional
$$R(q) = \int_\Omega\left(q^2 + \lambda^2|\nabla^2q|^2\right)dx$$
which penalizes large values of the curvature $\nabla^2q$.
Conventional finite element basis functions are continuous and piecewise-differentiable, but their derivatives have jump discontinuities across cell boundaries.
There are continuously-differentiable finite element bases which we could use to construct a conforming discretization of the curvature penalty.
I'll instead use a non-conforming discretization based on ordinary CG elements.
This approach is similar to how we used DG elements for the convection-diffusion equation.
For that problem, we applied Nitsche's method at all of the cell boundaries in order to make the solution continuous.
Here we'll instead apply Nitsche's method at all the cell boundaries to make the solution's gradient continuous.
I'm partly following [this paper](https://doi.org/10.1515/jnma-2023-0028) for discretization of the curvature penalty but working back to a minimization form.

In [None]:
ϕ = firedrake.Function(Q)

λ = firedrake.Constant(1.0)
Dϕ = grad(ϕ)
DDϕ = grad(Dϕ)

h = firedrake.FacetArea(mesh)
vol = firedrake.CellVolume(mesh)

α = firedrake.Constant(4.0)
k = firedrake.Constant(Q.ufl_element().degree())
β = 3 * α * k * (k - 1) / 8 * avg(h)**2 * avg(1 / vol)
β_Γ = 3 * α * k * (k - 1) * h**2 / vol

ν = firedrake.FacetNormal(mesh)
J_cells = inner(DDϕ, DDϕ) * dx
J_facets = avg(inner(DDϕ, outer(ν, ν))) * jump(Dϕ, ν) * dS
J_facet_penalty = β / avg(h) * jump(Dϕ, ν)**2 * dS
J_boundary = inner(Dϕ, ν) * inner(DDϕ, outer(ν, ν)) * ds
J_boundary_penalty = β_Γ / h * inner(Dϕ, ν)**2 * ds

J_Δ = 0.5 * (J_cells - J_facets - J_boundary + J_facet_penalty + J_boundary_penalty)
J_2 = 0.5 * ϕ**2 * dx

J = J_2 + λ ** 4 * J_Δ

In [None]:
from firedrake import derivative

M = derivative(derivative(J, ϕ), ϕ)
h2_generator = NoiseGenerator(
    function_space=Q,
    covariance=M,
    generator=np.random.default_rng(seed=1666),
)

The plots below show samples from the precision operator $I - \alpha^2\Delta$ first and $I + \alpha^4\Delta^2$ second.
Maybe the most compelling argument I can give you for why biharmonic regularization is the way to go is, just look at it.

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=3, subplot_kw={"projection": "3d"})
for ax in axes.flatten():
    w = h1_generator()
    firedrake.trisurf(w, axes=ax)

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=3, subplot_kw={"projection": "3d"})
for ax in axes.flatten():
    z = h2_generator()
    firedrake.trisurf(z, axes=ax)