In a previous post, I described a procedure for sampling from probability distributions using Monte Carlo methods.
One of the subtle points that I tried to address there is what happens when the quantity we want to sample is a function of space and time.
For example, we might want to infer the conductivity of an aquifer from well head measurements; the material density in the earth's subsurface from satellite gravimetry; or basal drag between a glacier and the landscape from measurements of its surface velocity.
The parameter space is now a space of functions, and so it has infinitely many dimensions.
Extending common algorithms to work in function spaces is hard.

Since writing that first post, I read [this paper](https://doi.org/10.1214/13-STS421) by Cotter and others.
I think my previous post was wrong now.
Here I'd like to explain why (it involves Weyl's law) and how to fix it (it involves Nitsche's method).

The failure, I think, is trying to straightforwardly bolt a statistical interpretation onto deterministic inverse problems.
I'm guilty of this but I would argue that many textbooks on inverse problems are too.
A common misstep is to assume that the common choices of regularization functional used in the deterministic inverse problems literature are going to translate into sensible prior distributions.
They don't.

To clarify the notation for later, we're interested in inferring a field $q$.
In a deterministic inverse problem, we would add a multiple of a functional $R(q)$ to the objective in order to regularize the problem.
The conventional approach is to use a multiple of
$$R(q) = \frac{1}{2}\int_\Omega|\nabla q|^2dx.$$
When we try to turn this into a Bayesian inference problem, we want a probability distribution $\rho$ -- the prior -- such that
$$-\ln \rho(q) \propto R(q) + \ldots$$
The ellipses denote (in the quantum mechanic's parlance) another infinite constant that we can set to zero.

### Theory

Our goal is to construct a procedure for sampling from a probability distribution $\pi(q)$.
We assume that the parameters $q$ live in a separable Hilbert space $H$.
When $H$ is infinite-dimensional, there are a couple of hard parts that don't occur in finite dimensions.

For the rest of this post, we'll be concerned exclusively with the case where $\pi$ is a Gaussian measure.
In future posts we'll look at what happens when it isn't.

#### Proper priors

In the deterministic inverse problems literature, it's common to use
$$R(q) = \frac{1}{2}\int_\Omega|\nabla q|^2dx$$
as a regularization functional.
This accomplishes several goals.
First, the inferred parameters obtained without regularization usually have spurious high-wavenumber noise.
Adding this penalty filters out the noise.
Second, it can guarantee a unique solution where the problem would be otherwise ill-posed.

If we try to form a Gaussian measure $\rho$ for which $-\ln\rho \propto R(q) + \ldots$, we'll run aground right away.
Unless we also add boundary conditions to $q$, there is no constraint on the average value of $q$.
We can add any constant we want to $q$ and the value of $R$ is unchanged.
Viewing this in a probabilistic light now, the putative prior distribution doesn't integrate to 1 -- it is *improprer*.

Strictly speaking, you can use an improper prior so long as the posterior is proper.
The observational data need to provide enough of a constraint on the constant mode and this happens often.

The justification for using an improper prior is when we have no information at all about what value the parameter in question can take, but we expect that the data can give it to us.
Do we have no prior information whatsoever about the average value of the parameters we want to infer?
Let's take the example I mentioned above about inferring the density of material within the earth from satellite observations of earth's gravitational pull.
If you want priors, go weigh a rock, then put it in a beaker of water to measure its volume.
I'll save you the trouble, it's probably got a density around 3000 kg/m${}^3$.
Suppose you didn't know that Earth's core is solid iron.
If you guessed that the range went from 300 to 30,000 kg/m${}^3$, you wouldn't be doing particularly good but it wouldn't be a terrible guess either.
So it beggars belief that we should use a prior for a geophysical inverse problem that is totally uninformative about the mean value.
It might have very high variance but certainly not infinite.

#### Trace class operators

To prescribe a Gaussian measure, we need to know its mean and covariance.
The covariance has to be symmetric and positive-definite.
(You can phrase my objection about properness above into questions about the nature of the covariance.)
But in the function space setting, there are other criteria.
We don't think about them in finite dimensions because they're almost vacuously true.
But in the infinite-dimensional case they require much more thought.

The first criterion is that the sum of all the eigenvalues of the covariance operator, i.e. the trace, is finite.
In finite dimensions this is obvious.
In function spaces, the property of being a trace class operator is very special.
For example, the identity operator or any unitary map is not of trace class.

Suppose that we wanted to make the most minimal adaptation of a determinstic inverse problem into a statistical one.
Rather than use the improprer prior I showed above, we instead use
$$R(q) = \frac{1}{2}\int_\Omega\left(q^2 + \alpha^2|\nabla q|^2\right)dx$$
where $\alpha$ is some length scale we have to choose.
If we define the operator
$$L = I - \alpha^2\Delta,$$
then $R(q) = \frac{1}{2}\langle Lq, q\rangle$.
In other words, $L$ is the precision -- the inverse of the covariance operator.
Is $L^{-1}$ of trace class?

We can answer that question using Weyl's law.
But I want to be careful here about the units.
I'll adopt a slightly different convention and say that $\lambda$, $\phi$ are an eigenvalue / eigenfunction pair for $-\Delta$ if
$$-\Delta\phi = \lambda^{-2}\phi.$$
This choice, instead of the more conventional one, makes $\lambda$ have units of length.
Weyl's law then implies that the eigenvalues decay like
$$\lambda_n \sim \text{const}\times\text{vol}(\Omega)^{1/d}\times n^{-1/d}.$$
So the eigenvalues of $(I - \alpha^2\Delta)^{-1}$ go like
$$\sigma_n = (1 + \alpha^2\lambda_n^{-2})^{-1} \sim \text{const}\times\frac{\alpha^2}{\text{vol}(\Omega)^{2/d}} \times n^{-2/d}.$$
In dimension $d = 2$ the eigenvalues decay like $n^{-1}$.
The trace diverges like the harmonic series!

Suppose we instead take the precision to have the biharmonic operator as the leading term:
$$L = I - \alpha_1^2\Delta + \alpha_2^2\Delta^2$$
where now we need to pick two length scales $\alpha_1$, $\alpha_2$.
Then Weyl's law now comes to the rescue to show that the eigenvalues of $L^{-1}$ are instead asymptotic to
$$\sigma_n \sim \text{const} \times n^{-4/d}$$
which is summable in dimensions 2 and 3.

Further down, I'll show some samples generated from both the wrong $I - \alpha^2\Delta$ prior and the biharmonic prior.

#### Proposal mechanism

Questions about what kinds of covariance operator to choose concern the problem that we wish to solve.
The final challenge I ran into is not about what problem to solve but how to solve it.
A common thread in the advanced MCMC sampling literature is to use a Langevin-type equation to generate samples:
$$\dot q = -\frac{1}{2}G^{-1}\nabla\log\pi + G^{-1/2}\dot W$$
where $G$ is some s.p.d. linear operator.
We can then discretize this ODE using the scheme of our choice.
In fact, there is only one correct choice of scheme in the function space setting.

### Practice

Now let's write some code to try and put this into practice.

In [None]:
from firedrake import assemble
from firedrake.petsc import PETSc

area = assemble(Constant(1) * dx(mesh))

class NoiseGenerator:
    def __init__(
        self,
        function_space,
        covariance=None,
        generator=random.default_rng()
    ):
        if covariance is None:
            ϕ = firedrake.TrialFunction(function_space)
            ψ = firedrake.TestFunction(function_space)
            covariance = inner(ϕ, ψ) * dx

        M = assemble(covariance, mat_type='aij').M.handle
        ksp = PETSc.KSP().create()
        ksp.setOperators(M)
        ksp.setUp()

        pc = ksp.pc
        pc.setType(pc.Type.CHOLESKY)
        pc.setFactorSolverType(PETSc.Mat.SolverType.PETSC)
        pc.setFactorSetUpSolverType()
        L = pc.getFactorMatrix()
        pc.setUp()

        self.rng = generator
        self.function_space = function_space
        self.preconditioner = pc
        self.cholesky_factor = L

        self.rhs = firedrake.Function(self.function_space)
        self.noise = firedrake.Function(self.function_space)

    def __call__(self):
        z, ξ = self.rhs, self.noise
        N = len(z.dat.data_ro[:])
        z.dat.data[:] = self.rng.standard_normal(N)

        L = self.cholesky_factor
        with z.dat.vec_ro as Z:
            with ξ.dat.vec as Ξ:
                L.solveBackward(Z, Ξ)
                Ξ *= np.sqrt(area / N)

        return ξ.copy(deepcopy=True)

### The biharmonic operator

Here we'd like to use the functional
$$R(q) = \int_\Omega\left(q^2 + \lambda^2|D^2q|^2\right)dx$$
which penalizes large values of the curvature $D^2q$.
Conventional finite element basis functions are continuous and piecewise-differentiable, but their derivatives have jump discontinuities across cell boundaries.
There are continuously-differentiable finite element bases which we could use to construct a conforming discretization of the curvature penalty.
I'll instead use a non-conforming discretization based on ordinary CG elements.
This approach is similar to how we used DG elements for the convection-diffusion equation.
For that problem, we applied Nitsche's method at all of the cell boundaries in order to make the solution continuous.
Here we'll instead apply Nitsche's method at all the cell boundaries to make the solution's gradient continuous.
I'm partly following [this paper](https://doi.org/10.1515/jnma-2023-0028) for discretization of the curvature penalty but working back to a minimization form.

In [None]:
from firedrake import avg, jump, outer, dS
firedrake.adjoint.continue_annotation()

ϕ = firedrake.Function(Q)

λ = firedrake.Constant(1.0)
Dϕ = grad(ϕ)
DDϕ = grad(Dϕ)

h = firedrake.FacetArea(mesh)
vol = firedrake.CellVolume(mesh)

α = firedrake.Constant(4.0)
k = firedrake.Constant(Q.ufl_element().degree())
β = 3 * α * k * (k - 1) / 8 * avg(h)**2 * avg(1 / vol)
β_Γ = 3 * α * k * (k - 1) * h**2 / vol

ν = firedrake.FacetNormal(mesh)
J_cells = inner(DDϕ, DDϕ) * dx
J_facets = avg(inner(DDϕ, outer(ν, ν))) * jump(Dϕ, ν) * dS
J_facet_penalty = β / avg(h) * jump(Dϕ, ν)**2 * dS
J_boundary = inner(Dϕ, ν) * inner(DDϕ, outer(ν, ν)) * ds
J_boundary_penalty = β_Γ / h * inner(Dϕ, ν)**2 * ds

J_Δ = 0.5 * (J_cells - J_facets - J_boundary + J_facet_penalty + J_boundary_penalty)
J_2 = 0.5 * ϕ**2 * dx

J = J_2 + λ ** 4 * J_Δ

In [None]:
from firedrake import derivative

M = derivative(derivative(J, ϕ), ϕ)
biharmonic_generator = NoiseGenerator(
    function_space=Q,
    covariance=M,
    generator=np.random.default_rng(seed=1729),
)

The plot below shows a single random sample generated from the prior distribution.
This stochastic process has been used as an approximate model for the topography of real landscapes and perhaps you can see why.

In [None]:
z = biharmonic_generator()

In [None]:
firedrake.trisurf(z);

As before, the proposal is generated by numerically integrating the SDE
$$\dot q = -\frac{1}{2}M^{-1}dJ(q) + M^{-\frac{1}{2}}\dot B.$$
We then accept or reject the proposal using the usual Metropolis criterion.