# Data assimilation 1 

This is the first of two lectures on **data assimilation**. Their aim is to provide an introduction to this topic in the context of climate science. The lectures are accompanied by two practical sessions where you will apply the ideas yourself.
In these lectures, we focus on simple "toy problems" that can be solved easily on a laptop. Nevertheless, the ideas discussed are applicable to large-scale problems in climate science. 

## Weather forecasts

Weather forecasts are the most familiar, and perhaps the most important, application of data assimilation. Of course, most people have not heard the term "data assimilation" and have not thought about how weather forecasts are generated. The idea, in outline, is as follows. 

Let us assume that the physics of the weather system is known. This means that if we know the state of the system at some time, then we can calculate its state at any future time by solving the relevant equations. In practice, solving these equations means running 
forward in time a large-scale numerical model for the weather system.

We cannot, of course, measure the state of the system perfectly at any single time. Instead we have **noisy partial observations** of the weather's state at a set of observation times. If we guessed the initial state at some point in the past, we could evolve the system forward in time and calculate predictions for each available observation. Such an initial guess is unlikely to fit the observations satisfactorially, but we might somehow update the initial guess until the available data has been explained. 

If, in this manner, we can adequately estimate the initial state, we could carry our simulation forward into the future to arrive at the desired weather forecast. Because the data available is limited both in quantity and its accuracy, our reconstruction of the initial state will never be perfect, and hence uncertainties in this state are propagated into the resulting forecasts. The quantification of these uncertainties is the most important and most challenging aspect of data assimilation. 

There are at two main approaches to solving data assimilation problems. The first poses the estimation of the initial state as an **optimisation problem**, this being essentially the method just outlined. The second, which is a little harder to explain in outline, is to formulate the problem using the methods of **Bayesian inference**. While these approaches are 
not equivalent, there do exist relations between them. In these lectures, we focus on the Bayesian formulation of data assimilation problems because it allows for a simpler, and arguably superior, treatment of uncertainty quantification. 


## Dynamical systems

### The finite pendulum

To motivate the definition of a **dynamical system** we consider an idealised **finite pendulum**. Such a pendulum comprises a bob of mass, $m$, attached to one end of a massless and rigid rod of length, $l$. The other end of the rod is fixed at a point, with the rod being able to pivot about this point freely within a plane. We write $\theta$ for the angle between the rod and a line that points vertically downward. For convenience, we choose a co-ordinate system whose origin is at the pivot point of the rod, and suppose that the pendulum moves in the $(x,y)$-plane, with the $x$-axis horizontal and the $y$-axis pointing vertically upwards. The Cartesian co-ordinates of the pendulum bob can then be written

$$
x = l \sin\theta, \quad y = -l\cos\theta.
$$

This shows that the position of the pendulum at any time is determined completely by the angle, $\theta$.

We suppose that the pendulum is acted on by a constant gravitational force acting vertically downwards, and write, $g$, for the associated gravitational acceleration. From Newton's laws of motion, this physical system is governed by following set of ordinary differential equations:

$$
\frac{\mathrm{d} \theta}{\mathrm{d} t} = \frac{1}{ml^{2}}p, \quad \frac{\mathrm{d} p}{\mathrm{d} t} = - mgl \sin \theta
$$

Here $p$ represents the (angular) momentum of the pendulum bob. If desired, these two equations can be combined into the single second-order equation

$$
\frac{\mathrm{d}^{2} \theta}{\mathrm{d} t^{2}} = -\frac{g}{l}\sin \theta,
$$

where we note that the mass has been eliminated. 

Given initial values for the angle, $\theta$, and momentum, $p$, of the pendulum at a time, $t_{0}$, the above equations can be integrated to uniquely determine their values for all later times. The state of this system at any time is, therefore, expressed fully by its **state vector**

$$
\mathbf{x} = \left(\begin{array}{c}
             \theta  \\ p
             \end{array}\right)
$$

in terms of which we can write an **evolution equation** in the general form

$$
\frac{\mathrm{d}\mathbf{x}}{\mathrm{d} t} = \mathbf{f}(\mathbf{x}),
$$
with $\mathbf{f}$ defined appropriately. 

These ideas are illustrated within the following code block which generates an animation of the pendulum's motion subject to given initial conditions.

In [None]:
# Import the necessary libraries for this notebook, 
# installing pygeoinf if required. 
try:
    from pygeoinf import data_assimilation as da
except ImportError: 
    %pip install pygeoinf --quiet
    from pygeoinf import data_assimilation as da

import numpy as np
import matplotlib.pyplot as plt
from pygeoinf.data_assimilation.pendulum import single


# Set the initial state and evolve the system in time
initial_angle = np.deg2rad(130)
initial_momentum = 0.0
initial_state = [initial_angle, initial_momentum]
fps = 20
t0 = 0
t1 = 15
t_points = np.linspace(t0, t1, round(fps*(t1-t0))) # Animate the physical motion and the state space evolution
anim = single.animate_combined(t_points, solution)
plt.close()

da.display_animation_html(anim)


### Autonomous dynamical systems

Building on the above example, we can consider a more general **dynamical system** for which an the state vector, $\mathbf{x}$, 
lives in the the $m$-dimensional **state space**, $\mathbb{R}^{m}$, and is governed by an evolution equation that again takes the  standard form
$$
\frac{\mathrm{d}\mathbf{x}}{\mathrm{d} t} = \mathbf{f}(\mathbf{x}). 
$$
The function, $\mathbf{f}$, is known as the **dynamical rule**. If the value of the state vector is given at a time, $t_{0}$, then these equations can be integrated to **uniquely determine** the state vector at any later time (strictly, for non-linear systems this result is only guaranteed to hold for some finite length of time).

Within our evolution equation, the time-derivative of the state vector is given by a function of the state vector alone. Such a dynamical system is said to be **autonomous**. There are also **non-autonomous** dynamical systems whose evolution equation has an explicit dependence on time, with this feature typically representing the action of an **external force**. For simplicity we will consider only autonomous systems in these lectures, though there is a trick by which a non-autonomous system can be transformed into an autonomous one on a larger state space. 

It should be clear that the solution, $\mathbf{x}(t)$, to an autonomous dynamical system depends only on the difference, $t-t_{0}$, and hence we are free to take the initial time to always be zero. 

A dynamical system is said to be **linear** if the dynamical rule takes the form

$$
\mathbf{f}(\mathbf{x}) = \mathbf{A}\mathbf{x},
$$

where $\mathbf{A}$ is matrix that maps the state space to itself. Otherwise, the dynamical system is **non-linear**. Clearly the finite pendulum example is non-linear, and this is also true for most physical models. 




### The weather as a dynamical system

The physics describing the weather can be formulated as a dynamical system, albeit a rather complicated one. In this context, the state vector, $\mathbf{x}$, comprises continuous fields, such as temperature, pressure, and wind velocity,  defined at every point within the Earth's atmosphere. Consequently, the state space is  **infinite-dimensional**, while the dynamical rule is given through coupled systems of partial differential equations that express the conservation of quantities like mass, linear momentum, and energy. Dynamical systems modelling the weather are generally not autonomous due to forcing from the oceans and ice sheets.

To address the infinite-dimensional complication, the most common method is to approximate the true dynamical system by a finite-dimensional one that arises through a **numerical approximation**. For example, the physical fields can be approximated through their values on a grid of points, and the various spatial derivatives within the partial differential equations then replaced using finite-difference formulae. The arguments in favour of this approach are that (i) in any practical situation the true dynamics will eventually be reduced to a finite-dimensional form within numerical calculations, and (ii) by limiting attention to finite dimensional systems the mathematics is much simpler. 

Alternatively, it is possible to formulate and solve data assimilation problems on infinite dimensional state spaces, with such methods offering both conceptual and practical advantages. In this context, one first shows that a solution to the assimilation problem can be well-defined mathematically on the appropriate infinite-dimensional space. Having done this, numerical approximations must be developed that are proven to converge to the exact solution given a sufficiently fine discretisation. 

Within this introductory course we will restrict attention to finite-dimensional problems. Our examples will have state spaces of dimensions 2 or 4, but for contemporary applications to weather forecasting, a state space whose dimension is of order $10^{9}$ would be typical. As we will see, the high dimension of the state space within realistic applications places severe limits on the methods that can be used in practice. 

### The flow of a dynamical system

Given an initial state vector, $\mathbf{x}_{0}$, at the time $t=0$, we have seen that the evolution equation can be integrated to determine the state vector, $\mathbf{x}(t)$, at any other time. The resulting curve, $t\mapsto\mathbf{x}(t)$, in the state space depends parametrically on the chosen initial condition. To express this idea, we can introduce the **flow** for the dynamical system. This mapping, $\Phi:\mathbb{R}^{m}\times \mathbb{R} \rightarrow
\mathbb{R}^{m}$ is defined, by
$$
\Phi(\mathbf{x}_{0}, t) =\mathbf{x}(t), 
$$
where $\mathbf{x}(t)$ denotes the solution of the evolution equation at time $t$ subject to the initial condition $\mathbf{x}_{0}$ at $t=0$. By definition, we clearly have $\Phi(\mathbf{x}_{0},0) =\mathbf{x}_{0}$, while the following identity is readily established
$$
\Phi[\Phi(\mathbf{x}_{0}, t_{1}),t_{2}] = \Phi(\mathbf{x}_{0}, t_{1}+t_{2}). 
$$
If we fix the time, $t$, then the flow induces a mapping, $\Phi_{t}:\mathbb{R}^{m}\rightarrow \mathbb{R}^{m}$, from the state space to itself through
$$
\Phi_{t}(\mathbf{x}_{0}) = \Phi(\mathbf{x}_{0},t). 
$$
This mapping is invertible, with the inverse given by 
$$
\Phi_{t}^{-1}(\mathbf{x}) =  \Phi(\mathbf{x},-t).
$$

### Linearisation of the dynamical system

Consider two initial states, $\mathbf{x}_{0}$ and $\mathbf{x}_{0} + \Delta\mathbf{x}_{0}$, with their difference, $\Delta\mathbf{x}_{0}$, being suitably small. The difference in the 
resulting states at a time, $t$, can then be written
$$
\Phi_{t}(\mathbf{x}_{0} + \Delta\mathbf{x}_{0}) - \Phi_{t}(\mathbf{x}_{0}) = \frac{\partial \Phi_{t}}{\partial\mathbf{x}_{0}}(\mathbf{x}_{0}) \Delta\mathbf{x}_{0} + o(\|\Delta \mathbf{x}_{0}\|), 
$$
where we have performed a first-order Taylor expansion to obtain the right hand side. The partial derivative occurring in this expression is will be called the **sensitivity matrix** for the dynamical system at time, $t$, and captures the linearised dependence of the flow on the initial conditions. 

To calculate $\frac{\partial \Phi_{t}}{\partial\mathbf{x}_{0}}(\mathbf{x}_{0})$ we can implicitly differentiate the evolution equation with respect to the initial condition. 
Letting $\mathbf{x}(t)$ be the solution of the evolution equation subject to initial conditions, $\mathbf{x}(0) =\mathbf{x}_{0}$, and applying the chain rule, we find that
$$
\frac{\mathrm{d}}{\mathrm{d} t}\frac{\partial\mathbf{x}}{\partial\mathbf{x}_{0}}(t) = \frac{\partial \mathbf{f}}{\partial\mathbf{x}}[\mathbf{x}(t)] \frac{\partial\mathbf{x}}{\partial\mathbf{x}_{0}}(t).
$$
This is a *linear* ordinary differential equation for the matrix, $\frac{\partial \mathbf{x}}{\partial\mathbf{x}_{0}}$, that must be solved subject to the initial condition
$$
\frac{\partial\mathbf{x}}{\partial\mathbf{x}_{0}}(0) = \mathbf{1}, 
$$
with $\mathbf{1}$ the identity matrix on $\mathbb{R}^{m}$. Note that within this equation, the coefficient matrix, $\frac{\partial \mathbf{f}}{\partial\mathbf{x}}[\mathbf{x}(t)]$ is generally time-dependent due to its dependence on the reference state, $\mathbf{x}(t)$.  In this manner, we can write the sensitivity matrix of the flow as
$$
\frac{\partial \Phi_{t}}{\partial\mathbf{x}_{0}}(\mathbf{x}_{0}) = \frac{\partial\mathbf{x}}{\partial\mathbf{x}_{0}}(t).
$$




## Bayesian  data assimilation

### Expressing uncertainty on the initial state

Within a Bayesian framework, our knowledge of the initial state of the system in expressed in terms of a **probablity distribution**. In particular, the **prior probability distribution** specifies completely our knowledge 
of the intial state before have we taken into account any of the available observations. 

We will write $\pi_{0}$ for this prior probability distribution, and assume that it can be written in terms of **probability density function**, $p_{0}$, with respect to the standard volume element on
the state space, $\mathbb{R}^{m}$. 
The probability that the initial state lies within a subset $U\subseteq \mathbb{R}^{m}$ is then given through
$$
\pi_{0}(U) = \int_{U} p_{0}(\mathbf{x}) \,\mathrm{d}\mathbf{x}.
$$
This probability quantifies our **degree of our belief** that the statement, $\mathbf{x}_{0} \in U$, is true. If $\pi_{0}(U)=1$ then we think that the statement is definitely true, while if $\pi_{0}(U)=0$ that it 
is definitely false. If for two subsets, $U,V\subseteq \mathbb{R}^{m}$ we have $\pi_{0}(U) < \pi_{0}(V)$ then we think it more likely that the initial state is in $V$. 

The precise interpretation of these probabilities is an issue of philosophy and not mathematics and so will not be considered in detail. But it is worth emphasising that that in the Bayesian framework, probabilities have no relation to relative frequencies of outcomes within a random  process.  

### Propagating uncertainties under the flow of the system

Suppose that we have a prior probability that expresses our knowledge about a dynamical system's initial state. If there are no observations to futher constrain the system, then we will show 
that our knowledge of the state at later times is fully determined by evolving 
the prior distribution under the flow of the dynamical system.

Let us write $\pi_{t}$ for the probability distribution for the state at a time, $t$, this being the quantity we would like to  calculate. Given any $U\subseteq \mathbb{R}^{n}$, we know that  $\pi_{t}(U)$ represents the probability that the 
state vector, $\mathbf{y}(t)$, lies with this subset. Because, however, the flow of the dynamical system induces an invertible mapping, $\Phi_{t}$, between states at times $0$ and $t$, the condition $\mathbf{y}(t) \in U$  is equivalent to $\mathbf{y}_{0} \in \Phi_{t}^{-1}(U)$  for the corresponding initial condition. 
Here we have defined the **inverse image** of the subset $U$ under the mapping, $\Phi_{t}$, 
through
$$
\Phi_{t}^{-1}(U) = \left\{
\mathbf{y}_{0} \in \mathbb{R}^{m} \,|\, \Phi_{t}(\mathbf{y_{0}}) \in U
\right\}. 
$$ 
Note that the inverse image is a well-defined subset for any mapping, but because 
$\Phi_{t}$ is invertible, there is a one-to-one relationship between states
in $U$ and $\Phi_{t}^{-1}(U)$.
Using these notations and results, we can then express the probability distribution at time, $t$, in the form
$$
\pi_{t}(U) = \pi_{0}[\Phi_{t}^{-1}(U)].
$$ 
Equivalently, we write this as
$$
\pi_{t} = \pi_{0} \circ \Phi_{t}^{-1}, 
$$
with the right hand side being called the **push-forward** of the probablity distribution, $\pi_{0}$, under the mapping, $\Phi_{t}$. 

### Transformation of the probability density functions 

We assumed that the prior probability distribution is defined in terms of a probability density function, $p_{0}$. We now show that the same is true for the pushed-forward 
distribution, $\pi_{t}$, and hence the preceeding results can be made more concrete. Letting $p_{t}$ denote the PDF for $\pi_{t}$, we require that 
$$
\int_{U} p_{t}(\mathbf{x}) \,\mathrm{d}\mathbf{x} = \int_{\Phi_{t}^{-1}(U)} p_{0}(\mathbf{x}_{0}) \,\mathrm{d}\mathbf{x}_{0},
$$
for any $U \subseteq \mathbb{R}^{m}$. Setting $\mathbf{x}_{0}  = \Phi_{t}(\mathbf{x})$, and recalling the change of variables formula for integration, 
the left hand side of the above equality can be written
$$
\int_{U} p_{t}(\mathbf{x}) \,\mathrm{d}\mathbf{x} = \int_{\Phi_{t}^{-1}(U)} p_{t}[\Phi_{t}(\mathbf{x}_{0})] \det\!\left[
    \frac{\partial \Phi_{t}}{\partial \mathbf{x}_{0}}(\mathbf{x}_{0})\right] \mathrm{d}\mathbf{x}_{0},
$$
where $\frac{\partial \Phi_{t}}{\partial \mathbf{x}_{0}}(\mathbf{x}_{0})$ is the sensitivity matrix, 
and $\det$ denotes the determinant.  Because the subset $U$ was arbitary, 
it follows that
$$
p_{0}(\mathbf{x}_{0}) = p_{t}[\Phi_{t}(\mathbf{x}_{0})]  \det\!\left[
    \frac{\partial \Phi_{t}}{\partial \mathbf{x}_{0}}(\mathbf{x}_{0})\right], 
$$
and hence  the desired PDF can be written
$$
p_{t}(\mathbf{x}) = p_{0}[\Phi_{t}^{-1}(\mathbf{x})]  \det\!\left\{
    \frac{\partial \Phi_{t}}{\partial \mathbf{x}_{0}}[\Phi_{t}^{-1}(\mathbf{x})]\right\}^{-1}.
$$



To apply the above formulae, we need to be able to integrate the dynamical system backwards in 
time, and we also need the determinant of the sensitivity matrix. This matrix could 
be constructed using the methods discussed above, and then its determinant evaluated. However, using an identity for the derivative of the determinant with respect to a matrix's components, it can be shown that
$$
\frac{\mathrm{d}}{\mathrm{d} t}\det \!\left[\frac{\partial \mathbf{x}}{\partial \mathbf{x}_{0}}(t)\right] = \mathrm{tr}\!\left\{\frac{\partial \mathbf{f}}{\partial \mathbf{x}}[\mathbf{x}(t)]\right\} \det \!\left[\frac{\partial \mathbf{x}}{\partial \mathbf{x}_{0}}(t)\right], 
$$
where $\mathrm{tr}$ denotes the trace of a matrix. This is a scalar differential equation for the 
determinant that can be solved subject to the initial condition, $\det \!\left[\frac{\partial \mathbf{x}}{\partial \mathbf{x}_{0}}(0)\right]=1$. In this manner, the necessary Jacobian factor can be determined without the explicit construction of the sensitivity matrix. 

### Application to the pendulum system

The above ideas can be readily illustrated for the pendulum system. The first step is calculating the matrix $\frac{\partial \mathbf{f}}{\partial \mathbf{x}}[\mathbf{x}(t)]$
for the dynamical rule. Recalling from earlier that 
$$
\mathbf{f}(\mathbf{x}) = \left(\begin{array}{c}
              \frac{1}{ml^{2}}p \\ - mgl \sin \theta
             \end{array}\right), 
$$
we find that the associated matrix of partial derivatives is then given by
$$
\frac{\partial \mathbf{f}}{\partial \mathbf{x}}[\mathbf{x}(t)] = \left(\begin{array}{cc}
              0 && \frac{1}{ml^{2}} \\ - mgl \cos \theta && 0
             \end{array}\right).
$$
The matrix, $\frac{\partial \mathbf{f}}{\partial \mathbf{x}}[\mathbf{x}(t)]$, is seen to have zero-trace, and hence for this system the determinant of the sensitivity matrix is constant. The transformation of the PDFs then reduces to
$$
p_{t}(\mathbf{x}) = p_{0}[\Phi_{t}^{-1}(\mathbf{x})].
$$
This simplification is an instance of **Liouville's theorem**  which holds for **Hamiltonian** dynamical systems. This property is not typically present within climatological applications and so we will not emphasises this point further, though it is used to simplify our numerical implementions.

These ideas are applied numerically for the pendulum system in the code below. To do this, we define a simple Gaussian PDF for the prior distribution on the initial state at time, $t_{0} = 0$. For a later time, $t_{1}$, we can then calculate and visualise the prior PDF as follows:

1. We form a regular grid in the state space that is sufficiently fine as to resolve the details of the two PDFs.
2. For each state, $\mathbf{x}$, in the grid we integrate the evolution equation backwards in time from $t_{1}$ to $t_{0}$. This gives the corresponding initial state, $\mathbf{x}_{0} = \Phi_{t}^{-1}(\mathbf{x})$. We then use $p_{t_{1}}(\mathbf{x}) = p_{t_{0}}(\mathbf{x}_{0})$ to set the corresponding value on our grid. For a more general dynamical system, it would be necessary to compute the appropriate determinant of the sensitivity matrix as discussed above. 

The necessary methods are built into the `pygeoinf.qces.ProbabilityGrid` class that you will make use of within the first practical. 

In [None]:
# Set the times for the simulation
t0 = 0
t1 = 10

# Set the prior PDF at t0=0 using a simple Gaussian with 
# diagonal covariance matrix.
mean = [0.0, 0.0] 
stds  = [0.5, 1.0] 
theta_range = (-np.pi, np.pi)
p_range = (-3*stds[1], 3*stds[1])
resolution = 300

prior_t0 = da.ProbabilityGrid.from_bounds(
    (theta_range, p_range), 
    resolution, 
    da.get_independent_gaussian_pdf(mean, stds)
)

# Push forward the prior under the flow
prior_t1 = prior_t0.push_forward(single.eom, 10)


# Plot the two PDFs side by side for comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

_, c1 = da.plot_grid_marginal(prior_t0, ax = ax1, cmap="Blues")
ax1.set_title(f'prior PDF at {t0}')
fig.colorbar(c1, ax=ax1, location="bottom", shrink=0.7, pad=0.1)

_, c2 = da.plot_grid_marginal(prior_t1, ax = ax2, cmap="Blues")
ax2.set_title(f'prior PDF at {t1}')
fig.colorbar(c2, ax=ax2, location="bottom", shrink=0.7, pad=0.1)


plt.tight_layout()
plt.show()

### Bayes theorem

Before considering the application of **Bayes theorem** to data assimilation, it will be useful to discuss this idea more generally. 

Suppose that our knowledge of two variables, $\mathbf{x} \in \mathbb{R}^{m}$ and $\mathbf{y}\in \mathbb{R}^{n}$, is expressed through a **joint probability distribution**
that we denote by $\rho$ and which has a PDF, $r$, defined on the product space $\mathbb{R}^{m}\times \mathbb{R}^{n}$. This means that the probability that 
$\mathbf{x} \in U$ and $\mathbf{y} \in V$ can be written
$$
\rho(U\times V) = \int_{U}\int_{V} r(\mathbf{x},\mathbf{y}) \,\mathrm{d}\mathbf{y} \,\mathrm{d}\mathbf{x} = \int_{V}\int_{U} r(\mathbf{x},\mathbf{y}) \,\mathrm{d}\mathbf{x} \,\mathrm{d}\mathbf{y}, 
$$
where we note that the integrations can be performed in either order. 

Our knowledge of the variable, $\mathbf{x}$, alone is expressed through the **marginal probability distribution** defined by
$$
\rho_{\mathbf{x}}(U) = \rho(U\times \mathbb{R}^{n}) =  \int_{U}\int_{\mathbb{R}^{n}} r(\mathbf{x},\mathbf{y}) \,\mathrm{d}\mathbf{y} \,\mathrm{d}\mathbf{x}.
$$
Here we note that the statement, $\mathbf{x} \in U$, is the same as asking that $\mathbf{x} \in U$ and $\mathbf{y}\in \mathbb{R}^{n}$. From this definition, 
it is clear that the PDF for this marginal distribution takes the form
$$
r_{\mathbf{x}}(\mathbf{x}) = \int_{V} r(\mathbf{x},\mathbf{y}) \,\mathrm{d}\mathbf{y}.
$$
The marginal distribution, $\rho_{\mathbf{y}}$, and its PDF, $r_{\mathbf{y}}$, can be similarly defined. 

Starting with our state of knowledge as defined by the joint distribution, $\rho$,  suppose that we 
learn that $\mathbf{y} \in V$. In seems reasonable that our knowledge of $\mathbf{x}$ should in general  have 
improved.  This idea is expressed by the **conditional probability distribution**, $\rho_{\mathbf{x}|\mathbf{y} \in V}$, on $\mathbf{x}$
given that $\mathbf{y} \in V$. By definition, this conditional distribution takes the form
$$
\rho_{\mathbf{x}|\mathbf{y}\in V}(U) = \frac{\rho(U\times V)}{\rho_{\mathbf{y}}(V)}.
$$
The numerator on the right hand side is the probability that $\mathbf{x} \in U$ and $\mathbf{y} \in V$, while the denominator 
is the marginal probability that $\mathbf{y} \in V$ which serves to correctly normalise the conditional distribution. In an idential manner, we can write 
$$
\rho_{\mathbf{y}|\mathbf{x}\in U}(V) = \frac{\rho(U\times V)}{\rho_{\mathbf{x}}(U)}, 
$$
and hence by a simple rearrangement we arrive at **Bayes theorem**:
$$
\rho_{\mathbf{x}|\mathbf{y}\in V}(U) = \frac{\rho_{\mathbf{y}|\mathbf{x}\in U}(V)\rho_{\mathbf{x}}(U)}{\rho_{\mathbf{y}}(V)}.
$$

Through a limiting argument that we will not detail, Bayes theorem can be expressed in terms of probability density functions. 
We can also specialise to the case in which $\mathbf{y}$ is known completely (i.e., it belongs to a subset comprising a single point). 
Let $r_{\mathbf{x}|\mathbf{y}}$ be the PDF for $\mathbf{x}$ given that $\mathbf{y}$ is known, and similarly for $r_{\mathbf{y}|\mathbf{x}}$. 
Then Bayes  theorem is equivalent to the condition
$$
r_{\mathbf{x}|\mathbf{y}}(\mathbf{x}) = \frac{r_{\mathbf{y}|\mathbf{x}}(\mathbf{y}) r_{\mathbf{x}}(\mathbf{x})}{\int_{\mathbb{R}^{m}}r_{\mathbf{y}|\mathbf{x}}(\mathbf{y}) r_{\mathbf{x}}(\mathbf{x})\,\mathrm{d}\mathbf{x}}.
$$
On the right hand side of this equality, $r_{\mathbf{y}|\mathbf{x}}(\mathbf{y})$ is being views as a function of $\mathbf{x}$, and we call this quantity 
the **likelihood** of $\mathbf{y}$ given $\mathbf{x}$. Here we see Bayes theorem provides the rule needed to transform the **prior** PDF, $r_{\mathbf{x}}$, 
into the **posterior** PDF, $r_{\mathbf{x}|\mathbf{y}}$, following our observation of $\mathbf{y}$.




### Updating beliefs in light of  observations

We now consider the dynamical system at a fixed time, and write $\mathbf{x}$ for the state vector. Suppose that our prior knowledge of the state is expressed through a probability distribution, $\pi$, with PDF, $p$.

A **noisy partial observation** of the state takes the form
$$
\mathbf{y} = \mathbf{g}(\mathbf{x}) + \mathbf{z}, 
$$
where the **observation function**, $\mathbf{g}:\mathbb{R}^{m}\rightarrow \mathbb{R}^{n}$, maps the state space to a lower dimensional observation space ($n < m$), and $\mathbf{z}$ is a  observational error. We suppose that $\mathbf{z}$  is sample from a known probability distribution with PDF denoted by $q$.

If the state, $\mathbf{x}$, were known, then $\mathbf{y} - \mathbf{g}(\mathbf{x})$ would be 
distributed according to error distribution, and hence the conditional PDF for $\mathbf{y}$ given 
$\mathbf{x}$ takes the simple form
$$
q[\mathbf{y} - \mathbf{g}(\mathbf{x})].
$$
Regarded as a function of the state vector, this is the likelihood for our problem. 

Applying **Bayes theorem**, the posterior distribution, $\tilde{\pi}$, given that $\mathbf{y}$  has been observed
has the PDF:
$$
\tilde{p}(\mathbf{x})  = \frac{q[\mathbf{y} - \mathbf{g}(\mathbf{x})] p(\mathbf{x})}{\int_{\mathbb{R}^{m}}
q[\mathbf{y} - \mathbf{g}(\mathbf{x})] p(\mathbf{x}) \,\mathrm{d}\mathbf{x}}.
$$ 
Recall that the normalising constant on  the right hand side represents the probability for the observation; this quantity 
is also known as the **Bayesian evidence** and plays  a central role within hypothesis testing. 

These ideas can be readily applied to the pendulum problem. Within the previous section, we pushed forward the prior PDF on the initial state to the time $t_{1}$. Within the jargon of data assimilation, this is known as the **prediction** step. Let us suppose that at this later time we make an 
noisy observation of the angle, $\theta$. This means that our observation operator, $\mathbf{g}:\mathbb{R}^{2}\rightarrow \mathbb{R}$, is a simple linear mapping:
$$
\mathbf{g}(\mathbf{x}) =\mathbf{G}\mathbf{x}, \quad  \mathbf{G} = \left(
    \begin{array}{cc}
    1 & 0
    \end{array}
\right)
$$
For simplicity we suppose that the error distribution takes a zero-mean Gaussian form:
 $$
 q(\theta) = \frac{1}{\sqrt{2\pi \sigma^{2}}}\exp\!\left(-\frac{\theta^{2}}{2\sigma^{2}}\right), 
 $$
 where $\sigma$ is the standard deviation for the measurement. The product of the likelihood and prior PDF at $t_{1}$ can then be formed to determine the un-normalised posterior PDF on our numerical grid.  The Bayesian evidence is found using a quadrature scheme, and hence the full posterior PDF can be constructed. Determining the posterior PDF at an observation time is known as an **analysis** step within the data assimilation problem. 

In [None]:
# 1. Setup the Observation System
theta_t1 = 0.8
z1_std = 0.2

# Define vectors/matrices for the Linear Gaussian Likelihood
y_obs = [theta_t1]       # The observation vector
R     = [[z1_std**2]]    # The observation error covariance matrix
G     = [[1.0, 0.0]]     # Observation Matrix: We observe Theta (index 0), not P (index 1) 

# Set up the likelihood and evaluate it on the gridded state space. 
likelihood_t1 = da.LinearGaussianLikelihood(
    y_obs, 
    R, 
    G
).evaluate(prior_t1)

# Perform the Bayesian update. 
unnormalised_posterior_t1 = prior_t1 * likelihood_t1
evidence_t1 = unnormalised_posterior_t1.total_mass
posterior_t1 = unnormalised_posterior_t1 / evidence_t1


# Visualise the analysis step
single.plot_bayesian_analysis(prior_t1, likelihood_t1, posterior_t1, theta_t1, t1)

Having formed the posterior distribution at time, $t_{1}$, we can evolve its PDF under the the dynamics until the next observation time, $t_{2}$. At this time, let us suppose that we make a second noisy observation of the angle, $\theta$. Using the pushed-forward posterior from $t_{1}$ as our prior, we can then apply Bayes theorem just as before to obtain a new posterior. 

In [None]:
# Set the observation at t3.
t2 = t1 + 5
theta_t2 = -1.0
z2_std = 0.1

# push forward the posterior to form the new prior.
prior_t2 = posterior_t1.push_forward(da.pendulum.single.eom, t2-t1)

# Form the likelihood.
likelihood_t2 = da.LinearGaussianLikelihood(
    [theta_t2], 
    [[z2_std**2]], 
    G
).evaluate(prior_t2)

# Perform the Bayesian update. 
unnormalised_posterior_t2 = prior_t2 * likelihood_t2
evidence_t2 = unnormalised_posterior_t2.total_mass
posterior_t2 = unnormalised_posterior_t2 / evidence_t2


# Visualise the analysis step
single.plot_bayesian_analysis(prior_t2, likelihood_t2, posterior_t2, theta_t2, t2)


And if we have a third observation at time, $t_{3}$, we just iterate the process:

In [None]:
# Set the observation at t3
t3 = t2 + 10
theta_t3 = 1.2
z3_std = 0.1

# push forward the posterior to form the new prior.
prior_t3 = posterior_t2.push_forward(da.pendulum.single.eom, t3-t2)

# Form the likelihood.
likelihood_t3 = da.LinearGaussianLikelihood(
    [theta_t3], 
    [[z3_std**2]], 
    G
).evaluate(prior_t3)

# Perform the Bayesian update. 
unnormalised_posterior_t3 = prior_t3 * likelihood_t3
evidence_t3 = unnormalised_posterior_t3.total_mass
posterior_t3 = unnormalised_posterior_t3 / evidence_t3


# Visualise the analysis step
da.pendulum.single.plot_bayesian_analysis(prior_t3, likelihood_t3, posterior_t3, theta_t3, t3)


Following the assimilation of our three measurements, we can see that the posterior at $t_{3}$  is quite well localised in the state space. 

To form a **forecast** of the state at a later times all we need do is evolve this PDF forward in the usual manner. Within the code below, we form an animation of this forecast over a chosen period of time. 

In [None]:
# Set the end time of the forecast
t4 = t3+5

# Animate the forecast
t_points = np.linspace(0, t4-t3, round((t4-t3)*fps))
anim = single.animate_advection(
    posterior_t3.to_interpolator(), 
    t_points, 
    x_lim=theta_range,
    y_lim=p_range,
    res=resolution,    
    title='Forecast',
    t_start=t3
)
plt.close()

da.display_animation_html(anim)

When introducing the idea of data assimilation, it was earlier stated that this problem amounted to estimating the initial state from the available observations. Within the prediction-analysis loop we have described it has not actually been necessary to estimate the intial state directly. But if this is needed, it is simply a matter of pushing back the posterior at the final observation time, $t_{3}$, to the initial time, $t_{0}$. In the code below we compare the prior and posterior distributions for the initial state, and in this manner can see qualitatively how much we have learned from the data. 

In [None]:
# Push forward the posterior at t3 to t0 
posterior_t0 = posterior_t3.push_forward(da.pendulum.single.eom, t0-t3)

# Compare the prior and posterior for the initial state
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

_, c1 = da.plot_grid_marginal(prior_t0, ax=ax1, cmap="Blues")
ax1.set_title(f'prior PDF at {t0}')
fig.colorbar(c1, ax=ax1, location="bottom", shrink=0.7, pad=0.1)

_, c2 = da.plot_grid_marginal(posterior_t0, ax=ax2, cmap="Blues")
ax2.set_title(f'posterior PDF at {t0}')
fig.colorbar(c2, ax=ax2, location="bottom", shrink=0.7, pad=0.1)

    
plt.tight_layout()
plt.show()

Using the posterior at the final observation time to constrain the system's state at an earlier time is known as **reanalysis**. Within the context of weather forecasting it is not a central concern, but it is very important within other climate applications. For example, in the context of oceanography, reanalysis is one of the primary tools by which available observations, which are comparatively sparse in both space and time,  can be combined to reconstruct ocean dynamics on a global scale over an observational period. 

## Is data assimilation a solved problem?

Within the above example, we have considered the application of data assimilation to a non-linear dynamical system, showing that a sequence of prediction and analysis steps is all that is required to form a forecast of the future state from noisy partial observations. Clearly within a serious application this process would be automated within a loop, but otherwise we have seen all the details. 

Is this then all that is required within large-scale climatological applications? The answer is no for three reasons. 

First, within the present discussion we have considered only deterministic dynamical systems. By this we mean that the system is governed by a differential equation for which the later states are determined uniquely by the initial state. Within many climate applications the underlying physics is not perfectly known. The uncertainity within the physical model can often be useful represented as a random forcing term, and this leads us into the world of **stochastic differential equations**. There is not time to discuss this extention within this course. But we note that while the basic pattern of predition and analysis steps is retained, there are additional complications. Moreover, reanalysis is no longer the simple matter of pushing back the final posterior distribution to the initial time. 

The second point is that most non-linear dynamical systems are **chaotic**. This means, roughly, that their future states are highly sensitive to initial conditions. The practical implication for data assimilation is that even if we have done a good job of estimating a tightly peaked posterior at the final observation time, as we push this distribution forward for our forecast the uncertainty can grow rapidly and soon becomes useless. This is, in outline, why weather forecasts are only valid for a small number of days. As computers have become more powerful and the amount of data available has increased it has been possible to stretch the forecast window to around one week, but there will always remain a fundamental difficulty in forming long term weather forecasts.

The third point is ostensibly the least interesting, but it is the most important in practice. The methods we have applied within this lecture are simply too costly to use within large-scale data assimilation problems. To see this, consider the cost of one prediction step. We formed an equally spaced grid within our two-dimensional state space. Suppose that this grid has $M$ points along each co-ordinate axes and so comprises $M^{2}$ states in total. To  transform the PDF from time $t_{i}$ to $t_{i+1}$ we loop over the states within this grid and for each one integrate the evolution equations backwards in time from $t_{i+1}$ to $t_{i}$. The cost of this process is proportional to the number of states in the grid, and hence scales like $M^{2}$. The cost of mulltipling the likelihood and prior and computing the Bayesian evidence similarly scales like $M^{2}$, though relative to the integrations these costs are typically negligible. 

If we tried to apply these methods directly to a $m$-dimensional dynamical system, it is readily seen that the computational cost would scale like $M^{m}$. We noted above that modern weather models typically have $m \sim 10^{9}$. Suppose, as an example, we used just 10 points along each co-ordinate axes in the state space. There would then need to be one billion integrations of the evolution equations - meaning in this case the numerical solutions of coupled non-linear partial differential equations - to perform a single prediction step. Such computational costs lie far beyond what is possible now or at any foresable point in the future. And even if this were somehow possible, the cost in terms of energy usage might be so high as to make the process unconscionable. 

What we are seeing here is often called the **curse of dimensionality**. Methods that apply nicely for low-dimensional toy problems  cannot be scaled up to  real applications. This does not mean that data assimilation is impossible in such cases, but it means we have to shift emphasis towards finding workable approximations to the full Bayesian solution. In the next lecture we will consider some of the available methods. 

## Further reading

These lectures can only provide an incomplete introduction to the topic of data assimilation. For those who wish to know more I recommend the following books:

- Wunsch, C., 1996. The ocean circulation inverse problem. Cambridge University Press.
- Law, K., Stuart, A. and Zygalakis, K., 2015. Data assimilation. Cham, Switzerland: Springer.
- Sanz-Alonso, D., Stuart, A. and Taeb, A., 2023. Inverse problems and data assimilation (Vol. 107). Cambridge University Press.

