# ACSE-7 (Inversion and Optimisation)  <a class="tocSkip"></a>

## Lecture 7: PDE-Constrained Optimisation and Adjoint Methods   <a class="tocSkip"></a>

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Motivation" data-toc-modified-id="Motivation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Motivation</a></span></li><li><span><a href="#Continuous-and-Discontinuous-PDE-Constrained-Optimisation" data-toc-modified-id="Continuous-and-Discontinuous-PDE-Constrained-Optimisation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Continuous and Discontinuous PDE-Constrained Optimisation</a></span><ul class="toc-item"><li><span><a href="#The-Reduced-Problem" data-toc-modified-id="The-Reduced-Problem-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>The Reduced Problem</a></span></li></ul></li><li><span><a href="#The-Tangent-Linear-Approach" data-toc-modified-id="The-Tangent-Linear-Approach-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>The Tangent Linear Approach</a></span></li><li><span><a href="#The-Adjoint-Equation" data-toc-modified-id="The-Adjoint-Equation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>The Adjoint Equation</a></span><ul class="toc-item"><li><span><a href="#Background:-Variational-Calculus-(*)" data-toc-modified-id="Background:-Variational-Calculus-(*)-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Background: Variational Calculus (*)</a></span></li><li><span><a href="#Continuous-Adjoint-for-1D-Advection-(*)" data-toc-modified-id="Continuous-Adjoint-for-1D-Advection-(*)-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Continuous Adjoint for 1D Advection (*)</a></span></li><li><span><a href="#A-Backwards-Continuous-PDE-for-$\lambda$-(*)" data-toc-modified-id="A-Backwards-Continuous-PDE-for-$\lambda$-(*)-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>A Backwards Continuous PDE for $\lambda$ (*)</a></span></li><li><span><a href="#Continuous-adjoint-equation---summary" data-toc-modified-id="Continuous-adjoint-equation---summary-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Continuous adjoint equation - summary</a></span></li><li><span><a href="#Solving-the-discrete-model-backwards" data-toc-modified-id="Solving-the-discrete-model-backwards-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Solving the discrete model backwards</a></span></li></ul></li><li><span><a href="#A-Derivative-of-the-Reduced-Problem-via-the-Adjoint-Equation" data-toc-modified-id="A-Derivative-of-the-Reduced-Problem-via-the-Adjoint-Equation-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>A Derivative of the Reduced Problem via the Adjoint Equation</a></span></li><li><span><a href="#Implementation-(*)" data-toc-modified-id="Implementation-(*)-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Implementation (*)</a></span><ul class="toc-item"><li><span><a href="#Continuous-vs.-Discrete-Adjoint" data-toc-modified-id="Continuous-vs.-Discrete-Adjoint-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Continuous vs. Discrete Adjoint</a></span></li><li><span><a href="#Automatic-Differentiation" data-toc-modified-id="Automatic-Differentiation-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Automatic Differentiation</a></span></li><li><span><a href="#Checkpointing" data-toc-modified-id="Checkpointing-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Checkpointing</a></span></li></ul></li><li><span><a href="#List-of-Definitions" data-toc-modified-id="List-of-Definitions-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>List of Definitions</a></span></li></ul></div>

## Summary <a class="tocSkip"></a>
In this lecture we discuss PDE-constrained optimisation, where the function whose value we want to optimise depends on the solution to a PDE. The PDE in turn, and therefore its solution depends on model parameters, such as initial conditions, boundary conditions and forcing terms. The optimisation problem can be described as a search for those parameters that produce the optimal value after solving the PDE and substituting the solution into the functional. 
Since we want to apply gradient based optimistion algorithms to this problem we will discuss two different approaches to compute the derivative of the reduced functional. Firstly, the tangent linear approach which requires the solution of a set of linearized PDEs, one for each model parameter. It is therefore only feasible for a small number of parameters. Once the PDEs have been solved however, it is relatively cheap to compute the derivative for many different functionals. The adjoint method on the other hand solves a single linearized PDE for one specific functional. The cost of computing the derivative is therefore (almost) independent of the number of model parameters.
    
### Important concepts: <a class="tocSkip"></a>

- the reduced problem and reduced functional associated with a PDE-constrained optimisation problem
- the tangent linear approach to computing the derivative
- the adjoint approach to computing the derivative
- continous vs. discrete PDE-constrained optimisation and the adjoint method

<font size="1pt">Some $\LaTeX$ definitions hidden in this cell (double-click to reveal)</font>
$
\newcommand\vec[1]{\mathbf{#1}}
\newcommand\mat[1]{\underline{\mathbf{#1}}}
\newcommand\R{\mathbb{R}}
\newcommand\vlam{\boldsymbol{\lambda}}
$

In [7]:
%%html
<style>
a.definition {
    color: blue;
    font-style: italic;
    font-weight: bold;
}
div.optional {
    display: block;
    background-color: #f0f8ff;
    border-color: #e0f0ff;
    border-left: 5px solid #e0f0ff;
    padding: 0.5em;
}
</style>

# Motivation

PDE-constrained optimisation problems, optimisation problems where the function that is optimized depends on the solution to a PDE, occur in many different application areas. In the design stage of an engineering project, we are often not just interested in running a numerical model for a single design: there might be many design parameters and we need efficient methods to find the optimal design. Inversion problems where the map between the model data and the observed data consists of the solving of a PDE that depends on the model data, form another example of PDE-constrained optimisation problems. For instance we might want to invert for the initial condition to a time-dependent PDE, where we are trying to minimize the difference between some observed quantity and the same quantity predicted by the PDE-based model. Other examples include data assimilation in weather and ocean forecasting models, and flow control problems.

As we have seen in this course, efficient continuous optimisation algorithms require access to gradient information. So it will be important to efficiently calculate the gradient of the outcome of a PDE-based (numerical) model with respect to inputs of that model. Such gradient information is not only needed in gradient-based optimisation algorithms, it is also very useful for a better understanding of model outcomes. In many applications there is a huge variability in the amount of certainty we have about the inputs to our
numerical model. Some input data comes with an error bar, some parameters might just be guess work. By calculating the derivative of the outcome of interest with respect to the input parameters, we know the sensitivity to changes in the input data, and thus we learn more about the reliability of the outcomes of our model.

<table>
<tr>
<td width="40%"><img src="https://www.researchgate.net/profile/Simon_Neill/publication/314606415/figure/fig7/AS:470778710892544@1489253799783/Leased-wave-and-tidal-sites-in-a-Scotland-and-b-Pentland-Firth-and-Orkney-waters.png"></td>
<td><img src="https://opentidalfarm.readthedocs.io/en/latest/_static/slider_media/discrete_streamlines.png"></td><td>
<img src="https://opentidalfarm.readthedocs.io/en/latest/_static/slider_media/discrete_turbine133_iter_plot.png"></td>
</tr>
<tr>
<td>
<img src="https://opentidalfarm.readthedocs.io/en/latest/_static/slider_media/smooth_turbine.png"></td>
<td width="24%">
<img src="https://d2n4wb9orp1vta.cloudfront.net/cms/turbine_blade_composites_Lockheed.jpeg"></td>
<td>
<img src="https://wp-assets.futurism.com/2016/09/Hammerfest-Str__m-Tidal-Turbine-array.jpg">
</td>
</tr>
</table>

**Tidal turbine farm optimisation** in the Pentland Firth between mainland Scotland and the Orkney islands. As the tidal turbines influence the flow, the positioning of the turbines within the farm changes the amount of energy that can be extracted. This is an example of a PDE-constrained optimisation problem.
**Top right figure**: Optimisation of individual turbines in the Inner Sound of Stroma site in the Pentland Firth.
<span style="font-size:10px">SW Funke, PE Farrell, MD Piggott (2014). Tidal turbine array optimisation using the adjoint approach Renewable Energy, 63, pp. 658-673. [doi:10.1016/j.renene.2013.09.031](https://www.sciencedirect.com/science/article/pii/S0960148113004989)</span>
**Bottom left figure**: Optimisation of multiple farms simultaneously in the form of a turbine density
<span style="font-size:10px">SW Funke, SC Kramer, MD Piggott (2015). Design optimisation and resource assessment for tidal-stream renewable energy farms using a new continuous turbine approach, Renewable Energy, 99, pp. 1046-1061. [doi:10.1016/j.renene.2016.07.039](https://www.sciencedirect.com/science/article/pii/S0960148116306358)</span>

In [6]:
from IPython.display import HTML, IFrame
IFrame("https://www.youtube.com/embed/qz9jXNS0VQY", 560, 315)

Example of PDE-constrained optimisation for positioning 256 individual turbine within a farm, using OpenTidalFarm based on FEniCS and dolfin-adjoint. For more information see: [OpenTidalFarm](http://opentidalfarm.org/)

<img src="./figures/mesh.png" width=500x>

**Inversion for Bottom Friction** In this example from some current research we have a model for tides in the Bristol Channel and Severn Estuary. The (numerical) "model" computes the depth-averages 2D velocity field and the free surface elevation. There are multiple numerical and physical "model parameters" that go into the model, including incoming tidal boundary conditions, bathymetry etc, some of which are known to varying levels of uncertainty. 

In this example we consider bed roughness (or bottom friction) as the parameters we wish to invert for, given the data of time series of tidal heights at tide gauges indicated by the red dots in the following image which also shows the discretised domain and the computational mesh

<img src="figures/pasc2019_luporini-page.svg"/>

**Full Wave Form Inversion**: Sending waves through the Earth to detect properties and different materials in the bottom. We have a numerical model that solves a PDE of the waves travelling through the Earth. The different materials/properties, which affect the wave speed, are inputs to the model, and using techniques described today we can invert for these inputs and thus detect the composition of the different layers. This can be phrased as an optimisation problem, as we are optimising for those inputs such that the numerical model best predicts the response that has been measured by the receivers. You will find out more about this next week. Similar techniques are also used in medical imaging.

# Continuous and Discontinuous PDE-Constrained Optimisation

We consider the optimisation problem:

$$
  \text{minimize} f(u, m)\;\;
  \text{subject to }g(u, m) = 0
$$

where $u$ is a solution to a PDE (or set of PDEs) $g(u, m)$, and $m$ encodes any inputs to the PDE: initial and boundary conditions, physical parameters such as density, thermal conductivity, viscosity, forcing terms, etc. In the context of an optimisation we only include those parameters in $m$ that we actually want to vary, and assume others as given.

If we approach this problem analytically, that is before numerical discretisation, the solutions $u$ are functions in some function space $V$. The model parameters may be scalar constants, but could also be time-varying and/or spatially varying functions in some function space $M$. The function $f$ is then a real-valued function $f:V\times M \to \R$. It is often referred to as the *functional*. The function $g$ encodes the PDE including boundary and initial conditions, which is viewed as a constraint on $u$ and $m$. We will refer to this formulation as the <a class="definition" href="#definitions" id="continuousoptimisationproblem">continuous optimisation problem</a> where we optimise a functional that depends on the solution $u$ of a continuous PDE.

We will also consider the same optimisation problem *after discretisation*. In this case the numerical solution is given by a vector $\vec u\in\R^U$, for instance describing the solution at all grid points, or in the case of the finite element method these may be the coefficients of the solution with respect to the finite element basis functions. Note that in the case of a time-dependent PDE, $\vec u$ contains the solution at all timesteps; So for instance in a 2D finite difference discretisation, if we have $N_x\times N_y$ grid-points and $N_t$ timesteps, then $\vec u\in \R^U=\R^{N_t \times N_x\times N_y}$. Similarly if $m$ refers to (one or more) time- or spatially varying functions used in the PDE, then $\vec m\in \R^M$ describes a discretised version of these. The function $g$ is given by the discretised PDE. In general the number of discretised equations is the same as the number of solution values in space and time $U$. So we write: $g: \R^U\times \R^M \to \R^U$.
This PDE-constrained optimisation problem where we optimise a functional that depends on the numerical solution $\vec u$ of a discretised PDE, will be called the <a class="definition" href="#definitions" id="discontinousoptimisationproblem">discontinous optimisation problem</a>.

(Note: the literature on this subject usually preserves the letter $J$ for the functional, and $F$ for the PDE-constraint. Here we will use $f$ and $g$ to be consistent with previous lectures)

## Example: 1D Advection Equation <a class="tocSkip"></a>
Consider the one-dimensional scalar advection PDE:

$$
  \frac{\partial u(x,t)}{\partial t} + v\frac{\partial u(x,t)}{\partial x} = 0
$$

in a domain $x \in [0, L]$ and with $t \in [0, T]$, a velocity $v$ and initial and boundary conditions

$$
  \text{initial condition: }\;u(x,0) = u_{\text{ic}}(x), \;\;
  \text{boundary condition: }\;u(0,t) = u_{\text{bc}}(t)
$$

Consider the *inversion problem*: which initial condition $u_{\text{ic}}(t)$ produces an observed end state at $t=T$ of $u(x, T)=u_{\text{end}}(x)$? In this problem, the initial condition function $u_{\text{ic}}$ is the unknown; We write $m=u_{\text{ic}}$, and formulate 
an *optimisation problem* based on the following functional:

$$
  f(u, m) = \int_0^L (u(x,T)-u_{\text{end}}(x))^2 \mathrm{d}x
$$

which we minimize for $m$ with the *constraint* that $u$ and $m$ satisfy

$$
  g(u, m) = \begin{pmatrix}  
  \frac{\partial u}{\partial t} + v\frac{\partial u}{\partial x} \\
  u(\cdot, 0) - m \\
  u(0, \cdot) - u_{\text{bc}}
  \end{pmatrix} =
  \begin{pmatrix}
    0 \\ 0 \\ 0
  \end{pmatrix}
  \;\;\;
  \begin{array}{c}
    \text{PDE constraint} \\ \text{initial condition} \\ \text{boundary condition}
  \end{array}
$$

Note that $g$ does not map $u$ and $m$ to three scalars, but to three function expressions which need to be zero for all $x \in [0,L]$ and $t \in [0,T]$.

In this particular example $f$ does not actually explicitly depend on $m$, in other words $\frac{\partial f}{\partial m}=0$; Rather, the dependency on $m$ comes through the fact that $u$ needs to be a solution of the PDE with initial condition $u(x,0)=m(x)$. In the more general case $f$ may depend on $m$ both implicitly (through the PDE constraint) and explicitly. Also the functional does not necessarily depend on the end state of $u$ only; for instance in a time-dependent ocean model, we may have a time-series of observations in a fixed point and the functional could be the time-average of the difference between the observed value and the value predicted by the model.

Instead of inverting for the initial condition, we could also invert for the boundary condition with $m=u_{\text{bc}}$ or for a spatially or temporally varying velocity $m=v(x,t)$.

## Example: Discretised 1D Advection Equation  <a class="tocSkip"></a>

The simplest discretisation of this PDE is given by the following explicit first order upwind formula:

$$
  \frac{u^{n+1}_i-u^n_i}{\Delta t} + v_i^n \frac{u^{n}_i-u^n_{i-1}}{\Delta x} = 0
$$

where we assume that the velocity is always positive $v^n_i>0$. Here the superscript $n$, with $0\leq n \leq N_t-1$ refers to the time level, and the subscript $i$ with $0\leq i \leq N_x-1$ to the grid point. To simplify the formulas, we multiply by $\Delta t$ and write

\begin{align*}
  u^{n+1}_i-u^n_i + k_i^n \left(u^{n}_i-u^n_{i-1}\right) &= 0,\;\;
  \text{with }\;k_i^n = \frac{v_i^n\Delta t}{\Delta x} \\
  \text{or}\;\;
  u^{n+1}_i+(k_i^n-1)u^n_i - k_i^n u^n_{i-1} &= 0
\end{align*}

The inital and boundary conditions are given by:

$$
  \text{initial condition: }\;u^0_i = u_{\text{ic}, i}\;\;
  \text{boundary condition: }\;u^n_0 = u_{\text{bc}}^n
$$

Let us again consider the case where we are inverting for an initial condition, thus we write $m_i = u_{\text{ic}, i}$ and $\vec m$ is a vector in $\R^{N_x}$. The solution vector $\vec u$ contains the solution at all timesteps, so $\vec u\in \R^{N_t\times N_x}$. We can write the discretisation in a single linear system

$$
\begin{array}{c}
\text{time step}\; 0 \left\{
\begin{array}{c}
\\ \\ \\ \\
\end{array}\right. \\[2pt]
\text{time step}\; 1 \left\{
\begin{array}{c}
\\ \\ \\ \\
\end{array}\right. \\[2pt]
\text{time step}\; 2 \left\{
\begin{array}{c}
\\ \\ \\ \\
\end{array}\right. \\[2pt]
\vdots
\end{array}
\left(
\begin{array}{cccc|cccc|cccc|c}
1 & 0 & 0 & \dots \\
      0 & 1 & 0 & \dots \\
      0 & 0 & 1 & \dots \\
      \vdots & \vdots & \vdots & \ddots \\
      \hline
      0 & 0 & 0 & \dots &
      1 & 0 & 0 & \dots \\
      -k^1_1 & k^1_1 - 1 & 0 & \dots &
      0 & 1 & 0 & \dots \\
      0 & -k^1_2 & k^1_2 -1 & \dots &
      0 & 0 & 1 & \dots \\
      \vdots & \vdots & \vdots & \ddots &
      \vdots & \vdots & \vdots & \ddots \\
      \hline
      &&&& 0 & 0 & 0 & \dots &
      1 & 0 & 0 & \dots \\
      &&&& -k^2_1 & k^2_1 -1 & 0 & \dots &
      0 & 1 & 0 & \dots \\
      &&&& 0 & -k^2_2 & k^2_2 - 1 & \dots &
      0 & 0 & 1 & \dots \\
      &&&& \vdots & \vdots & \vdots & \ddots &
      \vdots & \vdots & \vdots & \ddots \\
      \hline
      &&&& &&&& &&&& \ddots
  \end{array}\right)
 \left(
\begin{array}{c}
    u^0_0 \\ u^0_1 \\ u^0_2 \\ \vdots \\
    \hline
    u^1_0 \\ u^1_1 \\ u^1_2 \\ \vdots \\
    \hline
    u^2_0 \\ u^2_1 \\ u^2_2 \\ \vdots \\
    \hline
    \vdots
  \end{array}\right)
  =
 \left(
\begin{array}{c}
    m_0 \\ m_1 \\ m_2 \\ \vdots \\
    \hline
    u^1_{\text{bc}} \\ 0 \\ 0 \\ \vdots \\
    \hline
    u^2_{\text{bc}} \\ 0 \\ 0 \\ \vdots \\
    \hline
    \vdots
  \end{array}\right)
$$

or written as a constraint

$$
  g(\vec u, \vec m) = \mat A \vec u - \vec b = 0
$$

Here $g$ is thus a function $g: \R^{N_t\times N_x}\times \R^{N_x} \to \R^{N_t\times N_x}$.
In our example of inverting for initial condition, it is $\vec b$ that depends on $\vec m$. If we were inverting for velocity it would be the matrix $\mat A$ that depends on $\vec m$.

The matrix $\mat A$ above consists of a number of $N_x\times N_x$ identity matrix-blocks along the diagonal, one for each time level. Further $N_x\times N_x$ nonzero matrix-blocks are found just to this block diagonal, which contains the references to values at the previous timestep used in the discretisation. All other blocks are zero. This is the general structure of an explicit model. An implicit (two-level) time integration produces more general matrix blocks along the diagonal, but is also nonzero in the same blocks on the diagonal and just to the left of it. In a nonlinear model, in which we solve for $U$ nonlinear discretised equations $g(\vec u, \vec m)=0$, the $U\times U$ matrix $\partial g/\partial u$ has the same structure. This matrix (in the linear case we simply have $\partial g/\partial u=\mat A$) will play an important role in the rest of this lecture.

## The Reduced Problem
In PDE-constrained optimisation we assume that for a given choice of model parameters $m$ we can uniquely solve for a $u$ that satisfies the PDE (which may also depend on other model parameters that we choose not to vary). In other words for each $m$ we can find a unique $u$ that satisfies the PDE constraint $g(u,m)=0$. This means we can consider $u$ a *function* $u(m)$ of $m$, which returns the solution for each choice of model parameter $m$. In this light, we can define the <a class="definition" href="#definitions" id="reducedfunctional">reduced functional</a>

$$
  \hat f(m) = f(u(m), m)
$$

which is a function of $m$ only. In order to evaluate the reduced functional $\hat f$ for a given $m$, we need to first solve the PDE to obtain the solution $u(m)$ and substitute the solution in the functional $f$.

The advantage of this formulation is that we can now define our optimisation problem as an *unconstrained optimisation problem*:

$$
  \text{minimize}\;\; \hat f(m)
$$

In principle we could try to solve this problem by evaluating the function for many values of $m$. However as we have seen in our lectures so far, *efficient* optimisation algorithms make use of the gradient of the function to be optimised. In particular in PDE constrained optimisation problems, the single evaluation of the function can be very expensive. Thus unless the model is relatively cheap and there are only a few feasible values of $m$, say we are only optimmising for a single parameter, access to gradient information becomes essential.

# The Tangent Linear Approach

The derivative of $\hat f(m)$ with respect to $m$ can be expanded as

$$
  \frac{d\hat f(m)}{dm} = \frac{df(u(m), m)}{dm} = 
  \left.\frac{\partial f(u, m)}{\partial u}\right|_{u=u(m)} \frac{du(m)}{dm} + 
  \left.\frac{\partial f(u, m)}{\partial m}\right|_{u=u(m)}
$$

The second term, the partial derivative of $f$ with respect to $m$, can often easily be derived from the explicity dependency of $f$ on $m$. In the example above in fact we had $\frac{\partial f}{\partial m}=0$. For the first term we need to know $du/dm$ and it is not immediately clear how to obtain it as $u(m)$ is a function of $m$ through solving a PDE.

The most straightforward approach is to consider the fact that

$$
  g(u(m), m) = 0
$$

holds for every $m$, and thus the $m$-derivative of $g(u(m), m)$ should also be zero:

$$
  \frac{dg(u(m), m)}{dm} = 
  \left.\frac{\partial g(u, m)}{\partial u}\right|_{u=u(m)} \frac{du(m)}{dm} + 
  \left.\frac{\partial g(u, m)}{\partial m}\right|_{u=u(m)} = 0
$$

which means we could solve

$$
  \left.\frac{\partial g(u, m)}{\partial u}\right|_{u=u(m)} \frac{du(m)}{dm} = - 
  \left.\frac{\partial g(u, m)}{\partial m}\right|_{u=u(m)}
  \tag{tangent-linear-equation} \label{tangent-linear-equation}
$$

for $du/dm$.



The meaning of this <a class="definition" href="#definitions" id="tangentlinearequation">tangent linear equation</a> is easiest to understand for the case that $m$ is just a single parameter (a value) in the model. In that case if we consider a (infinitesimally) small variation $\delta m$ in $m$, the change in $u(m)$ is given by

$$
  \delta u = \frac{du}{dm}\delta m
$$

Here $\delta u$, and therefore $du/dm$, is just a function of $x$ and $t$ that represent the perturbation to $u$ in all points $x$ and for all time levels $t$. The tangent linear equation above provides a PDE that can be solved to obtain $du/dm$. This PDE is a modification of the original PDE. In particular, even if the original PDE is nonlinear the PDE described by $\eqref{tangent-linear-equation}$ is linear.

## Example: Single Parameter Tangent Linear Equation for 1D Advection <a class="tocSkip"></a>
In the 1D advection example if we assume that the velocity $v$ is constant in space and time, then we can 
choose it as a single parameter $m$ to optimise for. This means that we write the PDE constraint as

$$
  g(u, m) = \begin{pmatrix}  
  \frac{\partial u}{\partial t} + m\frac{\partial u}{\partial x} \\
  u(\cdot, 0) - u_{\text{ic}}(\cdot) \\
  u(0, \cdot) - u_{\text{bc}}(\cdot)
  \end{pmatrix} =
  \begin{pmatrix}
    0 \\ 0 \\ 0
  \end{pmatrix}
  \;\;\;
  \begin{array}{c}
    \text{PDE constraint} \\ \text{initial condition} \\ \text{boundary condition}
  \end{array}
$$

where now both the initial and boundary condition $u_{\text{ic}}$ and $u_{\text{bc}}$ are assumed given.

If we substitute the perturbation $\delta u$ to the solution $u$ due to a perturbation $\delta m$ in $m$
the tangent linear equation reads

$$
  \frac{\partial g(u(m), m)}{\partial u} \delta u = - 
  \frac{\partial g(u(m), m)}{\partial m}\delta m
$$

We work out the left-hand side

$$
  \frac{\partial g(u(m), m)}{\partial u}\delta u = 
  \begin{pmatrix}  
  \frac{\partial \delta u}{\partial t} + m\frac{\partial \delta u}{\partial x} \\
  \delta u(\cdot, 0) \\
  \delta u(0, \cdot)
  \end{pmatrix}
$$

and the right-hand side

$$
-\frac{\partial g(u(m), m)}{\partial m}\delta m =
\begin{pmatrix}  
  -\delta m\frac{\partial u}{\partial x} \\
  0 \\
  0
  \end{pmatrix}
$$

separately. Combined this gives

\begin{align*}
  \frac{\partial \delta u(x,t)}{\partial t} + m\frac{\partial \delta u(x, t)}{\partial x} = 
  -\delta m\frac{\partial u(x,t)}{\partial x} \\
  \text{with}\; \delta u(x, 0)=0\;\;
  \text{and}\; \delta u(0, t)=0
\end{align*}

Note that this is just a PDE for $\delta u$. In fact it's the same PDE that we solve for $u$, except it has a source term on the right-hand side (which we know if we have solved for $u$ first). If the original PDE was non-linear, terms on the left-hand side would have been different as well, and we would have solved a linearised version of the PDE for $\delta u$. Further we note that unlike for $u$, the boundary and initial conditions for $\delta u$ are both zero. This make sense as the perturbation $\delta u$, after adding it to the original solution $u$ should not alter the initial and boundary conditions for the original PDE.

## Multiple Parameters and Functions as Parameters  <a class="tocSkip"></a>

If there are multiple, say $M$, parameters meaning that $\vec m\in\R^M$, then $du/dm$ in fact represents $M$ functions, each containing the perturbation to $u$ in response to a perturbation in one of the $M$ parameters.
For a perturbation $\delta m_i\in\R$ in the $i$-th parameter $m_i$, the corresponding change $\delta u_i$ in the solution $u$ can be solved as

$$
  \frac{\partial g}{\partial u}\delta u_i = - 
  \frac{\partial g}{\partial m_i}\delta m_i 
  \;\;\text{with }\;\delta u_i = \frac{du}{dm_i}\delta m_i
$$

However, to find the derivative of $u(m)$ in all directions, i.e. for perturbations in all $M$ parameters, we would need to solve this PDE $M$ times.

As we have seen above we would also like to consider parameters that are in fact functions of space and time, e.g. initial and boundary conditions, in which case there are in fact infinitely many perturbations $\delta m$ to consider. If $m(x)$ represents the initial condition (as in the example before) then $\delta m(x)$ represents a perturbation to it in every point $x$ of the domain.

## Discrete Tangent Linear Model  <a class="tocSkip"></a>

We get the same picture if we apply the tangent linear approach after discretisation. Now $g$ represent the discretised equations. In our example we had $N_t\cdot N_x$ equations for $N_t$ timesteps and $N_x$ gridpoints, which we solve for $\vec u$ a $U$-length vector containing the numerical solution in all grid points and at all time steps. Thus $\partial g/\partial u$ becomes a $U \times U$ matrix. Similarly for other terms in the tangent-linear equation we have

$$
  \underset{\bf U\times U}{\frac{\partial g}{\partial u}}\;
  \underset{\bf U\times M}{\frac{du}{dm}}
  = - 
  \underset{\bf U\times M}{\frac{\partial g}{\partial m}}
$$

This can be solved for each column in $du/dm$ separately with a corresponding colum of $\partial g/\partial m$ on the right-hand side, but each of those $M$ solves is equivalent to running an entire discretised model through all its timesteps. If $m$ represents an initial condition, then the number of model parameters $M$ is the same as the number of gridpoints which makes this method infeasible.

One advantage of this method (unlike the adjoint method which we'll discuss next), is that it is cheap to consider multiple functionals. Once we have calculated $du/dm$, we can compute the derivative $d\hat f(m)/dm$ for  arbitrarily meaning different functionals $f$ nearly for free, as the partial derivatives $\partial f/\partial u$ and $\partial f/\partial m$ are cheap to compute in comparison.

**Conclusion**: the *tangent linear method* is useful for cases with a small number of parameters, and a potentially large number of functionals.

# The Adjoint Equation

Let's return to the optimisation formulated as a constrained optimisation problem. We consider any potential solutions $u$ and model parameters $m$, and minimize $f(u, m)$ over all of these but with the constraint that $g(u, m)=0$, i.e. $u$ satisfies the PDE constraint which depends on $m$.

In lecture 6 we have seen how solutions to a constrained optimisation problem can be found by searching for stationary points of the Lagrangian

$$
  \mathcal{L}(u, m, \lambda) = f(u, m) - \lambda\cdot g(u, m)
$$

Note that both $u$ and $m$ are variables that we optimize over; In lecture 8 these would be combined in a single vector $\vec x$. Another difference with lecture 8 is the sign in front of $\lambda$, this is merely a sign convention that gives the same results after changing $\lambda \to -\lambda$. 
The Lagrange multiplier $\lambda$ is easiest to think about in the discrete case. Then $g(u,m)$ represents $U$ equations ($U=N_t\cdot N_x$ in our example) or constraints, and $\lambda$ should be a vector of length $U$ as well.

Before discretisation $g(u, m)=0$ is an equation that should be satisfied everywhere, i.e. for all locations $\vec x$ in the spatial domain $\Omega$ and for each time $t \in [0, T]$. It is therefore best to think about those as different constraints for all locations and times, and thus $\lambda$ should be a function that has a value $\lambda(\vec x, t)$ for each $\vec x\in\Omega, t\in [0,T]$. The dot-product should then be interpreted as

$$
  \lambda\cdot g(u, m) = \int_{t=0}^T \int_\Omega \lambda(\vec x, t) g(u(\vec x, t), m) \mathrm{d}\vec x \mathrm{d}t
$$

In addition to the actual PDE, $g$ also encodes initial and boundary conditions. This gives Lagrange multiplier functions $\lambda_{\text{ic}}(\vec x)$ associated with the initial condition enforced in each $\vec x \in \Omega$, and $\lambda_{\text{bc}}(\vec x, t)$ associated with the boundary condition for each point $\vec x$ on the boundary $\partial\Omega$ and all times $t$. The full expression for $\lambda\cdot g$ then becomes

$$
  \lambda\cdot g(u, m) = \int_{t=0}^T \int_\Omega \lambda(\vec x, t) g(u(\vec x, t), m) \mathrm{d}\vec x \mathrm{d}t + \int_\Omega \lambda_{\text{ic}}(\vec x)\left(u(\vec x,0)-u_{\text{ic}}(\vec x)\right) \mathrm{d}\vec x
  + \int_{t=0}^T \int_{\partial\Omega} \lambda_{\text{bc}}(\vec x, t)\left(u(\vec x,t)-u_{\text{bc}}(\vec x,t)\right) \mathrm{d}\vec x\mathrm{d}t
$$



To find the stationary points of $\mathcal{L}$ we simply write down its partial derivatives and set those to zero

\begin{align*}
  \frac{\partial\mathcal{L}(u, m, \lambda)}{\partial u} = \frac{\partial f(u,m)}{\partial u} - \lambda\cdot \frac{\partial g(u, m)}{\partial u} &= 0 \tag{adjoint-equation} \label{adjoint-equation} \\
  \frac{\partial\mathcal{L}(u, m, \lambda)}{\partial m} = \frac{\partial f(u,m)}{\partial m} - \lambda\cdot \frac{\partial g(u, m)}{\partial m} &= 0 \tag{m-equation} \\
  \frac{\partial\mathcal{L}(u, m, \lambda)}{\partial\lambda} = g(u, m) &= 0 \tag{PDE-constraint}
\end{align*}

Note that we now have three equations, since we have split the variables that we optimise over into solutions $u$ and parameters $m$, so that instead of a single equation (the $\partial \mathcal{L}/ \partial x = 0$ equation in lecture 6), we get two. The third equation is, as before, the constraint equation.

The first equation we have labeled adjoint-equation. In the case of real-valued matrices and vectors (or functions), the adjoint is simply the transpose. The reason for the name of this equation becomes clear after taking the transpose (adjoint) of this equation. Again let's first consider the discrete case

\begin{align*}
  \underset{\bf 1\times U}{\frac{\partial f(\vec u, \vec m)}{\partial\vec u}} - \underset{\bf 1\times U}{\lambda^T} 
  \underset{\bf U\times U}{\frac{\partial g(\vec u, \vec m)}{\partial\vec u}} = 0 \\
  \implies
  \underset{\bf U\times U}{\left(\frac{\partial g(\vec u, \vec m)}{\partial \vec u}\right)^T}
  \underset{\bf U\times 1}{\lambda} = \underset{\bf U\times 1}{\left(\frac{\partial f(\vec u,\vec m)}{\partial \vec u}\right)^T}
\end{align*}

Thus we see that by taking the transpose of the matrix $\partial g/\partial\vec u$ - in the tangent linear approach this matrix described a linearisation of the discretised PDE - we obtain an equation that we can solve for $\lambda$. We will see in a minute how to interpret the solution to the transpose of a linear system that corresponds to a discretised PDE.

## Background: Variational Calculus (\*)
<div class="optional">
Before we move on to the continuous we first need to better understand what the derivatives like $\partial L/\partial u$ actually mean when $u$ is a continuous function. For multiple (but finite) dimensional derivatives we use the concept of directional derivative:

$$
  \lim_{h\to 0} \frac{f(\vec x+h\vec v) - f(\vec x)}{h}
$$

for the derivative of $f$ in the direction of $v$, which we can relate to the gradient vector $\partial f/\partial\vec x$ by

<a name="directional_derivative"></a>
$$
  \frac{\partial f(\vec x)}{\partial\vec x}\cdot\vec v = \lim_{h\to 0} \frac{f(\vec x+h\vec v) - f(\vec x)}{h} \tag{directional_derivative} \label{directional_derivative}
$$

and in fact we can define $\partial f/\partial\vec x$ as the unique vector for which this equation holds for all $\vec v$.
</div>

<div class="optional">
Now instead of vectors $\vec x$, let's consider a function $u:\R\to\R$ as the input for $f$. For now let's consider $u$ to be a function of time $t$. The functional $f$ (a function that takes a function as an input) takes $u$, i.e. all values of $u(t)$ at all times $t$, and returns a number, for instance:

$$
  f(u) = \int u(t)^3 dt
$$


A variation in $u$ is another function, say $v$, which we consider as a perturbation of $u$ at all times $t$: $u(t) \to u(t)+v(t)$. We can now define the derivative of $f$ in the direction of $v$ to be

$$
      \lim_{hb\to 0} \frac{f(u+h v) - f(u)}{h}
$$

which we can work out as

\begin{align}
  &= \lim_{h\to 0} \frac{\int (u(t)+h v(t))^3 dt - \int u(t)^3 dt}h \\
  &=
  \lim_{h\to 0} \frac{\int (u(t)+h v(t))^3 - u(t)^3 dt}h 
  & \text{combine integrals} \\
  &=
  \int \lim_{h\to 0} \frac{(u(t)+h v(t))^3 - u(t)^3}h dt
  & \text{move limit inside integral} \\
  &=
  \int \lim_{\delta u\to 0} \frac{(u(t)+\delta u)^3 - u(t)^3}{\delta u} v(t) dt & \text{(substitute $\delta u=h v(t)$)} \\
  &=
  \int 3u(t)^2 v(t) dt
\end{align}

Since we know the directional derivative of $f$ for *any* perturbation $v$, we can now say that the derivative of $f$ with respect to $u$ is given by

$$
  \frac{\partial f}{\partial u} = 3 u^2
$$

which makes $\partial f/\partial u$ itself a function of $t$. By defining the following inner (dot) product between two functions:

$$
  u \cdot v = \int u(t)v(t) dt
$$

we then indeed have

\begin{align}
    \frac{\partial f(u)}{\partial u}\cdot v = \lim_{h\to 0} \frac{f(u + hv) - f(u)}{h} = \int 3u(t)^2 v(t) dt
\end{align}

for all perturbations $v$ of $u$.

So far the results are pretty intuitive: you can interpret $\partial f/\partial u$ simply as the derivative with respect to the value of $u(t)$ at all different times $t$. This is similar to the derivative with respect to a vector $\vec x$ being the derivative with respect to the individual $x_i$ for all indices $i$. The sum over $i$ that's implicit in the dot product of [(directional_derivative)](#directional_derivative) is replaced by an integral over $t$.
</div>

<div class="optional">
Things get a little trickier if we have functionals of the following form

$$
  f(u) = \int \left(\frac{\partial u(t)}{\partial t}\right)^2 dt
$$

To simplify the analysis we introduce the following notation:

$$
  \delta f(u) = f(u+\delta u) - f(u)
$$

which is an expression for the perturbation in $f(u)$ when we add a tiny perturbation $\delta u$ to $u$. For such a small perturbation $\delta u$ we can write

$$
  f(u+\delta u) = f(u) + \frac{\partial f(u)}{\partial u}\cdot \delta u
     + \mathcal{O}(\|\delta u\|^2) \\
  \delta f(u) = \frac{\partial f(u)}{\partial u}\cdot \delta u
     + \mathcal{O}(\|\delta u\|^2)
$$

Thus by working out $\delta f(u)$ and neglecting any higher order terms in $\delta u$, we can obtain the directional derivative of $f$ in the direction of $\delta u$.

Let's apply this to the example above:

\begin{align}
   \delta f(u) &= \int \left(\frac{\partial \left[u(t)+\delta u(t)\right]}{\partial t}\right)^2 dt - \int \left(\frac{\partial u(t)}{\partial t}\right)^2 dt \\
   &= \int \left(u'(t)\right)^2 + u'(t)\delta u'(t) + \left(\delta u'(t)\right)^2 - \left(u'(t)\right)^2 dt \\
   &= \int u'(t)\delta u'(t) dt + \mathcal{O}(\delta u^2),
\end{align}

wher we have used $u'$ and $\delta u'$ for the time derivatives of $u$ and $\delta u$. Thus we may say:

$$
\frac{\partial f(u)}{\partial u}\cdot \delta u
= \int u'(t)\delta u'(t) dt
$$

The problem with this form for the directional derivative, is that it is not immediately obvious what $\partial f/\partial u$ on its own is. We want to write a $\partial f/\partial u$ as a function so that we can write:

$$
\frac{\partial f(u)}{\partial u}\cdot \delta u
= \int \frac{\partial f}{\partial u}(t)\delta u(t)dt = \int u'(t)\delta u'(t) dt
$$

The trick to achieve this is integration-by-parts. First let's be a bit more explicity about the (time)-domain of $u$ and assume that $t_0\leq t \leq t_1$ with

$$
  f(u) = \int_{t_0}^{t_1} \left(u'(t)\right)^2 dt
$$

Then

\begin{align}
\frac{\partial f(u)}{\partial u}\cdot \delta u
&= \int_{t_0}^{t_1} u'(t)\delta u'(t) dt \\
&= \int_{t_0}^{t_1} \left(\frac{\partial}{\partial t} \left[u'(t)\delta u(t)\right] - u''(t)\delta u(t)\right) dt \\
&= \left[u'(t)\delta u(t)\right]_{t_0}^{t_1} - \int_{t_0}^{t_1} u''(t)\delta u(t) dt \\
&= u'(t_1)\delta u(t_1) - u'(t_0)\delta u(t_0) - \int_{t_0}^{t_1} u''(t)\delta u(t) dt \\
&= \int_{t_0}^{t_1} \delta(t-t_1)u'(t)\delta u(t)dt - \int_{t_0}^{t_1}\delta(t-t_0)u'(t)\delta u(t)dt - \int_{t_0}^{t_1} u''(t)\delta u(t) dt \\
&= \int_{t_0}^{t_1} \left(\delta(t-t_1)u'(t)- \delta(t-t_0)u'(t) - u''(t)\right)\delta u(t) dt.
\end{align}

In the last two lines we used the [Dirac delta function](https://en.wikipedia.org/wiki/Dirac_delta_function) $\delta(t-t_0)$. Comparing the begin and end of the last equation block, we see that it makes sense to write:

$$
  \frac{\partial f(u)}{\partial u} = \delta(t-t_1)u'(t)- \delta(t-t_0)u'(t) - u''(t)
$$

If we use this analysis to find a stationary point of $f$, we can simply write

$$
  \delta(t-t_1)u'(t)- \delta(t-t_0)u'(t) - u''(t) = 0
$$

Things are a little simpler if we fix the start and end values, i.e. we only consider function $u$ for which $u(t_0)=u_0$ and $u(t_1)=u_1$ for some chosen values $u_0$ and $u_1$. In that case there can be no perturbations to $u$ in $t_0$ and $t_1$, or in other words, $\delta u(t_0)=0$ and $\delta u(t_1)=0$. In the derivation above this means that the terms with the Dirac delta function drop out, and we simply have

$$
  -u''(t) = 0
$$

Note that this is a 2nd order ODE with initial/boundary conditions of $u(t_0)=u_0$ and $u(t_1)=u_1$.
</div>

## Continuous Adjoint for 1D Advection (\*)

<div class="optional">
First let us try to make sense of the continuous case (before discretisation). We will do this along the 1d example from above. 
We use again the same functional

$$
  f(u, m) = \int_0^L (u(x,T)-u_{\text{end}}(x))^2 \mathrm{d}x
$$

and thus the Lagrangian becomes

$$
  \mathcal{L}(u, m, \lambda) = f(u, m) - \lambda\cdot g(u, m) =
  \int_0^L (u(x,T)-u_{\text{end}}(x))^2 \mathrm{d}x 
    -\int_{t=0}^T\int_{x=0}^L \lambda(x, t) \left(\frac{\partial u(x,t)}{\partial t} + v(x,t)\frac{\partial u(x,t)}{\partial x}\right) \mathrm{d}x\mathrm{d}t \\
  -\int_{x=0}^L \lambda_{\text{ic}}(x) \left(u(x,0) - u_{\text{ic}}\right) \mathrm{d}x
  -\int_{t=0}^T \lambda_{\text{bc}}(t) \left(u(0,t) - u_{\text{bc}}\right) \mathrm{d}t
$$

where for now we can leave open whether we are optimizing for initial condition $m=u_{\text{ic}}$, boundary condition $m=u_{\text{bc}}$ or velocity $m=v$.
</div>

<div class="optional">
The adjoint equation is obtained by taking the derivative of $\mathcal{L}$
with respect to $u$. For ease of notation we'll consider this derivative for abitrary perturbations $\delta u(x,t)$ to $u(x,t)$:

\begin{align*}
  \lim_{\delta u\to 0} \mathcal{L}(u+\delta u, m, \lambda) - \mathcal{L}(u, m, \lambda) 
  = & \int_{t=0}^T\int_{x=0}^L \frac{\partial \mathcal{L}(u, m, \lambda)}{\partial u(x, t)} \delta u(x, t) \mathrm{d}x\mathrm{d}t \\
  = &
  \int_{x=0}^L 2\left(u(x, T) - u_{\text{end}}(x)\right) \delta u(x, T) \mathrm{d}x - 
  \int_{t=0}^T\int_{x=0}^L \lambda(x, t) \left(\frac{\partial\delta u(x,t)}{\partial t} + v(x,t)\frac{\partial\delta u(x,t)}{\partial x}\right) \mathrm{d}x\mathrm{d}t \\
  &-\int_{x=0}^L \lambda_{\text{ic}}(x) \delta u(x,0) \mathrm{d}x
  -\int_{t=0}^T \lambda_{\text{bc}}(t) \delta u(0,t) \mathrm{d}t = 0
\end{align*}

The second integral on the right-hand side consists of two terms (the two terms of the PDE) that we work out seperately. First, the time derivative:

\begin{align*}
  \int_{t=0}^T\int_{x=0}^L \lambda(x, t) \frac{\partial \delta u(x,t)}{\partial t} \mathrm{d}x\mathrm{d}t
&= \int_{x=0}^L\int_{t=0}^T \frac{\partial\left( \lambda(x, t)\delta u(x,t)\right)}{\partial t} - \frac{\partial \lambda(x,t)}{\partial t}\delta u(x,t) ~\mathrm{d}t\mathrm{d}x \\
&= \int_{x=0}^L \big[\lambda(x, t)\delta u(x,t)\big]_{t=0}^T \mathrm{d}x - \int_{x=0}^L\int_{t=0}^T \frac{\partial \lambda(x,t)}{\partial t}\delta u(x,t) \mathrm{d}t\mathrm{d}x \\
&= \int_{x=0}^L \lambda(x, T)\delta u(x,T) \mathrm{d}x - \int_{x=0}^L\lambda(x, 0)\delta u(x,0) \mathrm{d}x - \int_{t=0}^T\int_{x=0}^L \frac{\partial \lambda(x,t)}{\partial t}\delta u(x,t) \mathrm{d}x\mathrm{d}t
\end{align*}

Similarly for the advection term:

\begin{align*}
  \int_{t=0}^T\int_{x=0}^L \lambda(x, t) v(x,t) & \frac{\partial \delta u(x,t)}{\partial x} \mathrm{d}x\mathrm{d}t
= \int_{t=0}^T\int_{x=0}^L \frac{\partial\left( \lambda(x, t)v(x,t)\delta u(x,t)\right)}{\partial x} - \frac{\partial \lambda(x,t)v(x,t)}{\partial x}\delta u(x,t) ~\mathrm{d}x\mathrm{d}t \\
&= \int_{t=0}^T \big[\lambda(x, t)v(x,t)\delta u(x,t)\big]_{x=0}^L \mathrm{d}t - \int_{t=0}^T\int_{x=0}^L \frac{\partial \lambda(x,t)v(x,t)}{\partial x}\delta u(x,t) \mathrm{d}x\mathrm{d}t \\
&= \int_{t=0}^T \lambda(L, t)v(L, t)\delta u(L,t) \mathrm{d}t - \int_{t=0}^T\lambda(0, t)v(0,t)\delta u(0,t) \mathrm{d}t - \int_{x=0}^L\int_{t=0}^T \frac{\partial \lambda(x,t)v(x,t)}{\partial x}\delta u(x,t) \mathrm{d}x\mathrm{d}t
\end{align*}

Putting this all together we obtain

\begin{align*}
  \int_{x=0}^L\int_{t=0}^T \left(\frac{\partial \lambda(x,t)}{\partial t} + \frac{\partial \lambda(x,t)v(x,t)}{\partial x}\right)\delta u(x,t) \mathrm{d}x\mathrm{d}t
  + \int_{x=0}^L \Big(2\left(u(x, T) - u_{\text{end}}(x)\right) - \lambda(x, T)\Big)\delta u(x, T) \mathrm{d}x \\
  + \int_{x=0}^L \Big(\lambda(x, 0) - \lambda_{\text{ic}}(x)\Big)\delta u(x, 0) \mathrm{d}x \\
  + \int_{t=0}^T \Big(\lambda(0, t)v(0,t)- \lambda_{\text{bc}}(t)\Big)\delta u(0,t)\mathrm{d}t
  - \int_{t=0}^T \lambda(L, t)v(L, t)\delta u(L,t) \mathrm{d}t = 0
\end{align*}
</div>

## A Backwards Continuous PDE for $\lambda$ (\*)

<div class="optional">

Let us now return to our assertion that we can solve the adjoint equation for $\lambda$. For this to be true we need to find a $\lambda$ such that the previous equation holds for all possible perturbations $\delta u(x,t)$. 
The first term can be made to disappear by having $\lambda$ satisfy the PDE:

$$
  \frac{\partial \lambda(x,t)}{\partial t} + \frac{\partial v(x,t)\lambda(x,t)}{\partial x} = 0
$$

The second term can only be made to disappear by imposing

$$
  \lambda(x, T) = 2(u(x, T) - u_{\text{end}}(x))
$$

This means we need to solve this PDE with, instead of an initial condition at $t=0$, a "terminal" condition at $t=T$. One way to look at this, is that we are solving the PDE for $\lambda$ *backwards in time*. One consequence of this is that we need boundary conditions at the opposite boundaries. In the "forward" PDE that we solve for $u$ we have assumed that $v(0,t)>0$ and $v(L, t)>0$, so that we have inflow at $x=0$ and outflow at $x=L$. As we know this implies we need a Dirichlet boundary condition at $x=0$, and no condition at $x=L$. If we reverse time however, as we do for $\lambda$, the outflow boundaries becomes inflow boundaries and vice versa. This means we need a boundary condition for $\lambda$ at $x=L$, and no condition at $x=0$. If we choose

$$
  \lambda(L, t) = 0
$$

we can make the last term of the equation disappear.

That leaves us with the third and fourth term which seem to be associated with the initial condition at $t=0$, and boundary condition at $x=0$. One way to deal with those is to simply solve the above PDE for $\lambda$ with the stated "terminal" condition at $t=T$ and boundary condition at $x=L$. This will produce values for $\lambda(x, 0)$ and $\lambda(0, t)$. Since $\lambda_{\text{ic}}$ and $\lambda_{\text{bc}}$ are independent variables, we can then make the third and fourth term disappear by choosing

$$
  \lambda_{\text{ic}}(x) = \lambda(x,0),\;\text{and}\;
  \lambda_{\text{bc}}(t) = \lambda(0,t)v(0,t).
$$

An alternative approach, in the case we are not optimising for boundary or initial conditions, is to say that we only consider functions $u$ that satisfy the initial and boundary conditions $u(x,0)=u_{\text{ic}}(0)$ and $u(0, t)=u_{\text{bc}}(t)$. In that case the perturbation $\delta u$ at $t=0$ and in the location $x=0$ should be zero, i.e. $\delta u(x,0)=0$ and $\delta u(0, t)=0$, in order for the initiial and boundary conditions not to change. This also make the third and fourth term disappear. If we indeed limit ourselves to only functions that satsify the initial and boundary conditions, we can in fact get rid of the explicit constraints associated with those and thus get rid of the $\lambda_{\text{ic}}$ and $\lambda_{\text{bc}}$ terms.
</div>

## Continuous adjoint equation - summary
In the last three sections, we have derived that if 

$g(u, m)=0$ represents the forward one-dimensional advection equation:

$$
  \frac{\partial u(x,t)}{\partial t} + v\frac{\partial u(x,t)}{\partial x} = 0,
$$

in a domain $x \in [0, L]$ and with $t \in [0, T]$, a velocity $v$ and initial and boundary conditions

$$
  \text{initial condition: }\;u(x,0) = u_{\text{ic}}(x), \;\;
  \text{boundary condition: }\;u(0,t) = u_{\text{bc}}(t)
$$

then the adjoint equation

$$
\frac{\partial\mathcal{L}(u, m, \lambda)}{\partial u} = \frac{\partial f(u,m)}{\partial u} - \lambda\cdot \frac{\partial g(u, m)}{\partial u} =0
$$

can be solved for $\lambda$, where $\lambda(x,t)$ is a function of space and time, by solving the following PDE backwards in time:

$$
  \frac{\partial \lambda(x,t)}{\partial t} + \frac{\partial v(x,t)\lambda(x,t)}{\partial x} = 0
$$

with a final (instead of initial) condition at $t=T$:


$$
  \lambda(x, T) = 2(u(x, T) - u_{\text{end}}(x))
$$

and boundary condition (at the right instead of the left of the domain!):

$$
  \lambda(L, t) = 0
$$

The fact that the "final" condition depends on the mismatch between $u$ and $u_{\text{end}}$ is a direct reflection of the fact that we have chosen the functional

$$
  f(u, m) = \int_0^L (u(x,T)-u_{\text{end}}(x))^2 \mathrm{d}x
$$

as we are basically tracking back the sensitivty of this functional, and thus the final state $u(x,T)$, backwards through the model. As an example, if instead we had chosen a functional that depends on the solution $u$ as it leaves the domain (say we're interested a the flux of a concentration $u$ coming out of the right outflow boundary), e.g.:

$$
  f(u,m) = \int_{t=0}^T u(L, t) \mathrm{d}t
$$

then in the PDE for $\lambda$ we would get a nonzero boundary on the right, and if in this case the functional does not depend on $u(x,T)$ the "final" condition for $\lambda$ would just be zero. As another example, if the functional depended on a mismatch with some measurement in the interior of the domain, say an observation point $x_i$:

$$
  f(u,m) = \int_{t=0}^T (u(x_i,t)-u_{\text{obs}})^2 \mathrm{d}t
$$

then in the PDE for $\lambda$ we would end up with an extra term on the right-hand side, that would act like a source term for $\lambda$ at location $x_i$.

## Solving the discrete model backwards

So far we have seen that we can solve for the Lagrange multiplier $\lambda$ associated with the PDE-constraint $g$ by solving the adjoint equation. In the discrete case this looked like

$$
  \underset{\bf U\times U}{\left(\frac{\partial g(u, m)}{\partial u}\right)^T}
  \underset{\bf U\times 1}{\lambda} = \underset{\bf U\times 1}{\left(\frac{\partial f(u,m)}{\partial u}\right)^T}
$$

where $U$ is the number of discretised equations (in time and space). For the continuous case, we learned that $\lambda$ satisfies a PDE that we solve backwards in time and which is related, but not the same as the PDE we solve for $\vec u$.

A natural question is to ask what is the relation between the vector $\lambda$ in the discrete case, and the function $\lambda$ in the continous case.

For simplicity's sake let us assume that the numerical model that corresponds to $g(u, m)=0$ is linear.
As we have seen in the discretised 1D advection example, the structure of 
our model looks like

$$
  g(u, m) =
  \begin{pmatrix}
    \mat A_{1,1} \\
    \mat A_{2,1} & \mat A_{2,2} \\
    & \mat A_{3,2} & \mat A_{3,3} \\
    & & \mat A_{4,3} & \mat A_{4,4} \\
    & & & \ddots & \ddots
  \end{pmatrix}
  \begin{pmatrix}
    \vec u^1 \\ \vec u^2 \\ \vec u^3 \\ \vec u^4 \\ \vdots
  \end{pmatrix}
-
  \begin{pmatrix}
    \vec b^1 \\ \vec b^2 \\ \vec b^3 \\ \vec b^4 \\ \vdots
  \end{pmatrix} = 0
$$

where each of the $\mat A_{n,n}$ and $\mat A_{n,n-1}$ are $N_x\times N_x$ blocks for each time level $n$, $1\leq n \leq N_t$, and
we have split up the solution vector $\vec u$ of length $N_t\cdot N_x$ into $N_t$ vectors $\vec u^n$ of length $N_x$ corresponding to the numerical solution at each time level. The $\vec b^n$ are forcing terms at each time level. For explicit finite difference models the diagonal block $\mat A^{n,n}$ is simply the identity matrix, so we can work out the solution $\vec u^n$ for time level $n$ directly, as all other references are to the previous time level $\vec u^{n-1}$ (via the $\mat A^{n, n-1}$ block). In implicit models we need to do a linear solve involving $\mat A^{n,n}$ at each time level.

From now on we will call this model the forward model, as it corresponds to the PDE we solve for $u$ forward in time. Because we assumed forward model (represented by $g(u,m)=0$) is linear, $\partial g/\partial u$ is based on the same matrix:

$$
  \frac{\partial g(u, m)}{\partial u} =
  \begin{pmatrix}
    \mat A_{1,1} \\
    \mat A_{2,1} & \mat A_{2,2} \\
    & \mat A_{3,2} & \mat A_{3,3} \\
    & & \mat A_{4,3} & \mat A_{4,4} \\
    & & & \ddots & \ddots
  \end{pmatrix}
$$
The adjoint equation gives rise to a linear system of the following form

$$
  \begin{pmatrix}
    \mat A_{1,1}^T & \mat A_{2,1}^T \\
    & \mat A_{2,2}^T & \mat A_{3,2}^T \\
    & & \mat A_{3,3}^T & \mat A_{4,3}^T \\
    & & & \ddots & \ddots \\
    & & & & \mat A_{N_t-1, N_t-1}^T & \mat A_{N_t, N_t-1}^T \\
    & & & & & \mat A_{N_t, N_t}^T
  \end{pmatrix}
  \begin{pmatrix}
    \vlam^1 \\ \vlam^2 \\ \vlam^3 \\ \vdots \\ \vlam^{N_t-1} \\ \vlam^{N_t}
  \end{pmatrix}
=
  \begin{pmatrix}
    \frac{\partial f}{\partial \vec u^1} \\ 
    \frac{\partial f}{\partial \vec u^2} \\ 
    \frac{\partial f}{\partial \vec u^3} \\ 
    \vdots \\
    \frac{\partial f}{\partial \vec u^{N_t-1}} \\ 
    \frac{\partial f}{\partial \vec u^{N_t}} \\ 
  \end{pmatrix},
$$

Note that the natural way to solve this system is to actually go backwards through the time levels, i.e. first we solve the last row:

$$
  \mat A_{N_t, N_t}^T \vlam^{N_t} = \frac{\partial f}{\partial \vec u^{N_t}},
$$

then using $\vlam^{N_t}$, we solve for $\vlam^{N_t-1}$

$$
  \mat A_{N_t-1, N_t-1}^T \vlam^{N_t-1} + \mat A_{N_t, N_t-1}^T\vlam^{N_t} = \frac{\partial f}{\partial \vec u^{N_t-1}},
$$

etc.

Thus we see that for a discrete PDE-constrained optimisation problem, the adjoint equation leads to a numerical model that is solved backwards in time. The discrete adjoint solution $\vlam$ can in fact be interpreted as a discrete version of the continuous solution $\lambda(x, t)$ to the backwards adjoint PDE that we would have obtained by formulating the same optimisation as a continous problem. It is however *not* generally the case that if we start with the continuous formulation, derive the continuous backward PDE and then choose a suitable discretisation method for that PDE, that we would obtain the same discrete solution $\vlam$ - the difference being a discetisation error.

# A Derivative of the Reduced Problem via the Adjoint Equation
In the previous section we have seen how we can solve the adjoint equation for $\lambda$, either in the continous formulation via a backwards-in-time PDE, or in the discrete case by taking the adjoint of our discrete numerical model and solving the equations backwards. This solution procedure only makes sense however if we already know $m$ and $u$ as the adjoint equation depends on the values of these two. If we know $m$ we can first solve the PDE-constraint for $u$, i.e. solve the original forward PDE. This does not bring us any further as the model parameters $m$ are the actual unknowns of our optimisation problem. We do have one remaining equation, derived from $\partial L/\partial m=0$, that we could solve for $m$, but only if we know $u$ and $\lambda$ already.

This shows that the three equations based on finding stationary points of the Lagrangian $L$ are coupled.
One approach, called the "one shot" approach is to solve all three equations in one big system, and hope that we can find a solution strategy. As we discussed in lecture 6 in the context of the KKT systems (of which this is an example), solving this kind of systems is very tricky. In this case it becomes even harder because the system we need to solve is very large: we need to solve for the solution at all time-levels simultaneously.

A more easily manageable approach is to link the adjoint equation to the reduced problem. If we assume that $u(m)$ is a solution of $g(u,m)=0$ then the Lagrangian simplifies to

$$
  \mathcal{L}(u(m), m, \lambda) = f(u(m), m) + \lambda g(u(m), m) = f(u(m), m) = \hat f(m)
$$

We can now compute the total derivative of $\mathcal{L}(u(m), m, \lambda)$ and therefore of the reduced functional $\hat f(m)$, with respect to $m$:

$$
  \frac{d\hat f(m)}{dm} = \frac{d\mathcal{L}(u(m), m, \lambda)}{dm} = \left.\frac{\partial\mathcal{L}(u, m, \lambda)}{\partial u}\right|_{u=u(m)} \frac{du(m)}{dm}
    + \left.\frac{\partial\mathcal{L}(u, m, \lambda)}{\partial m}\right|_{u=u(m)}
$$

Note that the two terms on the right-hand side explicitly depend on $\lambda$. However the added result should be the same regardless of our choice of $\lambda$. We also see a dependence on $du(m)/dm$ which we encountered in the tangent linear approach, where we concluded that calculating this derivative is expensive if we have more than just a few parameters. In this case however we can make that term disappear by choosing our $\lambda$ to satisfy the adjoint equation! After all the adjoint equation is given by

$$
  \frac{\partial\mathcal{L}(u, m, \lambda)}{\partial u} = \frac{\partial f(u,m)}{\partial u} - \lambda\cdot \frac{\partial g(u, m)}{\partial u} = 0 
$$  

Now we can compute $d\hat f(m)/dm$ for any $m$ in three steps:

1. Solve the PDE constraint $g(u, m)=0$ to obtain $u=u(m)$

2. Solve the adjoint equation using $m$ and $u(m)$ for $\lambda$, such that $\frac{\partial\mathcal{L}(u, m, \lambda)}{\partial u}=0$

3. The derivative of the reduced functional is now given by

$$
  \frac{d\hat f(m)}{dm} = \frac{\partial\mathcal{L}(u, m, \lambda)}{\partial m} = \frac{\partial f(u,m)}{\partial m} - \lambda\cdot \frac{\partial g(u, m)}{\partial m}
$$

It is important to realize that the three-step procedure does not solve for the stationary points of $\mathcal{L}$ in one go: we have solved two out of three equations: $\partial\mathcal{L}/\partial\lambda=0$ and $\partial\mathcal{L}/\partial u=0$. But in the last step we compute $\partial\mathcal{L}/\partial m$ which is not necessarily zero. We do however have a way of computing $d\hat{f}/dm$, and we can use that to iteratively solve the reduced optimisation problem for $m$. The iterative solution procedure will typically require us to repeat the three steps to compute the gradient for each iterate $m$. Once we have found the optimal solution, which means we have found a $m$ for which $d\hat{f}/dm=0$, we then also have, as expected, found a stationary point of $\mathcal{L}$, since  $\partial\mathcal{L}/\partial m = d\hat{f}/dm=0$.

# Implementation (*)

<div class="optional">

## Continuous vs. Discrete Adjoint

Although we have outlined a procedure to solve PDE-constrained optimisation problems using the adjoint method, there are still a few important choices we have to make when implementing this method. We distinguuish between two approaches:

* The <a class="definition" href="#definitions" id="continuousadjoint">continuous adjoint</a> approach, starts with a continuous formulation of the PDE-constrained optimisation, i.e. we simply write down what PDE we want to solve, which we will call the *forward* PDE, without specifying which discretisation is going to be used. The adjoint equation then produces a second PDE, the *adjoint* or *backward* PDE that solves for $\lambda$. We now need to implement a discretisation for both the forward and the backward PDE. Luckily in many cases, the backward PDE is closely related to the forward PDE, so the code used to solve the forward PDE numerically, can be adapted to be able to also solve the backward problem. Fpr complicated nonlinear PDEs, the relations between the forward and the backward model is not always that straightforward however.

* In the <a class="definition" href="#definitions" id="discontinuousordiscreteadjoint">discontinuous (or discrete) adjoint</a> approach, the discretisation of the forward PDE is already chosen beforehand. This leads to a discrete PDE-constrained optimisation problem. The procedure to solve for $\lambda$ is to take the adjoint of the forward *discrete* model, and requires taking derivatives of that code (see next section), which can be a complicated procedure.

The main advantage of the discontinuous approach is that it produces the exact (except for details such as solver tolerances, machine precision, etc.) derivative of the discrete model that is used to evaluate the reduced functional for a given $m$. In the continuous approach, when we hook up our final implementation to the optimisation algorithm, we also use a discrete model to calculate the reduced functional values. This is however just a numerical approximation to the underlying continuous PDE, which introduces numerical error between the two. In the continuous adjoint method for calculating the derivative we again introduce numerical error, by discretising the adjoint PDE. Thus the gradient information that the optimisation algorithm receives is a numerical approximation of the procedure to compute the functional values. The introduced error may lead to undesired convergence, or non-convergence of the optimisation algorithm. For instance it might be that the adjoint-based derivative goes to zero, whereas the implemented functional still changes for perturbations to $m$. Or vice versa, the adjoint-based derivative predict a change in some direction that is not observed when perturbing $m$ in that direction.

## Automatic Differentiation
To implement a discrete adjoint model, in principle we could write out by hand what the mathematical operations are that occur in the discrete forward model, derive these by hand and implement them. In practice, this is only feasible for very simple models. There are *automatic differentation (AD)* frameworks that automate some or all of this work for you based on the computer code of the forward model.

On the one end of the spectrum (fully automated), this allows you to treat the computer code completely as a black box with some inputs and outputs. By applying the chain rule over all operations that occur in the model, the framework allows you to either take the derivative of individual input parameters and work out how that information propagates through the model towards the outputs (this is equivalent to the tangent linear approach). Or, to specify a specific outcome of the model and work backwards to compute its sensitivity to the various input parameters (as in the adjoint approach). The downside of this approach is that information spreads quickly through the model, and any libraries it uses, the code of which would all have to be analysed as well. This means it will quickly descend into areas of the code that are not necessarily relevant for the mathematical calculation, and which are hard to differentiate sensibly, for instance code to do with I/O routines, parallel communication, etc.
Fully automated adjoint codes are often not very efficient. Not only because it spends time in areas of the code that are not relevant, but also because in order for the backward mode to be executed, it first needs to establish a tree of all the operations that were performed in the forward model so that it can be traversed backwards. Because of the fine granularity of all individual operations, this tree, which also needs to store all intermediate results to deal with nonlinearity, may become very large.

In the discrete approach that we sketched in this lecture, there would be more user intervention required to indicate which of the many variables in the computer program actually store the intermediate solutions $\vec u$, and which are the equations $g(\vec u, \vec m)=0$ that are solved. This might happen for instance through annotation of the code. The automatic differentiation framework would then still be responsible to work out what the partial derivatives $\partial g/\partial u$ and $\partial g/\partial m$ are. Because the model is broken down in larger steps, this means there are fewer intermediate results that need to be stored. As you might imagine the user annotation can be laboursome, and in particular in codes that are still being developed (which is the usual case) it can be hard to keep the annotation up-to-date.

In an ideal world, the computer would have a better understanding of the mathematical operations that are performed to implement a certain numerical scheme. For instance if one of the steps in the model requires the solution to a linear system, we can work out analytically what the adjoint to that step is. This would be a far more efficient and elegant approach than blindly applying the automatic differentation tool to the entire code base of the linear solver library. An interesting approach is followed in [dolfin-adjoint](http://www.dolfin-adjoint.org/). Automated code generation frameworks for solving PDES, such as [Firedrake](https://firedrakeproject.org/), [FEniCS](https://fenicsproject.org/) and [Devito](https://www.devitoproject.org/), provides a higher level ("domain specific") language to describe the numerical equations approximating a PDE. These frameworks can automatically derive highly optimized code that is then executed to numerically solve the PDE. For models written in Firedrake or FEniCS, because the framework has acces to this higher level description of the underlying numerical mathematics, dolfin-adjoint can automatically derive and implement an efficient adjoint model and compute gradients of any specified outcome of the model with respect to any input. Obtaining derivative information can then be as simple as adding three lines of code!

## Checkpointing
It should be noted that although the adjoint equation is linear (in $\lambda$!), it does depend on the forward solution $u$. This means that when solving the adjoint, we need access to the entire forward solution $\vec u$, meaning the discrete solution at all timesteps. When the forward model is run for more than a few timesteps, it might become infeasible to store all these in memory. Writing every single timestep to disc, and reading back when needed would be very expensive however (both in time and disc space). A compromise that is often taken, is to only write (checkpoint) a few intermediate results at some specified intervals. When the adjoint model is going backwards (in time) through the model and requires access to the forward solution for some range of time levels, the missing intermediate solutions can be recomputed by starting from the nearest checkpoint. The *revolve* algorithm is a method for working out the optimal checkpointing frequency (see [pyrevolve](https://github.com/opesci/pyrevolve) for a python implementation).
</div>

# List of Definitions
<a id="definitions"/>

* <a class=definition href="#continuousoptimisationproblem">continuous optimisation problem</a>
* <a class=definition href="#discontinousoptimisationproblem">discontinous optimisation problem</a>
* <a class=definition href="#reducedfunctional">reduced functional</a>
* <a class=definition href="#tangentlinearequation">tangent linear equation</a>
* <a class=definition href="#continuousadjoint">continuous adjoint</a>
* <a class=definition href="#discontinuousordiscreteadjoint">discontinuous (or discrete) adjoint</a>
* <a class=definition href="#automaticdifferentationAD">automatic differentation (AD)</a>
