# Full-waveform inversion - Theory

Full-waveform inversion (FWI) is a computational scheme for generating high-resolution, high-fidelity models of physical properties using finite-frequency waves. The waves could be electromagnetic, acoustic, elastic or of various other kinds. The method is used in medical imaging of soft tissues, in non-destructive testing, in petroleum exploration, in earthquake seismology, and to image the interior of the Sun. FWI is a form of tomography, but conventional tomography assumes that energy travels along infinitely thin geometric ray paths, that there are no finite-frequency wave effects, and that it is necessary only to fit a single amplitude or travel-time for each source-and-receiver pair. In contrast, FWI attempts to fully account for the finite wavelength of the observed signals, and it seeks to explain the detailed waveforms of the recorded data.

Like other simpler forms of tomography, FWI is a local, iterated inversion scheme that successively improves a starting model. It does this by using the two-way wave equation to predict the observed data from a model, and then seeks to find a new model that minimises the differences between those predictions and the observed data; it attempts to match the raw observed data wiggle for wiggle. The computational effort required for FWI is large, but the resulting spatial resolution is much better than can be obtained by conventional tomography - it has only become economically feasible for three-dimensional models in the last ten years or so. The notation used is summarised at the end of these notes, together with definitions of the $L_2-norm$, the *gradient* and the *Hessian*.

<tr>
    <td> <img src="figures/survey-ship-diagram.png" alt="Drawing" style="width: 450px;"/> </td>
    <td> <img src="figures/Marmousi3D.png" alt="Drawing" style="width: 450px;"/> </td>
</tr>

**Left:** Sketch of offshore seismic survey. **Right:** Example model result for $v_p$.

## The FWI algorithm

The aim of FWI is find a model that minimises some measure of the misfit between a dataset predicted by a model and an observed dataset - this measure is called the *objective function*.

A simple geometric analogy, in which the model has just two parameters, is to regard the misfit as being represented by the local height of a two-dimensional error surface, and the two model parameters as representing the $x$ and $y$-coordinates of a point on this surface. FWI then involves starting at some point on this surface, and trying to find the bottom of the deepest valley by heading downhill in a sequence of finite steps. To do this, we have to discover which way is downhill, and how far to step. In real FWI, the model has not just two parameters, but many millions, but the analogy is still appropriate. The algorithm proceeds as follows:
1. Calculate the direction of the local gradient $\nabla_\mathbf{m}$ of the objective function f with respect to the model parameters - this points uphill
    - Using the *starting model* $\mathbf{m}$ and a known *source* $\mathbf{s}$, calculate the forward *wavefield* $\mathbf{u}$ everywhere in the model including the *predicted data* $\mathbf{p}$ at the receivers.
    - At the receivers, subtract the observed data d from the predicted data to obtain the *residual data* $\delta\mathbf{d}$.
    - Treating the receivers as virtual sources, back-propagate the residual data into the model, to generate the residual wavefield $\delta\mathbf{u}$.
    - Scale the residual wavefield by the local slowness $1/c$, and differentiate it twice in time.
    - At every point in the model, cross-correlate the forward and scaled residual wavefields, and take the zero lag in time to generate the *gradient* for one source.
    - Do this for every source, and stack together the results to make the global gradient.
2. Find the step length - how far is the bottom of the hill?
    - Take a small step and a larger step directly downhill, and calculate the objective function at the current model and in these two new models.
    - Assume a linear relationship between changes in the model and changes in the residual data so that there will be a parabolic relationship between changes in the model and changes in the objective function, then fit a parabola through these three points.
    - The lowest point on this parabola represents the optimal step length (assuming a locally linear relationship).
    - Step downhill by the required amount, and update the model.
3. Do it all over again
    - Use the new model as the starting model, and repeat steps 1. and 2.
    - Repeat this process until the model is 'good enough', that is the model is no longer changing (to some numerical tolerance), or we run out of time, money or patience.

This is the basic algorithm. There are several ways to enhance and improve it, but nearly all of these involve a greater computational cost (which is already high).

## The Wave Equation

The wave equation is a simplified model for, i.e. the displacement of a vibrating
string (approx. 1D), a membrane (such as a drum skin, approx. 2D) or an elastic solid in 3D (the situation relevant to FWI). That is, the main physics the wave equation is attempting to capture is, broadly speaking, the transfer through space of oscillatory energy (vibrations in time).

The simplest wave equation that is commonly used in FWI is:

\begin{equation}
  \frac{1}{c^2}\frac{\partial^2 u}{\partial t^2}-\nabla^2 u = s,
\label{eq:we0} \tag{1}
\end{equation}

where $u$ is the propagating wavefield measured using some appropriate material property (for example electric field in an EM wave or acoustic pressure in an acoustic wave), $s$ is the driving source that produces the wavefield, and $c$ is the wave speed. Both $u$ and $s$ vary in space and time, and c varies in space. This equation applies to small-amplitude linear waves propagating within an inhomogeneous, isotropic, constant-density, non-attenuating, non-dispersive, fluid medium. It is relatively straightforward to add variable density, shear
strength, attenuation, anisotropy, dispersion, polarisation and other physical effects to this simple wave equation; these effects change the detailed equations and numerical complexity, but not the general approach.

### Simplified derivation of the 1D wave-equation

For a simple derivation of the 1D acoustic wave equation let us focus on an isotropic and homogeneous elastic string, where $u(x, t)$ describes displacement or 'height' from the position of rest at position $x$ and time $t$. We consider a
small subinterval $[x_1, x_2]$. The total acceleration, $a$, in the '$u$'-direction within this interval is

\begin{equation}
   a=\partial^2\int_{x_1}^{x^2}u(x,t)\mathrm{d}t=\int_{x_1}^{x^2}u_{tt}(x,t)\mathrm{d}x.
\label{eq:sd0} \tag{2}
\end{equation}

The total force acting on this interval of the string is the net force of the forces
acting at the points $x_1$ and $x_2$. The force $F=F(u)$ will be some function
of $u$. By Newton’s law, force equals acceleration for unit mass and hence

\begin{equation}
  a = F(u(x_2,t))-F(u(x_1,t))=\int_{x_1}^{x_2}F_{x}(u(x,t))\mathrm{d}x.
  \label{eq:sd1} \tag{3}
\end{equation}

We are now required to make some assumptions. Let us assume that the force $F$ is proportional to the slope of the string with a proportionality factor $c^2$ (can be justified for small displacements). Hence

\begin{equation}
  F(u) \approx c^2 u_x.
  \label{eq:sd2} \tag{4}
\end{equation}

Combining the above yields

\begin{equation}
  \int_{x_1}^{x_2}(u_{tt}-c^2 u_{xx})\mathrm{d}x=0.
  \label{eq:sd3} \tag{5}
\end{equation}

Since this holds for an arbitrary interval $[x_1, x_2]$, we must have that

\begin{equation}
  \frac{1}{c^2}u_{tt}=c^2 u_{xx},
  \label{eq:sd4} \tag{6}
\end{equation}

which is the 1D wave equation. Similar arguments (albeit using vector calculus) can be made to derive the corresponding wave equation in two or three dimensions.

### Other forms of the wave equation

More general forms of the wave equation can be written as

\begin{equation}
  \rho(\mathbf{x})\frac{\partial^2 \mathbf{u}}{\partial t^2}(\mathbf{x},t)-\nabla\cdot\mathbf{\sigma}(\mathbf{x},t)=\mathbf{f}(\mathbf{x},t),
  \label{eq:awe0} \tag{7}
\end{equation}

where $\mathbf{u}(\mathbf{x},t)$ is the displacement field and $\rho(\mathbf{x})$, $\mathbf{\sigma}(\mathbf{x},t)$, and $\mathbf{f}(\mathbf{x},t)$ represent the material density, stress tensor and an external force density respectively. Depending on the fidelity of model we wish to implement, $\mathbf{\sigma}(\mathbf{x},t)$ can take on many different forms (the simplest, as we'll see below, resulting in the acoustic wave equation defined above). For example, the acoustic wave equation only accounts for the propagation of *pressure waves* (P-waves) but the propagation of *shear waves* (S-waves) is also important in many physical problems. In the *elastic wave equation*, which accounts for the propagation of both P- and S-waves (i.e. both longitudinal and transverse motions) the stress tensor can be written as (dropping function dependencies for conciseness)

\begin{equation}
  \sigma_{ij}=\sum_{k,l=1}^{3}(\lambda\delta_{ij}\delta_{kl}+\mu\delta_{ik}\delta_{jl}+\mu\delta_{il}\delta_{jk})\varepsilon_{kl},
  \label{eq:awe1} \tag{8}
\end{equation}

where

\begin{equation}
  \varepsilon_{ij}=\frac{1}{2}(\partial_{i}u_{j}+\partial_{j}u_{i})
  \label{eq:awe2} \tag{9}
\end{equation}

and $\lambda$ and $\mu$ and known as *Lame parameters* (which are determined by the physical properties of the isotropic homogeneous elastic medium). Note that in the formulations introduced so far, energy is not dissipated - one formulation in which energy is dissipated is know as the *viscoelastic wave equation*. We will not discuss this formulation here, but for interested readers details regarding viscoelastic formulations (along with acoustic and elastic) and details regarding FWI in general can be found in the book entitled **Full Seismic Waveform Modelling and Inversion** by **Andreas Fichtner**.

In the fluid regions of the Earth (e.g. oceans and outer core) the shear modulus $\mu$ (one of the *Lame parameters*) is effectively zero. In such cases the stress tensor reduces to

\begin{equation}
  \sigma_{ij} = \kappa\delta_{ij}\nabla\cdot\mathbf{u}=-p\delta_{ij},
  \label{eq:awe3} \tag{10}
\end{equation}

where we have introduced the scalar pressure $p:=-\kappa\nabla\cdot\mathbf{u}$ and $\kappa$ ($=\lambda+\frac{2}{3}$) is the *bulk modulus* (which has a more straightforward physical interpretation). Hence, $\eqref{eq:awe0}$ reduces to

\begin{equation}
  \rho\ddot{\mathbf{u}}+\nabla p = \mathbf{f}.
  \label{eq:awe4} \tag{11}
\end{equation}

(Note: $\ddot{f}\equiv\frac{\partial^2 f}{\partial t^2}$.) Dividing by the density $\rho$ and taking the divergence gives

\begin{equation}
  \nabla\cdot\ddot{\mathbf{u}}+\nabla\cdot(\rho^{-1}\nabla p) = \nabla\cdot(\rho^{-1}\mathbf{f}).
  \label{eq:awe5} \tag{12}
\end{equation}

Using our definition of pressure we can then eliminate $\mathbf{u}$ leaving

\begin{equation}
  \kappa^{-1}\ddot{p}-\nabla\cdot(\rho^{-1}\nabla p) = -\nabla\cdot(\rho^{-1}\mathbf{f}).
  \label{eq:awe6} \tag{13}
\end{equation}

Provided density varies much slower than the pressure field $p$ and the source term $\mathbf{f}$ we can further simply to obtain

\begin{equation}
  \ddot{p}-v_p^2\nabla^2p=-v_p^2\nabla\cdot\mathbf{f},
  \label{eq:awe7} \tag{14}
\end{equation}

with the *acoustic wave speed* $v_a:=\sqrt{\frac{\kappa}{\rho}}$. This is of course the 'main' equation of FWI introduced earlier when we let the wavefield $u=p$, the wave speed $c=v_p$ and 'abstracting away' any intricacies in the source term such we have simply $-v_p^2\nabla\cdot\mathbf{f}=s$.


### Matrix form

The wave equation represents a linear relationship between a wavefield $u$ and the source $s$ that generates the wavefield. After discretisation (with for example finite differences) we can therefore write $\eqref{eq:we0}$ as a matrix equation

\begin{equation}
  \mathbf{A}\mathbf{u}=\mathbf{s},
\label{eq:we1} \tag{15}
\end{equation}

where $\mathbf{u}$ and $\mathbf{s}$ are column vectors that represent the source and wavefield at discrete points
in space and time, and $\mathbf{A}$ is a matrix that represents the discrete numerical implementation of the operator

\begin{equation}
  \frac{1}{c^2}\frac{\partial^2}{\partial t^2}-\nabla^2.
\label{eq:we3} \tag{16}
\end{equation}

Although the wave equation represents a linear relationship between $u$ and $s$, it also represents a non-linear relationship between a model $\mathbf{m}$ and wavefield $\mathbf{u}$. Thus we can also write the wave equation as

\begin{equation}
  G(\mathbf{m})=\mathbf{u}.
\label{eq:we4} \tag{17}
\end{equation}

Here $\mathbf{m}$ is a column vector that contains the model parameters. Commonly these will be the values of $c$ at every point in the model, but they may be any set of parameters that is sufficient to describe the model, for example slownesses $1/c$. Note that in equation $\eqref{eq:we4}$ $G$ is not a matrix; it is a (non-linear) function that describes how to calculate a wavefield $\mathbf{u}$ given a model $\mathbf{m}$.

Note that the form of matrix $\mathbf{A}$ depends upon both the model properties and the details of the numerical implementation, and that the form of the function $G$ depends upon the source and the acquisition geometry. The form of $\mathbf{A}$ does not depend upon the source and the form of $G$ does not depend upon the model.

It is common in FWI to construct the numerical wave equation in $\eqref{eq:we1}$ such that the matrix
$\mathbf{A}$ represents a wave travelling forward in time, and its transpose represents a wave travelling
backwards in time. This is not essential, but it is often straightforward to achieve, in which
case it simplifies the numerics of FWI.

## The Objective Function

The central purpose of FWI is to find a physical model of the wave-transmitting medium that
minimises the difference between an observed dataset and the same dataset as predicted by
the model. Consequently we need a means to measure this difference. There are many ways
to do this, but the most common is a *least-squares* formulation where we seek to minimise the
sum of the squares of the differences between the two datasets over all sources and receivers,
and over all times. That is, we seek to find a model that minimises the square of the $L_2-norm$
of the *data residuals*.

The $L_2-norm$ expresses the misfit between the two datasets as a single number. This number
is variously called the *cost function*, the *objective function*, the *misfit function*, or just the
*functional*. It is typically given the symbol $f$. It is a real positive scalar quantity, and it is
a function of the model $\mathbf{m}$. In practice, a factor of a half is often included in the definition
of the objective function to 'simplify' the formulation (as we will see later). Define:

\begin{equation}
  f(\mathbf{m})=\frac{1}{2}||\mathbf{p}-\mathbf{d}||^2=\frac{1}{2}||\delta\mathbf{d}||^2=\frac{1}{2}\delta\mathbf{d}^{T}\delta\mathbf{d}=\frac{1}{2}\sum_{n_s}\sum_{n_r}\sum_{n_t}|p-d|^2,
\label{eq:oe0} \tag{18}
\end{equation}

where $n_s$, $n_r$ and $n_t$ are the number of sources, receivers and time samples in the data set,
and $\mathbf{d}$ and $\mathbf{p}$ are the observed and predicted datasets.

To minimise this function with respect to the model parameters $\mathbf{m}$, we have to differentiate
$f$ with respect to $\mathbf{m}$, set the differentials equal to zero, and solve for $\mathbf{m}$.

## Local Inversion

FWI is a local iterative inversion scheme. It begins from a starting model $\mathbf{m}_0$ that is
assumed to be sufficiently close to the true model, and it seeks to make a series of step-wise
improvements to this model which successively reduce the objective function towards zero.
Thus, we need to consider the objective function for a starting model $\mathbf{m}_0$ and a new model
$\mathbf{m}=\mathbf{m}_0+\delta\mathbf{m}$.

Recall the Taylor series, truncated to second order, for a scalar function of a single *scalar*
variable is

\begin{equation}
  f(x) = f(x_0+\delta x)=f(x_0)+\delta x\frac{\mathrm{d} f}{\mathrm{d} x}\biggr|_{x=x_0}+\frac{1}{2}\delta x^2 \frac{\mathrm{d}^2f}{\mathrm{d}x^2}\biggr|_{x=x_0}+\mathcal{O}(\delta x^3).
  \label{eq:li0} \tag{19}
\end{equation}

For a scalar function of a *vector*, the analogous expression is

\begin{equation}
  f(\mathbf{m}) = f(\mathbf{m}_0+\delta\mathbf{m})=f(\mathbf{m}_0)+\delta\mathbf{m}^T\frac{\partial f}{\partial \mathbf{m}}\biggr|_{\mathbf{m}=\mathbf{m}_0}+\frac{1}{2}\delta\mathbf{m}^T \frac{\partial^2f}{\partial \mathbf{m}^2}\biggr|_{\mathbf{m}=\mathbf{m}_0}\delta\mathbf{m}+\mathcal{O}(\delta \mathbf{m}^3).
  \label{eq:li1} \tag{20}
\end{equation}

Now we must differentiate this equation with respect to $\mathbf{m}$, and set the result to zero in order to
minimise $f$ with respect to $\mathbf{m}_0+\delta\mathbf{m}$. Note that differentiating with respect to $\mathbf{m}$ is the same as differentiating with respect to $\delta\mathbf{m}$ because $\mathbf{m}_0$ is constant. Note also that $f$, $\partial f\partial\mathbf{m}$ and $\partial^2 f\partial\mathbf{m}^2$, evaluated at $\mathbf{m}=\mathbf{m}_0$, do not depend upon $\delta\mathbf{m}$. Thus, when we differentiate equation with respect to $\mathbf{m}$, we obtain

\begin{equation}
  \frac{\partial f}{\partial \mathbf{m}}\biggr|_{\mathbf{m}=\mathbf{m}_0+\delta \mathbf{m}} = \frac{\partial f}{\partial \mathbf{m}}\biggr|_{\mathbf{m}=\mathbf{m}_0}+\left(\frac{1}{2}\delta\mathbf{m}^T \frac{\partial^2f}{\partial \mathbf{m}^2}\biggr|_{\mathbf{m}=\mathbf{m}_0}\right)^T+\frac{\partial^2f}{\partial \mathbf{m}^2}\biggr|_{\mathbf{m}=\mathbf{m}_0}\delta\mathbf{m}+....
  \label{eq:li2} \tag{21}
\end{equation}

Setting this equal to zero, and combining the two middle terms, gives

\begin{equation}
  \frac{\partial f}{\partial \mathbf{m}}\biggr|_{\mathbf{m}=\mathbf{m}_0}+\frac{\partial^2f}{\partial \mathbf{m}^2}\biggr|_{\mathbf{m}=\mathbf{m}_0}\delta\mathbf{m}+\mathcal{O}(\delta\mathbf{m}^2)=0.
  \label{eq:li3} \tag{22}
\end{equation}

Neglecting second-order terms, and rearranging, gives an expression for the update to the
model $\delta\mathbf{m}$:

\begin{equation}
  \delta\mathbf{m} \approx - \left(\frac{\partial^2 f}{\partial\mathbf{m}^2}\right)^{-1}\frac{\partial f}{\partial\mathbf{m}} \equiv -\mathbf{H}^{-1}\nabla_{\mathbf{m}}f.
\label{li4} \tag{23}
\end{equation}

Here $\nabla_{\mathbf{m}}f$ is the *gradient* of the objective function $f$ with respect to the model parameters,
and $\mathbf{H}$ is the *Hessian* matrix of second differentials, both evaluated at $\mathbf{m}_0$.

If the model has $n$ parameters, then the gradient is a column vector of length $n$, and the
Hessian is an $n \times n$ symmetric matrix. Methods that solve the inversion problem using
equation $\eqref{li4}$ directly are called *Newton* methods. Methods that use equation $\eqref{li4}$ with a
'reasonable' approximation to the Hessian are called Gauss-Newton or quasi-Newton methods
depending upon how the approximation is formulated.

## Steepest Descent

If the number of model parameters $n$ is large, calculating the Hessian is a major
undertaking, and inverting it is not normally computationally feasible. Consequently the
method that is typically used is to replace the inverse of the Hessian in equation ([23](#mjx-eqn-li4)) by a
simple scalar $\alpha$; this scalar is called the step length. We now have

\begin{equation}
  \delta\mathbf{m} = -\alpha\frac{\partial f}{\partial \mathbf{m}} = -\alpha\nabla_{\mathbf{m}}f .
  \tag{24}
\end{equation}

The method that uses this approach is called the method of *steepest descent*, and in its
simplest form it consists of the following steps:
1. start from a model $\mathbf{m}_0$,
2. evaluate the gradient of the objective function, $\nabla_{\mathbf{m}}f$, for the current model,
3. find the step length $\alpha$,
4. subtract $\alpha$ times the gradient from the current model to obtain a new model,
5. iterate from step 2 using the new model until the objective function is sufficiently small (or we run out of patience).

To implement this, we need a method of calculating the local gradient.

## Calculating the gradient

In principle, we could find the gradient by perturbing each of the model parameters in turn,
and calculating what happens to the objective function each time. For $n$ model parameters,
that would require $n+1$ modelling runs, and this is not computationally feasible. Fortunately
there is a faster way using a solution to the *adjoint* problem.

First, write the gradient in terms of the residual data $\delta\mathbf{d}=\mathbf{p}-\mathbf{d}$:

\begin{equation}
  \nabla_{\mathbf{m}}f=\frac{\partial f}{\partial \mathbf{m}}=\frac{\partial}{\partial \mathbf{m}}\left(\frac{1}{2}\delta\mathbf{d}^T\delta\mathbf{d}\right)=\frac{\partial (\mathbf{p}-\mathbf{d})^T}{\partial \mathbf{m}}\delta\mathbf{d}=\left(\frac{\partial \mathbf{p}}{\partial \mathbf{m}}\right)^T\delta\mathbf{d}.
  \label{eq:rw0} \tag{25}
\end{equation}

Now, write the wave equation for a dataset $\mathbf{p}$ generated by a source $\mathbf{s}$ as

\begin{equation}
  \mathbf{Au}=\mathbf{s},
  \label{eq:rwe1} \tag{26}
\end{equation}

where $\mathbf{p}$ is the subset of the full wavefield $\mathbf{u}$ that is located at the receiver positions.
Mathematically, we can extract the data at the receivers from the data everywhere in the
model simply by using a diagonal *restriction* matrix $\mathbf{R}$ that has non-zero unit values only
where there exists observed data. That is

\begin{equation}
  \mathbf{p}=\mathbf{Ru}.
  \label{eq:rwe2} \tag{27}
\end{equation}

Now, differentiate equation $\eqref{eq:rwe2}$ with respect to $\mathbf{m}$ to obtain

\begin{equation}
  \frac{\partial\mathbf{A}}{\partial\mathbf{m}}\mathbf{u}+\mathbf{A}\frac{\partial\mathbf{u}}{\partial\mathbf{m}}=\frac{\partial\mathbf{s}}{\partial\mathbf{m}}=0,
  \label{eq:rwe3} \tag{28}
\end{equation}

which is equal to zero because the source \mathbf{s} does not depend upon the model \mathbf{m}.
Rearranging gives

\begin{equation}
  \frac{\partial\mathbf{u}}{\partial\mathbf{m}}=-\mathbf{A}^{-1}\frac{\partial\mathbf{A}}{\partial\mathbf{m}}\mathbf{u},
  \label{eq:rwe4} \tag{29}
\end{equation}

and pre-multiplying $\eqref{eq:rwe4}$ by the matrix $\mathbf{R}$ extracts the wavefield only at those points where
we have data.

So now, to find the variation of the data with the model, we have

\begin{equation}
  \frac{\partial\mathbf{p}}{\partial\mathbf{m}}=\mathbf{R}\frac{\partial\mathbf{u}}{\partial\mathbf{m}}=-\mathbf{R}\mathbf{A}^{-1}\frac{\partial\mathbf{A}}{\partial\mathbf{m}}\mathbf{u}.
  \label{eq:rwe5} \tag{30}
\end{equation}

Substituting $\eqref{eq:rwe5}$ into $\eqref{eq:rw0}$ and rearranging results in the following expression for the gradient

\begin{equation}
  \nabla_{\mathbf{m}}f=-\mathbf{u}^T\left(\frac{\partial \mathbf{A}}{\partial \mathbf{m}}\right)^T(\mathbf{A^{-1}})^T\mathbf{R}^T\delta\mathbf{d}.
  \label{eq:rwe6} \tag{31}
\end{equation}

Therefore, to find the gradient, we must calculate the forward wavefield $\mathbf{u}$, differentiate
the numerical operator $\mathbf{A}$ with respect to the model parameters (this is an operation that
we can do analytically), and we must also compute the final term $(\mathbf{A}^{-1})^T\mathbf{R}^T\delta\mathbf{d}$. We must multiply these terms together at all times, and for all sources, and sum these together to
give a value corresponding to each parameter within the model; typically this means one
value of $\nabla_{\mathbf{m}}f$ at each grid point within the model.

## Interpreting the expression for the gradient

The final term in ([31](#mjx-eqn-eq:rwe6)) represents the back-propagation of the residual data $\delta\mathbf{d}$. This can be seen by writing the term as

\begin{equation}
  (\mathbf{A^{-1}})^T\mathbf{R}^T\delta\mathbf{d}=\delta\mathbf{u},
  \label{eq:ig0} \tag{32}
\end{equation}

which can be rearranged to give

\begin{equation}
  \mathbf{A}^T\delta\mathbf{u}=\mathbf{R}^T\delta\mathbf{d}.
  \label{eq:ig1} \tag{33}
\end{equation}

The matrix $\mathbf{R}$ represents the operation of extracting the wavefield at the receivers; consequently, its transpose represents the operation of re-injecting the wavefield at the receivers
back into the model. Equation $\eqref{eq:ig1}$ then simply describes a wavefield $\delta\mathbf{u}$ that is generated by a (virtual) source $\delta\mathbf{d}$ located at the receivers, and that is propagated by the operator $\mathbf{A}^T$ which is the *adjoint* of the operator in the original wave equation. So the term that we
need to compute in ([31](#mjx-eqn-eq:rwe6)) is the wavefield generated by a *modified* wave equation with
the data residuals used as sources.

Typically, to simplify the computations in the time domain, the numerical operator $\mathbf{A}$ is
designed such that it is symmetrical (i.e. self adjoint) in space, and such that its transpose
is equivalent to a back propagation in time. Formulating the problem this way allows the
use of the same code to calculate both $\mathbf{A}$ and $\mathbf{A}^T$ , where $\mathbf{A}$ propagates a wavefield forward in time, and $\mathbf{A}^T$ propagates a wavefield backward in time.

So now, to calculate the gradient, we must find

\begin{equation}
  \nabla_{\mathbf{m}}f=-\mathbf{u}^{T}\left(\frac{\partial\mathbf{A}}{\partial\mathbf{m}}\right)^T\delta\mathbf{u},
  \label{eq:ig2} \tag{34}
\end{equation}

where $\mathbf{u}$ is the calculated forward wavefield, $\delta\mathbf{u}$ represents the wavefield generated by back-propagating the data residuals, and the 'middle' can be calculated analytically.
The expression for the gradient in $\eqref{eq:ig2}$ represents the zero lag of the temporal cross
correlation of the forward wavefield for a particular source with the back-propagated wavefield
generated by the data residuals at each receiver for that source, calculated at every point
in the model, and weighted and modified by an analytical expression that depends upon
how the model $\mathbf{m}$ and operator $\mathbf{A}$ have been defined. Typically, in the time domain, this
weighting and modification involves the double differential of the back-propagated wavefield
with respect to time, and a scaling by the local value of the slowness.

For one source, the gradient calculated this way requires only two modelling runs rather
than the $n+1$ modelling runs that direct methods require. For multi-source datasets, the
full gradient is a sum over all sources. In practical applications with real datasets, the wave equation will nearly always be modified in various ways to include additional physics, but this does not change the underlying
approach. Several other numerical enhancements will normally also be incorporated into
the basic FWI scheme. Simply ignoring the Hessian is a gross simplification and while it
is not normally possible to incorporate its effects fully, there are several possibilities for
approximating its effects - L-BFGS is widely used, as are conjugate gradients.

Useful links:
- **L-BFGS**: https://en.wikipedia.org/wiki/Limited-memory_BFGS
- **Conjugate gradient**: https://en.wikipedia.org/wiki/Conjugate_gradient_method

## General notation and definitions

|      Symbol   |  Definition  |
|---------------|--------------|
|   $\mathbf{v}$  | column vector (bold lower case letter)|
|      $||v||$    | Euclidean norm of $\mathbf{v}$|
| $\mathbf{M}$ | matrix (bold capitalised letter)|
| $\mathbf{M}^T$ | transpose of $\mathbf{M}$; if $\mathbf{M}$ is real then $\mathbf{M}^T$ is the *adjoint* of $\mathbf{M}$ |
| $\mathbf{M}^{-1}$ | inverse of $\mathbf{M}$ |
| $\mathbf{H}$ | Hessian matrix |
| $\mathcal{O}(.)$ | Terms of order $.$ and higher |
| $x$ | scalar variable |
| $\delta x$ | infinitesimal perturbation to $x$ |
| $\nabla_{\mathbf{x}}$ | gradient with respect to $\mathbf{x}$, i.e. $\left(\frac{\partial}{\partial x_1},\frac{\partial}{\partial x_1}, \frac{\partial}{\partial x_2}, ..., \frac{\partial}{\partial x_n}\right)^T$ |
| $|_{\mathbf{x}=\mathbf{x}_0}$ | evaluation of a function of $\mathbf{x}$ at $\mathbf{x}_0$ |

## FWI notation and definitions

|      Symbol   |  Definition  |
|---------------|--------------|
| $\alpha$ | step length |
| $\mathbf{A}$ | operator resulting from the numerical discretisation of the wave-equation |
| $c$ | wave speed |
| $\mathbf{d}$ | observed dataset |
| $\delta\mathbf{d}$ | residual dataset,  $\delta\mathbf{d}=\mathbf{p}-\mathbf{d}$|
| $f$ | the functional (or objective/cost/misfit function)
| $G$ | function that generates a wavefield from the model parameters |
| $\mathbf{m}$ | discretised model parameters |
| $\mathbf{m}_0$ | starting model parameters |
| $\delta\mathbf{m}$ | perturbation to model parameters |
| $n$ | number of parameters in a model |
| $n_r$ | number of receivers |
| $n_s$ | number of sources |
| $n_t$ | number of time samples |
| $\mathbf{p}$ | a predicted dataset - typically this will be a subset of the wavefield $\mathbf{u}$
| $\mathbf{R}$ | diagonal restriction matrix that selects a subset of a wavefield such that $\mathbf{p}=\mathbf{Ru}$ |
| $s$ | source term |
| $\mathbf{s}$ | source term at all locations in a discretised model |
| $t$ | time |
| $u$ | the wavefield |
| $\mathbf{u}$ | the wavefield at all locations in a discretised model |
| $\delta\mathbf{u}$ | wavefield generated through back-propagating the residuals |

## The $L_2-norm$

The square of the $L_2-norm$ for a *real* vector $\mathbf{v}$ with $n$ elements is given by

\begin{equation}
  ||\mathbf{v}||^2 = \mathbf{v}^T\mathbf{v}=\sum_{i=1}^{n}v_i^2.
\end{equation}

It is the inner product of $\mathbf{v}$ with itself. It represents the square of the length of the vector
in Euclidean space. The $L_2-norm$ always has a non-negative real scalar value.
If $\mathbf{v}$ represents the difference between two vectors, say $\mathbf{v}$ = $\mathbf{v}_{computed}-\mathbf{v}_{observation}$ , then the $L_2-norm$ represents the distance between the two vectors $\mathbf{v}_{computed}$ and $\mathbf{v}_{observation}$. It is the square of this quantity that is minimised in *least-squares* problems.

## The Gradient

The *gradient* of the objective function $f$ with respect to the $n$ model parameters $\mathbf{m}$ is given by

\begin{align}
    \nabla_{\mathbf{m}}f &\equiv \frac{\partial f}{\partial\mathbf{m}} &\equiv \begin{bmatrix}
           \frac{\partial f}{\partial m_1} \\
           \frac{\partial f}{\partial m_2} \\
           \vdots \\
           \frac{\partial f}{\partial m_n}
         \end{bmatrix}
  \end{align}
  
The gradient is a vector that points in the direction of steepest ascent in the model space.
That is, if the model parameters are changed (by an appropriate amount) in the opposite direction to the gradient, then the objective function will decrease fastest.

## The Hessian

The Hessian matrix describes the variation of the objective function with respect to changes
in pairs of model parameters. It is a symmetric matrix of size $n\times n$ if there are $n$ model
parameters:

\begin{align}
    \mathbf{H} &\equiv \frac{\partial^2 f}{\partial \mathbf{m}^2} &\equiv
      \begin{bmatrix}
           \frac{\partial^2 f}{\partial m_1^2} & \frac{\partial^2 f}{\partial m_1\partial m_2} & \cdots & \frac{\partial^2 f}{\partial m_1\partial m_n} \\
           \frac{\partial^2 f}{\partial m_2\partial m_1} & \frac{\partial^2 f}{\partial m_2^2} & \cdots & \frac{\partial^2 f}{\partial m_2\partial m_n} \\  
           \vdots & \vdots & \ddots & \vdots \\
           \frac{\partial^2 f}{\partial m_n\partial m_1} & \frac{\partial^2 f}{\partial m_n\partial m_2} & \cdots & \frac{\partial^2 f}{\partial m_n^2}
      \end{bmatrix}
  \end{align}
  
The Hessian provides a measure of the local curvature of the $n$-dimensional error surface for
the current model.