Writing docs

duncanwp · Aug 4, 2021 · 17c2461 · 17c2461
1 parent c8f5bdc
commit 17c2461
Show file tree

Hide file tree

Showing 3 changed files with 116 additions and 4 deletions.
diff --git a/docs/calibration.rst b/docs/calibration.rst
@@ -3,11 +3,56 @@
 Calibrating with ESEm
 =====================
 
+Having trained a fast, robust emulator this can be used to calibrate our model against available observations.
+Generally, this problem involves estimating the model parameters which could give rise to, or best match, the available observations.
+
+In other words, we would like to know the posterior probability distribution of the input parameters: :math:`p\left(\theta\middle| Y^0\right)`.
+
+Using Bayes' theorem, we can write this as:
+
+.. math::
+    p\left(\theta\middle| Y^0\right)=p\left(Y^0\middle|\theta\right)p(θ)p(Y0)
+    :label: Bayes
+
+Where the probability of an output given the input parameters, :math:`p\left(Y^0\middle|\theta\right)`, is referred to as the likelihood.
+While the model is capable of sampling this distribution, generally the full distribution is unknown and intractable, and we must approximate this likelihood.
+Depending on the purpose of the calibration and assumptions about the form of :math:`p\left(Y^0\middle| Y\right)`, different techniques can be used.
+In order to determine a (conservative) estimate of the parametric uncertainty in the model for example, we can use approximate Bayesian computation (ABC) to determine those parameters which are plausible given a set of observations.
+Alternatively, we may wish to know the optimal parameters to best match a set of observations and Markov-Chain Monte-Carlo based techniques might be more appropriate.
+Both of these sampling strategies are available in ESEm and we describe each of them here.
+
+
 Using approximate Bayesian computation (ABC)
 ============================================
 
+The simplest ABC approach seeks to approximate the likelihood using only samples from the simulator and a discrepancy function :math:`\rho`:
+.. math::
+    p\left(\theta\middle| Y^0\right)\propto p\left(Y^0\middle| Y\right)p\left(Y\middle|\theta\right)p(\theta)\approx\int{\mathbb{I}(\rho\left(Y^0,Y\right)\le\epsilon)\ \ p\left(Y\middle|\theta\right)\ p(\theta)\ dY}
+    :label: Eq 2
+where the indicator function :math:`\mathbb{I}(x)=1,&x is true0,&x is false`, and :math:`\epsilon` is a small discrepancy.
+This can then be integrated numerically using e.g., Monte-Carlo sampling of p(\theta).
+Any of those parameters for which \rho\left(Y^0,Y\right)\le\epsilon are accepted and those which do not are rejected.
+As \epsilon\rightarrow\infty therefore, all parameters are accepted and we recover p(\theta).
+For \epsilon=0, it can be shown that we generate samples from the posterior p\left(\theta\middle| Y^0\right) exactly.
+
+In practice however the simulator proposals will never exactly match the observations and we must make a pragmatic choice for both \rho and \epsilon. ESEm includes an implementation of the ‘implausibility metric’ (Williamson et al., 2013; Craig et al., 1996; Vernon et al., 2010)(Williamson et al., 2013; Craig et al., 1996; Vernon et al., 2010) which defines the discrepancy in terms of the standardized Cartesian distance:
+\rho(Y^0,Y(\theta))=\frac{\left|Y^0-Y(\theta)\right|}{\sqrt{\sigma_E^2+\sigma_Y^2+\sigma_R^2+\sigma_S^2}}=\rho(Y^0,\theta)	Eq. 3
+where the total standard deviation is taken to be the squared sum of the emulator variance (\sigma_E^2)\ and the uncertainty in the observations (\sigma_Y^2) and due to representation (\sigma_R^2) and structural model uncertainties (\sigma_S^2). As described above, the representation uncertainty represents the degree to which observations at a particular time and location can be expected to match the (typically aggregate) model output  (Schutgens et al., 2016a, b). While reasonable approximates can often be made of this and the observational uncertainties, the model structural uncertainties are typically unknown. In some cases, a multi-model ensemble may be available which can provide an indication of the structural uncertainties for particular observables (Sexton et al., 1995), but these are likely to underestimate true structural uncertainties as models typically share many key processes and assumptions (Knutti et al., 2013). Indeed, one benefit of a comprehensive analysis of the parametric uncertainty of a model is that this structural uncertainty can be explored and determined (Williamson et al., 2015).
+Framed in this way, \epsilon, can be thought of as representing the number of standard deviations the (emulated) model value is from the observations. While this can be treated as a free parameter and may be specified in ESEm, it is common to choose \epsilon=3\ since it can be shown that for unimodal distributions values of 3\sigma correspond to a greater than 95% confidence bound (Vysochanskij and Petunin, 1980).
+This approach is closely related to the approach of ‘history matching’ (Williamson et al., 2013) and can be shown to be identical in the case of fixed \epsilon and uniform priors (Holden et al., 2015b). The key difference being that history matching may result in an empty posterior distribution, that is, it may find no plausible model configurations which match the observations. With ABC on the other hand the epsilon is typically treated as a hyper-parameter which can be tuned in order to return a suitably large number of posterior samples. Both \epsilon and the prior distributions can be specified in ESEm and it can thus be used to perform either analysis. The speed at which samples can typically be generated from the emulator means we can keep \epsilon fixed as in history matching and generate as many samples as is required to estimate the posterior distribution.
+When multiple (\mathcal{N})\ observations are used (as is often the case) \rho can be written as a vector of implausibilities, \rho(Y_i^O,\theta) or simply \rho_i(\theta), and a modified method of rejection or acceptance must be used. A simple choice is to require \rho_i<\epsilon\ \forall\ i\ \in\mathcal{N}, however this can become restrictive for large \mathcal{N} due to the curse of dimensionality. The first step should be to reduce \mathcal{N} through the use of summary statistics as described above. An alternative is to introduce a tolerance (T) such that only some proportion of \rho_i need to be smaller than \epsilon: \sum_{i=0}^{\mathcal{N}}{H(}\rho_i\ -\ \epsilon)<T, where H is the Heaviside function (Johnson et al. 2019), although this is a somewhat unsatisfactory approach that can hide potential structural uncertainties. On the other hand, choosing T=0 as a first approximation and then identifying any particular observations which generate a very large implausibility provides a mechanism for identifying potential structural (or observational) errors. These can then be removed and noted for further investigation.
+
 
 Using Markov-chain Monte-Carlo
 ==============================
 
+The ABC method described above is simple and powerful, but somewhat inefficient as it repeatedly samples from the same prior. In reality each rejection or acceptance of a set of parameters provides us with extra information about the ‘true’ form of p\left(\theta\middle| Y^0\right) so that the sampler could spend more time in plausible regions of the parameter space. This can then allow us to use smaller values of \epsilon and hence find better approximations of  p\left(\theta\middle| Y^0\right).
+Given the joint probability distribution described by Eq. 2 and an initial choice of parameters \theta\prime and (emulated) output Y\prime, the acceptance probability r of a new set of parameters (\theta) is given by:
+r=\frac{p\left(Y^0\middle| Y\prime\right)p\left(\theta\prime\middle|\theta\right)p(\theta\prime)}{p\left(Y^0\middle| Y\right)p\left(\theta\middle|\theta\prime\right)p(\theta)}	Eq. 4
+In the default implementation of MCMC calibration ESEm uses the TensorFlow-probability implementation of Hamiltonian Monte-Carlo (HMC) (Neal, 2011) which uses the gradient information automatically calculated by TensorFlow to inform the proposed new parameters \theta. For simplicity, we assume that the proposal distribution is symmetric: p\left(\theta\prime\middle|\theta\right)\ =\ p\left(\theta\middle|\theta\prime\right), which is implemented as a zero log-acceptance correction in the initialisation of the TensorFlow target distribution. The target log probability provided to the TensorFlow HMC algorithm is then:
+log(r)=log(p\left(Y^0\middle| Y\prime\right))\ +\ log(p(\theta\prime))\ -\ log(p\left(Y^0\middle| Y\right))\ -\ log(p(\theta))\ 	Eq. 5
+Note, that for this implementation the distance metric \rho must be cast as a probability distribution with values [0, 1]. We therefore assume that this discrepancy can be approximated as a normal distribution centred about zero, with standard deviation equal to the sum of the squares of the variances as described in Eq. 3:
+p\left(Y^0\middle| Y\right)\approx{\frac{1}{\sigma_t\sqrt{2\pi}}e}^{-\frac{1}{2}\left(\frac{Y^0-Y}{\sigma_t}\right)^2},\ \ \sigma_t=\sqrt{\sigma_E^2+\sigma_Y^2+\sigma_R^2+\sigma_S^2}\ 	Eq. 5
+The implementation will then return the requested number of accepted samples as well as reporting the acceptance rate, which provides a useful metric for tuning the algorithm. It should be noted that MCMC algorithms can be sensitive to a number of key parameters, including the number of burn-in steps used (and discarded) before sampling occurs and the step size. Each of these can be controlled via keyword arguments to the sampler.
+This approach can provide much more efficient sampling of the emulator and provide improved parameter estimates, especially when used with informative priors which can guide the sampler.
 
diff --git a/docs/emulation.rst b/docs/emulation.rst
@@ -3,15 +3,82 @@
 Emulating with ESEm
 ===================
 
-ESEm provides a simple and streamlined interface to emulate earth system datasets stored as iris Cubes.
-The corresponding predictors can be provided as a numpy array or pandas DataFrame.
-This emulation is essentially just a regression and can be performed using a variety of techniques using the same API.
+ESEm provides a simple and streamlined interface to emulate earth system datasets stored as iris Cubes, denoted :math:`Y` in the following documentation.
+The corresponding predictors (:math:`X`) can be provided as a numpy array or pandas `DataFrame`s.
+This emulation is essentially just a regression estimating the functional form:
+
+.. math::
+
+   Y \approx f(X)
+
+and can be performed using a variety of techniques using the same API.
 
 
 Gaussian processes emulation
 ============================
 
-The ESEm Gaussian process emulation module provides a wrapper around the `GPFlow <https://gpflow.readthedocs.io/en/master/#>`_ implementation.
+Gaussian processes (GPs) are a popular choice for model emulation due to their simple formulation and robust uncertainty estimates, particularly in cases of relatively small amounts of training data.
+Many excellent texts are available to describe their implementation and use (Rasmussen and Williams, 2005) and we only provide a short description here.
+Briefly, a GP is a stochastic process (a distribution of continuous functions) and can be thought of as an infinite dimensional normal distribution (hence the name).
+
+The ESEm GP emulation module provides a thin wrapper around the `GPFlow <https://gpflow.readthedocs.io/en/master/#>`_ implementation.
+Please see their documentation for a detailed description of the implementation.
+
+An important consideration when using RP regression is the form of the covariance matrix, or kernel. Typical kernels include: constant; linear; radial basis function (RBF; or squared exponential); and Matérn 3/2 and 5/2 which are only once and twice differentiable respectively.
+Kernels can also be designed to represent any aspect of the functions of interest such as non-stationarity or periodicity.
+This choice can often be informed by the physical setting and provides greater control and interpretability of the resulting model compared to e.g., Neural Networks.
+
+.. Note::
+    By default, ESEm uses a combination of linear, RBF and polynomial kernels which are suitable for the smooth and continuous parameter response expected for the examples used in this paper and related problems.
+    However, given the importance of the kernel for determining the form of the functions generated by the GP we have also included the ability for users to specify combinations of other common kernels.
+    See e.g., `Duvenaud, 2011 <https://www.cs.toronto.edu/~duvenaud/thesis.pdf>`_ for a clear description of some common kernels and their combinations, as well as work towards automated methods for choosing them.
+
+The framework provided by GPFlow also allows for multi-output GP regression and ESEm takes advantage of this to automatically provide regression over each of the output features provided in the training data.
+E.g. :math:`Y` can be of arbitrary dimensionality. It will be automatically flattened and reshaped before being passed to GPFlow.
+
+.. code-block:: python
+
+    from esem import gp_model
+
+
+
+Neural network emulation
+========================
+
+Through the development of automatic differentiation and batch-gradient descent it has become possible to efficiently train very large (millions of parameters), deep (dozens of layers) neural networks, using large amounts (terabytes) of training data.
+The price of this scalability is the risk of overfitting, and the lack of any information about the uncertainty of the outputs.
+However, both of these shortcomings can be addressed using a technique known as ‘dropout’ whereby individual weights are randomly set to zero and effectively ‘dropped’ from the network.
+During training this has the effect of forcing the network to learn redundant representations and reduce the risk of overfitting (Srivastava et al., 2014).
+More recently it was shown that applying the same technique during inference casts the NN as approximating Bayesian inference in deep Gaussian processes and can provide a well calibrated uncertainty estimate on the outputs (Gal and Ghahramani, 2015).
+The convolutional layers within these networks also take into account spatial correlations which cannot currently be directly modelled by GPs (although dimension reduction in the input can have the same effect).
+The main drawback with a CNN based emulator is that they typically need a much larger amount of training data than GP based emulators.
+
+While fully connected neural networks have been used for many years, even in climate science (Knutti et al., 2006; Krasnopolsky et al., 2005), the recent surge in popularity has been powered by the increases in expressibility provided by deep, convolutional neural networks (CNNs) and the regularisation techniques which prevent these huge models from over-fitting the large amounts of training data required to train them.
+Many excellent introductions can be found elsewhere but, briefly, a neural network consists of a network of nodes connecting (through a variety of architectures) the inputs to the target outputs via a series of weighted activation functions.
+The network architecture and activation functions are typically chosen a-priori and then the model weights are determined through a combination of back-propagation and (batch) gradient descent until the outputs match (defined by a given loss function) the provided training data. As previously discussed, the random dropping of nodes (by setting the weights to zero), termed dropout, can provide estimates of the prediction uncertainty of such networks.
+The computational efficiency of such networks and the rich variety of architectures available have made them the tool of choice in many machine learning settings, and they are starting to be used in climate sciences for emulation (Dagon et al., 2020), although the large amounts of training data required have so far limited their use somewhat.
+
+.. image:: images/CNN_diagram.png
+  :width: 400
+  :alt: Schematic illustration of the structure of the default convolutional neural network model.
+
+
+Random forest emulation
+=======================
+
+ESEm also provides the option for emulation with Random Forests using the open-source implementation provided by scikit-learn.
+Random Forest estimators are comprised of an ensemble of decision trees; each decision tree is a recursive binary partition over the training data and the predictions are an average over the predictions of the decision trees (Breiman, 2001).
+As a result of this architecture, Random Forests (along with other algorithms built on decision trees) have two main attractions.
+Firstly, they require very little pre-processing of the inputs as the binary partitions are invariant to monotonic rescaling of the training data.
+Secondly, and of particular importance for climate problems, they are unable to extrapolate outside of their training data because the predictions are averages over subsets of the training dataset.
+As a result of this, a Random Forest trained on output from an idealized GCM was shown to automatically conserve water and energy (O’Gorman and Dwyer, 2018).
+
+These features are of particular importance for problems involving the parameterization of sub-grid processes in climate models (Beucler et al., 2021) and as such, although parameterization is not the purpose of ESEm, we include a simple Random Forest implementation and hope to build on this in future.
+
+
+Data processing
+===============
+
 
 
 Feature selection

diff --git a/docs/images/CNN_diagram.png b/docs/images/CNN_diagram.png