<a href="https://colab.research.google.com/github/comet-toolkit/comet_training/blob/main/defining_digital_effects_table.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Defining digital effects tables
================================

In this notebook, we will show how to create a digital effects table with obsarray (which can be propagated through a measurement function using punpy). 
First, we show how obsarray can be used as a templater for efficiently making xarray datasets (both with and without uncertainties). We show how, using obsarray's special variable types (uncertainties and flags), datasets including detailed uncertainty and covariance information as well as quality flags can be created.
Finally, we define an example for a digital effects table quantifying the uncertainties and error-correlation of the gas temperature, pressure and amount of substance (number of moles). Using such a dataset, the uncertainties can be efficiently and easily propagated through a measurement function using punpy (see [this notebook](https://colab.research.google.com/github/comet-toolkit/comet_training/blob/master/training/punpy_digital_effects_table_example.ipynb)).

We first install and import the obsarray package (and xarray):

In [None]:
!pip install obsarray>=1.0.0

In [None]:
import obsarray
import xarray as xr

Using Obsarray as a templater
================================

**obsarray** can create :py:class:`xarray.Dataset`'s to a particular templates, defined as a :py:class:`dict`'s (referred to hereafter as **template** dictionaries), which can range from very simple to more complex. Every key in the **template** dictionary is the name of a variable, with the corresponding entry a further variable specification dictionary (referred to hereafter as **variable** dictionaries).

So a **template** dictionary may look something like this:

template = {
        "temperature": temperature_variable,
        "u_temperature": u_temperature_variable
    }

Each **variable** dictionary defines the following entries:

* ``dim`` - list of variable dimension names.
* ``dtype`` - variable data type, generally a :py:class:`numpy.dtype`, though for some :ref:`special variables <special variables>` particular values may be required.
* ``attributes`` - dictionary of variable metadata, for some :ref:`special variables <special variables>` particular entries may be required.
* ``encoding`` - (optional) variable `encoding <http://xarray.pydata.org/en/stable/user-guide/io.html?highlight=encoding#writing-encoded-data>`_.

So for the previous example we may define:

In [None]:
import numpy as np

temperature_variable = {
    "dim": ["lon", "lat", "time"],
    "dtype": np.float32,
    "attributes": {"units": "K", "unc_comps": ["u_temperature"]}
}

u_temperature_variable = {
    "dim": ["lon", "lat", "time"],
    "dtype": np.float16,
    "attributes": {"units": "%"}
}


In [None]:
template = {
        "temperature": temperature_variable,
        "u_temperature": u_temperature_variable
    }

The following section details the special variable types that can be defined with **obsarray**.


Special variable types
------------------------

**obsarray**'s special variables allow the quick definition of a set of standardised variable formats. The following special variable types are available.

Uncertainties
_____________

[Recent work](https://www.mdpi.com/2072-4292/11/5/474/htm) in the Earth Observation metrology domain is working towards the standardisation of the representation of measurement uncertainty information in data, with a particular focus on capturing the error-covariance associated with the uncertainty. Although it is typically the case that for large measurement datasets storing full error-covariance matrices is impractical, often the error-covariance between measurements may be efficiently parameterised. Work to standardise such parameterisations is on-going (see for example the EU H2020 FIDUCEO project defintions list in Appendix A of [this project report](https://ec.europa.eu/research/participants/documents/downloadPublic?documentIds=080166e5c84c9e2c&appId=PPGMS)).

**dsbuilder** enables the specification of such error-correlation parameterisations for uncertainty variables through the variable attributes. This is achieved by including an ``"err_corr"`` list entry in a variable's **variable_spec** dictionary. Each element of ``err_corr`` is a  dictionary defining the error-correlation along one or more dimensions, which should include the following entries:

* ``dim`` (*str*/*list*) - name of the dimension(s) as a str or list of str's (i.e. from ``dim_names``)
* ``form`` (*str*) - error-correlation form, defines functional form of error-correlation structure along
  dimension. Suggested error-correlation forms are defined in the table below.
* ``params`` (*list*) - (optional) parameters of the error-correlation structure defining function for dimension
  if required. The number of parameters required depends on the particular form.
* ``units`` (*list*) - (optional) units of the error-correlation function parameters for dimension
  (ordered as the parameters)

Measurement variables with uncertainties should include a list of ``unc_comps`` in their attributes, as in the above example.

An example ``err_corr`` dictionary may therefore look like:


In [None]:
err_corr = [
        {
            "dim": "x",
            "form": "err_corr_matrix",
            "params": "err_corr_var_x",
            "units": []
        },
        {
            "dim": "y",
            "form": "random",
            "params": [],
            "units": []
        }
]


If the error-correlation structure is not defined along a particular dimension (i.e. it is not included in ``err_corr``), the error-correlation is assumed random. Variable attributes are populated to the effect of this assumption.

| Form Name | Parameters | Description |
| --- | --- | --- |
| ``"random"`` | None required | Errors uncorrelated along dimension(s) |
| ``"systematic"`` | None required | Errors fully correlated along dimension(s) |
| ``"err_corr_matrix"`` | Error-correlation matrix variable name | Error-correlation for dimension(s) not parameterised, defined as a full matrix in another named variable in dataset. |

Flags
_____

Setting the ``"flag"`` dtype builds a variable in the [cf conventions flag format](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#flags). Each datum bit corresponds to boolean condition flag with a given meaning.

The variable must be defined with an attribute that lists the per bit flag meanings as follows:

In [None]:
variables = {
       "quality_flag": {
           "dim": ["x", "y"],
           "dtype": "flag",
           "attributes": {
               "flag_meanings": ["good_data", "bad_data"]
           }
       }
   }

The smallest necessary integer is used as the flag variable :py:class:`numpy.dtype`, given the number of flag meanings defined (i.e. 7 flag meanings results in an 8 bit integer variable).

Creating a template dataset
----------------------------

With the ``template`` dictionary prepared, only two more specifications are required to build a template dataset. First a dictionary that defines the sizes of all the dimensions used in the ``template`` dictionary, e.g.:

In [None]:
dim_size= {"lon":100, "lat":50, "time": 10}

Secondly, a dictionary of dataset global metadata, e.g.:

In [None]:
metadata = {"dataset_name": "my cool image"}

Combining the above together a template dataset can be created as follows:


In [None]:
ds = obsarray.create_ds(
       template,
       dim_size,
       metadata
   )

Where ``ds`` is an empty xarray dataset with variables defined by the template definition. Fill values for the empty arrays are chosen using the [cf convention values](http://cfconventions.org/cf-conventions/cf-conventions.html#missing-data).

Populating and writing the dataset
------------------------------------

[Populating](http://xarray.pydata.org/en/stable/user-guide/data-structures.html#dictionary-like-methods) and [writing](http://xarray.pydata.org/en/stable/user-guide/io.html#reading-and-writing-files) the dataset can be achieved using xarray's builtin functionality. Here's a dummy example:

ds\["band_red"] = ... # populate variable with red image array

ds\["band_green"] = ... # populate variable with green image array

ds\["band_blue"] = ... # populate variable with blue image array

ds.to_netcdf("path/to/file.nc")

Defining the example digital effects table
==========================================




Here we provide a full example of creating a digital effects table. The example is for a dataset quantifying the uncertainties and error-correlation of the gas temperature, pressure and amount of substance (number of moles) to be used in the calculation of the volume through the ideal gas law (see [this notebook](https://colab.research.google.com/github/comet-toolkit/comet_training/blob/master/training/punpy_digital_effects_table_example.ipynb)). Uncertainty propagation becomes very straightforward with punpy, once this digital effects table has been defined:

In [None]:
import numpy as np
import obsarray

# define ds variables
template = {
    "temperature": {
        "dtype": np.float32,
        "dim": ["x", "y", "time"],
        "attributes": {
            "units": "K",
            "unc_comps": ["u_ran_temperature","u_sys_temperature"]
        }
    },
    "u_ran_temperature": {
        "dtype": np.float32,
        "dim": ["x", "y", "time"],
        "attributes": {
            "units": "K",
            "err_corr": [
              {
                  "dim": "x",
                  "form": "random",
                  "params": [],
                  "units": []
              },
              {
                  "dim": "y",
                  "form": "random",
                  "params": [],
                  "units": []
              },
              {
                  "dim": "time",
                  "form": "random",
                  "params": [],
                  "units": []
              }
          ]
        },
    },
    "u_sys_temperature": {
        "dtype": np.float32,
        "dim": ["x", "y", "time"],
        "attributes": {
            "units": "K",
            "err_corr": [
              {
                  "dim": "x",
                  "form": "systematic",
                  "params": [],
                  "units": []
              },
              {
                  "dim": "y",
                  "form": "systematic",
                  "params": [],
                  "units": []
              },
              {
                  "dim": "time",
                  "form": "systematic",
                  "params": [],
                  "units": []
              }
          ]
        }
    },
    "pressure": {
        "dtype": np.float32,
        "dim": ["x", "y", "time"],
        "attributes": {
            "units": "Pa",
            "unc_comps": ["u_str_pressure"]
        }
    },
    "u_str_pressure": {
        "dtype": np.float32,
        "dim": ["x", "y", "time"],
        "attributes": {
            "units": "Pa",
            "err_corr": [
              {
                  "dim": "x",
                  "form": "random",
                  "params": [],
                  "units": []
              },
              {
                  "dim": "y",
                  "form": "err_corr_matrix",
                  "params": "err_corr_str_pressure_y",
                  "units": []
              },
              {
                  "dim": "time",
                  "form": "systematic",
                  "params": [],
                  "units": []
              }
          ]
        },
    },
    "err_corr_str_pressure_y": {
        "dtype": np.float32,
        "dim": ["y", "y"],
        "attributes": {"units": ""},
    },
    "n_moles": {
        "dtype": np.float32,
        "dim": ["x", "y", "time"],
        "attributes": {
            "units": "",
            "unc_comps": ["u_ran_n_moles"]
        }
    },
    "u_ran_n_moles": {
        "dtype": np.float32,
        "dim": ["x", "y", "time"],
        "attributes": {
            "units": "",
            "err_corr": [
              {
                  "dim": "x",
                  "form": "random",
                  "params": [],
                  "units": []
              },
              {
                  "dim": "y",
                  "form": "random",
                  "params": [],
                  "units": []
              },
              {
                  "dim": "time",
                  "form": "random",
                  "params": [],
                  "units": []
              }
          ]
       },  
    },
}

# define dim_size_dict to specify size of arrays
dim_sizes = {
    "x": 20,
    "y": 30,
    "time": 6
}

# create dataset template
ds = obsarray.create_ds(template, dim_sizes)

# populate with example data
ds["temperature"].values = 293*np.ones((20,30,6))
ds["u_ran_temperature"].values = 1*np.ones((20,30,6))
ds["u_sys_temperature"].values = 0.4*np.ones((20,30,6))
ds["pressure"].values = 10**5*np.ones((20,30,6))
ds["u_str_pressure"].values = 10*np.ones((20,30,6))
ds["err_corr_str_pressure_y"].values = 0.5*np.ones((30,30))+0.5*np.eye(30)
ds["n_moles"].values = 40*np.ones((20,30,6))
ds["u_ran_n_moles"].values = 1*np.ones((20,30,6))

# store example file
# ds.to_netcdf("path/to/file.nc")

Here the last line has been commented as we do not want to save the NetCDF file as part of this notebook.

We can then inspect some of the results using the obsarray uncertainty accessor:

In [None]:
print(ds.unc["temperature"].total_unc())
print(ds.unc["pressure"]["u_str_pressure"].value)
print(ds.unc["pressure"]["u_str_pressure"].err_corr)
print(ds.unc["pressure"][0,:,:]["u_str_pressure"].err_corr_matrix())  # here the [0,:,:] slice indicates we want the error correlation matrix of the second and third dimension