# ESiWACE3 Compression Turing Test in WebAssembly

**Note:** Any changes you make to this notebook will be lost once the page is closed or refreshed. Please download any files you would like to keep.

**Note:** The WASM-based version of the compression lab running inside JupyterLite has only been tested in recent Chrome and Firefox browsers.

## Introductory remarks

This notebook is an advanced example, which implements a Turing Test for lossy data compression methods. Please review the accompanying [`lab.ipynb`](https://esiwace3-compression-lab.onrender.com/retro/notebooks/?path=lab.ipynb) notebook first to learn about how to load your own user-provided data into the lab, how to visualise it, how to run different compression algorithms on the data, how to analyse the compression results, and how to download files from the lab to your own computer.

If you are already familiar with these or feel ready to take the plunge and refer back to the [`lab.ipynb`](https://esiwace3-compression-lab.onrender.com/retro/notebooks/?path=lab.ipynb) notebook if necessary, you are welcomed to continue and test which lossy compression algorithms provide functionally identical results for your downstream scientific analysis on your datasets.

## Import the ESiWACE3 field compression lab library

Importing `fcpy` also imports a large number of dependencies, which may take around a minute. Why not use this time to grab a drink, stretch your legs, and look out of your window?

In [1]:
import fcpy

Many Python packages that are common in scientific computing and meteorology are available in this lab and can simply be imported. These include, e.g., `cartopy`, `cfgrib`, `dask`, `matplotlib`, `netcdf4`, `numcodecs`, `numpy`, `pandas`, `xarray`, and `zarr`. You can also install additional *pure* Python packages from PyPi by running `%pip install PACKAGE` before the import statement.

## Fetch your dataset

Here we download a small datasets and save it the in-memory file system of this JupyterLite notebook. Since the memory of the notebook is limited, this only works for very small demo datasets.

In [2]:
small_path = fcpy.fetch_small_http_file(
    "https://esiwace3-compression-lab.onrender.com/data/hplp_ml_q_dx=2.0.grib"
)
small_path

PosixPath('/scratch/8d4c6c62-a3e9-41df-98ea-3c824ec3114c/hplp_ml_q_dx=2.0.grib')

Alternatively, you can also upload a dataset file from your own computer. The file is mounted in read-only mode into the notebook's file system without reading the file into memory, thus allowing arbitrarily large files to be made accessible. It is worth remembering that large files can still only be read if the algorithm that processes them supports streaming or chunking and does not request to load all data into memory at the same time.

If this notebook is run inside JupyterLite, the file also never leaves your own computer.

Note that this code is commented out since it requires your user input to upload a file and progress through the code. To mount an uploaded file, uncomment the following code cell and run it.

In [3]:
# upload_path = await fcpy.mount_user_local_file()
# upload_path

## Load the example dataset into `xarray`

To select which dataset you wish to load, only execute one of the following two lines.

Afterwards, we load the dataset into `xarray`.
```python
fcpy.open_dataset(path: pathlib.Path, **kwargs) -> xarray.Dataset
```
is a thin wrapper around
```python
xarray.open_dataset(filename: str, **kwargs) -> xarray.Datset
```
and thus takes the same arguments. Please refer to the [`xarray.open_dataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html) documentation if you need to perform some special configuration.

In [4]:
dataset_path = small_path
# dataset_path = upload_path

ds = fcpy.open_dataset(dataset_path)
ds

Unnamed: 0,Array,Chunk
Bytes,1.25 MiB,1.25 MiB
Shape,"(10, 91, 180)","(10, 91, 180)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.25 MiB 1.25 MiB Shape (10, 91, 180) (10, 91, 180) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",180  91  10,

Unnamed: 0,Array,Chunk
Bytes,1.25 MiB,1.25 MiB
Shape,"(10, 91, 180)","(10, 91, 180)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## Define your Analysis Procedure

Next we define the procedure by which we want to analyse the dataset. You are encouraged to test any downstream analysis you normally perform on your data here, as long as it produces some visual output inside this notebook, e.g. a plot, a table, or a few `print()` statements.

The analysis function has the following signature:
```python
def my_analysis_function(
    ds: xarray.Dataset,
    rng: numpy.random.Generator,
) -> typing.Any
```
where `ds` is a (possibly compressed) version of the dataset you have loaded, and `rng` is a random number generator which you can use to reproducibly add randomness to your analysis.

In this example, we simply plot a randomly selected variable over a randomly selected model level of the provided dataset. `fcpy` provides the
```python
fcpy.suite.plot_spatial_dataarray(
    da: xr.DataArray,
) -> Tuple[matplotlib.figure.Figure, matplotlib.axes.Axes]
```
helper function to plot spatial data. Note that this function requires that the data array is gridded along the standard "latitude" and "longitude" axes.

In [5]:
import warnings

import cartopy
from matplotlib import pyplot as plt

def plotting_analysis(ds, rng):
    variable = rng.choice(list(ds))
    level = rng.choice(ds.hybrid.values)
    
    fig, ax = fcpy.suite.plot_spatial_dataarray(
        ds[variable].sel(dict(hybrid=level)),
    )
    ax.set_title(None)

    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=cartopy.io.DownloadWarning)
        plt.show()

## Choose the Compression Algorithms you wish to compare

In this penultimate step, we select which compression algorithm configurations we want to compare. `fcpy` supports compression codecs implementing the `numcodecs` API and provides additional compressors in the `fcpy.codecs` module. Each entry in the list of compressors can be either:
- a subclass of `numcodecs.abc.Codec`, from now on just referred to as a codec. It will be applied to every variable in the dataset.
- a list of codecs. Note that compression is applied from left to right, i.e. the rightmost codec will be applied last. Decompression is applied in reverse. This stack of codecs is applied to every variable in the dataset.
- a mapping from variable names to either a codec or a list of codecs. A different compression scheme is thus applied to every variable.

Here we test various configurations of the simple lossy `LinearQuantize` codec that rescales real-valued data from $[min; max]$ to the integer range $[0; 2^b - 1]$. Here, $b$ is the number of bit precision you want to keep. 

Note that if you want to compare the codecs against the uncompressed dataset, you can simply include the `fcpy.codecs.Identity()` codec in the list.

In [6]:
compressors = [
    fcpy.codecs.LinearQuantize(bits=b, dtype="float64") for b in [2, 4, 6, 8]
] + [fcpy.codecs.Identity()]

## A Turing Test for Compression Algorithms

We can now perform a Turing Test for data compression algorithms, which is inspired by Milan Klöwer's PhD thesis:
> Klöwer, M. (2021). Low-precision climate computing: preserving information despite fewer bits [PhD thesis]. University of Oxford. Available from: https://ora.ox.ac.uk/objects/uuid:1158e44a-7faf-45a0-8ab1-73c91fd694a6

Once you start the Turing test, you will be repeatedly presented with two results of your analysis procedure, which have been run with identical random number generators on differently compressed versions of your dataset. Your task is to decide which of the cases was compressed worse, i.e. with higher information loss, and to click the case's associated button. After the test has gone through a sufficient number of examples, it produces a ranking of the compression algorithms.

If you compare the performance of different compressors against the uncompressed dataset by including the `fcpy.codecs.Identity()` codec amongst the compressors, this ranking reveals which compressors have passed the Turing test. In particular, if a compressor ranks above `fcpy.codecs.Identity()`, it must produce analysis results that are functionally identical to the ones produced on the uncompressed data. Thus, this compressor can be used for this data and specific analysis procedure without losing any information that is relevant to this downstream analysis.

The Turing Test for data compression is initiated with the following function:
```python
def initiate_turing_test(
    ds: xarray.Dataset,
    compressors: list[
        Union[
            numcodecs.abc.Codec,
            list[numcodecs.abc.Codec],
            dict[str, Union[
                numcodecs.abc.Codec,
                list[numcodecs.abc.Codec],
            ]],
        ],
    ],
    analysis: Callable[[xarray.Dataset, numpy.random.Generator], None],
    rng: Optional[numpy.random.Generator] = None,
)
```
The `rng` parameter can be used to reproducibly fix the randomness used throughout the test and passed to the analysis procedure calls.

Also note that the exact ranking produced by the Turing Test depends on your answers and the initial randomness provided to the test.

In [7]:
fcpy.initiate_turing_test(ds, compressors, plotting_analysis)

VBox(children=(AppLayout(children=(HTML(value="\n        <h2 style='text-align: center'>\n            Data Com…

## Feedback on the ESiWACE3 Compression Lab

We aim to build an online compression laboratory in which you can easily test and apply the most relevant compression algorithms on your data. We want to hear from you about what your requirements for compression are to ensure that any downstream scientific analysis is not adversely affected.

Please use the below link to provide us with feedback on
- your requirements for compression
- any bugs in the compression lab
- missing features that would allow you to better use it (e.g. unsupported data formats, compression methods, or compression error analysis methods)
- complicated or unclear functionality in the compression lab

https://forms.office.com/e/hKqfmvFTkz