## How to save a data cube with a desired chunking
### A DeepESDL example notebook 

This notebook demonstrates how modify the chunking of a dataset before persisting it. 

Please, also refer to the [DeepESDL documentation](https://deepesdl.readthedocs.io/en/latest/guide/jupyterlab/) and visit the platform's [website](https://www.earthsystemdatalab.net/) for further information!

Brockmann Consult, 2024

-----------------

**This notebook runs with the python environment `deepesdl-xcube-1.7.0`, please checkout the documentation for [help on changing the environment](https://deepesdl.readthedocs.io/en/latest/guide/jupyterlab/#python-environment-selection-of-the-jupyter-kerne).**

First, lets create a small cube, which we can later on append data to. We will use ESA CCI data for this. Please head over to 3 - Generate CCI data cubes to get more details about the xcube-cci data store :)

In [1]:
import datetime
import os

from xcube.core.store import new_data_store

In [2]:
store = new_data_store("ccizarr")

Next, we create a cube containing only 4 days of data:

In [3]:
def open_zarrstore(filename, time_range, variables):
    ds = store.open_data(filename)
    subset = ds.sel(time=slice(time_range[0], time_range[1]))
    subset = subset[variables]

    return subset


dataset = open_zarrstore(
    "ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.zarr",
    time_range=[datetime.datetime(2015, 10, 1), datetime.datetime(2015, 10, 5)],
    variables=["analysed_sst"],
)

In [4]:
dataset

Unnamed: 0,Array,Chunk
Bytes,31.64 MiB,15.82 MiB
Shape,"(4, 720, 1440)","(4, 720, 720)"
Dask graph,2 chunks in 3 graph layers,2 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 31.64 MiB 15.82 MiB Shape (4, 720, 1440) (4, 720, 720) Dask graph 2 chunks in 3 graph layers Data type float64 numpy.ndarray",1440  720  4,

Unnamed: 0,Array,Chunk
Bytes,31.64 MiB,15.82 MiB
Shape,"(4, 720, 1440)","(4, 720, 720)"
Dask graph,2 chunks in 3 graph layers,2 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In the example above, we can see that the variable analysed_sst is chunked as follows: (4, 720, 720). This means, each chunk contains 4 time values, 720 lat values and 720 lon values per chunk.
Variables, which contain 1 time value and many spatial dimensions in one chunk are optimal for visualisation/plotting of one time stamp.

For analysing long time series, it is benificial to chunk a dataset accordingly, so the chunks contain more values of the time dimension and less of the spatial dimensions. 

In [5]:
# time optimised chunking - please note, this is just an example
time_chunksize = 1
x_chunksize = 120  # or lon
y_chunksize = 120  # or lat

Now the chunking is applied to all variables, but skipping crs if present:

In [6]:
dataset.data_vars

Data variables:
    analysed_sst  (time, lat, lon) float64 33MB dask.array<chunksize=(4, 720, 720), meta=np.ndarray>

In [7]:
for var_name in dataset.data_vars:
    if var_name != "crs" and "_bounds" not in var_name:
        print(var_name)
        dataset[var_name] = dataset[var_name].chunk(
            {"time": time_chunksize, "lat": y_chunksize, "lon": x_chunksize}
        )

analysed_sst


To save a copy of a cube with a specific chunking, the encoding must be adjusted acordingly. 

In [8]:
encoding_dict = dict()

We want to ensure that the coordinate variables are stored in the best performant way, so we ensure that they are not chunked. This can be specified via the encoding:

In [9]:
coords_encoding = {k: dict(chunks=v.shape) for k, v in dataset.coords.items()}

In [10]:
coords_encoding

{'lat': {'chunks': (720,)},
 'lon': {'chunks': (1440,)},
 'time': {'chunks': (4,)}}

Specify the chunking the data variables encoding and ensuring that empty chunks are not written to disk by adding `write_empty_chunks=False`. This saves space on disk. Again, skipping crs if present.

In [11]:
vars_encoding = {
    k: dict(chunks=(time_chunksize, y_chunksize, x_chunksize), write_empty_chunks=False)
    for k, v in dataset.data_vars.items()
    if k != "crs"
}

In [12]:
vars_encoding

{'analysed_sst': {'chunks': (1, 120, 120), 'write_empty_chunks': False}}

Next, combining both dictionaries to form the encoding for the entire dataset.

In [13]:
encoding_dict.update(coords_encoding)
encoding_dict.update(vars_encoding)

In [14]:
encoding_dict

{'lat': {'chunks': (720,)},
 'lon': {'chunks': (1440,)},
 'time': {'chunks': (4,)},
 'analysed_sst': {'chunks': (1, 120, 120), 'write_empty_chunks': False}}

Next, save it to the team s3 storage:

To store the cube in your teams user space, please first retrieve the details from your environment variables as the following:

In [15]:
S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]

You need to instantiate a s3 datastore pointing to the team bucket:

In [16]:
team_store = new_data_store(
    "s3",
    root=S3_USER_STORAGE_BUCKET,
    storage_options=dict(
        anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
    ),
)

If you have stored no data to your user space, the returned list will be empty:

In [17]:
list(team_store.get_data_ids())

['SST.levels',
 'amazonas_v8.zarr',
 'amazonas_v9.zarr',
 'noise_trajectory.zarr',
 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr',
 's2-demo-cube.zarr']

In [18]:
output_id = "analysed_sst.zarr"

Now let's write the data to the team s3 storage and remember to specify the encoding while doing so:

In [19]:
team_store.write_data(dataset, output_id, encoding=encoding_dict, replace=True)

'analysed_sst.zarr'

If you list the content of you datastore again, you will now see the newly written dataset in the list:

In [20]:
list(team_store.get_data_ids())

['SST.levels',
 'amazonas_v8.zarr',
 'amazonas_v9.zarr',
 'analysed_sst.zarr',
 'noise_trajectory.zarr',
 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr',
 's2-demo-cube.zarr']

Let's verify that our chunking has been applied: 

In [21]:
ds = team_store.open_data(output_id)

In [22]:
ds

Unnamed: 0,Array,Chunk
Bytes,31.64 MiB,112.50 kiB
Shape,"(4, 720, 1440)","(1, 120, 120)"
Dask graph,288 chunks in 2 graph layers,288 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 31.64 MiB 112.50 kiB Shape (4, 720, 1440) (1, 120, 120) Dask graph 288 chunks in 2 graph layers Data type float64 numpy.ndarray",1440  720  4,

Unnamed: 0,Array,Chunk
Bytes,31.64 MiB,112.50 kiB
Shape,"(4, 720, 1440)","(1, 120, 120)"
Dask graph,288 chunks in 2 graph layers,288 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Looks good, now let's clean up the example cube :) 

In [23]:
team_store.delete_data(output_id)