## How to save a data cube with a desired chunking
### A DeepESDL example notebook 

This notebook demonstrates how modify the chunking of a dataset before persisting it. 

Please, also refer to the [DeepESDL documentation](https://deepesdl.readthedocs.io/en/latest/guide/jupyterlab/) and visit the platform's [website](https://www.earthsystemdatalab.net/) for further information!

Brockmann Consult, 2025

-----------------

**This notebook runs with the python environment `deepesdl-xcube-1.9.1`, please checkout the documentation for [help on changing the environment](https://deepesdl.readthedocs.io/en/latest/guide/jupyterlab/#python-environment-selection-of-the-jupyter-kerne).**

First, lets create a small cube, which we can rechunk. We will use ESA CCI data for this. Please head over to "xcube datastores - Generate CCI data cubes" to get more details about the xcube-cci data store :)

In [1]:
import datetime
import os

from xcube.core.store import new_data_store
from xcube.core.chunk import chunk_dataset

In [2]:
store = new_data_store("ccizarr")

Next, we create a cube containing 3y of data:

In [3]:
def open_zarrstore(filename, time_range, variables):
    ds = store.open_data(filename)
    subset = ds.sel(time=slice(time_range[0], time_range[1]))
    subset = subset[variables]

    return subset


dataset = open_zarrstore(
    "ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.zarr",
    time_range=[datetime.datetime(2013, 10, 1), datetime.datetime(2016, 9, 30)],
    variables=["analysed_sst"],
)

In [4]:
dataset

Unnamed: 0,Array,Chunk
Bytes,8.46 GiB,63.28 MiB
Shape,"(1095, 720, 1440)","(16, 720, 720)"
Dask graph,140 chunks in 3 graph layers,140 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 8.46 GiB 63.28 MiB Shape (1095, 720, 1440) (16, 720, 720) Dask graph 140 chunks in 3 graph layers Data type float64 numpy.ndarray",1440  720  1095,

Unnamed: 0,Array,Chunk
Bytes,8.46 GiB,63.28 MiB
Shape,"(1095, 720, 1440)","(16, 720, 720)"
Dask graph,140 chunks in 3 graph layers,140 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [5]:
dataset.analysed_sst.encoding

{'chunks': (16, 720, 720),
 'preferred_chunks': {'time': 16, 'lat': 720, 'lon': 720},
 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),
 'filters': None,
 '_FillValue': np.int16(-32768),
 'scale_factor': 0.009999999776482582,
 'add_offset': 273.1499938964844,
 'dtype': dtype('int16')}

In the example above, we can see that the variable analysed_sst is chunked as follows: (16, 720, 720). This means, each chunk contains 16 time values, 720 lat values and 720 lon values per chunk.
Variables, which contain 1 time value and many spatial dimensions in one chunk are optimal for visualisation/plotting of one time stamp.

For analysing long time series, it is benificial to chunk a dataset accordingly, so the chunks contain more values of the time dimension and less of the spatial dimensions. 

In [6]:
# time optimised chunking - please note, this is just an example
time_chunksize = 1095
x_chunksize = 10  # or lon
y_chunksize = 10  # or lat

Rechunking the dataset with desired chunking using xcube chunk_dataset. 

In [7]:
rechunked_ds = chunk_dataset(dataset, 
                             {"time": time_chunksize,
                              "lat": y_chunksize,
                              "lon": x_chunksize}, 
                             format_name='zarr', 
                             data_vars_only=True) 

Save rechunked dataset to team s3 storage.

In [8]:
S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]

In [9]:
team_store = new_data_store(
    "s3",
    root=S3_USER_STORAGE_BUCKET,
    storage_options=dict(
        anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
    ),
)

In [10]:
team_store.list_data_ids()

['ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.rechunked.zarr',
 'LC-1x720x1440-0.25deg-2.0.0-v1.zarr',
 'LC-1x720x1440-0.25deg-2.0.0-v2.zarr',
 'SST.levels',
 'SeasFireCube-8D-0.25deg-1x720x1440-3.0.0.zarr',
 'amazonas_v8.zarr',
 'amazonas_v9.zarr',
 'analysed_sst.zarr',
 'analysed_sst_2.zarr',
 'analysed_sst_3.zarr',
 'analysed_sst_4.zarr',
 'esa-cci-permafrost-1x1151x1641-0.1.0.zarr',
 'esa-cci-permafrost-1x1151x1641-0.4.0.zarr',
 'esa-cci-permafrost-1x1151x1641-0.5.0.zarr',
 'esa-cci-permafrost-1x1151x1641-0.6.0.zarr',
 'esa-cci-permafrost-1x1151x1641-0.7.0.zarr',
 'esa-cci-permafrost-1x1151x1641-0.8.0.zarr',
 'esa-cci-permafrost-1x1151x1641-1.0.0.zarr',
 'esa_gda-health_pakistan_ERA5_precipitation_and_temperature_testdata.zarr',
 'noise_trajectory.zarr']

In [11]:
output_id = "ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.rechunked_AB.zarr"

In [12]:
team_store.write_data(rechunked_ds, output_id, replace=True)

'ESACCI-L4_GHRSST-SST-GMPE-GLOB_CDR2.0-1981-2016-v02.0-fv01.0.rechunked_AB.zarr'

In [13]:
ds_re = team_store.open_data(output_id)

In [14]:
ds_re

Unnamed: 0,Array,Chunk
Bytes,8.46 GiB,855.47 kiB
Shape,"(1095, 720, 1440)","(1095, 10, 10)"
Dask graph,10368 chunks in 2 graph layers,10368 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 8.46 GiB 855.47 kiB Shape (1095, 720, 1440) (1095, 10, 10) Dask graph 10368 chunks in 2 graph layers Data type float64 numpy.ndarray",1440  720  1095,

Unnamed: 0,Array,Chunk
Bytes,8.46 GiB,855.47 kiB
Shape,"(1095, 720, 1440)","(1095, 10, 10)"
Dask graph,10368 chunks in 2 graph layers,10368 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Lets have a look at the chunking of the varialble analysed_sst now: (1095, 10, 10). This means, each chunk contains 1095 time values, 10 lat values and 10 lon values per chunk.
That is corresponding to what we have defined to be used by xcube chunk_dataset.

In [15]:
ds_re.analysed_sst.encoding

{'chunks': (1095, 10, 10),
 'preferred_chunks': {'time': 1095, 'lat': 10, 'lon': 10},
 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),
 'filters': None,
 '_FillValue': np.int16(-32768),
 'scale_factor': 0.009999999776482582,
 'add_offset': 273.1499938964844,
 'dtype': dtype('int16')}

In [16]:
# Clean up test dataset
team_store.delete_data(output_id)