## 10 - How to save a data cube with a desired chunking
### A DeepESDL example notebook 

This notebook demonstrates how modify the chunking of a dataset before persisting it. 

Please, also refer to the [DeepESDL documentation](https://deepesdl.readthedocs.io/en/latest/guide/jupyterlab/) and visit the platform's [website](https://www.earthsystemdatalab.net/) for further information!

Brockmann Consult, 2023

-----------------

**This notebook runs with the python environment `deepesdl-xcube-1.1.2`, please checkout the documentation for [help on changing the environment](https://deepesdl.readthedocs.io/en/latest/guide/jupyterlab/#python-environment-selection-of-the-jupyter-kerne).**

First, lets create a small cube, which we can later on append data to. We will use ESA CCI data for this. Please head over to 3 - Generate CCI data cubes to get more details about the xcube-cci data store :)

In [1]:
from xcube.core.store import new_data_store
import os

In [2]:
store = new_data_store('cciodp')
store

<xcube_cci.dataaccess.CciOdpDataStore at 0x7f1ba43a9c70>

Next, we create a cube containing only 2 days of data:

In [3]:
dataset = store.open_data('esacci.SST.day.L4.SSTdepth.multi-sensor.multi-platform.OSTIA.2-1.sst', 
                          variable_names=['analysed_sst'],
                          time_range=['1981-09-01','1981-09-04'])


In [4]:
dataset

Unnamed: 0,Array,Chunk
Bytes,28.12 kiB,28.12 kiB
Shape,"(3600, 2)","(3600, 2)"
Count,2 Graph Layers,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 28.12 kiB 28.12 kiB Shape (3600, 2) (3600, 2) Count 2 Graph Layers 1 Chunks Type float32 numpy.ndarray",2  3600,

Unnamed: 0,Array,Chunk
Bytes,28.12 kiB,28.12 kiB
Shape,"(3600, 2)","(3600, 2)"
Count,2 Graph Layers,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,56.25 kiB,56.25 kiB
Shape,"(7200, 2)","(7200, 2)"
Count,2 Graph Layers,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 56.25 kiB 56.25 kiB Shape (7200, 2) (7200, 2) Count 2 Graph Layers 1 Chunks Type float32 numpy.ndarray",2  7200,

Unnamed: 0,Array,Chunk
Bytes,56.25 kiB,56.25 kiB
Shape,"(7200, 2)","(7200, 2)"
Count,2 Graph Layers,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,64 B,64 B
Shape,"(4, 2)","(4, 2)"
Count,2 Graph Layers,1 Chunks
Type,datetime64[ns],numpy.ndarray
"Array Chunk Bytes 64 B 64 B Shape (4, 2) (4, 2) Count 2 Graph Layers 1 Chunks Type datetime64[ns] numpy.ndarray",2  4,

Unnamed: 0,Array,Chunk
Bytes,64 B,64 B
Shape,"(4, 2)","(4, 2)"
Count,2 Graph Layers,1 Chunks
Type,datetime64[ns],numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,395.51 MiB,2.75 MiB
Shape,"(4, 3600, 7200)","(1, 600, 1200)"
Count,2 Graph Layers,144 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 395.51 MiB 2.75 MiB Shape (4, 3600, 7200) (1, 600, 1200) Count 2 Graph Layers 144 Chunks Type float32 numpy.ndarray",7200  3600  4,

Unnamed: 0,Array,Chunk
Bytes,395.51 MiB,2.75 MiB
Shape,"(4, 3600, 7200)","(1, 600, 1200)"
Count,2 Graph Layers,144 Chunks
Type,float32,numpy.ndarray


In the example above, we can see that the variable analysed_sst is chunked as follows: (1, 1200, 2400). This means, each chunk contains 1 time value, 1200 lat values and 2400 lon values per chunk.
Varibales, which contain 1 time value and many spatial dimensions in one chunk are optimal for visualisation/plotting of one time stamp.

For analysing long time series, it is benificial to chunk a dataset accordingly, so the chunks contain more values of the time dimension and less of the spatial dimensions. 

In [5]:
# time optimised chunking - please note, this is just an example
time_chunksize = 4
x_chunksize = 120  # or lon
y_chunksize = 120 # or lat

Now the chunking is applyed to all variables, but skipping crs if present:

In [6]:
for var_name in dataset.data_vars:
    if var_name != "crs" and "_bounds" not in var_name:
        print(var_name)
        dataset[var_name] = dataset[var_name].chunk({'time': time_chunksize, 'lat': y_chunksize, 'lon': x_chunksize})

analysed_sst


To save a copy of a cube with a specific chunking, the encoding must be adjusted acordingly. 

In [7]:
encoding_dict = dict()

We want to ensure that the coordinate variables are stored in the best performant way, so we ensure that they are not chunked. This can be specified via the encoding:

In [8]:
coords_encoding = {k: dict(chunks=v.shape) for k, v in dataset.coords.items()} 

In [9]:
coords_encoding

{'lat': {'chunks': (3600,)},
 'lat_bnds': {'chunks': (3600, 2)},
 'lon': {'chunks': (7200,)},
 'lon_bnds': {'chunks': (7200, 2)},
 'time': {'chunks': (4,)},
 'time_bnds': {'chunks': (4, 2)}}

Specify the chunking the data variables encoding and ensuring that empty chunks are not written to disk by adding `write_empty_chunks=False`. This saves space on disk. Again, skipping crs if present.

In [10]:
vars_encoding = {k: dict(chunks=(time_chunksize, y_chunksize, x_chunksize), write_empty_chunks=False) for k, v in dataset.data_vars.items() if k != "crs"} 

In [11]:
vars_encoding

{'analysed_sst': {'chunks': (4, 120, 120), 'write_empty_chunks': False}}

Next, combining both dictionaries to form the encoding for the entire dataset.

In [12]:
encoding_dict.update(coords_encoding)
encoding_dict.update(vars_encoding)

In [13]:
encoding_dict

{'lat': {'chunks': (3600,)},
 'lat_bnds': {'chunks': (3600, 2)},
 'lon': {'chunks': (7200,)},
 'lon_bnds': {'chunks': (7200, 2)},
 'time': {'chunks': (4,)},
 'time_bnds': {'chunks': (4, 2)},
 'analysed_sst': {'chunks': (4, 120, 120), 'write_empty_chunks': False}}

Next, save it to the team s3 storage:

To store the cube in your teams user space, please first retrieve the details from your environment variables as the following:

In [14]:
S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]

You need to instantiate a s3 datastore pointing to the team bucket:

In [15]:
from xcube.core.store import new_data_store
team_store = new_data_store("s3", 
                       root=S3_USER_STORAGE_BUCKET, 
                       storage_options=dict(anon=False, 
                                            key=S3_USER_STORAGE_KEY, 
                                            secret=S3_USER_STORAGE_SECRET))


If you have stored no data to your user space, the returned list will be empty:

In [16]:
list(team_store.get_data_ids())

[]

In [17]:
output_id = 'analysed_sst.zarr'

Now let's write the data to the team s3 storage and remember to specify the encoding while doing so:

In [18]:
team_store.write_data(dataset, output_id, encoding=encoding_dict)

'analysed_sst.zarr'

If you list the content of you datastore again, you will now see the newly written dataset in the list:

In [19]:
list(team_store.get_data_ids())

['analysed_sst.zarr']

Let's verify that our chunking has been applied: 

In [20]:
ds = team_store.open_data(output_id)

In [21]:
ds

Unnamed: 0,Array,Chunk
Bytes,28.12 kiB,28.12 kiB
Shape,"(3600, 2)","(3600, 2)"
Count,2 Graph Layers,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 28.12 kiB 28.12 kiB Shape (3600, 2) (3600, 2) Count 2 Graph Layers 1 Chunks Type float32 numpy.ndarray",2  3600,

Unnamed: 0,Array,Chunk
Bytes,28.12 kiB,28.12 kiB
Shape,"(3600, 2)","(3600, 2)"
Count,2 Graph Layers,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,56.25 kiB,56.25 kiB
Shape,"(7200, 2)","(7200, 2)"
Count,2 Graph Layers,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 56.25 kiB 56.25 kiB Shape (7200, 2) (7200, 2) Count 2 Graph Layers 1 Chunks Type float32 numpy.ndarray",2  7200,

Unnamed: 0,Array,Chunk
Bytes,56.25 kiB,56.25 kiB
Shape,"(7200, 2)","(7200, 2)"
Count,2 Graph Layers,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,64 B,64 B
Shape,"(4, 2)","(4, 2)"
Count,2 Graph Layers,1 Chunks
Type,datetime64[ns],numpy.ndarray
"Array Chunk Bytes 64 B 64 B Shape (4, 2) (4, 2) Count 2 Graph Layers 1 Chunks Type datetime64[ns] numpy.ndarray",2  4,

Unnamed: 0,Array,Chunk
Bytes,64 B,64 B
Shape,"(4, 2)","(4, 2)"
Count,2 Graph Layers,1 Chunks
Type,datetime64[ns],numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,395.51 MiB,225.00 kiB
Shape,"(4, 3600, 7200)","(4, 120, 120)"
Count,2 Graph Layers,1800 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 395.51 MiB 225.00 kiB Shape (4, 3600, 7200) (4, 120, 120) Count 2 Graph Layers 1800 Chunks Type float32 numpy.ndarray",7200  3600  4,

Unnamed: 0,Array,Chunk
Bytes,395.51 MiB,225.00 kiB
Shape,"(4, 3600, 7200)","(4, 120, 120)"
Count,2 Graph Layers,1800 Chunks
Type,float32,numpy.ndarray


Looks good, now let's clean up the example cube :) 

In [22]:
team_store.delete_data(output_id)