## How to append data to existing datacube stored in team S3 storage
### A DeepESDL example notebook 

This notebook demonstrates how to append new data to an existing datacube. This cannot be done using xcube directly yet. 

Please, also refer to the [DeepESDL documentation](https://deepesdl.readthedocs.io/en/latest/guide/jupyterlab/) and visit the platform's [website](https://www.earthsystemdatalab.net/) for further information!

Brockmann Consult, 2024

-----------------

**This notebook runs with the python environment `deepesdl-xcube-1.7.0`, please checkout the documentation for [help on changing the environment](https://deepesdl.readthedocs.io/en/latest/guide/jupyterlab/#python-environment-selection-of-the-jupyter-kerne).**

First, lets create a small cube, which we can later on append data to. We will use ESA CCI data for this. Please head over to 3 - Generate CCI data cubes to get more details about the xcube-cci data store :)

In [1]:
import datetime
import os

from xcube.core.store import new_data_store

In [2]:
store = new_data_store("ccizarr")

Next, we create a cube containing only 2 months of data:

In [3]:
def open_zarrstore(filename, time_range, variables):
    ds = store.open_data(filename)
    subset = ds.sel(time=slice(time_range[0], time_range[1]))
    subset = subset[variables]

    return subset

In [4]:
dataset = open_zarrstore(
    "ESACCI-OC-L3S-RRS-MERGED-1M_MONTHLY_4km_GEO_PML_RRS-1997-2022-fv6.0.zarr",
    time_range=[datetime.datetime(1998, 3, 1), datetime.datetime(1998, 4, 30)],
    variables=["Rrs_412"],
)

In [5]:
dataset

Unnamed: 0,Array,Chunk
Bytes,284.77 MiB,17.80 MiB
Shape,"(2, 4320, 8640)","(1, 2160, 2160)"
Dask graph,16 chunks in 3 graph layers,16 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 284.77 MiB 17.80 MiB Shape (2, 4320, 8640) (1, 2160, 2160) Dask graph 16 chunks in 3 graph layers Data type float32 numpy.ndarray",8640  4320  2,

Unnamed: 0,Array,Chunk
Bytes,284.77 MiB,17.80 MiB
Shape,"(2, 4320, 8640)","(1, 2160, 2160)"
Dask graph,16 chunks in 3 graph layers,16 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


Next, save it to the team s3 storage:

To store the cube in your teams user space, please first retrieve the details from your environment variables as the following:

In [6]:
S3_USER_STORAGE_KEY = os.environ["S3_USER_STORAGE_KEY"]
S3_USER_STORAGE_SECRET = os.environ["S3_USER_STORAGE_SECRET"]
S3_USER_STORAGE_BUCKET = os.environ["S3_USER_STORAGE_BUCKET"]

You need to instantiate a s3 datastore pointing to the team bucket:

In [7]:
team_store = new_data_store(
    "s3",
    root=S3_USER_STORAGE_BUCKET,
    storage_options=dict(
        anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
    ),
)

If you have stored no data to your user space, the returned list will be empty:

In [8]:
list(team_store.get_data_ids())

['SST.levels',
 'amazonas_v8.zarr',
 'amazonas_v9.zarr',
 'noise_trajectory.zarr',
 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr',
 's2-demo-cube.zarr']

Appending data currently only works with .zarr format and is not supported in .levels yet. 

In [9]:
output_id = "ocean_color.zarr"

In [10]:
team_store.write_data(dataset, output_id, replace=True)

'ocean_color.zarr'

If you list the content of you datastore again, you will now see the newly written dataset in the list:

In [11]:
list(team_store.get_data_ids())

['SST.levels',
 'amazonas_v8.zarr',
 'amazonas_v9.zarr',
 'noise_trajectory.zarr',
 'ocean_color.zarr',
 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr',
 's2-demo-cube.zarr']

Now, to append new time stamps, xcube cannot be used but there is a workaround :) 

In [12]:
# needed for appending data to an existing cube saved in s3 storage
import s3fs

Connect to your team storage in S3 

In [13]:
# Connect to AWS S3 storage
fs = s3fs.S3FileSystem(
    anon=False, key=S3_USER_STORAGE_KEY, secret=S3_USER_STORAGE_SECRET
)

In [14]:
s3_client_kwargs = {"endpoint_url": "https://s3.eu-central-1.amazonaws.com"}
target_bucket_path = f"s3://{S3_USER_STORAGE_BUCKET}"

We create a new dataset, with different time stamps, which we want to append to the existing cube: 

In [15]:
dataset = open_zarrstore(
    "ESACCI-OC-L3S-RRS-MERGED-1M_MONTHLY_4km_GEO_PML_RRS-1997-2022-fv6.0.zarr",
    time_range=[datetime.datetime(1998, 5, 1), datetime.datetime(1998, 6, 30)],
    variables=["Rrs_412"],
)

In [16]:
dataset

Unnamed: 0,Array,Chunk
Bytes,284.77 MiB,17.80 MiB
Shape,"(2, 4320, 8640)","(1, 2160, 2160)"
Dask graph,16 chunks in 3 graph layers,16 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 284.77 MiB 17.80 MiB Shape (2, 4320, 8640) (1, 2160, 2160) Dask graph 16 chunks in 3 graph layers Data type float32 numpy.ndarray",8640  4320  2,

Unnamed: 0,Array,Chunk
Bytes,284.77 MiB,17.80 MiB
Shape,"(2, 4320, 8640)","(1, 2160, 2160)"
Dask graph,16 chunks in 3 graph layers,16 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [17]:
# we need to create a mapper pointing to the existing cube, stored in the team s3 storage
mapper = fs.get_mapper(f"{target_bucket_path}/{output_id}")

Now we can append the new dataset to the existing datacube:

In [18]:
dataset.to_zarr(mapper, mode="a", append_dim="time", consolidated=True)

<xarray.backends.zarr.ZarrStore at 0x7f211d077240>

In [19]:
list(team_store.get_data_ids())

['SST.levels',
 'amazonas_v8.zarr',
 'amazonas_v9.zarr',
 'noise_trajectory.zarr',
 'ocean_color.zarr',
 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr',
 's2-demo-cube.zarr']

Check if the cube now contains the expected time stamps: 

In [20]:
ds = team_store.open_data(output_id)

As expected, we now find all four days in the datacube. **Please note: you are responsible for passing the time stamps in the right order. If you do not, this might cause trouble later on and you will need to reorder the time dimension.**

In [21]:
ds

Unnamed: 0,Array,Chunk
Bytes,569.53 MiB,17.80 MiB
Shape,"(4, 4320, 8640)","(1, 2160, 2160)"
Dask graph,32 chunks in 2 graph layers,32 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 569.53 MiB 17.80 MiB Shape (4, 4320, 8640) (1, 2160, 2160) Dask graph 32 chunks in 2 graph layers Data type float32 numpy.ndarray",8640  4320  4,

Unnamed: 0,Array,Chunk
Bytes,569.53 MiB,17.80 MiB
Shape,"(4, 4320, 8640)","(1, 2160, 2160)"
Dask graph,32 chunks in 2 graph layers,32 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


It is also possible to append a variable with the same dimensions to an existing datacube:

In [22]:
dataset = open_zarrstore(
    "ESACCI-OC-L3S-RRS-MERGED-1M_MONTHLY_4km_GEO_PML_RRS-1997-2022-fv6.0.zarr",
    time_range=[datetime.datetime(1998, 3, 1), datetime.datetime(1998, 6, 30)],
    variables=["Rrs_443"],
)

In [23]:
dataset

Unnamed: 0,Array,Chunk
Bytes,569.53 MiB,17.80 MiB
Shape,"(4, 4320, 8640)","(1, 2160, 2160)"
Dask graph,32 chunks in 3 graph layers,32 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 569.53 MiB 17.80 MiB Shape (4, 4320, 8640) (1, 2160, 2160) Dask graph 32 chunks in 3 graph layers Data type float32 numpy.ndarray",8640  4320  4,

Unnamed: 0,Array,Chunk
Bytes,569.53 MiB,17.80 MiB
Shape,"(4, 4320, 8640)","(1, 2160, 2160)"
Dask graph,32 chunks in 3 graph layers,32 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


Now we can append the new dataset with the additional variable to the existing datacube:

In [24]:
dataset.to_zarr(mapper, mode="a", consolidated=True)

<xarray.backends.zarr.ZarrStore at 0x7f211d0aadc0>

Check if the cube now contains the expected new variable: 

In [25]:
ds = team_store.open_data(output_id)

In [26]:
ds

Unnamed: 0,Array,Chunk
Bytes,569.53 MiB,17.80 MiB
Shape,"(4, 4320, 8640)","(1, 2160, 2160)"
Dask graph,32 chunks in 2 graph layers,32 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 569.53 MiB 17.80 MiB Shape (4, 4320, 8640) (1, 2160, 2160) Dask graph 32 chunks in 2 graph layers Data type float32 numpy.ndarray",8640  4320  4,

Unnamed: 0,Array,Chunk
Bytes,569.53 MiB,17.80 MiB
Shape,"(4, 4320, 8640)","(1, 2160, 2160)"
Dask graph,32 chunks in 2 graph layers,32 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,569.53 MiB,17.80 MiB
Shape,"(4, 4320, 8640)","(1, 2160, 2160)"
Dask graph,32 chunks in 2 graph layers,32 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 569.53 MiB 17.80 MiB Shape (4, 4320, 8640) (1, 2160, 2160) Dask graph 32 chunks in 2 graph layers Data type float32 numpy.ndarray",8640  4320  4,

Unnamed: 0,Array,Chunk
Bytes,569.53 MiB,17.80 MiB
Shape,"(4, 4320, 8640)","(1, 2160, 2160)"
Dask graph,32 chunks in 2 graph layers,32 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


**In other use cases the chunking could be different for the new variable. You can ensure the desired chunking before writing the data to a cube by specifying it in the encoding. To learn about chunking, head over to the example notebook 10 - Chunking of Datasets**

Alright, now you know how to append new time stamps or variables to an existing cube - let's clean up our example :) 

In [27]:
team_store.delete_data(output_id)

In [28]:
list(team_store.get_data_ids())

['SST.levels',
 'amazonas_v8.zarr',
 'amazonas_v9.zarr',
 'noise_trajectory.zarr',
 'reanalysis-era5-single-levels-monthly-means-subset-2001-2010_TMH.zarr',
 's2-demo-cube.zarr']