# Consolidating metadata of `zarr` datasets

Examples of how to consolidate the metadata of a netCDF4 dataset before and after conversion to `zarr`

Author: Charles Blackmon-Luca

## Getting started

We start by importing the necessary packages:

In [1]:
import xarray as xr
import zarr

Note that we are using developmental versions of `xarray` and `zarr` - this is so we can take advantage of `zarr`'s consolidated metadata:

In [2]:
print(xr.__version__)
print(zarr.__version__)

0.11.1+64.g612d390
2.2.1.dev140


We start a `dask` cluster:

When `xarray` reads in a `zarr` dataset, it must load in all the metadata immediately. This can introduce a significant overhead cost as the number of variables with metadata increases, and as such `zarr >= 2.3` suuports the consolidation of metadata into a single `.zmetadata` file. This can be done either when converting a dataset to `zarr`:

In [3]:
from dask.distributed import Client

client = Client("tcp://127.0.0.1:34227")
client

0,1
Client  Scheduler: tcp://127.0.0.1:34227  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 16  Memory: 135.44 GB


In [4]:
!rm -rf /data2/tracmip/zarr/test/

monthly = xr.open_mfdataset('/data2/tracmip/ECHAM-6.3/LandOrbit/Amon/*.nc').chunk(chunks={'time' : 'auto', 'plev' : 'auto'})
monthly.to_zarr('/data2/tracmip/zarr/test/', consolidated=True)

!ls -a /data2/tracmip/zarr/test/

.      clt	hfss  plev  ps	    rlut    rsus     ta    ts	wap
..     clw	hur   pr    psl     rlutcs  rsuscs   tas   ua	.zattrs
cl     clwvi	hus   prc   rlds    rsds    rsut     tauu  uas	zg
cli    evspsbl	lat   prsn  rldscs  rsdscs  rsutcs   tauv  va	.zgroup
clivi  hfls	lon   prw   rlus    rsdt    sfcWind  time  vas	.zmetadata


Or after the fact with a preexisting `zarr` dataset:

In [5]:
!rm -rf /data2/tracmip/zarr/test/

monthly.to_zarr('/data2/tracmip/zarr/test/')
zarr.consolidate_metadata('/data2/tracmip/zarr/test/')

!ls -a /data2/tracmip/zarr/test/

.      clt	hfss  plev  ps	    rlut    rsus     ta    ts	wap
..     clw	hur   pr    psl     rlutcs  rsuscs   tas   ua	.zattrs
cl     clwvi	hus   prc   rlds    rsds    rsut     tauu  uas	zg
cli    evspsbl	lat   prsn  rldscs  rsdscs  rsutcs   tauv  va	.zgroup
clivi  hfls	lon   prw   rlus    rsdt    sfcWind  time  vas	.zmetadata


In either case, we must handle `zarr` datasets with consolidated metadata differently when opening them. In `xarray`:

In [6]:
xr.open_zarr('/data2/tracmip/zarr/test/', consolidated=True)

<xarray.Dataset>
Dimensions:  (lat: 96, lon: 192, plev: 17, time: 480)
Coordinates:
  * lat      (lat) float64 88.57 86.72 84.86 83.0 ... -83.0 -84.86 -86.72 -88.57
  * lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
  * plev     (plev) float64 1e+05 9.25e+04 8.5e+04 7e+04 ... 3e+03 2e+03 1e+03
  * time     (time) datetime64[ns] 2066-01-30T23:52:00 ... 2105-12-30T23:52:00
Data variables:
    cl       (time, plev, lat, lon) float32 dask.array<shape=(480, 17, 96, 192), chunksize=(160, 10, 96, 192)>
    cli      (time, plev, lat, lon) float32 dask.array<shape=(480, 17, 96, 192), chunksize=(160, 10, 96, 192)>
    clivi    (time, lat, lon) float32 dask.array<shape=(480, 96, 192), chunksize=(480, 96, 192)>
    clt      (time, lat, lon) float32 dask.array<shape=(480, 96, 192), chunksize=(480, 96, 192)>
    clw      (time, plev, lat, lon) float32 dask.array<shape=(480, 17, 96, 192), chunksize=(160, 10, 96, 192)>
    clwvi    (time, lat, lon) float32 dask.array<shape

And in `zarr`:

In [7]:
zarr.open_consolidated('/data2/tracmip/zarr/test/').info

0,1
Name,/
Type,zarr.hierarchy.Group
Read-only,False
Store type,zarr.storage.ConsolidatedMetadataStore
Chunk store type,zarr.storage.DirectoryStore
No. members,45
No. arrays,45
No. groups,0
Arrays,"cl, cli, clivi, clt, clw, clwvi, evspsbl, hfls, hfss, hur, hus, lat, lon, plev, pr, prc, prsn, prw, ps, psl, rlds, rldscs, rlus, rlut, rlutcs, rsds, rsdscs, rsdt, rsus, rsuscs, rsut, rsutcs, sfcWind, ta, tas, tauu, tauv, time, ts, ua, uas, va, vas, wap, zg"


Consolidated metadata improves the overall performance of `zarr` datasets, especially in cloud-based environments, and thus it is reccommended to make sure `zarr` metadata is consolidated before uploading data to Pangeo.