# Icechunk Performance - Zarr V2

Using data from the [NCAR ERA5 AWS Public Dataset](https://nsf-ncar-era5.s3.amazonaws.com/index.html).

In [1]:
import xarray as xr
import zarr
import dask
import fsspec
from dask.diagnostics import ProgressBar

print('xarray:  ', xr.__version__)
print('dask:    ', dask.__version__)
print('zarr:    ', zarr.__version__)

xarray:   2024.7.0
dask:     2024.6.2
zarr:     2.18.2


In [6]:
url = "https://nsf-ncar-era5.s3.amazonaws.com/e5.oper.an.pl/194106/e5.oper.an.pl.128_060_pv.ll025sc.1941060100_1941060123.nc"
%time dsc = xr.open_dataset(fsspec.open(url).open(), engine="h5netcdf", chunks={"time": 1}).drop_encoding()

CPU times: user 123 ms, sys: 44.5 ms, total: 168 ms
Wall time: 1.91 s




In [7]:
print(ds)

<xarray.Dataset> Size: 4GB
Dimensions:    (time: 24, level: 37, latitude: 721, longitude: 1440)
Coordinates:
  * latitude   (latitude) float64 6kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0
  * level      (level) float64 296B 1.0 2.0 3.0 5.0 ... 925.0 950.0 975.0 1e+03
  * longitude  (longitude) float64 12kB 0.0 0.25 0.5 0.75 ... 359.2 359.5 359.8
  * time       (time) datetime64[ns] 192B 1941-06-01 ... 1941-06-01T23:00:00
Data variables:
    PV         (time, level, latitude, longitude) float32 4GB dask.array<chunksize=(1, 37, 721, 1440), meta=np.ndarray>
    utc_date   (time) int32 96B dask.array<chunksize=(1,), meta=np.ndarray>
Attributes:
    DATA_SOURCE:          ECMWF: https://cds.climate.copernicus.eu, Copernicu...
    NETCDF_CONVERSION:    CISL RDA: Conversion from ECMWF GRIB 1 data to netC...
    NETCDF_VERSION:       4.8.1
    CONVERSION_PLATFORM:  Linux r1i4n4 4.12.14-95.51-default #1 SMP Fri Apr 1...
    CONVERSION_DATE:      Wed May 10 06:33:49 MDT 2023
    Conventions:       

### Load Data from HDF5 File

This illustrates how loading directly from HDF5 files on S3 can be slow, even with Dask.

In [8]:
with ProgressBar():
    dsl = ds.load()

[########################################] | 100% Completed | 61.19 ss


### Write Zarr Store - No Dask

In [9]:
encoding = {
    "PV": {
        "compressor": zarr.Zstd(),
        "chunks": (1, 1, 721, 1440)
    }
}

In [17]:
target_url = "s3://icechunk-test/ryan/zarr-v2/test-era5-11"
store = zarr.storage.FSStore(target_url)

In [18]:
%time dsl.to_zarr(store, consolidated=False, encoding=encoding, mode="w")

CPU times: user 21.4 s, sys: 3.73 s, total: 25.1 s
Wall time: 31.8 s


<xarray.backends.zarr.ZarrStore at 0x7efac8869fc0>

In [22]:
# with dask
dslc = dsl.chunk({"time": 1, "level": 1})
store_d = zarr.storage.FSStore(target_url + '-dask')
with ProgressBar():
    dslc.to_zarr(store_d, consolidated=False, encoding=encoding, mode="w")

[########################################] | 100% Completed | 12.30 s


### Read Data Back

In [12]:
%time dss = xr.open_dataset(store, consolidated=False, engine="zarr")

CPU times: user 50.4 ms, sys: 7.21 ms, total: 57.6 ms
Wall time: 487 ms


In [13]:
%time dss.PV[0, 0, 0, 0].values

CPU times: user 15.2 ms, sys: 671 μs, total: 15.9 ms
Wall time: 97.4 ms


array(0.00710905, dtype=float32)

In [23]:
%time _ = dss.compute()

CPU times: user 8.6 s, sys: 1.53 s, total: 10.1 s
Wall time: 22.6 s


In [15]:
dssd = xr.open_dataset(store, consolidated=False, engine="zarr").chunk({"time": 1, "level": 10})

In [16]:
with ProgressBar():
    _ = dssd.compute()

[########################################] | 100% Completed | 4.55 sms


In [22]:
1893510506 / 2 / 1e6

946.755253

In [20]:
group = zarr.open_group(store, mode="r")
group.info

0,1
Name,/
Type,zarr.hierarchy.Group
Read-only,True
Store type,zarr.storage.FSStore
No. members,6
No. arrays,6
No. groups,0
Arrays,"PV, latitude, level, longitude, time, utc_date"


In [21]:
group.PV.info

0,1
Name,/PV
Type,zarr.core.Array
Data type,float32
Shape,"(24, 37, 721, 1440)"
Chunk shape,"(1, 1, 721, 1440)"
Order,C
Read-only,True
Compressor,Zstd(level=1)
Store type,zarr.storage.FSStore
No. bytes,3687828480 (3.4G)
