# Read time comparison: Zarr vs HDF5 
Compute the maximum water level during Hurricane Ike on a 9 million node triangular mesh storm surge model (this reads 53GB of data). The data was stored in both Zarr format and as NetCDF4/HDF5, using 11MB chunks with no filters and zlib (level 5) compression.  

Instead of reading the NetCDF4/HDF5 file with an HDF5 or NetCDF4 library, we extract the metadata into an fsspec referenceFileSystem file, create a mapper, and then read the mapper using the Zarr library.

Using a cluster with 60 cores, we find that the performance between this read approach and reading native Zarr is not significantly different.  

In [1]:
import xarray as xr
import zarr
import fsspec
import fsspec.implementations.reference as refs
import intake
import intake_xarray

### Start a dask cluster to crunch the data

In [2]:
from dask.distributed import Client
from dask_gateway import Gateway
gateway = Gateway()
cluster = gateway.new_cluster()

In [3]:
cluster.scale(30);

In [4]:
client = Client(cluster)

In [5]:
client

0,1
Client  Scheduler: gateway://traefik-prod-dask-gateway.prod:80/prod.a8824221e75842b380a5f36f33da870d  Dashboard: https://hub.aws-uswest2-binder.pangeo.io/services/dask-gateway/clusters/prod.a8824221e75842b380a5f36f33da870d/status,Cluster  Workers: 30  Cores: 60  Memory: 120.00 GiB


### Open Intake Catalog

In [6]:
cat = intake.open_catalog('intake_catalog.yml')
list(cat)

['ike-hdf5', 'ike-zarr', 'ike-hdf5-30']

### Zarr library reading HDF5 file with fsspec

In [7]:
ds_hdf5  = cat['ike-hdf5'].to_dask()
print(ds_hdf5.zeta.encoding,'\n')
ds_hdf5.zeta

{'chunks': (10, 141973), 'preferred_chunks': {'time': 10, 'node': 141973}, 'compressor': Zlib(level=5), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64'), 'coordinates': 'y x'} 



Unnamed: 0,Array,Chunk
Bytes,49.50 GiB,10.83 MiB
Shape,"(720, 9228245)","(10, 141973)"
Count,4681 Tasks,4680 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 49.50 GiB 10.83 MiB Shape (720, 9228245) (10, 141973) Count 4681 Tasks 4680 Chunks Type float64 numpy.ndarray",9228245  720,

Unnamed: 0,Array,Chunk
Bytes,49.50 GiB,10.83 MiB
Shape,"(720, 9228245)","(10, 141973)"
Count,4681 Tasks,4680 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,70.41 MiB,70.41 MiB
Shape,"(9228245,)","(9228245,)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 70.41 MiB 70.41 MiB Shape (9228245,) (9228245,) Count 2 Tasks 1 Chunks Type float64 numpy.ndarray",9228245  1,

Unnamed: 0,Array,Chunk
Bytes,70.41 MiB,70.41 MiB
Shape,"(9228245,)","(9228245,)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,70.41 MiB,70.41 MiB
Shape,"(9228245,)","(9228245,)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 70.41 MiB 70.41 MiB Shape (9228245,) (9228245,) Count 2 Tasks 1 Chunks Type float64 numpy.ndarray",9228245  1,

Unnamed: 0,Array,Chunk
Bytes,70.41 MiB,70.41 MiB
Shape,"(9228245,)","(9228245,)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray


In [8]:
%%time
max1 = ds_hdf5['zeta'].max(dim='time').compute()

CPU times: user 1.14 s, sys: 364 ms, total: 1.5 s
Wall time: 30.8 s


### Zarr library reading HDF5 file with fsspec, with chunksize 30 in time

In [9]:
ds_hdf5_30  = cat['ike-hdf5-30'].to_dask()
print(ds_hdf5_30.zeta.encoding,'\n')
ds_hdf5_30.zeta

{'chunks': (10, 141973), 'preferred_chunks': {'time': 10, 'node': 141973}, 'compressor': Zlib(level=5), 'filters': None, '_FillValue': nan, 'dtype': dtype('float64'), 'coordinates': 'y x'} 



Unnamed: 0,Array,Chunk
Bytes,49.50 GiB,32.50 MiB
Shape,"(720, 9228245)","(30, 141973)"
Count,1561 Tasks,1560 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 49.50 GiB 32.50 MiB Shape (720, 9228245) (30, 141973) Count 1561 Tasks 1560 Chunks Type float64 numpy.ndarray",9228245  720,

Unnamed: 0,Array,Chunk
Bytes,49.50 GiB,32.50 MiB
Shape,"(720, 9228245)","(30, 141973)"
Count,1561 Tasks,1560 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,70.41 MiB,70.41 MiB
Shape,"(9228245,)","(9228245,)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 70.41 MiB 70.41 MiB Shape (9228245,) (9228245,) Count 2 Tasks 1 Chunks Type float64 numpy.ndarray",9228245  1,

Unnamed: 0,Array,Chunk
Bytes,70.41 MiB,70.41 MiB
Shape,"(9228245,)","(9228245,)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,70.41 MiB,70.41 MiB
Shape,"(9228245,)","(9228245,)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 70.41 MiB 70.41 MiB Shape (9228245,) (9228245,) Count 2 Tasks 1 Chunks Type float64 numpy.ndarray",9228245  1,

Unnamed: 0,Array,Chunk
Bytes,70.41 MiB,70.41 MiB
Shape,"(9228245,)","(9228245,)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray


In [10]:
%%time
max2 = ds_hdf5_30['zeta'].max(dim='time').compute()

CPU times: user 492 ms, sys: 369 ms, total: 861 ms
Wall time: 19.7 s
