# A brief example on how to extract a single time step from a zarr store

Includes a clarification when loading to memory is actually triggered.

In [1]:
import numpy as np
import xarray as xr
import zarr

In [2]:
import dask
dask.config.set({"array.slicing.split_large_chunks": True})

from dask.distributed import Client
client = Client()

In [3]:
client

0,1
Client  Scheduler: tcp://127.0.0.1:42715  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 8  Cores: 48  Memory: 67.11 GB


In [4]:
# load a zarr store; function taken from calc_ccf_sampling.ipynb
def load_var_ddt_temp_rad_fromflux(expid, var):
    # open previously calculated radiative heating rates from zarr store
    zarr_store = '/work/bb1018/nawdex-hackathon_pp/ddttemp_rad-from-fluxes/'+expid+'_ddttemp_rad-from-fluxes_DOM01_ML.zarr'
    return ( xr.open_zarr(zarr_store)
             [var].resample(time="1H").nearest(tolerance="5M").squeeze() )

At this point, ds is not loaded to disk. This is dask's lazy data access model, and can be seen by the fact that there are 442 tasks associated with the dataset.

In [5]:
ds = load_var_ddt_temp_rad_fromflux('nawdexnwp-2km-mis-0001', 'ddt_temp_radlw_fromflux')

In [6]:
ds

Unnamed: 0,Array,Chunk
Bytes,237.15 GB,600.00 MB
Shape,"(97, 75, 8149456)","(2, 75, 1000000)"
Count,442 Tasks,441 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 237.15 GB 600.00 MB Shape (97, 75, 8149456) (2, 75, 1000000) Count 442 Tasks 441 Chunks Type float32 numpy.ndarray",8149456  75  97,

Unnamed: 0,Array,Chunk
Bytes,237.15 GB,600.00 MB
Shape,"(97, 75, 8149456)","(2, 75, 1000000)"
Count,442 Tasks,441 Chunks
Type,float32,numpy.ndarray


We can request loading the data to memory via compute, this will trigger the computation of the tasks. But it will lead to a memory error on the compute node use here as the dataset has a size of ~240GB.

In [7]:
ds.compute()



KilledWorker: ("('zarr-e8c768b4bc8ca98128410b3f5b91bd26', 16, 0, 6)", <Worker 'tcp://127.0.0.1:40298', name: 3, memory: 0, processing: 75>)

We can circumvent this by selecting one time step before requiring a compute. A single timestep has a size of 2.4 GB, so no problem at all.

In [8]:
ds.isel(time=0)

Unnamed: 0,Array,Chunk
Bytes,2.44 GB,300.00 MB
Shape,"(75, 8149456)","(75, 1000000)"
Count,451 Tasks,9 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 2.44 GB 300.00 MB Shape (75, 8149456) (75, 1000000) Count 451 Tasks 9 Chunks Type float32 numpy.ndarray",8149456  75,

Unnamed: 0,Array,Chunk
Bytes,2.44 GB,300.00 MB
Shape,"(75, 8149456)","(75, 1000000)"
Count,451 Tasks,9 Chunks
Type,float32,numpy.ndarray


And now the compute runs fine.

In [9]:
ds.isel(time=0).compute()