# Example: Opening large remote datasets

> [!NOTE]
> The online laboratory has only been tested in recent Firefox and Chrome browsers. Some features may not (yet) be supported in Safari browsers.

> [!CAUTION]
> Any changes you make to this notebook will be lost once the page is closed or refreshed. Please download any files you would like to keep.

## Motivation

Datasets in weather and climate modelling can easily exceed the size of your machine's working memory or even file storage. It is thus increasingly important to work with remotely stored datasets that are streamed-in as needed. Instead of downloading the entire dataset up-front, only the dataset metadata (e.g. information about the variables in the dataset) is downloaded immediately. The actual data is only fetched when it is requested by the user, e.g. by inspecting the data values, visualising a variable, or computing the mean value over a subslice of the data. Note that even when such a request for data occurs, care is taken to only load the minimum data needed to satisfy the request, and to stream it in small chunks instead of downloading it all in one go.

## Installing `fsspec`, `kerchunk`, and `zarr`

Zarr is a modern dataset format that is specifically designed for chunked access. It supports loading datasets from any filesystem implementing the `fsspec` API, e.g. local or remote HTTP or S3 filesystems. The `kerchunk` package helps with utilising the power of `fsspec` &#10084; `zarr` for non-Zarr datasets, including NetCDF.

In [1]:
import aiohttp
import dask
import fsspec
import kerchunk
import s3fs
import xarray as xr
import zarr

[pyodide]: Loading dask, click, cloudpickle, importlib_metadata, zipp, Jinja2, MarkupSafe, partd, locket, toolz, pyyaml, zarr, asciitree, numcodecs, msgpack, aiohttp, aiosignal, frozenlist, async-timeout, attrs, multidict, yarl, s3fs, aiobotocore, botocore, jmespath, wrapt, aioitertools, fsspec, xarray, pandas, tzdata, kerchunk, ujson, cftime, h5py, pkgconfig, cfgrib, eccodes, cffi, pycparser, findlibs, scipy, openblas
[pyodide]: Loaded Jinja2, MarkupSafe, aiobotocore, aiohttp, aioitertools, aiosignal, asciitree, async-timeout, attrs, botocore, cffi, cfgrib, cftime, click, cloudpickle, dask, eccodes, findlibs, frozenlist, fsspec, h5py, importlib_metadata, jmespath, kerchunk, locket, msgpack, multidict, numcodecs, openblas, pandas, partd, pkgconfig, pycparser, pyyaml, s3fs, scipy, toolz, tzdata, ujson, wrapt, xarray, yarl, zarr, zipp


We also import a utility module `utils.py` from the outer parent directory.

In [2]:
import sys
sys.path.insert(0, "..")

In [3]:
import utils

[pyodide]: Loading ipyfilite, ipywidgets, widgetsnbextension, jupyterlab_widgets, sympy, distutils, mpmath
[pyodide]: Loaded distutils, ipyfilite, ipywidgets, jupyterlab_widgets, mpmath, sympy, widgetsnbextension


We also install the `humanize` package so that we can later pretty-print the size of the remotely opened datasets.

In [4]:
%pip install humanize
import humanize

In [5]:
dask.config.set(array__chunk_size="4MiB");

## Opening a remote Zarr dataset

First, we open a remote Zarr dataset via HTTPS:

In [6]:
ds = xr.open_dataset(
    "https://noaa-nwm-retro-v2-zarr-pds.s3.amazonaws.com", engine="zarr", chunks=dict(),
)
ds

Unnamed: 0,Array,Chunk
Bytes,10.41 MiB,10.41 MiB
Shape,"(2729077,)","(2729077,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 10.41 MiB 10.41 MiB Shape (2729077,) (2729077,) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",2729077  1,

Unnamed: 0,Array,Chunk
Bytes,10.41 MiB,10.41 MiB
Shape,"(2729077,)","(2729077,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,10.41 MiB,10.41 MiB
Shape,"(2729077,)","(2729077,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 10.41 MiB 10.41 MiB Shape (2729077,) (2729077,) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",2729077  1,

Unnamed: 0,Array,Chunk
Bytes,10.41 MiB,10.41 MiB
Shape,"(2729077,)","(2729077,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.26 TiB,76.90 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.26 TiB 76.90 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float32 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,2.26 TiB,76.90 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.26 TiB,76.90 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 2.26 TiB 76.90 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type int32 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,2.26 TiB,76.90 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.53 TiB 153.81 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float64 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.53 TiB 153.81 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float64 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.53 TiB 153.81 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float64 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.53 TiB 153.81 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float64 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.53 TiB 153.81 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float64 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.53 TiB 153.81 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float64 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Since this particular dataset is stored in an S3 bucket, we could have also used an S3 filesystem directly:

In [7]:
ds = xr.open_dataset(
    "s3://noaa-nwm-retro-v2-zarr-pds", engine="zarr", chunks=dict(),
    backend_kwargs=dict(storage_options=dict(anon=True)),
)
ds

Unnamed: 0,Array,Chunk
Bytes,10.41 MiB,10.41 MiB
Shape,"(2729077,)","(2729077,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 10.41 MiB 10.41 MiB Shape (2729077,) (2729077,) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",2729077  1,

Unnamed: 0,Array,Chunk
Bytes,10.41 MiB,10.41 MiB
Shape,"(2729077,)","(2729077,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,10.41 MiB,10.41 MiB
Shape,"(2729077,)","(2729077,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 10.41 MiB 10.41 MiB Shape (2729077,) (2729077,) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",2729077  1,

Unnamed: 0,Array,Chunk
Bytes,10.41 MiB,10.41 MiB
Shape,"(2729077,)","(2729077,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.26 TiB,76.90 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.26 TiB 76.90 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float32 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,2.26 TiB,76.90 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.26 TiB,76.90 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 2.26 TiB 76.90 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type int32 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,2.26 TiB,76.90 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.53 TiB 153.81 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float64 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.53 TiB 153.81 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float64 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.53 TiB 153.81 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float64 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.53 TiB 153.81 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float64 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.53 TiB 153.81 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float64 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.53 TiB 153.81 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float64 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


After loading the dataset, which may take a few minutes, we can inspect its metadata and variables and verify its large size.

In [8]:
print(f"ds.nbytes = {humanize.naturalsize(ds.nbytes, binary=True)}")

ds.nbytes = 31.7 TiB


Note that even inspecting variable or creating a slice does not immediately request any data but instead just operates on the variable's metadata.

In [9]:
ds["velocity"]

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 4.53 TiB 153.81 MiB Shape (227904, 2729077) (672, 30000) Dask graph 30940 chunks in 2 graph layers Data type float64 numpy.ndarray",2729077  227904,

Unnamed: 0,Array,Chunk
Bytes,4.53 TiB,153.81 MiB
Shape,"(227904, 2729077)","(672, 30000)"
Dask graph,30940 chunks in 2 graph layers,30940 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,10.41 MiB,10.41 MiB
Shape,"(2729077,)","(2729077,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 10.41 MiB 10.41 MiB Shape (2729077,) (2729077,) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",2729077  1,

Unnamed: 0,Array,Chunk
Bytes,10.41 MiB,10.41 MiB
Shape,"(2729077,)","(2729077,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,10.41 MiB,10.41 MiB
Shape,"(2729077,)","(2729077,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 10.41 MiB 10.41 MiB Shape (2729077,) (2729077,) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",2729077  1,

Unnamed: 0,Array,Chunk
Bytes,10.41 MiB,10.41 MiB
Shape,"(2729077,)","(2729077,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


At this point, only the metadata of the dataset is stored in memory. The data values of the array will only be streamed in once we request them, e.g. by accessing the `values` attribute of the variable (slice):

In [10]:
ds["velocity"][:10,:10].values

array([[0.08      , 0.14      , 0.14      , 0.03      , 0.03      ,
        0.06      , 0.11      , 0.26999999, 0.28999999, 0.2       ],
       [0.08      , 0.14      , 0.14      , 0.03      , 0.03      ,
        0.06      , 0.11      , 0.26999999, 0.28999999, 0.2       ],
       [0.08      , 0.14      , 0.14      , 0.03      , 0.03      ,
        0.06      , 0.11      , 0.26999999, 0.28999999, 0.2       ],
       [0.08      , 0.14      , 0.14      , 0.03      , 0.03      ,
        0.06      , 0.11      , 0.26999999, 0.28999999, 0.2       ],
       [0.08      , 0.14      , 0.14      , 0.03      , 0.03      ,
        0.06      , 0.11      , 0.26999999, 0.28999999, 0.2       ],
       [0.08      , 0.14      , 0.14      , 0.03      , 0.03      ,
        0.06      , 0.11      , 0.26999999, 0.28999999, 0.2       ],
       [0.08      , 0.14      , 0.14      , 0.03      , 0.03      ,
        0.06      , 0.11      , 0.26999999, 0.28999999, 0.2       ],
       [0.08      , 0.14      , 0.14     

## Opening a remote NetCDF dataset

The NetCDF format also supports streaming its data. Here, we want to use the power of `zarr` with a remotely hosted NetCDF dataset.

First, we use `kerchunk` to fetch the chunking metadata for a remote NetCDF dataset via HTTPS and to translate it into a Zarr-compatible format. We then apply the

```python
utils.kerchunk_autochunk(kc: dict, *, chunk_size: int) -> dict
```
helper function to automatically chunk the `kerchunk`-loaded dataset such that every chunk is at most `chunk_size` bytes large.

In [11]:
import kerchunk.hdf

kc = kerchunk.hdf.SingleHdf5ToZarr(
    "https://a3s.fi/compression.lab.climet.eu/HighResMIP_6h_reduced_pl_t.nc",
    inline_threshold=0, error="raise", storage_options=dict(block_size=512),
).translate()

kc = utils.kerchunk_autochunk(kc, chunk_size=2**22)  # 4MiB

After the conversion has completed, which may take a minute, we can open the dataset using Zarr:

In [12]:
ds = xr.open_dataset(
    "reference://", engine="zarr", backend_kwargs=dict(
        storage_options=dict(fo=kc),
    ), consolidated=False, chunks=dict(),
)
ds

Unnamed: 0,Array,Chunk
Bytes,25.18 MiB,3.15 MiB
Shape,"(6599680,)","(824960,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 25.18 MiB 3.15 MiB Shape (6599680,) (824960,) Dask graph 8 chunks in 2 graph layers Data type float32 numpy.ndarray",6599680  1,

Unnamed: 0,Array,Chunk
Bytes,25.18 MiB,3.15 MiB
Shape,"(6599680,)","(824960,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,25.18 MiB,3.15 MiB
Shape,"(6599680,)","(824960,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 25.18 MiB 3.15 MiB Shape (6599680,) (824960,) Dask graph 8 chunks in 2 graph layers Data type float32 numpy.ndarray",6599680  1,

Unnamed: 0,Array,Chunk
Bytes,25.18 MiB,3.15 MiB
Shape,"(6599680,)","(824960,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,160 B,8 B
Shape,"(20,)","(1,)"
Dask graph,20 chunks in 2 graph layers,20 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 160 B 8 B Shape (20,) (1,) Dask graph 20 chunks in 2 graph layers Data type datetime64[ns] numpy.ndarray",20  1,

Unnamed: 0,Array,Chunk
Bytes,160 B,8 B
Shape,"(20,)","(1,)"
Dask graph,20 chunks in 2 graph layers,20 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,100.70 MiB,3.15 MiB
Shape,"(6599680, 4)","(206240, 4)"
Dask graph,32 chunks in 2 graph layers,32 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 100.70 MiB 3.15 MiB Shape (6599680, 4) (206240, 4) Dask graph 32 chunks in 2 graph layers Data type float32 numpy.ndarray",4  6599680,

Unnamed: 0,Array,Chunk
Bytes,100.70 MiB,3.15 MiB
Shape,"(6599680, 4)","(206240, 4)"
Dask graph,32 chunks in 2 graph layers,32 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,100.70 MiB,3.15 MiB
Shape,"(6599680, 4)","(206240, 4)"
Dask graph,32 chunks in 2 graph layers,32 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 100.70 MiB 3.15 MiB Shape (6599680, 4) (206240, 4) Dask graph 32 chunks in 2 graph layers Data type float32 numpy.ndarray",4  6599680,

Unnamed: 0,Array,Chunk
Bytes,100.70 MiB,3.15 MiB
Shape,"(6599680, 4)","(206240, 4)"
Dask graph,32 chunks in 2 graph layers,32 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,18.69 GiB,3.15 MiB
Shape,"(20, 19, 6599680)","(1, 1, 412480)"
Dask graph,6080 chunks in 2 graph layers,6080 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 18.69 GiB 3.15 MiB Shape (20, 19, 6599680) (1, 1, 412480) Dask graph 6080 chunks in 2 graph layers Data type float64 numpy.ndarray",6599680  19  20,

Unnamed: 0,Array,Chunk
Bytes,18.69 GiB,3.15 MiB
Shape,"(20, 19, 6599680)","(1, 1, 412480)"
Dask graph,6080 chunks in 2 graph layers,6080 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,320 B,16 B
Shape,"(20, 2)","(1, 2)"
Dask graph,20 chunks in 2 graph layers,20 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 320 B 16 B Shape (20, 2) (1, 2) Dask graph 20 chunks in 2 graph layers Data type datetime64[ns] numpy.ndarray",2  20,

Unnamed: 0,Array,Chunk
Bytes,320 B,16 B
Shape,"(20, 2)","(1, 2)"
Dask graph,20 chunks in 2 graph layers,20 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,320 B,16 B
Shape,"(20, 2)","(1, 2)"
Dask graph,20 chunks in 2 graph layers,20 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 320 B 16 B Shape (20, 2) (1, 2) Dask graph 20 chunks in 2 graph layers Data type datetime64[ns] numpy.ndarray",2  20,

Unnamed: 0,Array,Chunk
Bytes,320 B,16 B
Shape,"(20, 2)","(1, 2)"
Dask graph,20 chunks in 2 graph layers,20 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray


After loading the dataset, we can inspect its metadata and variables and verify its large size.

In [13]:
print(f"ds.nbytes = {humanize.naturalsize(ds.nbytes, binary=True)}")

ds.nbytes = 18.9 GiB


Note that even inspecting variable or creating a slice does not immediately request any data but instead just operates on the variable's metadata.

In [14]:
ds["t"]

Unnamed: 0,Array,Chunk
Bytes,18.69 GiB,3.15 MiB
Shape,"(20, 19, 6599680)","(1, 1, 412480)"
Dask graph,6080 chunks in 2 graph layers,6080 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 18.69 GiB 3.15 MiB Shape (20, 19, 6599680) (1, 1, 412480) Dask graph 6080 chunks in 2 graph layers Data type float64 numpy.ndarray",6599680  19  20,

Unnamed: 0,Array,Chunk
Bytes,18.69 GiB,3.15 MiB
Shape,"(20, 19, 6599680)","(1, 1, 412480)"
Dask graph,6080 chunks in 2 graph layers,6080 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,25.18 MiB,3.15 MiB
Shape,"(6599680,)","(824960,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 25.18 MiB 3.15 MiB Shape (6599680,) (824960,) Dask graph 8 chunks in 2 graph layers Data type float32 numpy.ndarray",6599680  1,

Unnamed: 0,Array,Chunk
Bytes,25.18 MiB,3.15 MiB
Shape,"(6599680,)","(824960,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,25.18 MiB,3.15 MiB
Shape,"(6599680,)","(824960,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 25.18 MiB 3.15 MiB Shape (6599680,) (824960,) Dask graph 8 chunks in 2 graph layers Data type float32 numpy.ndarray",6599680  1,

Unnamed: 0,Array,Chunk
Bytes,25.18 MiB,3.15 MiB
Shape,"(6599680,)","(824960,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,160 B,8 B
Shape,"(20,)","(1,)"
Dask graph,20 chunks in 2 graph layers,20 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 160 B 8 B Shape (20,) (1,) Dask graph 20 chunks in 2 graph layers Data type datetime64[ns] numpy.ndarray",20  1,

Unnamed: 0,Array,Chunk
Bytes,160 B,8 B
Shape,"(20,)","(1,)"
Dask graph,20 chunks in 2 graph layers,20 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray


At this point, only the metadata of the dataset is stored in memory. The data values of the array will only be streamed in once we request them, e.g. by accessing the `values` attribute of the variable (slice):

In [15]:
ds["t"][:2, :2, :10].values

array([[[263.40517198, 263.42814901, 263.44936419, 263.46693394,
         263.4795639 , 263.48691854, 263.48951936, 263.48797143,
         263.48207581, 263.47062646],
        [263.26139479, 263.23130788, 263.19385379, 263.15260127,
         263.11252696, 263.07885051, 263.05588428, 263.04620513,
         263.05037622, 263.06733049]],

       [[264.75515573, 264.87320114, 264.94528026, 264.96253324,
         264.93121055, 264.86723462, 264.78821846, 264.70725971,
         264.63061958, 264.55893116],
        [262.22296366, 262.27007024, 262.29859419, 262.3072605 ,
         262.29656567, 262.26837826, 262.22556472, 262.17163671,
         262.11076081, 262.04783023]]])