# Zarr in Practice

This notebook demonstrates how to create, explore and modify a Zarr store.

These concepts are explored in more detail in the official [Zarr Tutorial](https://zarr.readthedocs.io/en/stable/tutorial.html).

It also shows the use of public Zarr stores for geospatial data.

## How to create a Zarr store

In [1]:
import sys
import numpy as np
import xarray as xr
import zarr

# Here we create a simple Zarr store.
zstore = zarr.array(np.arange(10))

This is an in-memory Zarr store. To persist it to disk, we can use `.save`.

In [2]:
zarr.save("test.zarr", zstore)

We can open the metadata about this dataset, which gives us some interesting information. The dataset has a shape of 10 chunks of 10, so we know all the data was stored in 1 chunk, and was compressed with the `blosc` compressor.

In [3]:
!cat test.zarr/.zarray 

{
    "chunks": [
        10
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "<i8",
    "fill_value": 0,
    "filters": null,
    "order": "C",
    "shape": [
        10
    ],
    "zarr_format": 2
}

This was a pretty basic example. Let's explore the other things we might want to do when creating Zarr.

## How to create a group

In [4]:
root = zarr.group()
group1 = root.create_group('group1')
group2 = root.create_group('group2')
z1 = group1.create_dataset('ds_in_group', shape=(100,100), chunks=(10,10), dtype='i4')
z2 = group2.create_dataset('ds_in_group', shape=(1000,1000), chunks=(10,10), dtype='i4')
root.tree(expand=True)

Tree(nodes=(Node(disabled=True, name='/', nodes=(Node(disabled=True, name='group1', nodes=(Node(disabled=True,…

## How to Examine and Modify the Chunk Shape

If your data is sufficiently large, Zarr will chose a chunksize for you.

In [5]:
zarr_no_chunks = zarr.array(np.arange(100), chunks=True)
zarr_no_chunks.chunks, zarr_no_chunks.shape

((100,), (100,))

In [6]:
zarr_with_chunks = zarr.array(np.arange(10000000), chunks=True)
zarr_with_chunks.chunks, zarr_with_chunks.shape

((156250,), (10000000,))

For `zarr_with_chunks` we see the chunks are smaller than the shape, so we know the data has been chunked. Other ways to examine the chunk structure are `zarr.info` and `zarr.cdata_shape`.

In [7]:
?zarr_no_chunks.cdata_shape

[0;31mType:[0m        property
[0;31mString form:[0m <property object at 0x7efde6ecfb00>
[0;31mDocstring:[0m  
A tuple of integers describing the number of chunks along each
dimension of the array.

In [8]:
zarr_no_chunks.cdata_shape, zarr_with_chunks.cdata_shape

((1,), (64,))

The zarr store with chunks has 64 chunks. The number of chunks multiplied by the chunk size equals the length of the whole array.

In [9]:
zarr_with_chunks.cdata_shape[0] * zarr_with_chunks.chunks[0] == zarr_with_chunks.shape[0]

True

### What's the storage size of these chunks?

The default chunks are pretty small.

In [10]:
sys.getsizeof(zarr_with_chunks.chunk_store['0']) # this is in bytes

8049

In [11]:
zarr_with_big_chunks = zarr.array(np.arange(10000000), chunks=(500000))

In [12]:
zarr_with_big_chunks.chunks, zarr_with_big_chunks.shape, zarr_with_big_chunks.cdata_shape

((500000,), (10000000,), (20,))

This Zarr store has 10 million values, stored in 20 chunks of 500,000 data values.

In [13]:
sys.getsizeof(zarr_with_big_chunks.chunk_store['0'])

24941

These chunks are still pretty small, but this is just a silly example. In the real world, you will likely want to deal in Zarr chunks of 1MB or greater, especially when dealing with remote storatge options where data is read over a network and the number of requests should be minimized.

## Exploring and Modifying Data Compression

Continuing with data from the example above, we can tell that Zarr has also compressed the data for us using `zarr.info` or `zarr.compressor`. 

In [26]:
zarr_with_chunks.compressor

Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)

The `Blosc` compressor is actually a meta compressor so actually implements multiple different internal compressors. In this case, it has implemented `lz4` compression. We can also explore how much space was saved by using this compression method.

In [15]:
zarr_with_chunks.info

0,1
Type,zarr.core.Array
Data type,int64
Shape,"(10000000,)"
Chunk shape,"(156250,)"
Order,C
Read-only,False
Compressor,"Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)"
Store type,zarr.storage.KVStore
No. bytes,80000000 (76.3M)
No. bytes stored,514193 (502.1K)


We can see, from the storage ratio above, that compression has made our data 155 times smaller 😱 .

You can set `compression=None` when creating a Zarr array to turn off this behavior, but I'm not sure why you would do that.

Let's see what happens when we use a different compression method. We can checkout a full list of numcodecs compressors here: [https://numcodecs.readthedocs.io/](https://numcodecs.readthedocs.io/).

In [16]:
from numcodecs import GZip
compressor = GZip()
zstore_gzip_compressed = zarr.array(np.arange(10000000), chunks=True, compressor=compressor)
zstore_gzip_compressed.info

0,1
Type,zarr.core.Array
Data type,int64
Shape,"(10000000,)"
Chunk shape,"(156250,)"
Order,C
Read-only,False
Compressor,GZip(level=1)
Store type,zarr.storage.KVStore
No. bytes,80000000 (76.3M)
No. bytes stored,15086009 (14.4M)


In this case, the storage ratio is 5.3 - so not as good! How to chose a compression algorithm is a topic for future investigation.

## Consolidating metadata

It's important to consolidate metadata to minimize requests. Each group and array will have a metadata file, so in order to limit requests to read the whole tree of metadata files, Zarr provides the ability to consolidate metdata into a metadata file at the of the store.

So far we have only been dealing in single array Zarr data stores. In this next example, we will create a zarr store with multiple arrays and then consolidate metadata. The speed up with local storage is insignificant, but becomes significant when dealing in remote storage options, which we will see in the following example on accessing cloud storage.

In [17]:
root = zarr.group()
zarr_store = 'example.zarr'
# Let's create many groups and many arrays
num_groups, num_arrays_per_group = 100, 100
for i in range(num_groups):
    group = root.create_group(f'group-{i}')
    for j in range(num_arrays_per_group):
        group.create_dataset(f'array-{j}', shape=(1000,1000), dtype='i4')

store = zarr.DirectoryStore(zarr_store)
zarr.save(store, root)

In [1]:
# We don't expect it to exist yet!
!cat {zarr_store}/.zmetadata

cat: {zarr_store}/.zmetadata: No such file or directory


In [19]:
zarr.consolidate_metadata(zarr_store)

<zarr.core.Array (100,) <U8>

In [20]:
zarr.open_consolidated(zarr_store)

<zarr.core.Array (100,) <U8>

In [21]:
!cat {zarr_store}/.zmetadata

{
    "metadata": {
        ".zarray": {
            "chunks": [
                100
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<U8",
            "fill_value": "",
            "filters": null,
            "order": "C",
            "shape": [
                100
            ],
            "zarr_format": 2
        }
    },
    "zarr_consolidated_format": 1
}

# Example of Cloud-Optimized Access for this Format

Fortunately, there are many publicly accessible cloud archives of Zarr data.

Zarr provides storage backends for all of these cloud providers: [Zarr Tutorial - Distributed/cloud storage](https://zarr.readthedocs.io/en/stable/tutorial.html#distributed-cloud-storage).

Here are a few we are aware of:

* [Zarr data in Microsoft's Planetary Computer](https://planetarycomputer.microsoft.com/catalog?filter=zarr)
* [Zarr data from Google](https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset&_ga=2.226354714.1000882083.1692116148-1788942020.1692116148&pli=1&q=zarr)
* [Amazon Sustainability Data Initiative available from Registry of Open Data on AWS](https://registry.opendata.aws/collab/asdi/) - Enter "Zarr" in the Search input box.
* [Pangeo-Forge Data Catalog](https://pangeo-forge.org/catalog)

The Pangeo-Forge Data Catalog provides handy examples of how to open each dataset, for example, from the [Global Precipitation Climatology Project (GPCP)](https://pangeo-forge.org/dashboard/feedstock/42) page:

In [22]:
store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'

In [23]:
ds = xr.open_dataset(store, engine='zarr', chunks={}, consolidated=True)
ds

Unnamed: 0,Array,Chunk
Bytes,1.41 kiB,1.41 kiB
Shape,"(180, 2)","(180, 2)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.41 kiB 1.41 kiB Shape (180, 2) (180, 2) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",2  180,

Unnamed: 0,Array,Chunk
Bytes,1.41 kiB,1.41 kiB
Shape,"(180, 2)","(180, 2)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.81 kiB,2.81 kiB
Shape,"(360, 2)","(360, 2)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.81 kiB 2.81 kiB Shape (360, 2) (360, 2) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",2  360,

Unnamed: 0,Array,Chunk
Bytes,2.81 kiB,2.81 kiB
Shape,"(360, 2)","(360, 2)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,144.16 kiB,3.12 kiB
Shape,"(9226, 2)","(200, 2)"
Dask graph,47 chunks in 2 graph layers,47 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 144.16 kiB 3.12 kiB Shape (9226, 2) (200, 2) Dask graph 47 chunks in 2 graph layers Data type datetime64[ns] numpy.ndarray",2  9226,

Unnamed: 0,Array,Chunk
Bytes,144.16 kiB,3.12 kiB
Shape,"(9226, 2)","(200, 2)"
Dask graph,47 chunks in 2 graph layers,47 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.23 GiB,49.44 MiB
Shape,"(9226, 180, 360)","(200, 180, 360)"
Dask graph,47 chunks in 2 graph layers,47 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.23 GiB 49.44 MiB Shape (9226, 180, 360) (200, 180, 360) Dask graph 47 chunks in 2 graph layers Data type float32 numpy.ndarray",360  180  9226,

Unnamed: 0,Array,Chunk
Bytes,2.23 GiB,49.44 MiB
Shape,"(9226, 180, 360)","(200, 180, 360)"
Dask graph,47 chunks in 2 graph layers,47 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


Microsoft's Planetary Computer goes above and beyond, providing tutorials alongside each dataset. We recommend exploring these on your own to get an idea of what you can do with Zarr and Xarray. See all tutorials here: [microsoft/PlanetaryComputerExamples](https://github.com/microsoft/PlanetaryComputerExamples/tree/main/tutorials). Note, this repo contains ALL tutorials, not just Zarr tutorials, so you may want to filter for Zarr.

For example, here is some code from the [Daymet Puerto Rico Dataset on MS Planetary Computer](https://planetarycomputer.microsoft.com/dataset/daymet-daily-pr#Example-Notebook):

In [1]:
import cartopy.crs as ccrs
import fsspec
import matplotlib.pyplot as plt
import pystac
import xarray as xr
import warnings

warnings.simplefilter("ignore", RuntimeWarning)

In [2]:
url = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-daily-hi"
collection = pystac.read_file(url)
asset = collection.assets["zarr-https"]
store = fsspec.get_mapper(asset.href)
ds = xr.open_zarr(store, **asset.extra_fields["xarray:open_kwargs"])
ds

Unnamed: 0,Array,Chunk
Bytes,647.88 kiB,647.88 kiB
Shape,"(584, 284)","(584, 284)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 647.88 kiB 647.88 kiB Shape (584, 284) (584, 284) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",284  584,

Unnamed: 0,Array,Chunk
Bytes,647.88 kiB,647.88 kiB
Shape,"(584, 284)","(584, 284)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,647.88 kiB,647.88 kiB
Shape,"(584, 284)","(584, 284)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 647.88 kiB 647.88 kiB Shape (584, 284) (584, 284) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",284  584,

Unnamed: 0,Array,Chunk
Bytes,647.88 kiB,647.88 kiB
Shape,"(584, 284)","(584, 284)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,9.25 GiB,230.93 MiB
Shape,"(14965, 584, 284)","(365, 584, 284)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 9.25 GiB 230.93 MiB Shape (14965, 584, 284) (365, 584, 284) Dask graph 41 chunks in 2 graph layers Data type float32 numpy.ndarray",284  584  14965,

Unnamed: 0,Array,Chunk
Bytes,9.25 GiB,230.93 MiB
Shape,"(14965, 584, 284)","(365, 584, 284)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,9.25 GiB,230.93 MiB
Shape,"(14965, 584, 284)","(365, 584, 284)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 9.25 GiB 230.93 MiB Shape (14965, 584, 284) (365, 584, 284) Dask graph 41 chunks in 2 graph layers Data type float32 numpy.ndarray",284  584  14965,

Unnamed: 0,Array,Chunk
Bytes,9.25 GiB,230.93 MiB
Shape,"(14965, 584, 284)","(365, 584, 284)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,9.25 GiB,230.93 MiB
Shape,"(14965, 584, 284)","(365, 584, 284)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 9.25 GiB 230.93 MiB Shape (14965, 584, 284) (365, 584, 284) Dask graph 41 chunks in 2 graph layers Data type float32 numpy.ndarray",284  584  14965,

Unnamed: 0,Array,Chunk
Bytes,9.25 GiB,230.93 MiB
Shape,"(14965, 584, 284)","(365, 584, 284)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,9.25 GiB,230.93 MiB
Shape,"(14965, 584, 284)","(365, 584, 284)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 9.25 GiB 230.93 MiB Shape (14965, 584, 284) (365, 584, 284) Dask graph 41 chunks in 2 graph layers Data type float32 numpy.ndarray",284  584  14965,

Unnamed: 0,Array,Chunk
Bytes,9.25 GiB,230.93 MiB
Shape,"(14965, 584, 284)","(365, 584, 284)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,233.83 kiB,5.70 kiB
Shape,"(14965, 2)","(365, 2)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 233.83 kiB 5.70 kiB Shape (14965, 2) (365, 2) Dask graph 41 chunks in 2 graph layers Data type datetime64[ns] numpy.ndarray",2  14965,

Unnamed: 0,Array,Chunk
Bytes,233.83 kiB,5.70 kiB
Shape,"(14965, 2)","(365, 2)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,9.25 GiB,230.93 MiB
Shape,"(14965, 584, 284)","(365, 584, 284)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 9.25 GiB 230.93 MiB Shape (14965, 584, 284) (365, 584, 284) Dask graph 41 chunks in 2 graph layers Data type float32 numpy.ndarray",284  584  14965,

Unnamed: 0,Array,Chunk
Bytes,9.25 GiB,230.93 MiB
Shape,"(14965, 584, 284)","(365, 584, 284)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,9.25 GiB,230.93 MiB
Shape,"(14965, 584, 284)","(365, 584, 284)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 9.25 GiB 230.93 MiB Shape (14965, 584, 284) (365, 584, 284) Dask graph 41 chunks in 2 graph layers Data type float32 numpy.ndarray",284  584  14965,

Unnamed: 0,Array,Chunk
Bytes,9.25 GiB,230.93 MiB
Shape,"(14965, 584, 284)","(365, 584, 284)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,9.25 GiB,230.93 MiB
Shape,"(14965, 584, 284)","(365, 584, 284)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 9.25 GiB 230.93 MiB Shape (14965, 584, 284) (365, 584, 284) Dask graph 41 chunks in 2 graph layers Data type float32 numpy.ndarray",284  584  14965,

Unnamed: 0,Array,Chunk
Bytes,9.25 GiB,230.93 MiB
Shape,"(14965, 584, 284)","(365, 584, 284)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,29.23 kiB,730 B
Shape,"(14965,)","(365,)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,int16 numpy.ndarray,int16 numpy.ndarray
"Array Chunk Bytes 29.23 kiB 730 B Shape (14965,) (365,) Dask graph 41 chunks in 2 graph layers Data type int16 numpy.ndarray",14965  1,

Unnamed: 0,Array,Chunk
Bytes,29.23 kiB,730 B
Shape,"(14965,)","(365,)"
Dask graph,41 chunks in 2 graph layers,41 chunks in 2 graph layers
Data type,int16 numpy.ndarray,int16 numpy.ndarray


# Additional Resources

* Jupyter Notebook for a high level overview of Zarr on Google Cloud by Tyson Swetnam: [![image](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/tyson-swetnam/agic-2022/blob/main/docs/notebooks/zarr.ipynb)