# How to Determine your Zarr Chunking Strategy

A frequently asked question is how to determine the optimal chunk shape for your Zarr store. Unfortunately, there is no easy answer. At this time there is an [active discussion in zarr-python](https://github.com/zarr-developers/zarr-python/discussions/1479) about benchmarking for certain use cases and it's quite long and complicated.

But here is a first shot at a "step-by-step" guide to determine the best chunk shape and size. Warning, Step 3 is really either a best guess or going down the rabbit hole of performance testing.

## 1. What is the use case you wish to optimize for?

Or do you wish to (dis)satisfy many use cases equally?

The use case should inform the chunk size and shape. 

## 2. Determine the desired aspect ratio of your chunk shape

Your Zarr store will have multiple dimensions to chunk. A typical scenario is a dataset which has dimensions latitude, longitude, and time with 1 or more data variables. If you want to optimize for time series, you will want to store more values on the time dimension so you can load fewer chunks to generate time series. If you want to evaluate large spatial extents, for visualization or aggregation, you want store more values of the latitude and longitude dimensions for the same reason. In general, you want to store more values per chunk of the dimensions you will be summarizing/aggregating/visualizing.

### 2.1 Using the aspect ratio strategy

There is an open PR to pangeo-forge recipes to determine the best chunk shape, so first you'll need to install `pangeo_forge_recipes` from that PR (https://github.com/pangeo-forge/pangeo-forge-recipes/pull/546).

In [1]:
%%capture
!pip install git+https://github.com/jbusecke/pangeo-forge-recipes@dynamic_chunks_2

In [2]:
import fsspec
from pangeo_forge_recipes import aggregation, dynamic_target_chunks
import s3fs
import xarray as xr

In [3]:
model = "GISS-E2-1-G"
variable = "tas"
s3_path = f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/{model}/historical/r1i1p1*/{variable}/*"
fs_read = fsspec.filesystem("s3", anon=True)
file_paths = fs_read.glob(s3_path)
s3_fs = s3fs.S3FileSystem(anon=True)
print(f"{len(file_paths)} discovered from {s3_path}")

65 discovered from s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/GISS-E2-1-G/historical/r1i1p1*/tas/*


In [23]:
file = s3_fs.open(file_paths[0])
ds = xr.open_dataset(file)

In [24]:
ds

In [6]:
d = ds.to_dict(data=False)
schema = aggregation.XarraySchema(
    attrs=d.get("attrs"),
    coords=d.get("coords"),
    data_vars=d.get("data_vars"),
    dims=d.get("dims"),
    chunks=d.get("chunks", {}),
)
target_chunks = dynamic_target_chunks.dynamic_target_chunks_from_schema(
    schema,
    target_chunk_size='3MB',
    # Dictionary mapping dimension names to desired aspect ratio of total number of chunks along each dimension.
    # The aspect ratio is to determine the number of chunks along that dimension for the entire dataset, NOT the length of that dimension for each chunk.
    # So, a lower value in this aspect ratio means fewer total chunks to capture all data values.
    # If want squarish spatial chunks, and we know we have about twice as many lon as lat values,
    # we can set lon to 2 and lat to 1 since we expect twice as many chunks along the lon dimension as the lat dimension.
    target_chunks_aspect_ratio={'time': -1, 'lat': 1, 'lon': 2},
    # size tolerence is multiplied by target_chunk_size to get a range of target chunk size, so this would give us chunks from 1.5MB to 4.5MB.
    size_tolerance=0.5
)
d.get("dims"), target_chunks

({'time': 365, 'lat': 600, 'lon': 1440}, {'time': 365, 'lat': 30, 'lon': 36})

The example above would optimize if we want to create timeseries for small spatial areas. Another use case we might want to optimize for is loading the entire spatial extent, say for visualizing the data globally. Assuming we're ok with having a chunk for each day of data, we can modify our chunk configuration as follows:

In [7]:
target_chunks = dynamic_target_chunks.dynamic_target_chunks_from_schema(
    schema,
    target_chunk_size='3MB',
    target_chunks_aspect_ratio={'time': 365, 'lat': 1, 'lon': 1},
    size_tolerance=0.5
)
d.get("dims"), target_chunks

({'time': 365, 'lat': 600, 'lon': 1440}, {'time': 1, 'lat': 600, 'lon': 1440})

We can then pass `target_chunks` when creating our dataset.

In [28]:
file = s3_fs.open(file_paths[0])
ds = xr.open_dataset(file, chunks=target_chunks)
# or ds = xr.open_mfdataset(fileset, combine='by_coords', chunks=target_chunks) if creating a data store from multiple files.
# Here's how you can write it, but it will take some time so commented out for now.
# ds.to_zarr("test.zarr")
# xr.open_zarr("test.zarr")

## 3. Determine your target chunk size

We chose a (somewhat) arbitrary target chunk size in step 2. You might need to do the same unless you have lots of time on your hands.

The Pangeo project has been recommending a chunk size of about 100MB, which originated from the [Dask Best Practices](https://docs.dask.org/en/stable/array-best-practices.html#orient-your-chunks).

>Partitions should fit comfortably in memory (smaller than a gigabyte) but also not be too many. Every operation on every partition takes the central scheduler a few hundred microseconds to process.

The [Zarr tutorial](https://zarr.readthedocs.io/en/stable/tutorial.html#chunk-size-and-shape) recommends a chunk size of at least 1MB.

>In general, chunks of at least 1 megabyte (1M) uncompressed size seem to provide better performance, at least when using the Blosc compression library.

As you may be able to tell by now, performance really depends on the clients and systems accessing the data. A wide variety of libraries (such as xarray and fsspec) and systems (Dask) may be involved. Parallelism and caching will differ depending on these libraries and systems. Ideally you will be able to do some testing of different chunk sizes to understand the performance for different chunk sizes. Future versions of this guide will provide or link to benchmarking examples or tools.

One way to break this problem down is to create many test datasets at different chunk sizes and consider that your input. The output should be whatever subset of your Zarr dataset you want your application(s) to read. You can then run those test datasets through your target system to correlate performance with chunk size.

## Useful examples

Sometimes the best strategy is to learn from what others have done.

* [Understanding optimal zarr chunking scheme for a climatology (Pangeo Discourse)](https://discourse.pangeo.io/t/understanding-optimal-zarr-chunking-scheme-for-a-climatology/2335)

## To be continued

As the community around Zarr continues to evolve and mature, we hope to provide more guidance on how to choose chunk sizes. In the meantime, we hope this guide has helped you understand the problem and provided some ideas about how to move forward with designing your Zarr store.