# Tiny data chunks

There are three primary reasons why you should avoid using too small of chunk sizes in your datacubes:

- Inefficient compression since most compression algorithms leverage correlations within a chunk.
- Inefficient data loading when querying large subsets of the data cube due to numerous GET requests with high latency. The excessive GET requests also increases costs.
- Inefficient decompression due to the number of chunks greatly exceeding available parallelism.

Please note that issue of too many GET requests can be mitigated but not completely solved by using Zarr V3 sharding or a cloud-native file format that allows storing multiple chunks in a single file.


In [None]:
import datacube_benchmark
import obstore as obs
import zarr
import pandas as pd
import hvplot.pandas  # noqa
from pint import Quantity
from azure.identity import DefaultAzureCredential
from obstore.auth.azure import AzureCredentialProvider

_ = zarr.config.set({"async.concurrency": 128})
pd.set_option("display.float_format", "{:0.30f}".format)

credential_provider = AzureCredentialProvider(credential=DefaultAzureCredential())

## Demonstrating storage inefficiencies of too small of chunks

Create a blosc compressed array with 25 KB chunks

In [None]:
object_store = obs.store.AzureStore.from_url(
    "https://datacubeguide.blob.core.windows.net/performance-testing/zarr-tiny-chunks",
    credential_provider=credential_provider,
)
tiny_chunks_zarr_store = datacube_benchmark.create_zarr_store(
    object_store,
    compressor=zarr.codecs.BloscCodec(
        cname="zstd", clevel=3, shuffle=zarr.codecs.BloscShuffle.shuffle
    ),
    target_chunk_size="25 kilobyte",
    target_array_size="10 GB",
)

Create a blosc compressed array with 25 MB chunks

In [None]:
object_store = obs.store.AzureStore.from_url(
    "https://datacubeguide.blob.core.windows.net/performance-testing/zarr-reg-chunks",
    credential_provider=credential_provider,
)
reg_chunks_zarr_store = datacube_benchmark.create_zarr_store(
    object_store,
    compressor=zarr.codecs.BloscCodec(
        cname="zstd", clevel=3, shuffle=zarr.codecs.BloscShuffle.shuffle
    ),
    target_chunk_size="25 megabyte",
    target_array_size="10 GB",
)

Compare the storage size of the two arrays

In [None]:
tiny_chunks_array = zarr.open_array(tiny_chunks_zarr_store, zarr_version=3, path="data")
tiny_chunks_storage_size = Quantity(
    datacube_benchmark.utils.array_storage_size(tiny_chunks_array), "bytes"
).to("GB")
tiny_chunks_compression_ratio = (
    tiny_chunks_array.nbytes / tiny_chunks_storage_size.to("bytes").magnitude
)
reg_chunks_array = zarr.open_array(reg_chunks_zarr_store, zarr_version=3, path="data")
reg_chunks_storage_size = Quantity(
    datacube_benchmark.utils.array_storage_size(reg_chunks_array), "bytes"
).to("GB")
reg_chunks_compression_ratio = (
    reg_chunks_array.nbytes / reg_chunks_storage_size.to("bytes").magnitude
)

print("Storage size of a 10 GB array in object storage:")
print(f"\t25 KB chunks: {tiny_chunks_storage_size:.2f}")
print(f"\t25 MB chunks: {reg_chunks_storage_size:.2f}")
print("Compression ratio of a 10 GB array in object storage:")
print(f"\t25 KB chunks: {tiny_chunks_compression_ratio:.2f}")
print(f"\t25 MB chunks: {reg_chunks_compression_ratio:.2f}")

Notice the much better compression ratio for a datacube with 25 MB chunks relative to a datacube with 25 KB chunks.

## Demonstrating performance inefficiencies of too small of chunks

Test the time required to load a random point, a time series, or a spatial slice for the array.

In [None]:
tiny_chunks_results = datacube_benchmark.benchmark_access_patterns(
    tiny_chunks_array, num_samples=10
).reset_index(drop=True)
reg_chunks_results = datacube_benchmark.benchmark_access_patterns(
    reg_chunks_array, num_samples=10
).reset_index(drop=True)

In [None]:
df = pd.concat([tiny_chunks_results, reg_chunks_results])
df["access_pattern"] = df["access_pattern"].replace(
    {
        "point": "Random point",
        "time_series": "Time series",
        "spatial_slice": "Spatial slice",
        "full": "Full scan",
    }
)
df["mean_time"] = df.apply(lambda row: float(row["mean_time"].magnitude), axis=1)
df["chunk_size"] = df.apply(lambda row: f"{row['chunk_size'].magnitude:,.2f}", axis=1)
df

In [None]:
title = "Duration to load data for difference access patterns"
plt = df.hvplot.bar(
    x="chunk_size",
    y="mean_time",
    by="access_pattern",
    width=1000,
    rot=45,
    title=title,
    ylabel="Duration (s)",
    xlabel="Chunk Size, Query type",
)

In [None]:
plt

Note that while random point access is faster for datacubes with smaller chunks, the time for loading many chunks is dramatically worse.