duckdb_zarr is a DuckDB extension for exploring Zarr stores with SQL through a relational projection.
The current implementation follows the project documents conservatively:
SPEC.md: start with relational metadata and chunk discoveryARCHITECTURE.md: build the store adapter, metadata parser, and DuckDB bridge firstROADMAP.md: advance incrementally from metadata discovery to relational cell scans and planner-aware execution
Today’s extension provides metadata and relational scan table functions for local stores and a constrained remote read path:
zarr(path)for a simple array overviewzarr(path, array_path)andzarr(path, array_path, version_override)as convenience aliases forzarr_cells(...)zarr_groups(path)andzarr_groups(path, version_override)zarr_arrays(path)andzarr_arrays(path, version_override)zarr_chunks(path)andzarr_chunks(path, version_override)zarr_cells(path, array_path)andzarr_cells(path, array_path, version_override)for Zarr dtypes|b1,i1,i2,i4,i8,u1,u2,u4,u8,f2,f4, andf8
This gives a usable SQL entrypoint for understanding a Zarr store and projecting dense numeric arrays into relational rows. Other dtypes such as complex values and other unsupported or non-standard Zarr dtypes are currently rejected by zarr_cells(path, array_path).
The MVP currently supports:
- Local filesystem Zarr v2 discovery
- Local filesystem Zarr v3 discovery from
zarr.json - Zarr v2 consolidated metadata discovery from
.zmetadata - Remote
http://,https://, ands3://stores when DuckDBhttpfsis available and the store is consolidated - OME-Zarr-style multiscales groups built on Zarr v2, including level arrays such as
0 - Real-world OME-Zarr v3 metadata discovery for stores such as image-label hierarchies and multiscales groups
- Group enumeration from
.zgroup - Array enumeration from
.zarrayandzarr.json - Automatic v2/v3 detection for local stores, with explicit
version_overridesupport (auto,v2,v3) on the lower-level SQL functions - Chunk enumeration for v2
.and/separators plus Zarr v3defaultandv2chunk-key encodings zarr_cells(path, array_path)for dense arrays with dtypes|b1,i1,i2,i4,i8,u1,u2,u4,u8,f2,f4, andf8- Uncompressed and gzip-compressed chunk payloads
- Zarr v3 regular chunk grids with
bytesand optionalgzipcodecs - Zarr v3 transpose handling for identity and full-reverse permutations
- Zarr v3
sharding_indexedarrays with inner Blosc (zstd) decode for common OME-Zarr image and label data - Real OME-Zarr v3 cell scans against
test/data/idr0062A/6001240_labels.zarr - Missing-chunk fill-value materialization for supported
zarr_cells()dtypes whenfill_valueis present - Dynamic
(dim_0, ..., value)projection based on array rank and dtype - Projection-aware
zarr_cells()scans - Filter pushdown on dimension and value columns
- Chunk pruning from dimension filters before chunk decode
- Chunk-streamed execution with bounded scan-time memory instead of init-time row buffering
- Developer fixture generation and SQLLogic tests
The MVP does not yet support:
zarr_cells()for dtypes outside|b1,i1,i2,i4,i8,u1,u2,u4,u8,f2,f4, andf8- Blosc codecs outside the currently supported
zstd-backed path - Non-consolidated remote store discovery
- Remote hierarchical discovery for non-consolidated Zarr v3 group stores
- Arrow materialization
Build the extension:
makeInstall developer tooling for formatting and hooks:
python3 -m pip install pre-commit clang-format==11.0.1
pre-commit install --hook-type pre-commit --hook-type pre-pushRe-run that command after pulling hook configuration changes. It installs both:
- a
pre-commithook that auto-fixes DuckDB formatting - a
pre-pushhook that runs the same formatter in--checkmode before CI sees the push
Create the checked-in sample fixture again if needed:
make fixtureRun the extension tests:
make test_metadataRun the Python smoke example with the same environment shape many Python users will use:
uvx --with duckdb==1.5.0 --with numpy --with pandas python test/python_smoke.pyThere is also a matching make target:
make test_python_examplePackage a static extension repository layout from downloaded CI artifacts:
python3 scripts/package_extension_repository.py \
--artifacts-dir build/distribution-artifacts \
--out-dir build/extension-repository \
--extension-name duckdb_zarr \
--duckdb-version v1.5.0Run the formatter checks directly:
make format-checkAuto-fix formatting:
make format-fixOpen the DuckDB shell with the extension linked in:
./build/release/duckdbThen query a sample store:
SELECT * FROM zarr_groups('test/data/simple_v2.zarr');
SELECT * FROM zarr_arrays('test/data/simple_v2.zarr');
SELECT * FROM zarr_chunks('test/data/simple_v2.zarr');
SELECT * FROM zarr_cells('test/data/simple_v2.zarr', 'temperature');
SELECT * FROM zarr('test/data/simple_v2.zarr');
SELECT * FROM zarr_arrays('test/data/simple_v3.zarr', 'v3');
SELECT * FROM zarr('test/data/simple_v3.zarr', 'temperature_v3', 'v3');
SELECT * FROM zarr('test/data/ome_example.ome.zarr', '0');
SELECT * FROM zarr('test/data/idr0062A/6001240_labels.zarr');
SELECT * FROM zarr('test/data/idr0062A/6001240_labels.zarr', '0') LIMIT 5;
SELECT SUM(value)
FROM zarr('test/data/idr0062A/6001240_labels.zarr', 'labels/0/0')
WHERE dim_0 = 0 AND dim_1 = 0 AND dim_2 < 4 AND dim_3 < 4;For remote stores in this phase, the practical contract is:
- the store must expose Zarr v2 consolidated metadata at
.zmetadata - DuckDB must have
httpfsavailable for the URI scheme you use - chunk data are still read directly from remote chunk object paths
Version detection notes:
zarr(path)auto-detects between local v2 and v3 stores- lower-level functions accept an optional
version_overrideargument:auto,v2, orv3 - the convenience cell entrypoint also accepts an override as
zarr(path, array_path, version_override) - explicit override is useful when a path is ambiguous or when you want failures to be version-specific
This repo now includes a release-asset publishing workflow in
PublishExtensionRepository.yml
that builds extension binaries on version tags and attaches platform-specific
.duckdb_extension.gz assets to the matching GitHub Release.
To make this work for your repository:
- push a version tag such as
v0.1.0 - create the matching GitHub Release, or let the workflow create/update it from the tag
Then users can install directly from a release asset URL:
INSTALL 'https://github.com/wayscience/duckdb_zarr/releases/download/v0.1.0/duckdb_zarr-v1.5.0-osx_arm64.duckdb_extension.gz';
LOAD duckdb_zarr;Each release publishes assets named like:
duckdb_zarr-v1.5.0-osx_arm64.duckdb_extension.gz
duckdb_zarr-v1.5.0-linux_amd64.duckdb_extension.gz
To publish a new installable extension release from this repository:
- Ensure the release workflow in
PublishExtensionRepository.ymlis enabled and passing on the default branch. - Create and push a version tag such as
v0.1.0. - Wait for the publish workflow to:
- build the extension binaries
- package platform-specific
.duckdb_extension.gzassets - upload those assets to the matching GitHub Release
- Verify that a published artifact path exists, for example:
https://github.com/wayscience/duckdb_zarr/releases/download/v0.1.0/duckdb_zarr-v1.5.0-osx_arm64.duckdb_extension.gz - Optionally add human-readable release notes. The installable path is the GitHub Release asset itself.
After that, users can install the extension with:
INSTALL 'https://github.com/wayscience/duckdb_zarr/releases/download/v0.1.0/duckdb_zarr-v1.5.0-osx_arm64.duckdb_extension.gz';
LOAD duckdb_zarr;For remote Zarr stores, users may also need:
INSTALL httpfs;
LOAD httpfs;The repository is still the standard DuckDB extension template, so the main commands are unchanged:
make: build DuckDB and the extensionmake test: run SQLLogic testsmake fixture: regenerate the sample Zarr storemake test_metadata: regenerate the fixture and run testsmake format-check: verify DuckDB formatting forsrc/andtest/make format-fix: apply DuckDB formatting forsrc/andtest/
This repo also includes a pre-commit configuration in
\.pre-commit-config.yaml
that runs the DuckDB formatter automatically before commits. The formatter wrapper depends on clang-format,
and the current DuckDB script expects clang-format==11.0.1.
Useful build outputs:
./build/release/duckdb./build/release/test/unittest./build/release/extension/duckdb_zarr/duckdb_zarr.duckdb_extension./build/extension-repository/<duckdb-version>/<platform>/duckdb_zarr.duckdb_extension.gz
The current code maps directly to the architecture document:
- Store Adapter: local filesystem traversal through DuckDB’s
FileSystem - Store Adapter: consolidated remote discovery for
http://,https://, ands3://paths through DuckDB filesystem routing - Metadata Parser: Zarr v2
.zgroup/.zarrayplus Zarr v3zarr.jsonparsing viayyjson - DuckDB Bridge: table functions registered from the extension entrypoint
- Relational Cell Scan:
zarr_cells()with pushed filters, projection-aware output, and dimension-based chunk pruning
The current zarr_cells() path now streams chunk-by-chunk at execution time, but there is still room to improve execution internals. The next major implementation step is to tighten the scan loop further with:
- chunk-file reads
- codec decode
- typed value materialization
- row projection
(dim_0, ..., value) - Arrow-oriented batching and broader codec support
src/duckdb_zarr_extension.cpp: extension entrypoint and function registrationsrc/zarr_metadata.cpp: metadata discovery and table function implementationsscripts/create_sample_zarr.py: reproducible fixture generationtest/sql/duckdb_zarr.test: SQLLogic coverage for metadata, cell projection, and pushed-filter behavior
No new third-party runtime dependencies were added in this repo. For remote access, duckdb_zarr relies on DuckDB’s httpfs extension being available to the runtime.
For the next phase, useful dependency choices will likely be:
- a stable Zarr metadata/codec implementation strategy in C++
- blosc decode support for chunk payloads
- Arrow integration once
zarr_cells()starts producing typed batches
If you want me to take on the next phase, the highest-value dependency discussion is now around broader codec support, especially Blosc, and whether remote execution should grow from the current consolidated-metadata path into fuller object-store discovery plus caching.