duckdb_zarr

duckdb_zarr is a DuckDB extension for exploring Zarr stores with SQL through a relational projection.

The current implementation follows the project documents conservatively:

SPEC.md: start with relational metadata and chunk discovery
ARCHITECTURE.md: build the store adapter, metadata parser, and DuckDB bridge first
ROADMAP.md: advance incrementally from metadata discovery to relational cell scans and planner-aware execution

Today’s extension provides metadata and relational scan table functions for local stores and a constrained remote read path:

zarr(path) for a simple array overview
zarr(path, array_path) and zarr(path, array_path, version_override) as convenience aliases for zarr_cells(...)
zarr_groups(path) and zarr_groups(path, version_override)
zarr_arrays(path) and zarr_arrays(path, version_override)
zarr_chunks(path) and zarr_chunks(path, version_override)
zarr_cells(path, array_path) and zarr_cells(path, array_path, version_override) for Zarr dtypes |b1, i1, i2, i4, i8, u1, u2, u4, u8, f2, f4, and f8

This gives a usable SQL entrypoint for understanding a Zarr store and projecting dense numeric arrays into relational rows. Other dtypes such as complex values and other unsupported or non-standard Zarr dtypes are currently rejected by zarr_cells(path, array_path).

What Works

The MVP currently supports:

Local filesystem Zarr v2 discovery
Local filesystem Zarr v3 discovery from zarr.json
Zarr v2 consolidated metadata discovery from .zmetadata
Remote http://, https://, and s3:// stores when DuckDB httpfs is available and the store is consolidated
OME-Zarr-style multiscales groups built on Zarr v2, including level arrays such as 0
Real-world OME-Zarr v3 metadata discovery for stores such as image-label hierarchies and multiscales groups
Group enumeration from .zgroup
Array enumeration from .zarray and zarr.json
Automatic v2/v3 detection for local stores, with explicit version_override support (auto, v2, v3) on the lower-level SQL functions
Chunk enumeration for v2 . and / separators plus Zarr v3 default and v2 chunk-key encodings
zarr_cells(path, array_path) for dense arrays with dtypes |b1, i1, i2, i4, i8, u1, u2, u4, u8, f2, f4, and f8
Uncompressed and gzip-compressed chunk payloads
Zarr v3 regular chunk grids with bytes and optional gzip codecs
Zarr v3 transpose handling for identity and full-reverse permutations
Zarr v3 sharding_indexed arrays with inner Blosc (zstd) decode for common OME-Zarr image and label data
Real OME-Zarr v3 cell scans against test/data/idr0062A/6001240_labels.zarr
Missing-chunk fill-value materialization for supported zarr_cells() dtypes when fill_value is present
Dynamic (dim_0, ..., value) projection based on array rank and dtype
Projection-aware zarr_cells() scans
Filter pushdown on dimension and value columns
Chunk pruning from dimension filters before chunk decode
Chunk-streamed execution with bounded scan-time memory instead of init-time row buffering
Developer fixture generation and SQLLogic tests

The MVP does not yet support:

zarr_cells() for dtypes outside |b1, i1, i2, i4, i8, u1, u2, u4, u8, f2, f4, and f8
Blosc codecs outside the currently supported zstd-backed path
Non-consolidated remote store discovery
Remote hierarchical discovery for non-consolidated Zarr v3 group stores
Arrow materialization

Quick Start

Build the extension:

make

Install developer tooling for formatting and hooks:

python3 -m pip install pre-commit clang-format==11.0.1
pre-commit install --hook-type pre-commit --hook-type pre-push

Re-run that command after pulling hook configuration changes. It installs both:

a pre-commit hook that auto-fixes DuckDB formatting
a pre-push hook that runs the same formatter in --check mode before CI sees the push

Create the checked-in sample fixture again if needed:

make fixture

Run the extension tests:

make test_metadata

Run the Python smoke example with the same environment shape many Python users will use:

uvx --with duckdb==1.5.0 --with numpy --with pandas python test/python_smoke.py

There is also a matching make target:

make test_python_example

Package a static extension repository layout from downloaded CI artifacts:

python3 scripts/package_extension_repository.py \
  --artifacts-dir build/distribution-artifacts \
  --out-dir build/extension-repository \
  --extension-name duckdb_zarr \
  --duckdb-version v1.5.0

Run the formatter checks directly:

make format-check

Auto-fix formatting:

make format-fix

Open the DuckDB shell with the extension linked in:

./build/release/duckdb

Then query a sample store:

SELECT * FROM zarr_groups('test/data/simple_v2.zarr');
SELECT * FROM zarr_arrays('test/data/simple_v2.zarr');
SELECT * FROM zarr_chunks('test/data/simple_v2.zarr');
SELECT * FROM zarr_cells('test/data/simple_v2.zarr', 'temperature');
SELECT * FROM zarr('test/data/simple_v2.zarr');
SELECT * FROM zarr_arrays('test/data/simple_v3.zarr', 'v3');
SELECT * FROM zarr('test/data/simple_v3.zarr', 'temperature_v3', 'v3');
SELECT * FROM zarr('test/data/ome_example.ome.zarr', '0');
SELECT * FROM zarr('test/data/idr0062A/6001240_labels.zarr');
SELECT * FROM zarr('test/data/idr0062A/6001240_labels.zarr', '0') LIMIT 5;
SELECT SUM(value)
FROM zarr('test/data/idr0062A/6001240_labels.zarr', 'labels/0/0')
WHERE dim_0 = 0 AND dim_1 = 0 AND dim_2 < 4 AND dim_3 < 4;

For remote stores in this phase, the practical contract is:

the store must expose Zarr v2 consolidated metadata at .zmetadata
DuckDB must have httpfs available for the URI scheme you use
chunk data are still read directly from remote chunk object paths

Version detection notes:

zarr(path) auto-detects between local v2 and v3 stores
lower-level functions accept an optional version_override argument: auto, v2, or v3
the convenience cell entrypoint also accepts an override as zarr(path, array_path, version_override)
explicit override is useful when a path is ambiguous or when you want failures to be version-specific

Install Like An Extension

This repo now includes a release-asset publishing workflow in PublishExtensionRepository.yml that builds extension binaries on version tags and attaches platform-specific .duckdb_extension.gz assets to the matching GitHub Release.

To make this work for your repository:

push a version tag such as v0.1.0
create the matching GitHub Release, or let the workflow create/update it from the tag

Then users can install directly from a release asset URL:

INSTALL 'https://github.com/wayscience/duckdb_zarr/releases/download/v0.1.0/duckdb_zarr-v1.5.0-osx_arm64.duckdb_extension.gz';
LOAD duckdb_zarr;

Each release publishes assets named like:

duckdb_zarr-v1.5.0-osx_arm64.duckdb_extension.gz
duckdb_zarr-v1.5.0-linux_amd64.duckdb_extension.gz

Release Checklist

To publish a new installable extension release from this repository:

Ensure the release workflow in PublishExtensionRepository.yml is enabled and passing on the default branch.
Create and push a version tag such as v0.1.0.
Wait for the publish workflow to:
- build the extension binaries
- package platform-specific .duckdb_extension.gz assets
- upload those assets to the matching GitHub Release
Verify that a published artifact path exists, for example: https://github.com/wayscience/duckdb_zarr/releases/download/v0.1.0/duckdb_zarr-v1.5.0-osx_arm64.duckdb_extension.gz
Optionally add human-readable release notes. The installable path is the GitHub Release asset itself.

After that, users can install the extension with:

INSTALL 'https://github.com/wayscience/duckdb_zarr/releases/download/v0.1.0/duckdb_zarr-v1.5.0-osx_arm64.duckdb_extension.gz';
LOAD duckdb_zarr;

For remote Zarr stores, users may also need:

INSTALL httpfs;
LOAD httpfs;

Developer Workflow

The repository is still the standard DuckDB extension template, so the main commands are unchanged:

make: build DuckDB and the extension
make test: run SQLLogic tests
make fixture: regenerate the sample Zarr store
make test_metadata: regenerate the fixture and run tests
make format-check: verify DuckDB formatting for src/ and test/
make format-fix: apply DuckDB formatting for src/ and test/

This repo also includes a pre-commit configuration in \.pre-commit-config.yaml that runs the DuckDB formatter automatically before commits. The formatter wrapper depends on clang-format, and the current DuckDB script expects clang-format==11.0.1.

Useful build outputs:

./build/release/duckdb
./build/release/test/unittest
./build/release/extension/duckdb_zarr/duckdb_zarr.duckdb_extension
./build/extension-repository/<duckdb-version>/<platform>/duckdb_zarr.duckdb_extension.gz

Architecture Mapping

The current code maps directly to the architecture document:

Store Adapter: local filesystem traversal through DuckDB’s FileSystem
Store Adapter: consolidated remote discovery for http://, https://, and s3:// paths through DuckDB filesystem routing
Metadata Parser: Zarr v2 .zgroup/.zarray plus Zarr v3 zarr.json parsing via yyjson
DuckDB Bridge: table functions registered from the extension entrypoint
Relational Cell Scan: zarr_cells() with pushed filters, projection-aware output, and dimension-based chunk pruning

The current zarr_cells() path now streams chunk-by-chunk at execution time, but there is still room to improve execution internals. The next major implementation step is to tighten the scan loop further with:

chunk-file reads
codec decode
typed value materialization
row projection (dim_0, ..., value)
Arrow-oriented batching and broader codec support

Code Layout

src/duckdb_zarr_extension.cpp: extension entrypoint and function registration
src/zarr_metadata.cpp: metadata discovery and table function implementations
scripts/create_sample_zarr.py: reproducible fixture generation
test/sql/duckdb_zarr.test: SQLLogic coverage for metadata, cell projection, and pushed-filter behavior

Dependency Notes

No new third-party runtime dependencies were added in this repo. For remote access, duckdb_zarr relies on DuckDB’s httpfs extension being available to the runtime.

For the next phase, useful dependency choices will likely be:

a stable Zarr metadata/codec implementation strategy in C++
blosc decode support for chunk payloads
Arrow integration once zarr_cells() starts producing typed batches

If you want me to take on the next phase, the highest-value dependency discussion is now around broader codec support, especially Blosc, and whether remote execution should grow from the current consolidated-metadata path into fuller object-store discovery plus caching.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
docs		docs
duckdb @ 3a3967a		duckdb @ 3a3967a
extension-ci-tools @ 02fb3fd		extension-ci-tools @ 02fb3fd
scripts		scripts
src		src
test		test
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.coderabbit.yaml		.coderabbit.yaml
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
ARCHITECTURE.md		ARCHITECTURE.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
SPEC.md		SPEC.md
extension_config.cmake		extension_config.cmake
vcpkg.json		vcpkg.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

duckdb_zarr

What Works

Quick Start

Install Like An Extension

Release Checklist

Developer Workflow

Architecture Mapping

Code Layout

Dependency Notes

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

duckdb_zarr

What Works

Quick Start

Install Like An Extension

Release Checklist

Developer Workflow

Architecture Mapping

Code Layout

Dependency Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages