feat(raster-zarr): sedona-raster-zarr crate + sd_read_zarr UDTF#858
Draft
james-willis wants to merge 3 commits into
Draft
feat(raster-zarr): sedona-raster-zarr crate + sd_read_zarr UDTF#858james-willis wants to merge 3 commits into
james-willis wants to merge 3 commits into
Conversation
Adds a sedona-raster-zarr crate that maps Zarr groups onto SedonaDB's
N-D raster Arrow schema, plus an sd_read_zarr DataFusion UDTF that
exposes it to SQL.
## sedona-raster-zarr
Two entry points emit one raster row per chunk position in the group's
chunk grid, with one band per array:
* group_to_indb_rasters(group_uri) — eager: fetches every chunk's bytes
into the Arrow data column.
* group_to_outdb_rasters(group_uri) — URI-only: emits chunk-anchor URIs
in each band's outdb_uri (<store>#array=<path>&chunk=<i0>,...).
Built on the zarrs 0.23 crate, chosen over GDAL MultiDim because zarrs
reads Zarr v3 sharding and vlen-utf8 coordinate variables.
Modules:
* source_uri — group-URI normalisation (file:// + bare path; cloud
schemes error pending the resolver work) and chunk anchor build/parse.
* geozarr — proj:wkt2 / proj:projjson / proj:epsg precedence and
spatial:transform / spatial:dims parsing from group attributes.
* dtype — zarrs DataType ↔ BandDataType via the type-erased is::<T>()
downcast (zarrs 0.23 wraps Arc<dyn DataTypeTraits>, not an enum).
* loader — opens FilesystemStore + Group, enumerates child arrays
(sorted by path for deterministic band order), validates shared chunk
grid / chunk shape / dim names, walks the chunk grid row-major,
computes per-chunk transforms by translating the group affine along
the spatial axes.
Tests: 31 in-module unit tests + 4 integration tests that build a real
Zarr group on disk via zarrs::ArrayBuilder and verify byte-exact pixel
content, per-chunk transforms, anchor URI format, and error paths.
## sd_read_zarr UDTF
DataFusion table function exposing the loader to SQL. Mirrors
sd_random_geometry's shape (Function + Provider + Exec) and registers
in SedonaContext::new_from_context.
SELECT raster FROM sd_read_zarr('file:///path/to/datacube.zarr');
SELECT count(*) FROM sd_read_zarr(
'file:///path/to/datacube.zarr',
'{\"mode\": \"outdb\", \"rows_per_batch\": 256}'
);
Returns a single-column table raster: Raster. All existing RS_* UDFs
operate on the column unchanged.
JSON options:
* mode: \"indb\" (default) or \"outdb\". Phase 1 defaults to InDb so
byte-reading kernels work end-to-end. Flips to \"outdb\" once a
format-keyed OutDb dispatcher registers a zarr loader.
* rows_per_batch: chunks per RecordBatch (default 1024).
* num_partitions: scan partitions (default 1). Anything other than 1
errors — round-robin chunk partitioning lands with the resolver work.
ZarrChunkProvider builds the StructArray once at plan time. scan()
wraps the inner exec in ProjectionExec so empty / partial projections
(e.g. SELECT count(*)) match the requested physical schema.
ZarrChunkExec slices the StructArray into rows_per_batch-sized
RecordBatches. Bounded, Incremental, single-partition.
6 in-module integration tests build a Zarr fixture and exercise the
full SQL pipeline.
## Out of scope (follow-up PR)
* OutDb byte resolver (the function the byte-loading hook calls to
fetch zarr chunks) — depends on a format-keyed multi-backend
dispatcher.
* Default mode flip from \"indb\" to \"outdb\".
* Cloud storage backends (zarrs_object_store) and the async runtime
story they require.
* Round-robin chunk partitioning past num_partitions=1.
* CREATE EXTERNAL TABLE ... STORED AS ZARR.
Adds a `con.funcs.table.sd_read_zarr(uri, *, mode, rows_per_batch,
num_partitions)` Python method that mirrors the SQL UDTF added in the
companion Rust commit. Same shape as the existing
`sd_random_geometry` wrapper: builds the JSON options object,
filters out None entries, and runs the SQL UDTF through
`self._ctx.sql`.
Python test uses `pytest.importorskip("zarr")` so it cleanly skips
when the optional `zarr` library isn't installed in the test
environment. Builds a 2×2 UInt8 Zarr group on disk via the official
`zarr` Python lib and exercises both the default-options form and
the `rows_per_batch` override.
e9cb19f to
3fd6420
Compare
james-willis
commented
May 19, 2026
| .map(|&i| array_infos[0].chunk_shape[i] as i64) | ||
| .collect(); | ||
|
|
||
| let group_transform = geo.transform.unwrap_or([0.0, 1.0, 0.0, 0.0, 0.0, -1.0]); |
Contributor
Author
There was a problem hiding this comment.
Currently the type implementation does not allow null geotransforms so we need to provide this potentially misleading value here when geozarr metadata is not present.
Havent considered what to do about this.
Round of review feedback on the sd_read_zarr loader and UDTF surface: - Drop "Phase 1" references from code, error messages, and docstrings; describe behaviour in absolute terms. - Bound-check the chunk-grid product so a malformed or hostile Zarr can't drive RasterBuilder capacity to overflow / OOM. - Always populate `outdb_uri` and `outdb_format` on every band as provenance metadata; InDb vs OutDb stays discriminated by `data.is_empty()`. - Fall back to xarray's `_ARRAY_DIMENSIONS` attribute when an array lacks the first-class `dimension_names` field (covers Zarr v2 and any v3 file that uses the xarray convention). - Auto-skip 1-D arrays at selection time (typical xarray coord variables) — they can never produce a valid raster band, so reading them would always fail downstream at spatial-dim resolution. - Add an `arrays = [...]` option (Python kwarg + UDTF JSON key) to read a named subset of arrays; unknown names and 1-D arrays both error early with clear messages pointing at the option. - Error loudly when a group declares a CRS but no `spatial:transform` attribute — the high-confidence-bug case where falling back to the identity transform would silently produce wrong spatial-join results. - Warn-log when both CRS and transform are absent and we fall back to pixel-coordinate identity, so spatial-join surprises are debuggable from logs. - Flip the InDb/OutDb selector from `mode: "indb"|"outdb"` to `indb: bool` end-to-end (Python kwarg, SQL JSON key, internal options struct). The schema-level discriminator is binary, so a boolean is the more honest UI. Tests cover every new branch: auto-skip, explicit array filter, 1-D rejection, _ARRAY_DIMENSIONS fallback, CRS-without-transform error, indb-bool plumbing through SQL.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds support for loading zarrs as rasters. Each chunk becomes one row, similar to how each tiff becomes one row.
Rasters table is returns as a single column table with the column named
raster. For now, chunks are eagerly loaded as part of the sd_read_zarr operation. When lazy loading support lands in #849 I will add lazy zarr support and make that the default.We built on the zarrs 0.23 craterather than GDAL MultiDim because two GDAL gaps are independently disqualifying: Zarr v3 sharding is unreadable in GDAL ≤ 3.12 (sharded Zarr is a critical cloud-native feature IMO), and vlen-utf8 string coordinate variables are unreadable through GDAL MultiDim (climate/datacube Zarr uses these for band names).
What's in this PR
group_to_indb_rasters(uri)andgroup_to_outdb_rasters(uri)entry points.sd_read_zarrUDTF registered inSedonaContext::new_from_context. JSON options:mode(default "indb"),rows_per_batch(default 1024),num_partitions(default 1).SELECT count(*) FROM sd_read_zarr(...)works.Out of scope (separate follow-up PR)
num_partitions = 1.Open questions for reviewers
sd_read_zarrmatches the existingsd_random_geometryUDTF in this repo. The wider DataFusion ecosystem also uses the _scan pattern (e.g. delta_scan). Happy to rename tosd_zarr_scanif reviewers prefer