feat(sedona-raster-gdal) add GDAL foundation library for supporting GDAL-based RS functions#787
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a new Rust workspace crate, sedona-raster-gdal, establishing the foundational GDAL integration layer for SedonaDB raster types (dataset conversion, out-db source handling, and GDAL session/config plumbing) to support later GDAL-backed raster functions.
Changes:
- Add
sedona-raster-gdalcrate to the workspace and wire it into the mainsedonacrate dependencies. - Implement GDAL dataset/provider foundations: MEM conversion for in-db bands, VRT construction for out-db/mixed rasters, and URL-to-VSI normalization helpers.
- Expose
RasterExecutor(viasedona-raster-functions) and add Sedona config options for GDAL (sedona.gdal.cpl_debug).
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| rust/sedona/Cargo.toml | Adds sedona-raster-gdal as a dependency of the main crate. |
| rust/sedona-raster-gdal/Cargo.toml | Defines the new GDAL raster foundation crate and dependencies. |
| rust/sedona-raster-gdal/src/lib.rs | Sets up module structure and re-exports key conversion utilities. |
| rust/sedona-raster-gdal/src/utils.rs | Adds helper to load a GDAL-openable raster path into an in-db raster StructArray. |
| rust/sedona-raster-gdal/src/gdal_common.rs | Adds shared GDAL helpers (type/nodata conversions, dataset open, URL normalization) + unit tests. |
| rust/sedona-raster-gdal/src/gdal_dataset_provider.rs | Implements dataset provider + thread-local caching + VRT windowing/alignment logic + tests. |
| rust/sedona-raster-functions/src/lib.rs | Re-exports RasterExecutor for external use. |
| rust/sedona-raster-functions/src/executor.rs | Makes new_with_num_iterations available outside tests. |
| rust/sedona-common/src/option.rs | Adds SedonaOptions.gdal namespace with cpl_debug option. |
| Cargo.toml | Adds the new crate to workspace members and workspace dependencies. |
| Cargo.lock | Updates lockfile for the new crate. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Share in-db raster builders through sedona-testing so GDAL raster tests reuse the same fixture construction path. Remove redundant GDAL driver initialization from the tests now that setup is handled consistently.
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you! The conversion to/from GDAL Dataset is huge. I'm happy to file follow-ups for any of my comments here.
| /// Normalize out-db raster URLs to GDAL VSI paths. | ||
| /// | ||
| /// Supported translations: | ||
| /// - `s3://bucket/key` -> `/vsis3/bucket/key` | ||
| /// - `s3a://bucket/key` -> `/vsis3/bucket/key` |
There was a problem hiding this comment.
No need to do it here, but {rest}.zip -> /vsizip/{rest} is also useful.
| pub(crate) fn configure_thread_local_options( | ||
| gdal: &Gdal, | ||
| config_options: Option<&ConfigOptions>, | ||
| ) -> Result<()> { | ||
| // Set frequently requested GDAL config options as thread-local options to eliminate the | ||
| // need for acquiring configs from global config or environment variable, which is very | ||
| // likely to result in heavy contention in multi-threaded environments. | ||
| let cpl_debug_enabled = config_options |
There was a problem hiding this comment.
It is probably a good pattern to always do this within some context like with_thread_local_options() that takes care to reset the values on exit, particularly when borrowing the shared object from something like rasterio which may be doing the same thing. During execution we'll often have our own threads from tokio but I think constant folding happens on the main Python thread. We can follow up with this if it's difficult to implement here.
| /// Get or create the thread-local `GDALDatasetCache`. | ||
| pub(crate) fn thread_local_cache() -> Result<Rc<GDALDatasetCache>> { | ||
| TL_GDAL_DATASET_CACHE.with(|cell| { | ||
| let mut opt = cell.borrow_mut(); | ||
| if let Some(rc) = opt.as_ref() { | ||
| Ok(Rc::clone(rc)) | ||
| } else { |
There was a problem hiding this comment.
I hope there is a future where we can move some of these caches to be session scoped (this feels like it could potentially be large in some cases).
| BandDataType::UInt64 => read_le_f64!(u64, 8), | ||
| BandDataType::Int64 => read_le_f64!(i64, 8), |
There was a problem hiding this comment.
Maybe remove these two since they're lossy and you handle/test them separately via the dedicated gdal setters?
There was a problem hiding this comment.
I have patched these arms to return Err to inform the developer to handle these types specially.
| #[allow(dead_code)] | ||
| mod gdal_common; | ||
| #[allow(dead_code)] | ||
| mod gdal_dataset_provider; |
There was a problem hiding this comment.
Do you still need these after the pub use? If you do perhaps add a comment with a link to a PR or issue that will remove it (or just make it pub).
| #[allow(dead_code)] | |
| mod gdal_common; | |
| #[allow(dead_code)] | |
| mod gdal_dataset_provider; | |
| mod gdal_common; | |
| mod gdal_dataset_provider; |
| /// Load a raster from any GDAL-openable path as an in-db raster `StructArray`. | ||
| /// | ||
| /// The `path` can be a regular file path, a `/vsimem/` memory path, | ||
| /// a `/vsicurl/` URL, or any other GDAL virtual filesystem path. | ||
| pub fn load_as_indb_raster(gdal: &Gdal, path: &str) -> Result<StructArray> { | ||
| // Open dataset from path | ||
| let dataset = gdal | ||
| .open_ex_with_options( |
There was a problem hiding this comment.
Should we file a follow-up issue to add tests for this?
There was a problem hiding this comment.
I was intended to move it to a dedicated patch. I'll remove it from this patch.
Reference issues apache#803 and apache#804 so the temporary lint suppressions are explicitly tracked without widening the crate API.
Rewrite the GDAL loader and dataset provider to read from the N-D band schema instead of apache#787's storage_type / outdb_url / outdb_band_id triplet. Each GDAL-backed read site now gates on `BandRef::is_2d()` and returns a clear `Execution` error on N-D input; multi-band source selection routes through `parse_outdb_source(outdb_uri)` so the `#band=N` SedonaDB convention and GDAL native subdataset URIs both work transparently. VSI normalization, the dataset cache, and the RasterIO bodies are byte-for-byte unchanged — only the schema-read sites move. In-db reads use `BandRef::contiguous_data()` and require `Cow::Borrowed` so MEM datasets can point at the StructArray's backing buffer without copying; for `is_2d` identity views this always holds. Tests rebuilt to use `RasterBuilder` directly. Adds an N-D rejection test for `raster_ref_to_gdal_mem` and the VRT path, plus an end-to-end `#band=2` selection test against a two-band GeoTIFF.
Replaces apache#787's 2D-only band schema with the canonical N-D schema: spatial_dims/spatial_shape at the raster level; bands carry dim_names, source_shape, nullable view, outdb_uri, outdb_format, plus the non-nullable data buffer. Removes nodata_value, storage_type, outdb_url, and outdb_band_id - every one is encodable in the new schema: - storage_type ↔ outdb_uri.is_null() (null = InDb, set = OutDbRef). - outdb_url ↔ outdb_uri (no rename, same string). - outdb_band_id ↔ encoded inside outdb_uri (#band=N or GDAL native subdataset URI), parsed only inside the GDAL format driver. - nodata_value ↔ typed nodata: Binary (a null row means "no nodata"). Top-level adds spatial_dims: List<Utf8View> and spatial_shape: List<Int64>; nullable view is List<Struct<source_axis, start, step, steps: Int64>> where a null row encodes the canonical identity view. Note: intermediate commits in this PR are not expected to build; only the PR tip is CI-green. The trait, reader/builder, RS_* migration, and GDAL loader port land in subsequent commits.
… reader/builder + RS_* migration View-aware Arrow reader (RasterStructArray, BandRefImpl) with corruption- surgery (negative steps, bad source_axis, length mismatch) that round-trips an ArrowError. Builder exposes start_raster / start_band for full N-D plus start_raster_2d / start_band_2d for legacy 2D, with identity-view default written as a null view row. finish_raster validates each band's visible shape against the raster's spatial_shape along the spatial dims. All 33 RS_* functions migrated mechanically; outputs on 2D inputs are byte-identical to apache#787. RS_BandPath keeps its existing inline fragment-stripping (format-agnostic display, untouched by the GDAL parser). Test helpers in sedona-testing rewritten on the N-D builder API.
Reads outdb_uri + parse_outdb_source instead of apache#787's storage_type / outdb_url / outdb_band_id triplet. Each GDAL-backed SQL function gates on BandRef::is_2d() at entry and returns an Execution error on N-D input. VSI normalization, the dataset cache, and RasterIO bodies are byte-for-byte unchanged from apache#787 - only the schema-read sites move. In-db reads use BandRef::contiguous_data() and require Cow::Borrowed so MEM datasets can point at the StructArray's backing buffer without copying; for is_2d identity views this always holds. Tests rebuilt to use RasterBuilder directly. Adds an N-D rejection test for raster_ref_to_gdal_mem and the VRT path, plus an end-to-end #band=2 selection test against a two-band GeoTIFF.
… reader/builder + RS_* migration View-aware Arrow reader (RasterStructArray, BandRefImpl) with corruption- surgery (negative steps, bad source_axis, length mismatch) that round-trips an ArrowError. Builder exposes start_raster / start_band for full N-D plus start_raster_2d / start_band_2d for legacy 2D, with identity-view default written as a null view row. finish_raster validates each band's visible shape against the raster's spatial_shape along the spatial dims. All 33 RS_* functions migrated mechanically; outputs on 2D inputs are byte-identical to apache#787. RS_BandPath keeps its existing inline fragment-stripping (format-agnostic display, untouched by the GDAL parser). Test helpers in sedona-testing rewritten on the N-D builder API.
Reads outdb_uri + parse_outdb_source instead of apache#787's storage_type / outdb_url / outdb_band_id triplet. Each GDAL-backed SQL function gates on BandRef::is_2d() at entry and returns an Execution error on N-D input. VSI normalization, the dataset cache, and RasterIO bodies are byte-for-byte unchanged from apache#787 - only the schema-read sites move. In-db reads use BandRef::contiguous_data() and require Cow::Borrowed so MEM datasets can point at the StructArray's backing buffer without copying; for is_2d identity views this always holds. Tests rebuilt to use RasterBuilder directly. Adds an N-D rejection test for raster_ref_to_gdal_mem and the VRT path, plus an end-to-end #band=2 selection test against a two-band GeoTIFF.
Summary
This PR is the starting point of implementing complex RS_* functions requiring GDAL library.
sedona-raster-gdalcrate and wire it into the workspace as the root GDAL raster foundationRasterExecutorfor later GDAL-backed raster functions while intentionally leaving user-facingRS_*functions and registrations to follow-up PRsNotes