Skip to content

feat(sedona-raster-gdal) add GDAL foundation library for supporting GDAL-based RS functions#787

Merged
Kontinuation merged 7 commits intoapache:mainfrom
Kontinuation:pr-b-raster-gdal-foundation
May 4, 2026
Merged

feat(sedona-raster-gdal) add GDAL foundation library for supporting GDAL-based RS functions#787
Kontinuation merged 7 commits intoapache:mainfrom
Kontinuation:pr-b-raster-gdal-foundation

Conversation

@Kontinuation
Copy link
Copy Markdown
Member

@Kontinuation Kontinuation commented Apr 26, 2026

Summary

This PR is the starting point of implementing complex RS_* functions requiring GDAL library.

  • add the new sedona-raster-gdal crate and wire it into the workspace as the root GDAL raster foundation
  • introduce the core GDAL conversion and provider layers for in-db, out-db, and mixed rasters, plus GDAL session options
  • expose RasterExecutor for later GDAL-backed raster functions while intentionally leaving user-facing RS_* functions and registrations to follow-up PRs

Notes

  • this PR is the root of the planned stacked raster PR series

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new Rust workspace crate, sedona-raster-gdal, establishing the foundational GDAL integration layer for SedonaDB raster types (dataset conversion, out-db source handling, and GDAL session/config plumbing) to support later GDAL-backed raster functions.

Changes:

  • Add sedona-raster-gdal crate to the workspace and wire it into the main sedona crate dependencies.
  • Implement GDAL dataset/provider foundations: MEM conversion for in-db bands, VRT construction for out-db/mixed rasters, and URL-to-VSI normalization helpers.
  • Expose RasterExecutor (via sedona-raster-functions) and add Sedona config options for GDAL (sedona.gdal.cpl_debug).

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
rust/sedona/Cargo.toml Adds sedona-raster-gdal as a dependency of the main crate.
rust/sedona-raster-gdal/Cargo.toml Defines the new GDAL raster foundation crate and dependencies.
rust/sedona-raster-gdal/src/lib.rs Sets up module structure and re-exports key conversion utilities.
rust/sedona-raster-gdal/src/utils.rs Adds helper to load a GDAL-openable raster path into an in-db raster StructArray.
rust/sedona-raster-gdal/src/gdal_common.rs Adds shared GDAL helpers (type/nodata conversions, dataset open, URL normalization) + unit tests.
rust/sedona-raster-gdal/src/gdal_dataset_provider.rs Implements dataset provider + thread-local caching + VRT windowing/alignment logic + tests.
rust/sedona-raster-functions/src/lib.rs Re-exports RasterExecutor for external use.
rust/sedona-raster-functions/src/executor.rs Makes new_with_num_iterations available outside tests.
rust/sedona-common/src/option.rs Adds SedonaOptions.gdal namespace with cpl_debug option.
Cargo.toml Adds the new crate to workspace members and workspace dependencies.
Cargo.lock Updates lockfile for the new crate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rust/sedona-raster-gdal/src/gdal_dataset_provider.rs Outdated
Comment thread rust/sedona-raster-gdal/src/gdal_common.rs Outdated
Comment thread rust/sedona-raster-gdal/src/gdal_dataset_provider.rs Outdated
Comment thread rust/sedona-raster-gdal/src/gdal_dataset_provider.rs Outdated
kontinuation added 3 commits April 27, 2026 22:50
Share in-db raster builders through sedona-testing so GDAL raster tests reuse the same fixture construction path. Remove redundant GDAL driver initialization from the tests now that setup is handled consistently.
@Kontinuation Kontinuation changed the title feat(raster): add GDAL raster foundation feat(sedona-raster-gdal) add GDAL foundation library for supporting GDAL-based RS functions Apr 28, 2026
@Kontinuation Kontinuation marked this pull request as ready for review April 29, 2026 16:51
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! The conversion to/from GDAL Dataset is huge. I'm happy to file follow-ups for any of my comments here.

cc @james-willis

Comment on lines +308 to +312
/// Normalize out-db raster URLs to GDAL VSI paths.
///
/// Supported translations:
/// - `s3://bucket/key` -> `/vsis3/bucket/key`
/// - `s3a://bucket/key` -> `/vsis3/bucket/key`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to do it here, but {rest}.zip -> /vsizip/{rest} is also useful.

Comment on lines +91 to +98
pub(crate) fn configure_thread_local_options(
gdal: &Gdal,
config_options: Option<&ConfigOptions>,
) -> Result<()> {
// Set frequently requested GDAL config options as thread-local options to eliminate the
// need for acquiring configs from global config or environment variable, which is very
// likely to result in heavy contention in multi-threaded environments.
let cpl_debug_enabled = config_options
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably a good pattern to always do this within some context like with_thread_local_options() that takes care to reset the values on exit, particularly when borrowing the shared object from something like rasterio which may be doing the same thing. During execution we'll often have our own threads from tokio but I think constant folding happens on the main Python thread. We can follow up with this if it's difficult to implement here.

Comment on lines +123 to +129
/// Get or create the thread-local `GDALDatasetCache`.
pub(crate) fn thread_local_cache() -> Result<Rc<GDALDatasetCache>> {
TL_GDAL_DATASET_CACHE.with(|cell| {
let mut opt = cell.borrow_mut();
if let Some(rc) = opt.as_ref() {
Ok(Rc::clone(rc))
} else {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope there is a future where we can move some of these caches to be session scoped (this feels like it could potentially be large in some cases).

Comment on lines +127 to +128
BandDataType::UInt64 => read_le_f64!(u64, 8),
BandDataType::Int64 => read_le_f64!(i64, 8),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove these two since they're lossy and you handle/test them separately via the dedicated gdal setters?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have patched these arms to return Err to inform the developer to handle these types specially.

Comment on lines +28 to +31
#[allow(dead_code)]
mod gdal_common;
#[allow(dead_code)]
mod gdal_dataset_provider;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need these after the pub use? If you do perhaps add a comment with a link to a PR or issue that will remove it (or just make it pub).

Suggested change
#[allow(dead_code)]
mod gdal_common;
#[allow(dead_code)]
mod gdal_dataset_provider;
mod gdal_common;
mod gdal_dataset_provider;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #804 and added comments

Comment thread rust/sedona-raster-gdal/src/utils.rs Outdated
Comment on lines +35 to +42
/// Load a raster from any GDAL-openable path as an in-db raster `StructArray`.
///
/// The `path` can be a regular file path, a `/vsimem/` memory path,
/// a `/vsicurl/` URL, or any other GDAL virtual filesystem path.
pub fn load_as_indb_raster(gdal: &Gdal, path: &str) -> Result<StructArray> {
// Open dataset from path
let dataset = gdal
.open_ex_with_options(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we file a follow-up issue to add tests for this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was intended to move it to a dedicated patch. I'll remove it from this patch.

kontinuation added 3 commits April 30, 2026 20:34
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@Kontinuation Kontinuation merged commit 06b22a2 into apache:main May 4, 2026
17 checks passed
james-willis added a commit to james-willis/sedona-db that referenced this pull request May 4, 2026
Rewrite the GDAL loader and dataset provider to read from the N-D
band schema instead of apache#787's storage_type / outdb_url / outdb_band_id
triplet. Each GDAL-backed read site now gates on `BandRef::is_2d()`
and returns a clear `Execution` error on N-D input; multi-band source
selection routes through `parse_outdb_source(outdb_uri)` so the
`#band=N` SedonaDB convention and GDAL native subdataset URIs both
work transparently.

VSI normalization, the dataset cache, and the RasterIO bodies are
byte-for-byte unchanged — only the schema-read sites move. In-db
reads use `BandRef::contiguous_data()` and require `Cow::Borrowed`
so MEM datasets can point at the StructArray's backing buffer
without copying; for `is_2d` identity views this always holds.

Tests rebuilt to use `RasterBuilder` directly. Adds an N-D rejection
test for `raster_ref_to_gdal_mem` and the VRT path, plus an
end-to-end `#band=2` selection test against a two-band GeoTIFF.
james-willis added a commit to james-willis/sedona-db that referenced this pull request May 4, 2026
Replaces apache#787's 2D-only band schema with the canonical N-D schema:
spatial_dims/spatial_shape at the raster level; bands carry dim_names,
source_shape, nullable view, outdb_uri, outdb_format, plus the
non-nullable data buffer. Removes nodata_value, storage_type,
outdb_url, and outdb_band_id - every one is encodable in the new
schema:

- storage_type ↔ outdb_uri.is_null() (null = InDb, set = OutDbRef).
- outdb_url ↔ outdb_uri (no rename, same string).
- outdb_band_id ↔ encoded inside outdb_uri (#band=N or GDAL native
  subdataset URI), parsed only inside the GDAL format driver.
- nodata_value ↔ typed nodata: Binary (a null row means "no nodata").

Top-level adds spatial_dims: List<Utf8View> and spatial_shape:
List<Int64>; nullable view is List<Struct<source_axis, start, step,
steps: Int64>> where a null row encodes the canonical identity view.

Note: intermediate commits in this PR are not expected to build; only
the PR tip is CI-green. The trait, reader/builder, RS_* migration,
and GDAL loader port land in subsequent commits.
james-willis added a commit to james-willis/sedona-db that referenced this pull request May 4, 2026
… reader/builder + RS_* migration

View-aware Arrow reader (RasterStructArray, BandRefImpl) with corruption-
surgery (negative steps, bad source_axis, length mismatch) that
round-trips an ArrowError. Builder exposes start_raster / start_band
for full N-D plus start_raster_2d / start_band_2d for legacy 2D, with
identity-view default written as a null view row. finish_raster
validates each band's visible shape against the raster's spatial_shape
along the spatial dims.

All 33 RS_* functions migrated mechanically; outputs on 2D inputs are
byte-identical to apache#787. RS_BandPath keeps its existing inline
fragment-stripping (format-agnostic display, untouched by the GDAL
parser). Test helpers in sedona-testing rewritten on the N-D builder
API.
james-willis added a commit to james-willis/sedona-db that referenced this pull request May 4, 2026
Reads outdb_uri + parse_outdb_source instead of apache#787's storage_type /
outdb_url / outdb_band_id triplet. Each GDAL-backed SQL function gates
on BandRef::is_2d() at entry and returns an Execution error on N-D
input. VSI normalization, the dataset cache, and RasterIO bodies are
byte-for-byte unchanged from apache#787 - only the schema-read sites move.

In-db reads use BandRef::contiguous_data() and require Cow::Borrowed
so MEM datasets can point at the StructArray's backing buffer without
copying; for is_2d identity views this always holds.

Tests rebuilt to use RasterBuilder directly. Adds an N-D rejection
test for raster_ref_to_gdal_mem and the VRT path, plus an end-to-end
#band=2 selection test against a two-band GeoTIFF.
james-willis added a commit to james-willis/sedona-db that referenced this pull request May 5, 2026
… reader/builder + RS_* migration

View-aware Arrow reader (RasterStructArray, BandRefImpl) with corruption-
surgery (negative steps, bad source_axis, length mismatch) that
round-trips an ArrowError. Builder exposes start_raster / start_band
for full N-D plus start_raster_2d / start_band_2d for legacy 2D, with
identity-view default written as a null view row. finish_raster
validates each band's visible shape against the raster's spatial_shape
along the spatial dims.

All 33 RS_* functions migrated mechanically; outputs on 2D inputs are
byte-identical to apache#787. RS_BandPath keeps its existing inline
fragment-stripping (format-agnostic display, untouched by the GDAL
parser). Test helpers in sedona-testing rewritten on the N-D builder
API.
james-willis added a commit to james-willis/sedona-db that referenced this pull request May 5, 2026
Reads outdb_uri + parse_outdb_source instead of apache#787's storage_type /
outdb_url / outdb_band_id triplet. Each GDAL-backed SQL function gates
on BandRef::is_2d() at entry and returns an Execution error on N-D
input. VSI normalization, the dataset cache, and RasterIO bodies are
byte-for-byte unchanged from apache#787 - only the schema-read sites move.

In-db reads use BandRef::contiguous_data() and require Cow::Borrowed
so MEM datasets can point at the StructArray's backing buffer without
copying; for is_2d identity views this always holds.

Tests rebuilt to use RasterBuilder directly. Adds an N-D rejection
test for raster_ref_to_gdal_mem and the VRT path, plus an end-to-end
#band=2 selection test against a two-band GeoTIFF.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants