feat(rust/sedona-spatial-join-geography): Implement spatial join for geography type#775
Conversation
e8868ea to
e641d84
Compare
Co-authored-by: Copilot <copilot@github.com>
There was a problem hiding this comment.
Pull request overview
Adds a new Rust crate (sedona-spatial-join-geography) that enables optimized spatial joins for geography (spherical) types by injecting a geography-specific refiner and planner into Sedona’s join planning pipeline.
Changes:
- Introduces a geography spatial-join physical planner + join provider + refiner powered by
sedona-s2geography. - Refactors
sedona-spatial-jointo expose extension points (refiner trait/factory, init-once cache, and join-ordering helpers) needed by external join planners/providers. - Extends distance-handling in operand evaluation to support rectangle expansion by per-row/per-scalar distance values (and reports unsupported behavior for GPU joins).
Reviewed changes
Copilot reviewed 33 out of 34 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| rust/sedona/src/context.rs | Registers the geography spatial-join physical planner when s2geography is enabled. |
| rust/sedona/Cargo.toml | Wires sedona-spatial-join-geography behind the s2geography feature. |
| rust/sedona-spatial-join/src/utils/init_once_array.rs | Makes InitOnceArray public and debuggable for extension crates. |
| rust/sedona-spatial-join/src/utils.rs | Exposes init_once_array module publicly. |
| rust/sedona-spatial-join/src/refine.rs | Makes refiner trait public and introduces a refiner factory abstraction. |
| rust/sedona-spatial-join/src/physical_planner.rs | Exposes join-order and repartition helpers for reuse by other planners. |
| rust/sedona-spatial-join/src/operand_evaluator.rs | Plumbs optional distance ColumnarValue into evaluated-array creation; centralizes rect expansion. |
| rust/sedona-spatial-join/src/lib.rs | Re-exports refiner/index types for extension crates. |
| rust/sedona-spatial-join/src/join_provider.rs | Makes default provider public (for reuse). |
| rust/sedona-spatial-join/src/index/spatial_index_builder.rs | Adds optional create_refiner() hook (default None). |
| rust/sedona-spatial-join/src/index/default_spatial_index_builder.rs | Supports injecting a custom refiner factory; passes refiner into index creation. |
| rust/sedona-spatial-join/src/index/default_spatial_index.rs | Accepts a refiner instance instead of constructing one internally. |
| rust/sedona-spatial-join/src/index.rs | Makes IndexQueryResult/builder module more accessible; re-exports SpatialIndexRef. |
| rust/sedona-spatial-join-gpu/src/physical_planner.rs | Reuses shared join-order heuristic from sedona-spatial-join. |
| rust/sedona-spatial-join-gpu/src/join_provider.rs | Explicitly rejects distance-based rectangle expansion for GPU joins (not yet supported). |
| rust/sedona-spatial-join-geography/src/spatial_index_builder.rs | Wraps the default builder and injects the geography refiner factory. |
| rust/sedona-spatial-join-geography/src/refiner.rs | Implements S2-based predicate refinement (relation + distance) with optional prepared caching. |
| rust/sedona-spatial-join-geography/src/physical_planner.rs | Adds a geography-aware physical planner that only triggers on geography-typed inputs. |
| rust/sedona-spatial-join-geography/src/lib.rs | Defines module layout for the new crate. |
| rust/sedona-spatial-join-geography/src/join_provider.rs | Provides geography evaluated-rect computation via RectBounder, plus builder wiring. |
| rust/sedona-spatial-join-geography/Cargo.toml | Declares the new crate and dependencies. |
| rust/sedona-query-planner/src/spatial_join_physical_planner.rs | Import cleanup to use datafusion_physical_plan::ExecutionPlan. |
| rust/sedona-query-planner/src/query_planner.rs | Import cleanup to use datafusion_physical_plan::ExecutionPlan. |
| rust/sedona-query-planner/src/optimizer.rs | Switches optimizer imports to datafusion_optimizer crate. |
| rust/sedona-query-planner/Cargo.toml | Adds datafusion-optimizer dependency; removes redundant dev-dep on datafusion. |
| python/sedonadb/tests/test_sjoin.py | Removes geography join test from the general sjoin test module. |
| python/sedonadb/tests/geography/test_geog_sjoin.py | Adds dedicated geography sjoin test suite (PostGIS comparisons + regression cases). |
| c/sedona-s2geography/src/s2geography_c_bindgen.rs | Exposes S2GeogRectBounderExpandByDistance via bindings. |
| c/sedona-s2geography/src/rect_bounder.rs | Adds safe wrapper expand_by_distance() + test coverage. |
| c/sedona-s2geography/src/geography.rs | Makes Geography debuggable; makes GeographyFactory::init_from_wkb public. |
| c/sedona-s2geography/build.rs | Uses RelWithDebInfo when building release with debug symbols. |
| Cargo.toml | Adds the new crate to the workspace and adds datafusion-optimizer to workspace deps. |
| Cargo.lock | Locks new crate + dependency changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 34 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <copilot@github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 34 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
zhangfengcdt
left a comment
There was a problem hiding this comment.
LGTM, and some minor comments
| }; | ||
|
|
||
| let result = match distance_columnar_value { | ||
| ColumnarValue::Scalar(ScalarValue::Float64(Some(distance))) => { |
There was a problem hiding this comment.
Do we need to match Int64 type as well? for example, ST_DWithin(geo, geo, 10)? Would be nice to add this type of test (instead of 10.0) as well.
| // partitioned join we may want to ensure we can express bounds in a way that | ||
| // the partitioner understands (if it doesn't already) to do a better job | ||
| // partitioning wraparounds. | ||
| if min_x > max_x { |
There was a problem hiding this comment.
This could make the shape that across antimeridian return a full longitude range, right? Is this because of the limitation of r-tree not being able to process cross-antimeridian geog correctly?
There was a problem hiding this comment.
I added #782 for this one...we basically need to replace the Rect in the evaluated geometry array with something that explicitly handles wraparound to make sure all of those usages don't do something invalid. But yes, for now, this is making the full longitude range from a wraparound to be defensive.
|
|
||
| let num_polygons = build_stats.polygonal_count().unwrap_or(0); | ||
| let num_lines = build_stats.lineal_count().unwrap_or(0); | ||
| (num_polygons + num_lines).try_into().unwrap_or(usize::MAX) * 500 |
There was a problem hiding this comment.
Should it also count the number of vertex in the estimation as well?
There was a problem hiding this comment.
It definitely should. I added a reference to this to the issue for tg where we also have this issue #281 ...we basically need to do some empirical tests with a bunch of geomery types and byte sizes for the libraries.
This PR adds a
sedona-spatial-join-geographycrate similar tosedona-spatial-join-gpu, which adds another join planner that intercepts geography and injects the appropriate refiner to evaluate the predicate on the sphere. Some basic preparedness is implemented but the heuristics aren't tuned. For my go-to join example (big enough that it usually doesn't finish if anything is off), the geography join is still slower than geometry (3 minutes vs 4 seconds) but is still faster than a lot of the other ways to do this ( https://dewey.dunnington.ca/post/2024/wrangling-and-joining-130m-points-with-duckdb--the-open-source-spatial-stack/ ).I did some more detailed profiling here to make sure the slow things were the slow things you might expect (e.g., the bulk of the time is actually spent evaluating a predicate and an index is getting built): https://gist.github.com/paleolimbot/629e96f56e193e89f4f7e5dc1acdee1c
A more practical example is using cities and countries from geoarrow data (DWithin joins work great here!)