Skip to content

feat(rust/sedona-spatial-join-geography): Implement spatial join for geography type#775

Merged
paleolimbot merged 38 commits into
apache:mainfrom
paleolimbot:geography-join
Apr 27, 2026
Merged

feat(rust/sedona-spatial-join-geography): Implement spatial join for geography type#775
paleolimbot merged 38 commits into
apache:mainfrom
paleolimbot:geography-join

Conversation

@paleolimbot
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot commented Apr 20, 2026

This PR adds a sedona-spatial-join-geography crate similar to sedona-spatial-join-gpu, which adds another join planner that intercepts geography and injects the appropriate refiner to evaluate the predicate on the sphere. Some basic preparedness is implemented but the heuristics aren't tuned. For my go-to join example (big enough that it usually doesn't finish if anything is off), the geography join is still slower than geometry (3 minutes vs 4 seconds) but is still faster than a lot of the other ways to do this ( https://dewey.dunnington.ca/post/2024/wrangling-and-joining-130m-points-with-duckdb--the-open-source-spatial-stack/ ).

import sedona.db

sd = sedona.db.connect()
sd.options.memory_limit = "unlimited"
sd.options.interactive = True

sd.read_parquet(
    "submodules/geoarrow-data/microsoft-buildings/files/microsoft-buildings_point.parquet"
).to_view("buildings")
sd.read_parquet(
    "https://github.com/paleolimbot/2025-02-25_duckdb-geography-geopython/raw/refs/heads/main/us-zip-codes.parquet"
).to_memtable().to_view("zipcodes")

# Regular geometry join
# 3.9s
df = sd.sql("""
SELECT
    zipcodes.zipcode,
    COUNT(*) AS building_count
FROM buildings
JOIN zipcodes
  ON ST_Contains(zipcodes.geom, buildings.geometry)
GROUP BY zipcodes.zipcode
ORDER BY building_count DESC
""").to_memtable()

res = df.to_memtable()
res.show(5)
# ┌─────────┬────────────────┐
# │ zipcode ┆ building_count │
# │   utf8  ┆      int64     │
# ╞═════════╪════════════════╡
# │ 60629   ┆          39767 │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 77449   ┆          38128 │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 77494   ┆          37960 │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 60617   ┆          37125 │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 32162   ┆          34956 │
# └─────────┴────────────────┘

# Make a geography version of this
sd.sql(
    "SELECT ST_GeogFromWKB(ST_AsBinary(geometry)) as geometry FROM buildings"
).to_view("buildings_geog", overwrite=True)
sd.sql(
    "SELECT zipcode, ST_GeogFromWKB(ST_AsBinary(geom)) as geom FROM zipcodes"
).to_view("zipcodes_geog", overwrite=True)

df = sd.sql("""
SELECT
    zipcodes_geog.zipcode,
    COUNT(*) AS building_count
FROM buildings_geog
JOIN zipcodes_geog
  ON ST_Contains(zipcodes_geog.geom, buildings_geog.geometry)
GROUP BY zipcodes_geog.zipcode
ORDER BY building_count DESC
""")

# 3m 18s for geography
res = df.to_memtable()
res.show(5)
# ┌─────────┬────────────────┐
# │ zipcode ┆ building_count │
# │   utf8  ┆      int64     │
# ╞═════════╪════════════════╡
# │ 60629   ┆          39767 │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 77449   ┆          38128 │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 77494   ┆          37959 │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 60617   ┆          37125 │
# ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ 32162   ┆          34956 │
# └─────────┴────────────────┘

I did some more detailed profiling here to make sure the slow things were the slow things you might expect (e.g., the bulk of the time is actually spent evaluating a predicate and an index is getting built): https://gist.github.com/paleolimbot/629e96f56e193e89f4f7e5dc1acdee1c

A more practical example is using cities and countries from geoarrow data (DWithin joins work great here!)

import sedona.db

sd = sedona.db.connect()

url = "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_countries-geography.parquet"
sd.read_parquet(url).to_view("countries")

url = "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_cities.parquet"
sd.read_parquet(url).to_view("cities")

sd.sql("""
SELECT cities.name, countries.name AS country_name, cities.geometry
FROM cities
INNER JOIN countries ON ST_Contains(countries.geometry, ST_GeogFromWKB(ST_AsBinary(cities.geometry)))
""").show(3)
# ┌──────────────┬──────────────┬──────────────────────────────┐
# │     name     ┆ country_name ┆           geometry           │
# │     utf8     ┆     utf8     ┆           geometry           │
# ╞══════════════╪══════════════╪══════════════════════════════╡
# │ Vatican City ┆ Italy        ┆ POINT(12.4533865 41.9032822) │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ San Marino   ┆ Italy        ┆ POINT(12.4417702 43.9360958) │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ Vaduz        ┆ Austria      ┆ POINT(9.5166695 47.1337238)  │
# └──────────────┴──────────────┴──────────────────────────────┘

@github-actions github-actions Bot requested a review from zhangfengcdt April 20, 2026 19:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Rust crate (sedona-spatial-join-geography) that enables optimized spatial joins for geography (spherical) types by injecting a geography-specific refiner and planner into Sedona’s join planning pipeline.

Changes:

  • Introduces a geography spatial-join physical planner + join provider + refiner powered by sedona-s2geography.
  • Refactors sedona-spatial-join to expose extension points (refiner trait/factory, init-once cache, and join-ordering helpers) needed by external join planners/providers.
  • Extends distance-handling in operand evaluation to support rectangle expansion by per-row/per-scalar distance values (and reports unsupported behavior for GPU joins).

Reviewed changes

Copilot reviewed 33 out of 34 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
rust/sedona/src/context.rs Registers the geography spatial-join physical planner when s2geography is enabled.
rust/sedona/Cargo.toml Wires sedona-spatial-join-geography behind the s2geography feature.
rust/sedona-spatial-join/src/utils/init_once_array.rs Makes InitOnceArray public and debuggable for extension crates.
rust/sedona-spatial-join/src/utils.rs Exposes init_once_array module publicly.
rust/sedona-spatial-join/src/refine.rs Makes refiner trait public and introduces a refiner factory abstraction.
rust/sedona-spatial-join/src/physical_planner.rs Exposes join-order and repartition helpers for reuse by other planners.
rust/sedona-spatial-join/src/operand_evaluator.rs Plumbs optional distance ColumnarValue into evaluated-array creation; centralizes rect expansion.
rust/sedona-spatial-join/src/lib.rs Re-exports refiner/index types for extension crates.
rust/sedona-spatial-join/src/join_provider.rs Makes default provider public (for reuse).
rust/sedona-spatial-join/src/index/spatial_index_builder.rs Adds optional create_refiner() hook (default None).
rust/sedona-spatial-join/src/index/default_spatial_index_builder.rs Supports injecting a custom refiner factory; passes refiner into index creation.
rust/sedona-spatial-join/src/index/default_spatial_index.rs Accepts a refiner instance instead of constructing one internally.
rust/sedona-spatial-join/src/index.rs Makes IndexQueryResult/builder module more accessible; re-exports SpatialIndexRef.
rust/sedona-spatial-join-gpu/src/physical_planner.rs Reuses shared join-order heuristic from sedona-spatial-join.
rust/sedona-spatial-join-gpu/src/join_provider.rs Explicitly rejects distance-based rectangle expansion for GPU joins (not yet supported).
rust/sedona-spatial-join-geography/src/spatial_index_builder.rs Wraps the default builder and injects the geography refiner factory.
rust/sedona-spatial-join-geography/src/refiner.rs Implements S2-based predicate refinement (relation + distance) with optional prepared caching.
rust/sedona-spatial-join-geography/src/physical_planner.rs Adds a geography-aware physical planner that only triggers on geography-typed inputs.
rust/sedona-spatial-join-geography/src/lib.rs Defines module layout for the new crate.
rust/sedona-spatial-join-geography/src/join_provider.rs Provides geography evaluated-rect computation via RectBounder, plus builder wiring.
rust/sedona-spatial-join-geography/Cargo.toml Declares the new crate and dependencies.
rust/sedona-query-planner/src/spatial_join_physical_planner.rs Import cleanup to use datafusion_physical_plan::ExecutionPlan.
rust/sedona-query-planner/src/query_planner.rs Import cleanup to use datafusion_physical_plan::ExecutionPlan.
rust/sedona-query-planner/src/optimizer.rs Switches optimizer imports to datafusion_optimizer crate.
rust/sedona-query-planner/Cargo.toml Adds datafusion-optimizer dependency; removes redundant dev-dep on datafusion.
python/sedonadb/tests/test_sjoin.py Removes geography join test from the general sjoin test module.
python/sedonadb/tests/geography/test_geog_sjoin.py Adds dedicated geography sjoin test suite (PostGIS comparisons + regression cases).
c/sedona-s2geography/src/s2geography_c_bindgen.rs Exposes S2GeogRectBounderExpandByDistance via bindings.
c/sedona-s2geography/src/rect_bounder.rs Adds safe wrapper expand_by_distance() + test coverage.
c/sedona-s2geography/src/geography.rs Makes Geography debuggable; makes GeographyFactory::init_from_wkb public.
c/sedona-s2geography/build.rs Uses RelWithDebInfo when building release with debug symbols.
Cargo.toml Adds the new crate to the workspace and adds datafusion-optimizer to workspace deps.
Cargo.lock Locks new crate + dependency changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rust/sedona-spatial-join-geography/src/refiner.rs Outdated
Comment thread rust/sedona-spatial-join-geography/src/refiner.rs Outdated
Comment thread rust/sedona/src/context.rs Outdated
Comment thread rust/sedona-spatial-join-geography/src/join_provider.rs Outdated
Comment thread rust/sedona-spatial-join-geography/src/join_provider.rs
Comment thread rust/sedona-spatial-join-geography/src/refiner.rs
Comment thread rust/sedona-spatial-join-geography/src/refiner.rs Outdated
Comment thread rust/sedona/src/context.rs Outdated
Comment thread rust/sedona-spatial-join-geography/src/refiner.rs Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rust/sedona-spatial-join/src/operand_evaluator.rs Outdated
Comment thread rust/sedona-spatial-join/src/operand_evaluator.rs
Comment thread rust/sedona-spatial-join/src/index/default_spatial_index_builder.rs Outdated
Comment thread rust/sedona-spatial-join/src/utils/init_once_array.rs
Comment thread rust/sedona-spatial-join-geography/src/refiner.rs
paleolimbot and others added 2 commits April 23, 2026 16:32
Co-authored-by: Copilot <copilot@github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rust/sedona-spatial-join-geography/src/refiner.rs
Comment thread rust/sedona-spatial-join-geography/src/refiner.rs
Comment thread rust/sedona-spatial-join-geography/src/join_provider.rs
Comment thread rust/sedona/Cargo.toml
Comment thread rust/sedona-spatial-join-geography/src/physical_planner.rs Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@zhangfengcdt zhangfengcdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and some minor comments

};

let result = match distance_columnar_value {
ColumnarValue::Scalar(ScalarValue::Float64(Some(distance))) => {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to match Int64 type as well? for example, ST_DWithin(geo, geo, 10)? Would be nice to add this type of test (instead of 10.0) as well.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

// partitioned join we may want to ensure we can express bounds in a way that
// the partitioner understands (if it doesn't already) to do a better job
// partitioning wraparounds.
if min_x > max_x {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could make the shape that across antimeridian return a full longitude range, right? Is this because of the limitation of r-tree not being able to process cross-antimeridian geog correctly?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added #782 for this one...we basically need to replace the Rect in the evaluated geometry array with something that explicitly handles wraparound to make sure all of those usages don't do something invalid. But yes, for now, this is making the full longitude range from a wraparound to be defensive.


let num_polygons = build_stats.polygonal_count().unwrap_or(0);
let num_lines = build_stats.lineal_count().unwrap_or(0);
(num_polygons + num_lines).try_into().unwrap_or(usize::MAX) * 500
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it also count the number of vertex in the estimation as well?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely should. I added a reference to this to the issue for tg where we also have this issue #281 ...we basically need to do some empirical tests with a bunch of geomery types and byte sizes for the libraries.

@paleolimbot paleolimbot merged commit c25e70a into apache:main Apr 27, 2026
21 checks passed
@paleolimbot paleolimbot deleted the geography-join branch April 27, 2026 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants