Skip to content

[GH-2830] Improve Geography query support - core#2831

Merged
jiayuasu merged 36 commits intoapache:masterfrom
zhangfengcdt:feature/geography.support
Apr 21, 2026
Merged

[GH-2830] Improve Geography query support - core#2831
jiayuasu merged 36 commits intoapache:masterfrom
zhangfengcdt:feature/geography.support

Conversation

@zhangfengcdt
Copy link
Copy Markdown
Member

@zhangfengcdt zhangfengcdt commented Apr 8, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • Yes, and the PR name follows the format [GH-XXX] my subject. Closes #<issue_number>

What changes were proposed in this PR?

Implements WKB-based Geography serialization (Option B: WKB with Cached S2) and a full set of Geography ST functions.

Core architecture:

  • WKBGeography — stores WKB bytes as primary representation with lazy-parsed JTS, S2, and ShapeIndex caches (double-checked locking for thread safety)
  • GeographyWKBSerializer — WKB serializer with 0xFF format byte, backward-compatible with legacy S2-native format
  • GeographyUDT, implicits.scala, GeometrySerde — switched to WKBSerializer for all serialization paths

Geography functions (3 new):

  • Level 1 (JTS): ST_NPoints
  • Level 2 (JTS + Spheroid): ST_Distance
  • Level 3 (S2): ST_Contains

Docs: API docs for all 3 new functions in docs/api/sql/geography/

Note: Geography-aware spatial join partitioning using S2 cells will be in a separate PR

How was this patch tested?

  • all unit tests pass in common module (WKBGeographyTest, FunctionTest)
  • GeographyFunctionTest.scala — Spark SQL integration tests covering constructors, structural functions, metrics, predicates, DataFrame API, and serialization round-trips

Did this PR include necessary documentation updates?

  • Yes, I have updated the documentation.

@zhangfengcdt zhangfengcdt requested a review from jiayuasu as a code owner April 8, 2026 15:28
@jiayuasu
Copy link
Copy Markdown
Member

jiayuasu commented Apr 9, 2026

@zhangfengcdt Is this PR ready for review? It has lots of unnecessary content (e.g., the benchmark folder). Please also break this huge PR to several small pieces so we can review piece by piece.

@zhangfengcdt zhangfengcdt marked this pull request as draft April 9, 2026 22:12
@paleolimbot
Copy link
Copy Markdown
Member

Cool!

In s2geography there's quite a bit of code to avoid the shape index build at all for ST_Distance() and each individual predicate, which may bring your results closer to on par. Notably, there's the possible intersection check which should help the "point is not in polygon" case.

 - point container return false immediately
 - point target use S2ClosestEdgeQuery instead of building ShapeIndexTarget
@zhangfengcdt
Copy link
Copy Markdown
Member Author

zhangfengcdt commented Apr 16, 2026

Cool!

In s2geography there's quite a bit of code to avoid the shape index build at all for ST_Distance() and each individual predicate, which may bring your results closer to on par. Notably, there's the possible intersection check which should help the "point is not in polygon" case.

Thanks for pointing to the s2geography optimizations. I implemented the following two optimization for now:

  1. point container to false directly early stop
  2. pointTarget for distance: avoids building ShapeIndex for the point side

I also tried to implement the MayIntersect that might help with the contains false case, but looks like the MayIntersects path (~700ns) is more expensive than the cached ShapeIndex path (~220ns) in a single call.

In this PR, I was thinking to optimize the hot path, which may benefit more with cached ShapeIndex (good for joins). For the code path optimizations, adding mayIntersects pre-filters may be better; potentially adding checking covering intersection pre-filtering as well. I'd prefer to add them in following PRs with a configurable flags to avoid the hot path gets slower.

There seems to be tradeoffs we need to make default and also custom sedona configs for users to tune these optimizations since the geography has too much overhead compared to the geometry functions.

What do you think?

@paleolimbot
Copy link
Copy Markdown
Member

That sounds good to me! I the work I did in C++ was mostly to expose all of the possible optimizations but I haven't tuned them yet (and the tuning is probably much different in Java than it is in C++). The tuning/extreme optimization is also much better done in the presence of integration testing against BigQuery/PostGIS (which is in the works on the SedonaDB side). The point optimizations are nice because they truly are easy outs (and point/polygon is I think common).

One of the things that may have affected my initial implementation is that SedonaDB benchmarks mostly have one Scalar argument (where the cost of computing the covering is amortized over many more rows). I didn't implement a S2LatLngRect bound comparison but I think computing that might be cheaper.

@zhangfengcdt
Copy link
Copy Markdown
Member Author

That sounds good to me! I the work I did in C++ was mostly to expose all of the possible optimizations but I haven't tuned them yet (and the tuning is probably much different in Java than it is in C++). The tuning/extreme optimization is also much better done in the presence of integration testing against BigQuery/PostGIS (which is in the works on the SedonaDB side). The point optimizations are nice because they truly are easy outs (and point/polygon is I think common).

One of the things that may have affected my initial implementation is that SedonaDB benchmarks mostly have one Scalar argument (where the cost of computing the covering is amortized over many more rows). I didn't implement a S2LatLngRect bound comparison but I think computing that might be cheaper.

Yes, one scalar case is a very popular case and the ShapeIndex cache and/or covering will help a lot in performance. The join also benefit from the cache and covering.

@zhangfengcdt
Copy link
Copy Markdown
Member Author

@jiayuasu Could you take a look again? I think the core is ready and I'd create a couple of following up PRs for different geography ST functions once this core PR is merged.

I'd also like to add a few optimization configs once all of the st functions are merged so we can measure the performance under different optimizations in all.

Thanks!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements WKB-based serialization for Sedona Geography (with lazy JTS/S2/ShapeIndex caching) and adds initial Geography function support (ST_NPoints, ST_Distance, ST_Contains), along with Spark SQL integration tests and documentation updates.

Changes:

  • Introduces WKBGeography as the primary Geography representation plus GeographyWKBSerializer for backward-compatible serialization.
  • Extends Spark SQL expressions to dual-dispatch between Geometry (JTS) and Geography (S2) for ST_NPoints, ST_Distance, and ST_Contains.
  • Adds/updates unit + integration tests and publishes new SQL API docs for the new Geography functions.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
spark/common/src/test/scala/org/apache/sedona/sql/geography/GeographyFunctionTest.scala New Spark SQL integration tests for Geography constructors + representative L1/L2/L3 functions + serialization round-trips.
spark/common/src/test/scala/org/apache/sedona/sql/geography/FunctionsTest.scala Removes older geography function tests (envelope/EWKT).
spark/common/src/test/scala/org/apache/sedona/sql/geography/ConstructorsTest.scala Updates expected WKTs to reflect S2 normalization behavior (loop orientation/winding).
spark/common/src/test/scala/org/apache/sedona/sql/geography/ConstructorsDataFrameAPITest.scala Updates expected WKT for multipolygon due to S2 normalization behavior.
spark/common/src/test/scala/org/apache/sedona/sql/FunctionResolverSuite.scala Updates resolver test expectations for ambiguity handling.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala Adjusts join predicate detection to accommodate ST_Contains no longer being an ST_Predicate.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/implicits.scala Switches Geography serialization/deserialization to GeographyWKBSerializer.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Predicates.scala Makes ST_Contains dual-dispatch (Geometry + Geography) via InferredExpression.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Functions.scala Adds Geography overloads for ST_Distance and ST_NPoints.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/FunctionResolver.scala Changes overload resolution behavior for ambiguous matches (prefers first).
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/GeographyUDT.scala Switches UDT serialization/deserialization to GeographyWKBSerializer.
spark/common/src/main/java/org/apache/sedona/core/utils/SedonaConf.java Adds config to enable eager ShapeIndex building at Geography deserialization.
docs/api/sql/geography/Geography-Functions/ST_NPoints.md New docs page for ST_NPoints.
docs/api/sql/geography/Geography-Functions/ST_Distance.md New docs page for ST_Distance.
docs/api/sql/geography/Geography-Functions/ST_Contains.md New docs page for ST_Contains.
docs/api/sql/geography/Geography-Functions.md Updates Geography function index to include the new functions and refresh descriptions.
common/src/test/java/org/apache/sedona/common/S2Geography/WKBGeographyTest.java New unit tests for WKBGeography + serializer behavior and legacy compatibility.
common/src/test/java/org/apache/sedona/common/Geography/FunctionTest.java Extends common geography function tests to cover nPoints/distance/contains and refactors envelope tests.
common/src/main/java/org/apache/sedona/common/geometrySerde/GeometrySerde.java Routes Geography serialization through GeographyWKBSerializer.
common/src/main/java/org/apache/sedona/common/geography/Functions.java Adds Geography implementations for nPoints/distance/contains plus helper conversions.
common/src/main/java/org/apache/sedona/common/geography/Constructors.java Updates geography constructors/converters to produce WKBGeography and adds EWKB SRID extraction.
common/src/main/java/org/apache/sedona/common/S2Geography/WkbS2Shape.java New lightweight S2Shape backed by WKB bytes for faster index building on simple types.
common/src/main/java/org/apache/sedona/common/S2Geography/WKBGeography.java New Geography implementation storing WKB bytes with lazy JTS/S2/ShapeIndex caches.
common/src/main/java/org/apache/sedona/common/S2Geography/GeographyWKBSerializer.java New WKB-based serializer with 0xFF marker and legacy S2-native fallback.
common/src/main/java/org/apache/sedona/common/S2Geography/Distance.java Adds a point-to-index distance helper using S2ClosestEdgeQuery.PointTarget.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread common/src/main/java/org/apache/sedona/common/S2Geography/WKBGeography.java Outdated
Comment thread docs/api/sql/geography/Geography-Functions/ST_Contains.md Outdated
Comment thread common/src/main/java/org/apache/sedona/common/geography/Functions.java Outdated
Comment thread docs/api/sql/geography/Geography-Functions/ST_Contains.md
@jiayuasu
Copy link
Copy Markdown
Member

There is a bug in the geography test helper which never actually validates the correctness of the code: #2810

Is it still true in this PR?

@zhangfengcdt
Copy link
Copy Markdown
Member Author

There is a bug in the geography test helper which never actually validates the correctness of the code: #2810

Is it still true in this PR?

Yes, the bug was still present. I just fixed it in this PR.

The TestHelper.java now shows "compareTo(geoWKB, geoWKT)" and compareTo — added early-return for same-class empty geometries before the kind comparison.

extraCondition: Option[Expression] = None): Option[JoinQueryDetection] = {
predicate match {
case ST_Contains(Seq(leftShape, rightShape)) =>
case ST_Intersects(Seq(leftShape, rightShape)) =>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does ST_Contains get changed here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And other predicates do not seem to change? @zhangfengcdt

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the ST_Contains moved out of getJoinDetection because it was converted to InferredExpression in this PR (dual-dispatch JTS + S2 Geography). That removed it from the ST_Predicate class hierarchy, so it no longer is type-checked.

Once we do the same geography function changes for other predicates, we will move them as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants