[GH-2830] Improve Geography query support - core#2831
[GH-2830] Improve Geography query support - core#2831jiayuasu merged 36 commits intoapache:masterfrom
Conversation
…ation with lazy-parsed JTS and S2 caches. This enables GeoArrow-compatible serialization while maintaining full S2 functionality on demand.
- Functions.distance(Geography, Geography) → Spheroid.distance() — returns meters - Functions.area(Geography) → Spheroid.area() — returns m² - Functions.length(Geography) → Spheroid.length() — returns meters - All route through WKBGeography.getJTSGeometry() (no S2 parse needed)
also, added getShapeIndexGeography() as a third lazy cache
- null type ambiguity - level 1 functions missing Geography dispatch
…hole winding order
|
@zhangfengcdt Is this PR ready for review? It has lots of unnecessary content (e.g., the benchmark folder). Please also break this huge PR to several small pieces so we can review piece by piece. |
|
Cool! In s2geography there's quite a bit of code to avoid the shape index build at all for ST_Distance() and each individual predicate, which may bring your results closer to on par. Notably, there's the possible intersection check which should help the "point is not in polygon" case. |
- point container return false immediately - point target use S2ClosestEdgeQuery instead of building ShapeIndexTarget
Thanks for pointing to the s2geography optimizations. I implemented the following two optimization for now:
I also tried to implement the MayIntersect that might help with the contains false case, but looks like the MayIntersects path (~700ns) is more expensive than the cached ShapeIndex path (~220ns) in a single call. In this PR, I was thinking to optimize the hot path, which may benefit more with cached ShapeIndex (good for joins). For the code path optimizations, adding mayIntersects pre-filters may be better; potentially adding checking covering intersection pre-filtering as well. I'd prefer to add them in following PRs with a configurable flags to avoid the hot path gets slower. There seems to be tradeoffs we need to make default and also custom sedona configs for users to tune these optimizations since the geography has too much overhead compared to the geometry functions. What do you think? |
|
That sounds good to me! I the work I did in C++ was mostly to expose all of the possible optimizations but I haven't tuned them yet (and the tuning is probably much different in Java than it is in C++). The tuning/extreme optimization is also much better done in the presence of integration testing against BigQuery/PostGIS (which is in the works on the SedonaDB side). The point optimizations are nice because they truly are easy outs (and point/polygon is I think common). One of the things that may have affected my initial implementation is that SedonaDB benchmarks mostly have one Scalar argument (where the cost of computing the covering is amortized over many more rows). I didn't implement a S2LatLngRect bound comparison but I think computing that might be cheaper. |
Yes, one scalar case is a very popular case and the ShapeIndex cache and/or covering will help a lot in performance. The join also benefit from the cache and covering. |
|
@jiayuasu Could you take a look again? I think the core is ready and I'd create a couple of following up PRs for different geography ST functions once this core PR is merged. I'd also like to add a few optimization configs once all of the st functions are merged so we can measure the performance under different optimizations in all. Thanks! |
There was a problem hiding this comment.
Pull request overview
Implements WKB-based serialization for Sedona Geography (with lazy JTS/S2/ShapeIndex caching) and adds initial Geography function support (ST_NPoints, ST_Distance, ST_Contains), along with Spark SQL integration tests and documentation updates.
Changes:
- Introduces
WKBGeographyas the primary Geography representation plusGeographyWKBSerializerfor backward-compatible serialization. - Extends Spark SQL expressions to dual-dispatch between Geometry (JTS) and Geography (S2) for
ST_NPoints,ST_Distance, andST_Contains. - Adds/updates unit + integration tests and publishes new SQL API docs for the new Geography functions.
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/test/scala/org/apache/sedona/sql/geography/GeographyFunctionTest.scala | New Spark SQL integration tests for Geography constructors + representative L1/L2/L3 functions + serialization round-trips. |
| spark/common/src/test/scala/org/apache/sedona/sql/geography/FunctionsTest.scala | Removes older geography function tests (envelope/EWKT). |
| spark/common/src/test/scala/org/apache/sedona/sql/geography/ConstructorsTest.scala | Updates expected WKTs to reflect S2 normalization behavior (loop orientation/winding). |
| spark/common/src/test/scala/org/apache/sedona/sql/geography/ConstructorsDataFrameAPITest.scala | Updates expected WKT for multipolygon due to S2 normalization behavior. |
| spark/common/src/test/scala/org/apache/sedona/sql/FunctionResolverSuite.scala | Updates resolver test expectations for ambiguity handling. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala | Adjusts join predicate detection to accommodate ST_Contains no longer being an ST_Predicate. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/implicits.scala | Switches Geography serialization/deserialization to GeographyWKBSerializer. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Predicates.scala | Makes ST_Contains dual-dispatch (Geometry + Geography) via InferredExpression. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Functions.scala | Adds Geography overloads for ST_Distance and ST_NPoints. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/FunctionResolver.scala | Changes overload resolution behavior for ambiguous matches (prefers first). |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/GeographyUDT.scala | Switches UDT serialization/deserialization to GeographyWKBSerializer. |
| spark/common/src/main/java/org/apache/sedona/core/utils/SedonaConf.java | Adds config to enable eager ShapeIndex building at Geography deserialization. |
| docs/api/sql/geography/Geography-Functions/ST_NPoints.md | New docs page for ST_NPoints. |
| docs/api/sql/geography/Geography-Functions/ST_Distance.md | New docs page for ST_Distance. |
| docs/api/sql/geography/Geography-Functions/ST_Contains.md | New docs page for ST_Contains. |
| docs/api/sql/geography/Geography-Functions.md | Updates Geography function index to include the new functions and refresh descriptions. |
| common/src/test/java/org/apache/sedona/common/S2Geography/WKBGeographyTest.java | New unit tests for WKBGeography + serializer behavior and legacy compatibility. |
| common/src/test/java/org/apache/sedona/common/Geography/FunctionTest.java | Extends common geography function tests to cover nPoints/distance/contains and refactors envelope tests. |
| common/src/main/java/org/apache/sedona/common/geometrySerde/GeometrySerde.java | Routes Geography serialization through GeographyWKBSerializer. |
| common/src/main/java/org/apache/sedona/common/geography/Functions.java | Adds Geography implementations for nPoints/distance/contains plus helper conversions. |
| common/src/main/java/org/apache/sedona/common/geography/Constructors.java | Updates geography constructors/converters to produce WKBGeography and adds EWKB SRID extraction. |
| common/src/main/java/org/apache/sedona/common/S2Geography/WkbS2Shape.java | New lightweight S2Shape backed by WKB bytes for faster index building on simple types. |
| common/src/main/java/org/apache/sedona/common/S2Geography/WKBGeography.java | New Geography implementation storing WKB bytes with lazy JTS/S2/ShapeIndex caches. |
| common/src/main/java/org/apache/sedona/common/S2Geography/GeographyWKBSerializer.java | New WKB-based serializer with 0xFF marker and legacy S2-native fallback. |
| common/src/main/java/org/apache/sedona/common/S2Geography/Distance.java | Adds a point-to-index distance helper using S2ClosestEdgeQuery.PointTarget. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
There is a bug in the geography test helper which never actually validates the correctness of the code: #2810 Is it still true in this PR? |
Yes, the bug was still present. I just fixed it in this PR. The TestHelper.java now shows "compareTo(geoWKB, geoWKT)" and compareTo — added early-return for same-class empty geometries before the kind comparison. |
| extraCondition: Option[Expression] = None): Option[JoinQueryDetection] = { | ||
| predicate match { | ||
| case ST_Contains(Seq(leftShape, rightShape)) => | ||
| case ST_Intersects(Seq(leftShape, rightShape)) => |
There was a problem hiding this comment.
Why does ST_Contains get changed here?
There was a problem hiding this comment.
And other predicates do not seem to change? @zhangfengcdt
There was a problem hiding this comment.
Yes, the ST_Contains moved out of getJoinDetection because it was converted to InferredExpression in this PR (dual-dispatch JTS + S2 Geography). That removed it from the ST_Predicate class hierarchy, so it no longer is type-checked.
Once we do the same geography function changes for other predicates, we will move them as well.
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes #<issue_number>What changes were proposed in this PR?
Implements WKB-based Geography serialization (Option B: WKB with Cached S2) and a full set of Geography ST functions.
Core architecture:
Geography functions (3 new):
Docs: API docs for all 3 new functions in docs/api/sql/geography/
Note: Geography-aware spatial join partitioning using S2 cells will be in a separate PR
How was this patch tested?
Did this PR include necessary documentation updates?