feat: add 40 native geospatial SQL functions (ST_ expressions)#4423
feat: add 40 native geospatial SQL functions (ST_ expressions)#4423sp-202 wants to merge 69 commits into
Conversation
…T_Within, ST_Area, ST_Centroid) Implements six Sedona-compatible spatial functions as native DataFusion ScalarUDFs backed by the pure-Rust `geo` crate, replacing JVM round-trips for geometry predicates and measurements. - Add geo/wkt/geos crate dependencies to native/core/Cargo.toml - Implement StContains, StIntersects, StWithin, StDistance, StArea, StCentroid as ScalarUDFImpl in native/core/src/execution/expressions/geo/ - Register all six UDFs with the DataFusion SessionContext in jni_api.rs - Add Scala serde (geo.scala) that reflectively maps Sedona ST_ expression classes to the corresponding ScalarFunc proto at runtime; Comet compiles cleanly without Sedona on the classpath - Wire geoExpressions into exprSerdeMap in QueryPlanSerde.scala Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Split long import line in geo.scala (col 110 violation) - Wrap Class.forName with scalastyle:off/on classforname guard - Replace non-ASCII em-dashes with hyphens in comments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Revert multi-line import to single line, rewrap doc comments to match the project's spotless/scalafmt rules, and fix scalastyle:on guard indentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Spark's analyzer cannot resolve st_contains etc. unless they are registered in the Spark function catalog. Add CometGeoFunctions, an injectResolutionRule that registers all six geo UDFs as ScalaUDFs on the SparkSession at startup. When Comet native execution is active, CometScalaUDF serde intercepts the ScalaUDF expression and routes it to the native Rust path via DataFusion. The JVM lambda bodies (CometGeoFallback) are only reached when Comet is disabled, and throw UnsupportedOperationException with a clear message directing the user to enable Comet or add Sedona. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
injectResolutionRule fires per-query, after LookupFunctions already failed with UNRESOLVED_ROUTINE. Switch to calling registerAll() inside the injectColumnar lambda which Spark invokes once at session-build time, ensuring the UDFs are in the catalog before any SQL runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… access injectColumnar and injectResolutionRule both fire after LookupFunctions has already failed. injectOptimizerRule receives the SparkSession at session-build time before any query analysis runs, guaranteeing the UDFs are in the catalog when SQL first executes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…asses
ScalaUDF is intercepted by Comet's codegen path which rejected the
lambdas, so the native Rust UDFs were never reached. Replace with
proper Spark Expression subclasses (StContains, StIntersects, etc.)
registered via SparkSessionExtensions.injectFunction. The serde
maps these expression classes directly to ScalarFunc { func = name }
so the DataFusion planner resolves them to the Rust geo UDFs.
- GeoExpressions.scala: BinaryExpression/UnaryExpression subclasses
for each ST_ function with JVM fallback via CometGeoFallback
- CometGeoFallback moved to expressions package (referenced from codegen)
- geo.scala serde: maps native expression classes + optional Sedona classes
- CometSparkSessionExtensions: registers via injectFunction (compile-time,
no session required, guaranteed to fire before any query analysis)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…for injectFunction extensions.FunctionDescription is a 3-tuple (FunctionIdentifier, ExpressionInfo, builder); our type alias was a 2-tuple missing FunctionIdentifier. ExpressionInfo 10-arg constructor does not exist in Spark 3.5 -- use the 2-arg (className, name) form instead. Also drop unused InternalRow and TypeCheckResult imports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Merge GeoExpressions import into comet group (no blank line, alphabetical order) - Break doGenCode string literals at 100-char boundary for contains/intersects/within/distance Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nt geometries - st_contains/st_within/st_intersects: replace geo::Contains/Within/Intersects traits with geo::relate::Relate DE-9IM matrix to match OGC/Sedona/JTS boundary semantics. The trait impls used closed-set (covers) semantics; DE-9IM requires interior intersection. - st_distance: geo::EuclideanDistance on Geometry enum returns 0.0 for disjoint Polygon/Point pairs. Fix by checking intersection first, then falling through to euclidean_distance which correctly handles disjoint geometries via nearest-point. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ype, st_numpoints, st_x, st_y, st_envelope, st_convexhull, st_simplify, st_buffer, st_union, st_intersection) Pure-geo tier: st_length (EuclideanLength), st_isempty, st_geometrytype, st_numpoints (CoordsIter), st_x/st_y (Point coord extraction), st_envelope (BoundingRect), st_convexhull (ConvexHull), st_simplify (Ramer-Douglas-Peucker). GEOS tier: st_buffer, st_union, st_intersection via static geos crate. All registered in DataFusion SessionContext, mapped in exprSerdeMap, and exposed via SparkSessionExtensions.injectFunction with JVM fallback stubs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…isempty/st_length/st_simplify - st_isempty: import HasDimensions and CoordsIter traits (required for is_empty/coords_count) - st_length: EuclideanLength not impl for Geometry enum; match on LineString/MultiLineString/ Polygon/MultiPolygon variants explicitly - st_simplify: Simplify not impl for Geometry enum; match on concrete variants, pass others through Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Moved to org.apache.comet.expressions in a previous commit; the old file was left behind causing a duplicate object definition compile error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
values_to_arrays does not broadcast scalar literals to batch length. Extract the numeric arg directly from ColumnarValue::Scalar before falling back to array extraction, preventing unwrap() panic on literals. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d POLYGON EMPTY bug
…nts, transformations, set ops)
…Block type Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove code\"\"\"...\"\"\" interpolation from StMakeEnvelope (confuses scalariform parser); use supportCodegen=false instead - Reorder geo.scala import: StGeometryType before StGeomFromGeoJson/Wkt (case-insensitive alphabetical order) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scalariform parser rejects (x, y) => when inline with defineCodeGen args; must place lambda on its own indented line. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e error Scalariform fails to parse (x, y) => in defineCodeGen context; rename params to g1/g2 which match the established pattern in the codebase. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Semicolons between val declarations confuse scalariform's parser, causing it to misidentify the next => in the file as a parse error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
StMakeEnvelope extends Expression directly (not BinaryExpression) and uses IndexedSeq[Expression] in withNewChildrenInternal. This unique structure causes scalariform 0.2.6 to corrupt its AST state, causing subsequent (g1, g2) => lambdas in the same file to fail to parse. Moving it to a separate file isolates the issue. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rd 2-line form for scalariform Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…deGen format Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rse error Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… for scalariform Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ns for scalariform Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s breaking scalariform Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xpression has neither) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… class) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tting Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…) tuple style Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents all 40 geospatial SQL functions registered by Comet as Spark SQL extensions. Each entry covers the SQL signature, parameter types, return type, a description of the semantics, and a usage example. Functions are grouped by category (constructors, serializers, predicates, measurements, accessors, transformations, set operations) with a summary table at the end. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for the contribution @sp-202! I'm not very familiar with geo functions in Spark. Are the functions in this PR intended to accelerate existing Spark geo functions? If so, could you add Comet SQL tests so that we can ensure Spark and Comet produce identical results? |
|
@paleolimbot fyi |
If the goal is to accelerate Sedona, I think we need to have a discussion on the mailing list first since this would be a shift in scope for Comet. |
|
Hi @andygrove — thanks for the clarification, and great to have @paleolimbot's eyes on this
I am working on the SQL tests now — will push them shortly. Happy to proceed however you think is best |
…functions The geos Rust crate (features = ["static"]) compiles GEOS from bundled C++ source and requires cmake. libgeos-dev provides headers used by pkg-config. On macOS, brew install geos adds the native library needed by the geos crate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes rustfmt --check failures in CI: splits long import lines, expands single-line trait methods, and wraps long register_udf calls in mod.rs to satisfy max_width = 100. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y clippy Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| apt-get update | ||
| apt-get install -y protobuf-compiler | ||
| apt-get install -y clang | ||
| apt-get install -y protobuf-compiler clang cmake libgeos-dev |
There was a problem hiding this comment.
I don't think we should be adding dependencies on C++ libraries, especially when they are LGPL
There was a problem hiding this comment.
@andygrove — completely agree, thank you for catching this. I will remove the geos dependency entirely. The LGPL license and C++ dependency are both good reasons to avoid it. I'll reimplement ST_Buffer, ST_Union, ST_Intersection, and ST_Difference using the pure Rust geo
andygrove
left a comment
There was a problem hiding this comment.
Let's have a discussion on the mailing list about this work.
paleolimbot
left a comment
There was a problem hiding this comment.
Cool!
A note that you can achieve roughly this as a user by using Wherobots (which I believe has 100ish functions + join + Parquet IO fully implemented on the Rust side). Of course, it's great to have this in Comet proper as well and I'm happy to help. Full UDT support would help us quite a bit...as I understand it there are a lot of DataType usages that would have to be FieldRef usages if UDTs or true Spark geometry were to be implemented (Spark geometry has a type parameter that is dropped when represented as a Utf8 here).
In its current form, I wonder if this could be implemented in such a way that the function maintenance is not Comet's responsibility...we have many functions implemented, tested, and benchmarked in SedonaDB that could be wrapped (such that only the wrapper code is a Comet maintenance burden).
Geometries are represented as WKT strings
I didn't know about this...Spark and Parquet and SedonaDB are all using WKB here, which is much faster (what Comet is currently doing is the equivalent of passing around doubles as strings...very slow).
cc @zhangfengcdt and @jiayuasu who are more familiar with the Spark side 🙂
| geo = "0.28" | ||
| geoarrow = "0.8" | ||
| geojson = { version = "0.24", features = ["geo-types"] } | ||
| geos = { version = "8.3", features = ["static"] } |
There was a problem hiding this comment.
Just a note that statically linking GEOS is often cited as not being compatible with a non-(L)GPL license (although in practice it is frequently done). I believe the general Apache line on this is that you can have (L)GPL dependencies as long as they are optional (which in this case I think would mean a non-default feature flag).
| iceberg-storage-opendal = { workspace = true } | ||
| serde_json = "1.0" | ||
| uuid = "1.23.0" | ||
| geo = "0.28" |
There was a problem hiding this comment.
I believe the latest version is 0.33?
| fn return_type(&self, _arg_types: &[DataType]) -> DataFusionResult<DataType> { | ||
| Ok(DataType::Float64) | ||
| } | ||
|
|
||
| fn invoke_with_args(&self, args: ScalarFunctionArgs) -> DataFusionResult<ColumnarValue> { | ||
| let args = ColumnarValue::values_to_arrays(&args.args)?; | ||
| let geom_col = args[0].as_any().downcast_ref::<StringArray>().unwrap(); | ||
|
|
||
| let result: Float64Array = geom_col | ||
| .iter() | ||
| .map(|g| { | ||
| let wkt = g?; | ||
| let geom = geo::Geometry::<f64>::try_from_wkt_str(wkt).ok()?; | ||
| Some(geom.unsigned_area()) | ||
| }) | ||
| .collect(); | ||
|
|
||
| Ok(ColumnarValue::Array(Arc::new(result) as ArrayRef)) | ||
| } |
There was a problem hiding this comment.
We have ~150 of these defined in SedonaDB ( https://github.com/apache/sedona-db ) with thousands of lines of tests against a live PostGIS instance. These are exportable/importable from C so you don't necessarily need the DataFusion/arrow-rs versions to align ( https://github.com/apache/sedona-db/blob/d286f9af8164a48889fd0d6fc82bc2bc274d687e/c/sedona-extension/src/scalar_kernel.rs#L39-L46 ). You would need a wrapper on this side to handle the to/from WKT for as long as you are still using it (WKB is what Spark and SedonaDB are using and it's much faster).
| ### st_geomfromwkt | ||
|
|
||
| ```sql | ||
| st_geomfromwkt(wkt STRING) -> STRING | ||
| ``` | ||
|
|
||
| Parses a WKT string and returns the geometry. Returns `null` if the input is `null`. | ||
|
|
||
| ```sql | ||
| SELECT st_geomfromwkt('POINT(1.0 2.0)'); | ||
| -- POINT (1 2) | ||
| ``` |
There was a problem hiding this comment.
Both Apache Sedona ( https://sedona.apache.org/latest/api/sql/Geometry-Constructors/ST_GeomFromWKT/ ) and SedonaDB ( https://sedona.apache.org/sedonadb/latest/reference/sql/st_geomfromwkt/ ) maintain copies of approximately these you can link to to avoid maintaining these yourselves. (We will probably converge those two docs at some point)
There was a problem hiding this comment.
@paleolimbot — thanks, really helpful direction!
- Is sedona-extension on crates.io or would it be a git dependency?
- Which arrow-rs version does SedonaDB pin? Comet is on 58.3.0 they need to match.
- Direct Cargo dep or the C ABI path you mentioned — which do you prefer?
There was a problem hiding this comment.
A Cargo dependency would be great, but we're a few versions behind Comet and you're right that they have to align for the non-FFI case. We can figure out a way to make this work on our end if Comet is interested.
…:BooleanOps GEOS is LGPL-licensed and a C++ dependency, both of which are incompatible with Apache project policy. Replace ST_Union, ST_Intersection, ST_Difference, and ST_SymDifference with geo::BooleanOps (MIT/Apache-2.0). Also revert the CI build action changes that added cmake and libgeos-dev. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: remove LGPL geos crate; reimplement set ops using pure Rust geo
Summary
SparkSessionExtensions.injectFunction. All functions execute natively inthe Rust/DataFusion engine when Comet is enabled (
spark.comet.exec.enabled=true).geo UDF convention.
ST_*expression classes are transparently mapped to the same Comet native UDFsvia the serde layer, so Sedona queries also benefit from native acceleration.
New files
spark/src/main/scala/org/apache/comet/expressions/GeoExpressions.scala— 40 Spark expression case classes and their
SparkSessionExtensionsregistration descriptors.spark/src/main/scala/org/apache/comet/expressions/CometGeoFallback.scala— JVM fallback stubs. Called only when Comet native execution is disabled and
Sedona is not on the classpath; all methods throw
UnsupportedOperationException.spark/src/main/scala/org/apache/comet/serde/geo.scala— Serde map wiring each expression class to its named DataFusion scalar function.
docs/geo-functions.md— Reference documentation covering all 40 functions with signatures, parameter
types, return types, descriptions, and SQL examples.
Functions added (40)
st_geomfromwkt,st_geomfromgeojson,st_point,st_makeenvelope,st_makelinest_astext,st_asgeojsonst_contains,st_intersects,st_within,st_covers,st_coveredby,st_equals,st_touches,st_crosses,st_disjoint,st_overlapsst_area,st_length,st_perimeter,st_distance,st_distancesphere,st_hausdorffdistance,st_numpoints,st_x,st_yst_isempty,st_geometrytypest_centroid,st_envelope,st_convexhull,st_buffer,st_simplify,st_simplifypreservetopology,st_flipcoordinates,st_boundaryst_union,st_intersection,st_difference,st_symdifferenceTest plan
./mvn scalastyle:checkpasses on all changed Scala files./mvn spotless:checkpasses (scalafmt formatting)./mvn compilesucceeds with no errors or warningsspark-shellsession(
spark.sessionState.functionRegistry.lookupFunction)transforms (18 functions), aggregation + shuffle (
CometHashAggregate),broadcast self-join (
CometBroadcastHashJoin), and window functions allproduced correct results
CometNativeScan+CometProjectthroughout —no fallback to JVM evaluation