Skip to content

feat: add 40 native geospatial SQL functions (ST_ expressions)#4423

Open
sp-202 wants to merge 69 commits into
apache:mainfrom
sp-202:main
Open

feat: add 40 native geospatial SQL functions (ST_ expressions)#4423
sp-202 wants to merge 69 commits into
apache:mainfrom
sp-202:main

Conversation

@sp-202
Copy link
Copy Markdown

@sp-202 sp-202 commented May 25, 2026

Summary

  • Adds 40 geospatial SQL functions registered as Spark SQL extensions via
    SparkSessionExtensions.injectFunction. All functions execute natively in
    the Rust/DataFusion engine when Comet is enabled (spark.comet.exec.enabled=true).
  • Geometries are represented as WKT strings, consistent with the existing Comet
    geo UDF convention.
  • Includes optional Sedona interop: if Apache Sedona is on the classpath, Sedona's
    ST_* expression classes are transparently mapped to the same Comet native UDFs
    via the serde layer, so Sedona queries also benefit from native acceleration.

New files

  • spark/src/main/scala/org/apache/comet/expressions/GeoExpressions.scala
    — 40 Spark expression case classes and their SparkSessionExtensions registration descriptors.
  • spark/src/main/scala/org/apache/comet/expressions/CometGeoFallback.scala
    — JVM fallback stubs. Called only when Comet native execution is disabled and
    Sedona is not on the classpath; all methods throw UnsupportedOperationException.
  • spark/src/main/scala/org/apache/comet/serde/geo.scala
    — Serde map wiring each expression class to its named DataFusion scalar function.
  • docs/geo-functions.md
    — Reference documentation covering all 40 functions with signatures, parameter
    types, return types, descriptions, and SQL examples.

Functions added (40)

Category Functions
Constructors st_geomfromwkt, st_geomfromgeojson, st_point, st_makeenvelope, st_makeline
Serializers st_astext, st_asgeojson
Predicates st_contains, st_intersects, st_within, st_covers, st_coveredby, st_equals, st_touches, st_crosses, st_disjoint, st_overlaps
Measurements st_area, st_length, st_perimeter, st_distance, st_distancesphere, st_hausdorffdistance, st_numpoints, st_x, st_y
Accessors st_isempty, st_geometrytype
Transformations st_centroid, st_envelope, st_convexhull, st_buffer, st_simplify, st_simplifypreservetopology, st_flipcoordinates, st_boundary
Set operations st_union, st_intersection, st_difference, st_symdifference

Test plan

  • ./mvn scalastyle:check passes on all changed Scala files
  • ./mvn spotless:check passes (scalafmt formatting)
  • ./mvn compile succeeds with no errors or warnings
  • Verified all 40 functions are registered in a live spark-shell session
    (spark.sessionState.functionRegistry.lookupFunction)
  • Ran a stress test with 10 000 parquet rows across 10 partitions: per-row
    transforms (18 functions), aggregation + shuffle (CometHashAggregate),
    broadcast self-join (CometBroadcastHashJoin), and window functions all
    produced correct results
  • Query plans confirm CometNativeScan + CometProject throughout —
    no fallback to JVM evaluation

sp-202 and others added 30 commits May 25, 2026 06:13
…T_Within, ST_Area, ST_Centroid)

Implements six Sedona-compatible spatial functions as native DataFusion
ScalarUDFs backed by the pure-Rust `geo` crate, replacing JVM round-trips
for geometry predicates and measurements.

- Add geo/wkt/geos crate dependencies to native/core/Cargo.toml
- Implement StContains, StIntersects, StWithin, StDistance, StArea,
  StCentroid as ScalarUDFImpl in native/core/src/execution/expressions/geo/
- Register all six UDFs with the DataFusion SessionContext in jni_api.rs
- Add Scala serde (geo.scala) that reflectively maps Sedona ST_ expression
  classes to the corresponding ScalarFunc proto at runtime; Comet compiles
  cleanly without Sedona on the classpath
- Wire geoExpressions into exprSerdeMap in QueryPlanSerde.scala

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Split long import line in geo.scala (col 110 violation)
- Wrap Class.forName with scalastyle:off/on classforname guard
- Replace non-ASCII em-dashes with hyphens in comments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Revert multi-line import to single line, rewrap doc comments to match
the project's spotless/scalafmt rules, and fix scalastyle:on guard
indentation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Spark's analyzer cannot resolve st_contains etc. unless they are
registered in the Spark function catalog. Add CometGeoFunctions, an
injectResolutionRule that registers all six geo UDFs as ScalaUDFs on
the SparkSession at startup.

When Comet native execution is active, CometScalaUDF serde intercepts
the ScalaUDF expression and routes it to the native Rust path via
DataFusion. The JVM lambda bodies (CometGeoFallback) are only reached
when Comet is disabled, and throw UnsupportedOperationException with a
clear message directing the user to enable Comet or add Sedona.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
injectResolutionRule fires per-query, after LookupFunctions already
failed with UNRESOLVED_ROUTINE. Switch to calling registerAll() inside
the injectColumnar lambda which Spark invokes once at session-build
time, ensuring the UDFs are in the catalog before any SQL runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… access

injectColumnar and injectResolutionRule both fire after LookupFunctions
has already failed. injectOptimizerRule receives the SparkSession at
session-build time before any query analysis runs, guaranteeing the
UDFs are in the catalog when SQL first executes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…asses

ScalaUDF is intercepted by Comet's codegen path which rejected the
lambdas, so the native Rust UDFs were never reached. Replace with
proper Spark Expression subclasses (StContains, StIntersects, etc.)
registered via SparkSessionExtensions.injectFunction. The serde
maps these expression classes directly to ScalarFunc { func = name }
so the DataFusion planner resolves them to the Rust geo UDFs.

- GeoExpressions.scala: BinaryExpression/UnaryExpression subclasses
  for each ST_ function with JVM fallback via CometGeoFallback
- CometGeoFallback moved to expressions package (referenced from codegen)
- geo.scala serde: maps native expression classes + optional Sedona classes
- CometSparkSessionExtensions: registers via injectFunction (compile-time,
  no session required, guaranteed to fire before any query analysis)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…for injectFunction

extensions.FunctionDescription is a 3-tuple (FunctionIdentifier, ExpressionInfo, builder);
our type alias was a 2-tuple missing FunctionIdentifier. ExpressionInfo 10-arg constructor
does not exist in Spark 3.5 -- use the 2-arg (className, name) form instead. Also drop
unused InternalRow and TypeCheckResult imports.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Merge GeoExpressions import into comet group (no blank line, alphabetical order)
- Break doGenCode string literals at 100-char boundary for contains/intersects/within/distance

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nt geometries

- st_contains/st_within/st_intersects: replace geo::Contains/Within/Intersects traits
  with geo::relate::Relate DE-9IM matrix to match OGC/Sedona/JTS boundary semantics.
  The trait impls used closed-set (covers) semantics; DE-9IM requires interior intersection.
- st_distance: geo::EuclideanDistance on Geometry enum returns 0.0 for disjoint
  Polygon/Point pairs. Fix by checking intersection first, then falling through to
  euclidean_distance which correctly handles disjoint geometries via nearest-point.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ype, st_numpoints, st_x, st_y, st_envelope, st_convexhull, st_simplify, st_buffer, st_union, st_intersection)

Pure-geo tier: st_length (EuclideanLength), st_isempty, st_geometrytype, st_numpoints
(CoordsIter), st_x/st_y (Point coord extraction), st_envelope (BoundingRect),
st_convexhull (ConvexHull), st_simplify (Ramer-Douglas-Peucker).
GEOS tier: st_buffer, st_union, st_intersection via static geos crate.
All registered in DataFusion SessionContext, mapped in exprSerdeMap, and exposed
via SparkSessionExtensions.injectFunction with JVM fallback stubs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…isempty/st_length/st_simplify

- st_isempty: import HasDimensions and CoordsIter traits (required for is_empty/coords_count)
- st_length: EuclideanLength not impl for Geometry enum; match on LineString/MultiLineString/
  Polygon/MultiPolygon variants explicitly
- st_simplify: Simplify not impl for Geometry enum; match on concrete variants, pass others through

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Moved to org.apache.comet.expressions in a previous commit; the old file
was left behind causing a duplicate object definition compile error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
values_to_arrays does not broadcast scalar literals to batch length.
Extract the numeric arg directly from ColumnarValue::Scalar before
falling back to array extraction, preventing unwrap() panic on literals.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Block type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove code\"\"\"...\"\"\" interpolation from StMakeEnvelope (confuses
  scalariform parser); use supportCodegen=false instead
- Reorder geo.scala import: StGeometryType before StGeomFromGeoJson/Wkt
  (case-insensitive alphabetical order)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scalariform parser rejects (x, y) => when inline with defineCodeGen args;
must place lambda on its own indented line.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e error

Scalariform fails to parse (x, y) => in defineCodeGen context; rename
params to g1/g2 which match the established pattern in the codebase.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Semicolons between val declarations confuse scalariform's parser, causing
it to misidentify the next => in the file as a parse error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
StMakeEnvelope extends Expression directly (not BinaryExpression) and
uses IndexedSeq[Expression] in withNewChildrenInternal. This unique
structure causes scalariform 0.2.6 to corrupt its AST state, causing
subsequent (g1, g2) => lambdas in the same file to fail to parse.
Moving it to a separate file isolates the issue.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sp-202 and others added 16 commits May 25, 2026 11:36
…rd 2-line form for scalariform

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…deGen format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rse error

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… for scalariform

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ns for scalariform

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s breaking scalariform

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xpression has neither)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… class)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tting

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…) tuple style

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents all 40 geospatial SQL functions registered by Comet as Spark
SQL extensions. Each entry covers the SQL signature, parameter types,
return type, a description of the semantics, and a usage example.
Functions are grouped by category (constructors, serializers, predicates,
measurements, accessors, transformations, set operations) with a summary
table at the end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@andygrove
Copy link
Copy Markdown
Member

Thanks for the contribution @sp-202! I'm not very familiar with geo functions in Spark. Are the functions in this PR intended to accelerate existing Spark geo functions? If so, could you add Comet SQL tests so that we can ensure Spark and Comet produce identical results?

@andygrove
Copy link
Copy Markdown
Member

@paleolimbot fyi

@andygrove
Copy link
Copy Markdown
Member

Thanks for the contribution @sp-202! I'm not very familiar with geo functions in Spark. Are the functions in this PR intended to accelerate existing Spark geo functions? If so, could you add Comet SQL tests so that we can ensure Spark and Comet produce identical results?

If the goal is to accelerate Sedona, I think we need to have a discussion on the mailing list first since this would be a shift in scope for Comet.

@sp-202
Copy link
Copy Markdown
Author

sp-202 commented May 25, 2026

Hi @andygrove — thanks for the clarification, and great to have @paleolimbot's eyes on this

To clarify the intent: this PR is not about accelerating Sedona. The goal is to add native geospatial SQL functions directly into Comet the same way Comet has native math, string, and datetime functions today. No Sedona dependency, no external library coupling. Comet compiles and runs completely without Sedona on the classpath.

So the core ask of this PR is simply:

  • Users add CometSparkSessionExtensions (which they already do to enable Comet)
  • They get ST_Contains, ST_Intersects, ST_Distance etc. natively in Spark SQL
  • No Sedona, no extra JARs, no config changes

On build dependencies — the native geo layer requires the following additions to native/core/Cargo.toml:

geo = "0.28"
geoarrow = "0.8"
geojson = { version = "0.24", features = ["geo-types"] }
geos = { version = "8.3", features = ["static"] }
wkt = "0.11"

And on the build machine, libgeos-dev needs to be installed:

sudo apt install -y libgeos-dev

Since geos is built with features = ["static"], the GEOS C library is statically linked into libcomet.so — so end users don't need libgeos-dev at runtime. The build machine needs it, but the final JAR is self-contained. For CI, I'm happy to submit a follow-up PR updating the CI workflow once this PR direction is agreed.

I am working on the SQL tests now — will push them shortly. Happy to proceed however you think is best

subhodeep and others added 5 commits May 25, 2026 23:45
…functions

The geos Rust crate (features = ["static"]) compiles GEOS from bundled
C++ source and requires cmake. libgeos-dev provides headers used by
pkg-config. On macOS, brew install geos adds the native library needed
by the geos crate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes rustfmt --check failures in CI: splits long import lines,
expands single-line trait methods, and wraps long register_udf
calls in mod.rs to satisfy max_width = 100.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y clippy

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
apt-get update
apt-get install -y protobuf-compiler
apt-get install -y clang
apt-get install -y protobuf-compiler clang cmake libgeos-dev
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be adding dependencies on C++ libraries, especially when they are LGPL

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove — completely agree, thank you for catching this. I will remove the geos dependency entirely. The LGPL license and C++ dependency are both good reasons to avoid it. I'll reimplement ST_Buffer, ST_Union, ST_Intersection, and ST_Difference using the pure Rust geo

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have a discussion on the mailing list about this work.

Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

A note that you can achieve roughly this as a user by using Wherobots (which I believe has 100ish functions + join + Parquet IO fully implemented on the Rust side). Of course, it's great to have this in Comet proper as well and I'm happy to help. Full UDT support would help us quite a bit...as I understand it there are a lot of DataType usages that would have to be FieldRef usages if UDTs or true Spark geometry were to be implemented (Spark geometry has a type parameter that is dropped when represented as a Utf8 here).

In its current form, I wonder if this could be implemented in such a way that the function maintenance is not Comet's responsibility...we have many functions implemented, tested, and benchmarked in SedonaDB that could be wrapped (such that only the wrapper code is a Comet maintenance burden).

Geometries are represented as WKT strings

I didn't know about this...Spark and Parquet and SedonaDB are all using WKB here, which is much faster (what Comet is currently doing is the equivalent of passing around doubles as strings...very slow).

cc @zhangfengcdt and @jiayuasu who are more familiar with the Spark side 🙂

Comment thread native/core/Cargo.toml Outdated
geo = "0.28"
geoarrow = "0.8"
geojson = { version = "0.24", features = ["geo-types"] }
geos = { version = "8.3", features = ["static"] }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that statically linking GEOS is often cited as not being compatible with a non-(L)GPL license (although in practice it is frequently done). I believe the general Apache line on this is that you can have (L)GPL dependencies as long as they are optional (which in this case I think would mean a non-default feature flag).

Comment thread native/core/Cargo.toml
iceberg-storage-opendal = { workspace = true }
serde_json = "1.0"
uuid = "1.23.0"
geo = "0.28"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the latest version is 0.33?

Comment on lines +56 to +74
fn return_type(&self, _arg_types: &[DataType]) -> DataFusionResult<DataType> {
Ok(DataType::Float64)
}

fn invoke_with_args(&self, args: ScalarFunctionArgs) -> DataFusionResult<ColumnarValue> {
let args = ColumnarValue::values_to_arrays(&args.args)?;
let geom_col = args[0].as_any().downcast_ref::<StringArray>().unwrap();

let result: Float64Array = geom_col
.iter()
.map(|g| {
let wkt = g?;
let geom = geo::Geometry::<f64>::try_from_wkt_str(wkt).ok()?;
Some(geom.unsigned_area())
})
.collect();

Ok(ColumnarValue::Array(Arc::new(result) as ArrayRef))
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have ~150 of these defined in SedonaDB ( https://github.com/apache/sedona-db ) with thousands of lines of tests against a live PostGIS instance. These are exportable/importable from C so you don't necessarily need the DataFusion/arrow-rs versions to align ( https://github.com/apache/sedona-db/blob/d286f9af8164a48889fd0d6fc82bc2bc274d687e/c/sedona-extension/src/scalar_kernel.rs#L39-L46 ). You would need a wrapper on this side to handle the to/from WKT for as long as you are still using it (WKB is what Spark and SedonaDB are using and it's much faster).

Comment thread docs/geo-functions.md
Comment on lines +11 to +22
### st_geomfromwkt

```sql
st_geomfromwkt(wkt STRING) -> STRING
```

Parses a WKT string and returns the geometry. Returns `null` if the input is `null`.

```sql
SELECT st_geomfromwkt('POINT(1.0 2.0)');
-- POINT (1 2)
```
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Apache Sedona ( https://sedona.apache.org/latest/api/sql/Geometry-Constructors/ST_GeomFromWKT/ ) and SedonaDB ( https://sedona.apache.org/sedonadb/latest/reference/sql/st_geomfromwkt/ ) maintain copies of approximately these you can link to to avoid maintaining these yourselves. (We will probably converge those two docs at some point)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paleolimbot — thanks, really helpful direction!

  1. Is sedona-extension on crates.io or would it be a git dependency?
  2. Which arrow-rs version does SedonaDB pin? Comet is on 58.3.0 they need to match.
  3. Direct Cargo dep or the C ABI path you mentioned — which do you prefer?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Cargo dependency would be great, but we're a few versions behind Comet and you're right that they have to align for the non-FFI case. We can figure out a way to make this work on our end if Comet is interested.

subhodeep and others added 2 commits May 26, 2026 00:39
…:BooleanOps

GEOS is LGPL-licensed and a C++ dependency, both of which are incompatible
with Apache project policy. Replace ST_Union, ST_Intersection, ST_Difference,
and ST_SymDifference with geo::BooleanOps (MIT/Apache-2.0). Also revert the
CI build action changes that added cmake and libgeos-dev.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: remove LGPL geos crate; reimplement set ops using pure Rust geo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants