Skip to content

geos C++ dependecy removed#4431

Closed
sp-202 wants to merge 68 commits into
apache:mainfrom
sp-202:ci-test-geo-functions
Closed

geos C++ dependecy removed#4431
sp-202 wants to merge 68 commits into
apache:mainfrom
sp-202:ci-test-geo-functions

Conversation

@sp-202
Copy link
Copy Markdown

@sp-202 sp-202 commented May 25, 2026

No description provided.

sp-202 and others added 30 commits May 25, 2026 06:13
…T_Within, ST_Area, ST_Centroid)

Implements six Sedona-compatible spatial functions as native DataFusion
ScalarUDFs backed by the pure-Rust `geo` crate, replacing JVM round-trips
for geometry predicates and measurements.

- Add geo/wkt/geos crate dependencies to native/core/Cargo.toml
- Implement StContains, StIntersects, StWithin, StDistance, StArea,
  StCentroid as ScalarUDFImpl in native/core/src/execution/expressions/geo/
- Register all six UDFs with the DataFusion SessionContext in jni_api.rs
- Add Scala serde (geo.scala) that reflectively maps Sedona ST_ expression
  classes to the corresponding ScalarFunc proto at runtime; Comet compiles
  cleanly without Sedona on the classpath
- Wire geoExpressions into exprSerdeMap in QueryPlanSerde.scala

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Split long import line in geo.scala (col 110 violation)
- Wrap Class.forName with scalastyle:off/on classforname guard
- Replace non-ASCII em-dashes with hyphens in comments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Revert multi-line import to single line, rewrap doc comments to match
the project's spotless/scalafmt rules, and fix scalastyle:on guard
indentation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Spark's analyzer cannot resolve st_contains etc. unless they are
registered in the Spark function catalog. Add CometGeoFunctions, an
injectResolutionRule that registers all six geo UDFs as ScalaUDFs on
the SparkSession at startup.

When Comet native execution is active, CometScalaUDF serde intercepts
the ScalaUDF expression and routes it to the native Rust path via
DataFusion. The JVM lambda bodies (CometGeoFallback) are only reached
when Comet is disabled, and throw UnsupportedOperationException with a
clear message directing the user to enable Comet or add Sedona.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
injectResolutionRule fires per-query, after LookupFunctions already
failed with UNRESOLVED_ROUTINE. Switch to calling registerAll() inside
the injectColumnar lambda which Spark invokes once at session-build
time, ensuring the UDFs are in the catalog before any SQL runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… access

injectColumnar and injectResolutionRule both fire after LookupFunctions
has already failed. injectOptimizerRule receives the SparkSession at
session-build time before any query analysis runs, guaranteeing the
UDFs are in the catalog when SQL first executes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…asses

ScalaUDF is intercepted by Comet's codegen path which rejected the
lambdas, so the native Rust UDFs were never reached. Replace with
proper Spark Expression subclasses (StContains, StIntersects, etc.)
registered via SparkSessionExtensions.injectFunction. The serde
maps these expression classes directly to ScalarFunc { func = name }
so the DataFusion planner resolves them to the Rust geo UDFs.

- GeoExpressions.scala: BinaryExpression/UnaryExpression subclasses
  for each ST_ function with JVM fallback via CometGeoFallback
- CometGeoFallback moved to expressions package (referenced from codegen)
- geo.scala serde: maps native expression classes + optional Sedona classes
- CometSparkSessionExtensions: registers via injectFunction (compile-time,
  no session required, guaranteed to fire before any query analysis)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…for injectFunction

extensions.FunctionDescription is a 3-tuple (FunctionIdentifier, ExpressionInfo, builder);
our type alias was a 2-tuple missing FunctionIdentifier. ExpressionInfo 10-arg constructor
does not exist in Spark 3.5 -- use the 2-arg (className, name) form instead. Also drop
unused InternalRow and TypeCheckResult imports.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Merge GeoExpressions import into comet group (no blank line, alphabetical order)
- Break doGenCode string literals at 100-char boundary for contains/intersects/within/distance

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nt geometries

- st_contains/st_within/st_intersects: replace geo::Contains/Within/Intersects traits
  with geo::relate::Relate DE-9IM matrix to match OGC/Sedona/JTS boundary semantics.
  The trait impls used closed-set (covers) semantics; DE-9IM requires interior intersection.
- st_distance: geo::EuclideanDistance on Geometry enum returns 0.0 for disjoint
  Polygon/Point pairs. Fix by checking intersection first, then falling through to
  euclidean_distance which correctly handles disjoint geometries via nearest-point.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ype, st_numpoints, st_x, st_y, st_envelope, st_convexhull, st_simplify, st_buffer, st_union, st_intersection)

Pure-geo tier: st_length (EuclideanLength), st_isempty, st_geometrytype, st_numpoints
(CoordsIter), st_x/st_y (Point coord extraction), st_envelope (BoundingRect),
st_convexhull (ConvexHull), st_simplify (Ramer-Douglas-Peucker).
GEOS tier: st_buffer, st_union, st_intersection via static geos crate.
All registered in DataFusion SessionContext, mapped in exprSerdeMap, and exposed
via SparkSessionExtensions.injectFunction with JVM fallback stubs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…isempty/st_length/st_simplify

- st_isempty: import HasDimensions and CoordsIter traits (required for is_empty/coords_count)
- st_length: EuclideanLength not impl for Geometry enum; match on LineString/MultiLineString/
  Polygon/MultiPolygon variants explicitly
- st_simplify: Simplify not impl for Geometry enum; match on concrete variants, pass others through

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Moved to org.apache.comet.expressions in a previous commit; the old file
was left behind causing a duplicate object definition compile error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
values_to_arrays does not broadcast scalar literals to batch length.
Extract the numeric arg directly from ColumnarValue::Scalar before
falling back to array extraction, preventing unwrap() panic on literals.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Block type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove code\"\"\"...\"\"\" interpolation from StMakeEnvelope (confuses
  scalariform parser); use supportCodegen=false instead
- Reorder geo.scala import: StGeometryType before StGeomFromGeoJson/Wkt
  (case-insensitive alphabetical order)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scalariform parser rejects (x, y) => when inline with defineCodeGen args;
must place lambda on its own indented line.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e error

Scalariform fails to parse (x, y) => in defineCodeGen context; rename
params to g1/g2 which match the established pattern in the codebase.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Semicolons between val declarations confuse scalariform's parser, causing
it to misidentify the next => in the file as a parse error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
StMakeEnvelope extends Expression directly (not BinaryExpression) and
uses IndexedSeq[Expression] in withNewChildrenInternal. This unique
structure causes scalariform 0.2.6 to corrupt its AST state, causing
subsequent (g1, g2) => lambdas in the same file to fail to parse.
Moving it to a separate file isolates the issue.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sp-202 and others added 27 commits May 25, 2026 11:26
The 4-line split format was not in the original file. Only intended
change is removing s"POINT(${...})" interpolation from nullSafeEval.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ode for scalariform

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ard style for scalariform

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rd 2-line form for scalariform

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…deGen format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rse error

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… for scalariform

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ns for scalariform

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s breaking scalariform

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xpression has neither)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… class)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tting

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…) tuple style

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents all 40 geospatial SQL functions registered by Comet as Spark
SQL extensions. Each entry covers the SQL signature, parameter types,
return type, a description of the semantics, and a usage example.
Functions are grouped by category (constructors, serializers, predicates,
measurements, accessors, transformations, set operations) with a summary
table at the end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…functions

The geos Rust crate (features = ["static"]) compiles GEOS from bundled
C++ source and requires cmake. libgeos-dev provides headers used by
pkg-config. On macOS, brew install geos adds the native library needed
by the geos crate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes rustfmt --check failures in CI: splits long import lines,
expands single-line trait methods, and wraps long register_udf
calls in mod.rs to satisfy max_width = 100.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y clippy

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…:BooleanOps

GEOS is LGPL-licensed and a C++ dependency, both of which are incompatible
with Apache project policy. Replace ST_Union, ST_Intersection, ST_Difference,
and ST_SymDifference with geo::BooleanOps (MIT/Apache-2.0). Also revert the
CI build action changes that added cmake and libgeos-dev.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sp-202 sp-202 closed this May 25, 2026
@sp-202 sp-202 deleted the ci-test-geo-functions branch May 25, 2026 19:12
@sp-202 sp-202 restored the ci-test-geo-functions branch May 25, 2026 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant