Skip to content

feat: add all Spark regexp expressions via JVM UDF framework#4239

Draft
andygrove wants to merge 39 commits intoapache:mainfrom
andygrove:java-regexp
Draft

feat: add all Spark regexp expressions via JVM UDF framework#4239
andygrove wants to merge 39 commits intoapache:mainfrom
andygrove:java-regexp

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

This PR extends the JVM UDF framework (introduced in #4232) to support all Spark regular expression functions with full Java regex compatibility. Previously only rlike was implemented.

Note: This PR is stacked on #4232 (JVM UDF framework) and should be reviewed/merged after that PR.

What changes are included in this PR?

  • Add JVM UDF implementations for regexp_extract, regexp_extract_all, regexp_instr, regexp_replace, and split
  • Change default regexp engine from rust to java for full Spark compatibility (backreferences, lookaheads, embedded flags)
  • Add serde support to route these expressions through the JVM UDF bridge
  • Add Arrow schema normalization in the Rust JVM UDF executor (handles ListVector field naming differences between Arrow Java and DataFusion)
  • Reorganize SQL test files: separate files per engine (*_rust.sql, *_rust_enabled.sql, *_java.sql)

How are these changes tested?

  • CometRegExpJvmSuite: 45 tests covering all regexp expressions
  • 12 SQL test files covering both Java and Rust engine paths
  • Existing CometExpressionSuite and CometStringExpressionSuite tests continue to pass

Also fix CometArrayExpressionSuite compilation by qualifying the Spark
udf() call, which was shadowed by the new org.apache.comet.udf package.
Implements a DataFusion PhysicalExpr that evaluates child expressions,
exports the results as Arrow FFI arrays, calls
CometUdfBridge.evaluate() via JNI, and imports the output array.
Adds datafusion-comet-jni-bridge as a dependency of the spark-expr crate.
…UDF class via context classloader

Wrap the JNI body in try/finally so input ValueVectors and the result vector
are always closed, even when the UDF or arrow export throws. Resolve the
CometUDF class through the thread context classloader so user-supplied UDF
jars (added via spark.jars) are visible from the bridge.
…ns fall back to Spark

When routing RLike through the JVM UDF, reject Literal(null) and patterns
that fail Pattern.compile during planning. Both cases now produce withInfo
+ None, letting Spark evaluate the expression instead of crashing the
executor task with PatternSyntaxException or NullPointerException.
Make comet_udf_bridge an Option in JVMClasses so a missing
org.apache.comet.udf.CometUdfBridge class (e.g. shading dropped
org.apache.comet.udf.*) no longer crashes executor JVM init. The
JVM-UDF dispatch path returns a clear ExecutionError when the bridge
is unavailable. Also clarify the FFI lifetime contract on the result
import.
Replace string literals "rust"/"java" used for the regexp engine selector
with named constants on CometConf. Tighten CometRLike.getSupportLevel so
it only reports Compatible(None) when the pattern is a Literal, matching
the actual constraint enforced by the convert path.
Literal-folded children no longer get expanded to batch-row count before
crossing JNI; ColumnarValue::Scalar is materialized at length 1, avoiding
an O(rows) copy of values that never vary across the batch. Document the
contract on CometUDF: scalar inputs arrive as length-1 vectors, vector
inputs at the batch row count, and the result must match the longest
input.
andygrove added 9 commits May 1, 2026 09:42
Add return_nullable to JvmScalarUdf proto and plumb it through to
JvmScalarUdfExpr::nullable. Use the source RLike's own nullable on
the convert path so downstream operators can apply null-aware
optimizations instead of assuming every JVM-UDF call may yield null.

Also clean up Hash/PartialEq to delegate to the children's PhysicalExpr
impls (matching SubstringExpr) rather than stringifying via Display, and
print children using Display in the EXPLAIN output.
The bridge now rejects UDFs that return a vector whose length does not
match the longest input. Previously a buggy UDF returning a shorter or
longer vector would silently corrupt query results.
Lift the engine selector into the user-facing reason string so the
generated compatibility guide reads the same regardless of which mode
is the project default. Future Java-engine-specific incompatibilities
can be appended with the matching qualifier.
The per-thread CometUDF instance cache (CometUdfBridge.INSTANCES) and
the per-instance Pattern cache in RegExpLikeUDF were unbounded, so a
workload that registers many UDF classes or evaluates many distinct
regex patterns would retain entries for the executor's JVM lifetime.
Both caches now use access-order LinkedHashMap with removeEldestEntry
bounds (64 UDF instances, 128 patterns).
CI fails the unit suite with 'Memory was leaked by query (98304/147456)'
when RegExpLikeUDFSuite runs after CometRegExpJvmSuite in the same
surefire JVM (Spark 3.5 Scala 2.13 / Spark 4.x). The per-test
RootAllocator only ever held the input vectors anyway: the UDF
allocates its output BitVector from CometArrowAllocator, which is
where leak detection actually needs to happen.

Allocate inputs from CometArrowAllocator alongside the output, and
verify each test cleans up by snapshotting the allocator's outstanding
bytes before/after the body.
The unit suite imports `org.apache.arrow.memory.RootAllocator` and
`org.apache.comet.CometArrowAllocator` from the spark module's test
sources, but at runtime the spark module loads the shaded
comet-common.jar from the local Maven repo. That JAR has
`package$.CometArrowAllocator()` returning the relocated
`org.apache.comet.shaded.arrow.memory.RootAllocator`, while the test
bytecode expects unshaded `org.apache.arrow.memory.RootAllocator`,
producing NoSuchMethodError. The earlier per-test RootAllocator
variant hit a different failure mode (Arrow leak detection).

End-to-end coverage of the JVM regex path is provided by
CometRegExpJvmSuite, which exercises Java-only constructs
(backreference, lookahead, lookbehind, embedded flags, named
groups), null handling, empty inputs, and multi-batch flows.
Implement regexp_extract, regexp_extract_all, regexp_instr,
regexp_replace, and split using the JVM scalar UDF framework.
Change default regexp engine from rust to java for full
compatibility with Spark's Java regex semantics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant