feat: add all Spark regexp expressions via JVM UDF framework by andygrove · Pull Request #4239 · apache/datafusion-comet

andygrove · 2026-05-06T13:56:23Z

Which issue does this PR close?

Closes #.

Rationale for this change

This PR extends the JVM UDF framework (introduced in #4232) to support all Spark regular expression functions with full Java regex compatibility. Previously only rlike was implemented.

Note: This PR is stacked on #4232 (JVM UDF framework) and should be reviewed/merged after that PR.

What changes are included in this PR?

Add JVM UDF implementations for regexp_extract, regexp_extract_all, regexp_instr, regexp_replace, and split
Change default regexp engine from rust to java for full Spark compatibility (backreferences, lookaheads, embedded flags)
Add serde support to route these expressions through the JVM UDF bridge
Add Arrow schema normalization in the Rust JVM UDF executor (handles ListVector field naming differences between Arrow Java and DataFusion)
Reorganize SQL test files: separate files per engine (*_rust.sql, *_rust_enabled.sql, *_java.sql)

How are these changes tested?

CometRegExpJvmSuite: 45 tests covering all regexp expressions
12 SQL test files covering both Java and Rust engine paths
Existing CometExpressionSuite and CometStringExpressionSuite tests continue to pass

Also fix CometArrayExpressionSuite compilation by qualifying the Spark udf() call, which was shadowed by the new org.apache.comet.udf package.

Implements a DataFusion PhysicalExpr that evaluates child expressions, exports the results as Arrow FFI arrays, calls CometUdfBridge.evaluate() via JNI, and imports the output array. Adds datafusion-comet-jni-bridge as a dependency of the spark-expr crate.

… is true

… wording

…UDF class via context classloader Wrap the JNI body in try/finally so input ValueVectors and the result vector are always closed, even when the UDF or arrow export throws. Resolve the CometUDF class through the thread context classloader so user-supplied UDF jars (added via spark.jars) are visible from the bridge.

…ns fall back to Spark When routing RLike through the JVM UDF, reject Literal(null) and patterns that fail Pattern.compile during planning. Both cases now produce withInfo + None, letting Spark evaluate the expression instead of crashing the executor task with PatternSyntaxException or NullPointerException.

Make comet_udf_bridge an Option in JVMClasses so a missing org.apache.comet.udf.CometUdfBridge class (e.g. shading dropped org.apache.comet.udf.*) no longer crashes executor JVM init. The JVM-UDF dispatch path returns a clear ExecutionError when the bridge is unavailable. Also clarify the FFI lifetime contract on the result import.

Replace string literals "rust"/"java" used for the regexp engine selector with named constants on CometConf. Tighten CometRLike.getSupportLevel so it only reports Compatible(None) when the pattern is a Literal, matching the actual constraint enforced by the convert path.

Literal-folded children no longer get expanded to batch-row count before crossing JNI; ColumnarValue::Scalar is materialized at length 1, avoiding an O(rows) copy of values that never vary across the batch. Document the contract on CometUDF: scalar inputs arrive as length-1 vectors, vector inputs at the batch row count, and the result must match the longest input.

…suite

Add return_nullable to JvmScalarUdf proto and plumb it through to JvmScalarUdfExpr::nullable. Use the source RLike's own nullable on the convert path so downstream operators can apply null-aware optimizations instead of assuming every JVM-UDF call may yield null. Also clean up Hash/PartialEq to delegate to the children's PhysicalExpr impls (matching SubstringExpr) rather than stringifying via Display, and print children using Display in the EXPLAIN output.

The bridge now rejects UDFs that return a vector whose length does not match the longest input. Previously a buggy UDF returning a shorter or longer vector would silently corrupt query results.

Lift the engine selector into the user-facing reason string so the generated compatibility guide reads the same regardless of which mode is the project default. Future Java-engine-specific incompatibilities can be appended with the matching qualifier.

The per-thread CometUDF instance cache (CometUdfBridge.INSTANCES) and the per-instance Pattern cache in RegExpLikeUDF were unbounded, so a workload that registers many UDF classes or evaluates many distinct regex patterns would retain entries for the executor's JVM lifetime. Both caches now use access-order LinkedHashMap with removeEldestEntry bounds (64 UDF instances, 128 patterns).

CI fails the unit suite with 'Memory was leaked by query (98304/147456)' when RegExpLikeUDFSuite runs after CometRegExpJvmSuite in the same surefire JVM (Spark 3.5 Scala 2.13 / Spark 4.x). The per-test RootAllocator only ever held the input vectors anyway: the UDF allocates its output BitVector from CometArrowAllocator, which is where leak detection actually needs to happen. Allocate inputs from CometArrowAllocator alongside the output, and verify each test cleans up by snapshotting the allocator's outstanding bytes before/after the body.

The unit suite imports `org.apache.arrow.memory.RootAllocator` and `org.apache.comet.CometArrowAllocator` from the spark module's test sources, but at runtime the spark module loads the shaded comet-common.jar from the local Maven repo. That JAR has `package$.CometArrowAllocator()` returning the relocated `org.apache.comet.shaded.arrow.memory.RootAllocator`, while the test bytecode expects unshaded `org.apache.arrow.memory.RootAllocator`, producing NoSuchMethodError. The earlier per-test RootAllocator variant hit a different failure mode (Arrow leak detection). End-to-end coverage of the JVM regex path is provided by CometRegExpJvmSuite, which exercises Java-only constructs (backreference, lookahead, lookbehind, embedded flags, named groups), null handling, empty inputs, and multi-batch flows.

Implement regexp_extract, regexp_extract_all, regexp_instr, regexp_replace, and split using the JVM scalar UDF framework. Change default regexp engine from rust to java for full compatibility with Spark's Java regex semantics.

andygrove added 30 commits April 30, 2026 06:17

docs: add implement-comet-expression Claude skill

9cd1566

docs: reference PR template and add skill-acknowledgement note

953cb86

docs: check datafusion-spark crate before writing native code

422d2b3

Merge branch 'add-implement-expression-skill'

88f2331

feat: add CometUDF trait for JVM-side scalar UDFs

eb8aa14

feat: add RegExpLikeUDF using java.util.regex.Pattern

60a2ecd

Also fix CometArrayExpressionSuite compilation by qualifying the Spark udf() call, which was shadowed by the new org.apache.comet.udf package.

feat: add CometUdfBridge JNI entry point for native UDF dispatch

633b75e

feat: add JvmScalarUdf proto message for JVM UDF dispatch

1c64070

feat: register CometUdfBridge in JVMClasses for native UDF dispatch

8f78436

feat: wire JvmScalarUdf proto into native planner

d8ab411

feat: add spark.comet.exec.regexp.useJVM config

4970c9c

feat: route RLike through JVM UDF when spark.comet.exec.regexp.useJVM…

54ddd50

… is true

test: add end-to-end suite for JVM-backed RLike

0a942ad

fix: use project-wide CometArrowAllocator in RegExpLikeUDF

fbfc158

docs: correct CometUdfBridge thread cache lifetime comment

909ab91

docs: document from_ffi consumption invariant in JvmScalarUdfExpr

862ed2e

style: apply make format

a943de5

docs: mark spark.comet.exec.regexp.useJVM experimental and generalize…

e1b9b2a

… wording

test: add CometRegExpBenchmark covering all rlike modes

76418c6

ci: register new RLike JVM-bridge test suites in PR workflows

8ac45be

build: exclude docs/superpowers from rat and git

a1f8ecf

remove skill

23a9e52

refactor: rename regexp.useJVM boolean to regexp.engine enum (rust|java)

1c66f44

test: cover empty and all-null subject vectors in RegExpLikeUDF unit …

85029c5

…suite

andygrove added 9 commits May 1, 2026 09:42

fix: validate UDF result row count matches longest input

5937650

The bridge now rejects UDFs that return a vector whose length does not match the longest input. Previously a buggy UDF returning a shorter or longer vector would silently corrupt query results.

Merge remote-tracking branch 'apache/main' into prototype-jvm-scalar-udf

eb544d6

docs: update regexp compatibility guide for java vs rust engine

6cac094

mbutrovich mentioned this pull request May 6, 2026

feat: add JVM UDF framework for native execution #4232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add all Spark regexp expressions via JVM UDF framework#4239

feat: add all Spark regexp expressions via JVM UDF framework#4239
andygrove wants to merge 39 commits intoapache:mainfrom
andygrove:java-regexp

andygrove commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 6, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant