feat: add all Spark regexp expressions via JVM UDF framework#4239
Draft
andygrove wants to merge 39 commits intoapache:mainfrom
Draft
feat: add all Spark regexp expressions via JVM UDF framework#4239andygrove wants to merge 39 commits intoapache:mainfrom
andygrove wants to merge 39 commits intoapache:mainfrom
Conversation
Also fix CometArrayExpressionSuite compilation by qualifying the Spark udf() call, which was shadowed by the new org.apache.comet.udf package.
Implements a DataFusion PhysicalExpr that evaluates child expressions, exports the results as Arrow FFI arrays, calls CometUdfBridge.evaluate() via JNI, and imports the output array. Adds datafusion-comet-jni-bridge as a dependency of the spark-expr crate.
…UDF class via context classloader Wrap the JNI body in try/finally so input ValueVectors and the result vector are always closed, even when the UDF or arrow export throws. Resolve the CometUDF class through the thread context classloader so user-supplied UDF jars (added via spark.jars) are visible from the bridge.
…ns fall back to Spark When routing RLike through the JVM UDF, reject Literal(null) and patterns that fail Pattern.compile during planning. Both cases now produce withInfo + None, letting Spark evaluate the expression instead of crashing the executor task with PatternSyntaxException or NullPointerException.
Make comet_udf_bridge an Option in JVMClasses so a missing org.apache.comet.udf.CometUdfBridge class (e.g. shading dropped org.apache.comet.udf.*) no longer crashes executor JVM init. The JVM-UDF dispatch path returns a clear ExecutionError when the bridge is unavailable. Also clarify the FFI lifetime contract on the result import.
Replace string literals "rust"/"java" used for the regexp engine selector with named constants on CometConf. Tighten CometRLike.getSupportLevel so it only reports Compatible(None) when the pattern is a Literal, matching the actual constraint enforced by the convert path.
Literal-folded children no longer get expanded to batch-row count before crossing JNI; ColumnarValue::Scalar is materialized at length 1, avoiding an O(rows) copy of values that never vary across the batch. Document the contract on CometUDF: scalar inputs arrive as length-1 vectors, vector inputs at the batch row count, and the result must match the longest input.
Add return_nullable to JvmScalarUdf proto and plumb it through to JvmScalarUdfExpr::nullable. Use the source RLike's own nullable on the convert path so downstream operators can apply null-aware optimizations instead of assuming every JVM-UDF call may yield null. Also clean up Hash/PartialEq to delegate to the children's PhysicalExpr impls (matching SubstringExpr) rather than stringifying via Display, and print children using Display in the EXPLAIN output.
The bridge now rejects UDFs that return a vector whose length does not match the longest input. Previously a buggy UDF returning a shorter or longer vector would silently corrupt query results.
Lift the engine selector into the user-facing reason string so the generated compatibility guide reads the same regardless of which mode is the project default. Future Java-engine-specific incompatibilities can be appended with the matching qualifier.
The per-thread CometUDF instance cache (CometUdfBridge.INSTANCES) and the per-instance Pattern cache in RegExpLikeUDF were unbounded, so a workload that registers many UDF classes or evaluates many distinct regex patterns would retain entries for the executor's JVM lifetime. Both caches now use access-order LinkedHashMap with removeEldestEntry bounds (64 UDF instances, 128 patterns).
CI fails the unit suite with 'Memory was leaked by query (98304/147456)' when RegExpLikeUDFSuite runs after CometRegExpJvmSuite in the same surefire JVM (Spark 3.5 Scala 2.13 / Spark 4.x). The per-test RootAllocator only ever held the input vectors anyway: the UDF allocates its output BitVector from CometArrowAllocator, which is where leak detection actually needs to happen. Allocate inputs from CometArrowAllocator alongside the output, and verify each test cleans up by snapshotting the allocator's outstanding bytes before/after the body.
The unit suite imports `org.apache.arrow.memory.RootAllocator` and `org.apache.comet.CometArrowAllocator` from the spark module's test sources, but at runtime the spark module loads the shaded comet-common.jar from the local Maven repo. That JAR has `package$.CometArrowAllocator()` returning the relocated `org.apache.comet.shaded.arrow.memory.RootAllocator`, while the test bytecode expects unshaded `org.apache.arrow.memory.RootAllocator`, producing NoSuchMethodError. The earlier per-test RootAllocator variant hit a different failure mode (Arrow leak detection). End-to-end coverage of the JVM regex path is provided by CometRegExpJvmSuite, which exercises Java-only constructs (backreference, lookahead, lookbehind, embedded flags, named groups), null handling, empty inputs, and multi-batch flows.
Implement regexp_extract, regexp_extract_all, regexp_instr, regexp_replace, and split using the JVM scalar UDF framework. Change default regexp engine from rust to java for full compatibility with Spark's Java regex semantics.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #.
Rationale for this change
This PR extends the JVM UDF framework (introduced in #4232) to support all Spark regular expression functions with full Java regex compatibility. Previously only
rlikewas implemented.Note: This PR is stacked on #4232 (JVM UDF framework) and should be reviewed/merged after that PR.
What changes are included in this PR?
regexp_extract,regexp_extract_all,regexp_instr,regexp_replace, andsplitrusttojavafor full Spark compatibility (backreferences, lookaheads, embedded flags)*_rust.sql,*_rust_enabled.sql,*_java.sql)How are these changes tested?
CometRegExpJvmSuite: 45 tests covering all regexp expressionsCometExpressionSuiteandCometStringExpressionSuitetests continue to pass