feat: add JVM UDF framework for native execution#4232
feat: add JVM UDF framework for native execution#4232andygrove wants to merge 2 commits intoapache:mainfrom
Conversation
Add a framework that allows Comet to invoke JVM-side UDF implementations operating on Arrow data via JNI, avoiding expensive fallback to Spark while maintaining 100% Spark compatibility for expressions not yet implemented natively in Rust. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
comphead
left a comment
There was a problem hiding this comment.
Btw @andygrove can we use this framework for regexp udfs?
Yes, there is example in #4170 It is perfect for regexp because we get 100% compatibility with almost no effoert, enabled by default |
|
I'm also wondering can we use this framework for user udfs 🤔 currently this is a huge drawback in Comet that for user defined function we fallback as there is no way to transpile custom user code to native side, can this framework be offered to the user as an alternative. depending on UDF complexity it may or may not be easy to rewrite custom user code from Spark UDF to Comet Java UDF. For example I anticipate some problems if the user works on the row level, i.e update some specific values in the row and in Arrow Java it might be more complicated but still promising |
I am already working on enable this in #4233 |
Which issue does this PR close?
Part of #4193
Rationale for this change
This PR adds the core JVM UDF framework that enables Comet to invoke JVM-side UDF implementations operating on Arrow data via JNI. This allows us to quickly implement expressions with 100% Spark compatibility without re-implementing them in native Rust code — we call existing Java/Spark code, but operate on Arrow data, avoiding an expensive transition falling back to Spark.
What changes are included in this PR?
The framework consists of:
JVM side:
CometUDFtrait — interface that JVM UDF implementations must satisfyCometUdfBridge— JNI entry point that native execution calls to invoke a UDF; handles class instantiation caching, Arrow FFI import/export, and result validationCometLambdaRegistry— thread-safe registry bridging plan-time Spark expressions to execution-time UDF lookupNative (Rust) side:
JvmScalarUdfExpr— DataFusionPhysicalExprthat delegates evaluation to a JVM-sideCometUDFvia JNI and the Arrow C Data InterfaceCometUdfBridgeJNI handle injni-bridge— caches class/method referencesJvmScalarUdfprotobuf message — serde format for transmitting UDF invocations from plan to executionPlanner integration:
ExprStruct::JvmScalarUdfhandling in the native plannerThis is the framework only — individual expression implementations (e.g.,
array_exists) will be added in follow-up PRs.How are these changes tested?
cargo checkpasses for all affected crates)