feat: add user-facing CometUDF registration for custom JVM UDFs by andygrove · Pull Request #4233 · apache/datafusion-comet

andygrove · 2026-05-05T21:43:46Z

Which issue does this PR close?

Part of #4193

Builds on #4232 (JVM UDF framework)

Rationale for this change

This PR enables end users to provide their own CometUDF implementations that operate on Arrow columnar data, registered alongside standard Spark UDFs. When Comet encounters a matching UDF during planning, it routes to the vectorized Arrow implementation instead of falling back to Spark's row-at-a-time execution.

What changes are included in this PR?

CometUdfRegistry — a thread-safe registry mapping Spark UDF names to CometUDF implementation class names + metadata. Includes a convenience method that registers both with Spark and Comet in one call.
CometScalaUdf serde handler — intercepts ScalaUDF expressions in query planning; if the UDF name is registered in CometUdfRegistry, emits a JvmScalarUdf proto for native execution.
User guide page (custom-jvm-udfs.md) — documents how to write, register, and deploy custom JVM UDFs.

User-facing API:

// Register the Spark UDF (row-at-a-time fallback)
spark.udf.register("is_positive", (x: Int) => x > 0)

// Register the CometUDF (vectorized Arrow implementation)
CometUdfRegistry.register(
  "is_positive",
  "com.example.IsPositiveUdf",
  BooleanType,
  nullable = true
)

How are these changes tested?

JVM compilation verified (mvn compile passes for common + spark modules)
End-to-end testing will come in a follow-up PR with a concrete UDF example

Test plan

Verify CometUdfRegistry.register + ScalaUDF interception emits JvmScalarUdf proto
Verify fallback to Spark when UDF is not registered
Verify user guide renders correctly

🤖 Generated with Claude Code

Add a framework that allows Comet to invoke JVM-side UDF implementations operating on Arrow data via JNI, avoiding expensive fallback to Spark while maintaining 100% Spark compatibility for expressions not yet implemented natively in Rust. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add CometUdfRegistry that allows end users to register their own CometUDF implementations to be accelerated by Comet's native execution. When a ScalaUDF is encountered during planning whose name matches a registry entry, Comet emits a JvmScalarUdf proto instead of falling back to Spark's row-at-a-time execution. Also adds user guide documentation explaining how to write, register, and deploy custom JVM UDFs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds end-to-end tests verifying: - Basic CometUDF execution (integer doubling via Arrow vectors) - Unregistered UDFs correctly fall back to Spark - Multiple UDF invocations in a single query - UDF combined with WHERE filter - CometUdfRegistry API (register, lookup, remove) Also fixes KnownNotNull unwrapping in CometScalaUdf — Spark wraps UDF arguments in KnownNotNull when the UDF is non-nullable, which needs to be stripped before serializing the underlying expression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

comphead · 2026-05-05T23:18:45Z

+  }
+
+  test("user CometUDF - basic integer doubling") {
+    CometUdfRegistry.register(


non critical, just from dev experience, we prob can have a single facade method that registers the function in all registries

comphead · 2026-05-05T23:19:29Z

+      sql("CREATE TABLE t (x INT) USING parquet")
+      sql("INSERT INTO t VALUES (1), (2), (3)")
+      // Should still produce correct results via Spark fallback
+      checkSparkAnswer(sql("SELECT triple_int(x) FROM t"))


should we check fallback msg?

Move the DoubleIntUdf test fixture from spark/src/test/ to common/src/main/ so that its bytecode references to org.apache.arrow are relocated by common's shade plugin to org.apache.comet.shaded.arrow, matching the shaded CometUDF interface that user code sees at runtime. A test-scope class in spark/ was compiled against common/target/classes (unshaded) due to Maven workspace resolution and failed at runtime with AbstractMethodError when dispatched through the shaded interface. Update the user-guide page to import Arrow from org.apache.comet.shaded.arrow, which is the package real users compile against in the published comet-spark JAR.

The PR's new ScalaUDF dispatch in QueryPlanSerde changes the fallback message emitted for an anonymous (no-name) UDF from the generic "scalaudf is not supported" to "ScalaUDF has no name, cannot look up CometUDF registration". Update the test's expected fallback reasons accordingly.

andygrove and others added 3 commits May 5, 2026 15:26

andygrove mentioned this pull request May 5, 2026

feat: add JVM UDF framework for native execution #4232

Open

comphead reviewed May 5, 2026

View reviewed changes

andygrove added 3 commits May 5, 2026 18:25

ci: add CometUserUdfSuite to PR build workflows

dc9ef3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add user-facing CometUDF registration for custom JVM UDFs#4233

feat: add user-facing CometUDF registration for custom JVM UDFs#4233
andygrove wants to merge 6 commits intoapache:mainfrom
andygrove:user-jvm-udf

andygrove commented May 5, 2026

Uh oh!

comphead May 5, 2026

Uh oh!

comphead May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented May 5, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

User-facing API:

How are these changes tested?

Test plan

Uh oh!

comphead May 5, 2026

Choose a reason for hiding this comment

Uh oh!

comphead May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants