Skip to content

feat: add user-facing CometUDF registration for custom JVM UDFs#4233

Draft
andygrove wants to merge 6 commits intoapache:mainfrom
andygrove:user-jvm-udf
Draft

feat: add user-facing CometUDF registration for custom JVM UDFs#4233
andygrove wants to merge 6 commits intoapache:mainfrom
andygrove:user-jvm-udf

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Part of #4193

Builds on #4232 (JVM UDF framework)

Rationale for this change

This PR enables end users to provide their own CometUDF implementations that operate on Arrow columnar data, registered alongside standard Spark UDFs. When Comet encounters a matching UDF during planning, it routes to the vectorized Arrow implementation instead of falling back to Spark's row-at-a-time execution.

What changes are included in this PR?

  • CometUdfRegistry — a thread-safe registry mapping Spark UDF names to CometUDF implementation class names + metadata. Includes a convenience method that registers both with Spark and Comet in one call.
  • CometScalaUdf serde handler — intercepts ScalaUDF expressions in query planning; if the UDF name is registered in CometUdfRegistry, emits a JvmScalarUdf proto for native execution.
  • User guide page (custom-jvm-udfs.md) — documents how to write, register, and deploy custom JVM UDFs.

User-facing API:

// Register the Spark UDF (row-at-a-time fallback)
spark.udf.register("is_positive", (x: Int) => x > 0)

// Register the CometUDF (vectorized Arrow implementation)
CometUdfRegistry.register(
  "is_positive",
  "com.example.IsPositiveUdf",
  BooleanType,
  nullable = true
)

How are these changes tested?

  • JVM compilation verified (mvn compile passes for common + spark modules)
  • End-to-end testing will come in a follow-up PR with a concrete UDF example

Test plan

  • Verify CometUdfRegistry.register + ScalaUDF interception emits JvmScalarUdf proto
  • Verify fallback to Spark when UDF is not registered
  • Verify user guide renders correctly

🤖 Generated with Claude Code

andygrove and others added 3 commits May 5, 2026 15:26
Add a framework that allows Comet to invoke JVM-side UDF implementations
operating on Arrow data via JNI, avoiding expensive fallback to Spark while
maintaining 100% Spark compatibility for expressions not yet implemented
natively in Rust.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add CometUdfRegistry that allows end users to register their own CometUDF
implementations to be accelerated by Comet's native execution. When a ScalaUDF
is encountered during planning whose name matches a registry entry, Comet emits
a JvmScalarUdf proto instead of falling back to Spark's row-at-a-time execution.

Also adds user guide documentation explaining how to write, register, and deploy
custom JVM UDFs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds end-to-end tests verifying:
- Basic CometUDF execution (integer doubling via Arrow vectors)
- Unregistered UDFs correctly fall back to Spark
- Multiple UDF invocations in a single query
- UDF combined with WHERE filter
- CometUdfRegistry API (register, lookup, remove)

Also fixes KnownNotNull unwrapping in CometScalaUdf — Spark wraps UDF
arguments in KnownNotNull when the UDF is non-nullable, which needs to
be stripped before serializing the underlying expression.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
}

test("user CometUDF - basic integer doubling") {
CometUdfRegistry.register(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non critical, just from dev experience, we prob can have a single facade method that registers the function in all registries

sql("CREATE TABLE t (x INT) USING parquet")
sql("INSERT INTO t VALUES (1), (2), (3)")
// Should still produce correct results via Spark fallback
checkSparkAnswer(sql("SELECT triple_int(x) FROM t"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check fallback msg?

andygrove added 3 commits May 5, 2026 18:25
Move the DoubleIntUdf test fixture from spark/src/test/ to common/src/main/
so that its bytecode references to org.apache.arrow are relocated by common's
shade plugin to org.apache.comet.shaded.arrow, matching the shaded CometUDF
interface that user code sees at runtime. A test-scope class in spark/ was
compiled against common/target/classes (unshaded) due to Maven workspace
resolution and failed at runtime with AbstractMethodError when dispatched
through the shaded interface.

Update the user-guide page to import Arrow from org.apache.comet.shaded.arrow,
which is the package real users compile against in the published comet-spark
JAR.
The PR's new ScalaUDF dispatch in QueryPlanSerde changes the fallback message
emitted for an anonymous (no-name) UDF from the generic "scalaudf is not
supported" to "ScalaUDF has no name, cannot look up CometUDF registration".
Update the test's expected fallback reasons accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants