Skip to content

feat: enable JVM Scala UDF codegen dispatch by default#4514

Merged
andygrove merged 5 commits into
apache:mainfrom
andygrove:enable-udf-codegen-default
May 30, 2026
Merged

feat: enable JVM Scala UDF codegen dispatch by default#4514
andygrove merged 5 commits into
apache:mainfrom
andygrove:enable-udf-codegen-default

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #4513.

Rationale for this change

Comet's Arrow-direct codegen dispatcher routes eligible Spark ScalaUDF expressions through native execution, keeping the project, exchange, and sort operators around a UDF on the Comet path instead of forcing a fall back to Spark and a columnar-to-row roundtrip. Until now this was gated behind spark.comet.exec.scalaUDF.codegen.enabled=false and documented as experimental.

The feature has broad type coverage (scalars, complex types with arbitrary nesting, higher-order functions) and is backed by end-to-end correctness, fuzz, and Iceberg test coverage. Enabling it by default lets users benefit without opt-in.

What changes are included in this PR?

  • Flip the default of spark.comet.exec.scalaUDF.codegen.enabled to true.
  • Drop the "Experimental" prefix from the config documentation.
  • Update the Scala/Java UDF and Iceberg user guides to state the dispatcher is enabled by default and document how to disable it.

Users can set spark.comet.exec.scalaUDF.codegen.enabled=false to restore the previous fall-back-to-Spark behavior.

How are these changes tested?

Covered by existing tests for the codegen dispatcher (CometCodegenSuite, CometCodegenFuzzSuite, CometCodegenHOFSuite, CometTemporalExpressionSuite, the datetime SQL file tests, and CometIcebergRewriteActionSuite), which exercise both the enabled and disabled paths.

Flip `spark.comet.exec.scalaUDF.codegen.enabled` to default `true` so that
eligible Spark `ScalaUDF` expressions are routed through Comet's Arrow-direct
codegen dispatcher without requiring opt-in. The feature is no longer marked
experimental.

Update the Scala/Java UDF and Iceberg user guides to reflect that the
dispatcher is on by default and document how to disable it.
@andygrove andygrove force-pushed the enable-udf-codegen-default branch from 87fbb79 to 1f5de3e Compare May 29, 2026 12:21
@mbutrovich
Copy link
Copy Markdown
Contributor

org.apache.comet.codegen.CometBatchKernelCodegen.compile has a logInfo which seemed appropriate to me for an experimental feature that wasn't on by default. It might be noisy now. Do we want to drop it down to debug, or leave it for now while we evaluate behavior?

Disable UDF codegen dispatch within the ArrayInsertUnsupportedArgs test so
the UDF-derived position stays non-convertible and continues to exercise the
ArrayInsert fallback path. With codegen now enabled by default, the UDF was
converted to proto, making ArrayInsert fully native and removing the expected
fallback reason.

Drop the per-compile log in CometBatchKernelCodegen from info to debug now
that codegen dispatch is on by default and the message would otherwise be
noisy.
@andygrove
Copy link
Copy Markdown
Member Author

org.apache.comet.codegen.CometBatchKernelCodegen.compile has a logInfo which seemed appropriate to me for an experimental feature that wasn't on by default. It might be noisy now. Do we want to drop it down to debug, or leave it for now while we evaluate behavior?

Good point. I dropped it down to logDebug.

Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really excited to see this code path exercised more! Thanks @andygrove!

@mbutrovich
Copy link
Copy Markdown
Contributor

I think you need the diffs from this commit (from when I had it enabled by default in #4267) 2be5f73.

@andygrove
Copy link
Copy Markdown
Member Author

I think you need the diffs from this commit (from when I had it enabled by default in #4267) 2be5f73.

Thanks @mbutrovich. I went with a different approach of disabling the codegen dispatch for these tests rather than changing the test expectations. Let me know what you think when you have time.

@mbutrovich
Copy link
Copy Markdown
Contributor

Thanks @mbutrovich. I went with a different approach of disabling the codegen dispatch for these tests rather than changing the test expectations. Let me know what you think when you have time.

Hm, assuming the tests checked the query results too I'd prefer to keep the feature enabled, since then we'd just be changing the plan structure checks. Not having dug into the tests, if it's just a plan structure check it's fine to disable the tests.

@andygrove
Copy link
Copy Markdown
Member Author

Thanks @mbutrovich. I went with a different approach of disabling the codegen dispatch for these tests rather than changing the test expectations. Let me know what you think when you have time.

Hm, assuming the tests checked the query results too I'd prefer to keep the feature enabled, since then we'd just be changing the plan structure checks. Not having dug into the tests, if it's just a plan structure check it's fine to disable the tests.

Thanks for the quick response. Ok, will revert to your approach.

…patch

Enabling the JVM Scala UDF codegen dispatcher by default routes supported
ScalaUDF expressions into native execution, changing observable behavior in two
upstream Spark tests. Rather than disabling the dispatcher for these tests,
adjust the expectations so the feature stays exercised:

- SQLQuerySuite "Common subexpression elimination": CometProject and
  CometHashAggregate do not implement cross-sibling subexpression elimination
  over ScalaUDF, so the aggregate case invokes the UDF once per reference
  (count 3 vs 1). The query result is unchanged. Tracking: apache#4516.
- SQLQueryTestSuite "udf/postgreSQL/udf-select_having.sql": the divide-by-zero
  in 1/udf(a) evaluates natively and surfaces as CometNativeException instead of
  SparkArithmeticException; normalize both to a DIVIDE_BY_ZERO placeholder for
  the literal compare. Tracking: apache#4517.
@andygrove andygrove force-pushed the enable-udf-codegen-default branch from 962e3d7 to 72286f5 Compare May 29, 2026 23:20
@andygrove
Copy link
Copy Markdown
Member Author

Good call. I dug into the two failing tests and you are right that both check query results, not just plan structure:

  • SQLQuerySuite "Common subexpression elimination" uses verifyCallCount, which runs checkAnswer and asserts the UDF invocation count.
  • SQLQueryTestSuite udf-select_having.sql is golden-file output matching.

So I switched away from disabling the dispatcher and went with adjusting the expectations instead, keeping the feature exercised:

  • CSE test: expected count is now if (isCometEnabled) 3 else 1 for the aggregate case. CometProject/CometHashAggregate do not do cross-sibling subexpression elimination over ScalaUDF, so the UDF runs once per reference. The result is unchanged.
  • udf-select_having.sql: 1/udf(a) now divides by zero natively and surfaces as CometNativeException instead of SparkArithmeticException. I normalize both sides to a DIVIDE_BY_ZERO placeholder, gated on isCometEnabled.

I filed follow-ons for the underlying gaps and linked them in the diff comments: #4516 (cross-sibling CSE over ScalaUDF) and #4517 (native divide-by-zero surfaces the wrong exception type).

Regenerated all four diffs (3.4.3 / 3.5.8 / 4.0.2 / 4.1.1) via the clone-apply-modify-regenerate workflow and merged latest main.

After merging main, apache#4499 routes InitCap through the JVM codegen dispatcher when
allowIncompatible is off. With this PR enabling the dispatcher by default,
`select initcap(_1)` now runs natively without allowIncompatible, so the
"enable incompat expression using dynamic config" test no longer sees the Spark
fallback it asserts. Disable the dispatcher within the test so it exercises the
serde-level incompatible fallback path it is meant to cover.
@andygrove andygrove merged commit 9cb4927 into apache:main May 30, 2026
68 checks passed
@andygrove andygrove deleted the enable-udf-codegen-default branch May 30, 2026 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable Scala/Java UDF codegen dispatch by default

2 participants