fix: route StringReplace through codegen dispatcher to fix empty-search divergence by andygrove · Pull Request #4537 · apache/datafusion-comet

andygrove · 2026-05-30T16:34:49Z

Which issue does this PR close?

Closes #4497.

Rationale for this change

replace(str, search, replacement) diverges from Spark when the search string is empty. Spark short-circuits and returns the source unchanged (UTF8String.replace checks search.numBytes == 0). DataFusion's replace instead inserts the replacement between every character and at both boundaries, producing e.g. replace('hello', '', 'x') -> xhxexlxlxox. The bug applies to both literal empty searches and column-valued empty strings at runtime.

What changes are included in this PR?

New CometStringReplace serde in spark/src/main/scala/org/apache/comet/serde/strings.scala. By default it routes through CometScalaUDF.emitJvmCodegenDispatch so Spark's own doGenCode runs inside the Comet pipeline. Users can opt back into the native (divergent) path with spark.comet.expression.StringReplace.allowIncompatible=true. The pattern mirrors CometInitCap.
QueryPlanSerde now wires StringReplace to CometStringReplace instead of the inline CometScalarFunction("replace").
string_replace.sql removes the query ignore(#3344) directive on the empty-literal case, adds rows for empty-source / NULL-source / NULL-replacement, and adds a row ('hello', '', 'x') to the test table to exercise the column-valued empty-search case.

Are these changes tested?

Yes. CometSqlFileTestSuite runs the updated string_replace.sql and passes on Spark 3.5. Compile verified for Spark 3.4 (Java 11), Spark 3.5 (Java 11), and Spark 4.0 (Java 17).

Are there any user-facing changes?

The default behavior of replace becomes Spark-compatible for all empty-search cases. Plans that previously executed replace natively now route through the JVM codegen dispatcher when spark.comet.exec.scalaUDF.codegen.enabled=true (the default), which is slightly slower than the native path but matches Spark exactly. The native path remains available behind spark.comet.expression.StringReplace.allowIncompatible=true.

…pty literal [skip ci] DataFusion's `replace` inserts the replacement between every character and at both boundaries when the search string is empty, but Spark short-circuits and returns the source unchanged. The new `CometStringReplace` serde detects an empty literal search and routes through `CometScalaUDF.emitJvmCodegenDispatch` so Spark's own `doGenCode` handles the case. Users can opt back into the native (divergent) path with `spark.comet.expression.StringReplace.allowIncompatible=true`. Closes apache#4497.

…owIncompatible [skip ci] The empty-search divergence affects column-valued empty strings too, not just literal empty searches. Drop the literal-only plan-time check and route every StringReplace through `CometScalaUDF.emitJvmCodegenDispatch` by default, mirroring the pattern used by `CometInitCap`. Users can opt back into the native path with `spark.comet.expression.StringReplace.allowIncompatible=true`. Test data now includes a row with an empty search column value to cover the non-literal case.

andygrove added 2 commits May 30, 2026 10:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: route StringReplace through codegen dispatcher to fix empty-search divergence#4537

fix: route StringReplace through codegen dispatcher to fix empty-search divergence#4537
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix/replace-empty-search

andygrove commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 30, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant