Skip to content

fix: route StringReplace through codegen dispatcher to fix empty-search divergence#4537

Open
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix/replace-empty-search
Open

fix: route StringReplace through codegen dispatcher to fix empty-search divergence#4537
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix/replace-empty-search

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #4497.

Rationale for this change

replace(str, search, replacement) diverges from Spark when the search string is empty. Spark short-circuits and returns the source unchanged (UTF8String.replace checks search.numBytes == 0). DataFusion's replace instead inserts the replacement between every character and at both boundaries, producing e.g. replace('hello', '', 'x') -> xhxexlxlxox. The bug applies to both literal empty searches and column-valued empty strings at runtime.

What changes are included in this PR?

  • New CometStringReplace serde in spark/src/main/scala/org/apache/comet/serde/strings.scala. By default it routes through CometScalaUDF.emitJvmCodegenDispatch so Spark's own doGenCode runs inside the Comet pipeline. Users can opt back into the native (divergent) path with spark.comet.expression.StringReplace.allowIncompatible=true. The pattern mirrors CometInitCap.
  • QueryPlanSerde now wires StringReplace to CometStringReplace instead of the inline CometScalarFunction("replace").
  • string_replace.sql removes the query ignore(#3344) directive on the empty-literal case, adds rows for empty-source / NULL-source / NULL-replacement, and adds a row ('hello', '', 'x') to the test table to exercise the column-valued empty-search case.

Are these changes tested?

Yes. CometSqlFileTestSuite runs the updated string_replace.sql and passes on Spark 3.5. Compile verified for Spark 3.4 (Java 11), Spark 3.5 (Java 11), and Spark 4.0 (Java 17).

Are there any user-facing changes?

The default behavior of replace becomes Spark-compatible for all empty-search cases. Plans that previously executed replace natively now route through the JVM codegen dispatcher when spark.comet.exec.scalaUDF.codegen.enabled=true (the default), which is slightly slower than the native path but matches Spark exactly. The native path remains available behind spark.comet.expression.StringReplace.allowIncompatible=true.

andygrove added 2 commits May 30, 2026 10:06
…pty literal [skip ci]

DataFusion's `replace` inserts the replacement between every character and at
both boundaries when the search string is empty, but Spark short-circuits and
returns the source unchanged. The new `CometStringReplace` serde detects an
empty literal search and routes through `CometScalaUDF.emitJvmCodegenDispatch`
so Spark's own `doGenCode` handles the case. Users can opt back into the native
(divergent) path with `spark.comet.expression.StringReplace.allowIncompatible=true`.

Closes apache#4497.
…owIncompatible [skip ci]

The empty-search divergence affects column-valued empty strings too, not just
literal empty searches. Drop the literal-only plan-time check and route every
StringReplace through `CometScalaUDF.emitJvmCodegenDispatch` by default, mirroring
the pattern used by `CometInitCap`. Users can opt back into the native path with
`spark.comet.expression.StringReplace.allowIncompatible=true`.

Test data now includes a row with an empty search column value to cover the
non-literal case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] replace returns wrong result for empty-string search

1 participant