[SPARK-57200] Fix JVM Codegen Bug - NULL for 3-arg form with column nullReplacement by rgyhuang · Pull Request #56249 · apache/spark

rgyhuang · 2026-06-01T18:51:05Z

What changes were proposed in this pull request?

This PR fixes a whole-stage codegen (WSCG) correctness bug in ArrayJoin (array_join) where the generated code computes the correct joined string but discards it as NULL.

ArrayJoin.doGenCode initializes ev.isNull = true whenever the expression is nullable (which is the case when the optional nullReplacement argument is a nullable column). The actual join is then produced by genCodeForArrayAndDelimiter, which has two branches:

When array or delimiter is nullable, the body is wrapped in nullSafeExec and explicitly emits ev.isNull = false before building the result. When both array and delimiter are non-nullable, the else branch builds the result but never resets ev.isNull, leaving it at its initialized true.

A minimal reproduction:

  SET spark.sql.codegen.wholeStage = true;
  SET spark.sql.codegen.factoryMode = CODEGEN_ONLY;
  -- Returns NULL for every row (buggy):
  SELECT array_join(
           array('a', 'b'),
           ',',
           CASE WHEN id % 2 = 0 THEN 'NR' ELSE CAST(NULL AS STRING) END
         ) AS r
  FROM range(4);
  SET spark.sql.codegen.wholeStage = false;
  SET spark.sql.codegen.factoryMode = NO_CODEGEN;
  -- Returns ['a,NR,b', NULL, 'a,NR,b', NULL] (correct):
  SELECT array_join(
           array('a', 'b'),
           ',',
           CASE WHEN id % 2 = 0 THEN 'NR' ELSE CAST(NULL AS STRING) END
         ) AS r
  FROM range(4);

Why are the changes needed?

This is a silent correctness bug: array_join(arr, delimiter, repl) returns NULL for every row instead of the joined string, but only under a specific (and realistic) combination:

The third argument nullReplacement is a nullable, non-foldable column, so ArrayJoin.nullable is true.
An upstream Filter containing IsNotNull(array) (and/or IsNotNull(delimiter)) tightens those children to non-nullable. FilterExec.output marks IsNotNull-referenced attributes as non-nullable, and UpdateAttributeNullability propagates this downstream, so genCodeForArrayAndDelimiter takes the non-nullable else branch.
The query stays in whole-stage codegen over a materialized source (e.g. FileScan parquet, or an InMemoryRelation from CACHE TABLE). Inline VALUES / WITH sources are folded by ConvertToLocalRelation to interpreted eval() and therefore do not hit the bug.

Interpreted eval() returns the correct result, so the same query produces different answers depending on whether codegen kicks in.

Does this PR introduce any user-facing change?

Yes. It fixes incorrect results. Previously, array_join(arr, delimiter, nullReplacement) could return NULL for every row under whole-stage codegen when nullReplacement was a nullable column and an upstream IsNotNull filter made the array/delimiter non-nullable. After this change, such queries return the correctly joined string, matching interpreted execution. Queries that were already correct (2-arg form, literal non-null nullReplacement, no upstream IsNotNull filter, or non-codegen execution) are unaffected.

How was this patch tested?

Unit testing in CollectionExpressionsSuite and DataFrameFunctionsSuite

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8

cloud-fan

0 blocking, 1 non-blocking, 0 nits.
Correct, complete, well-tested WSCG correctness fix that adopts an existing in-file idiom (ArrayContains's setIsNullCode, same file). The only finding is a minor test-clarity cleanup.

Suggestions (1)

DataFrameFunctionsSuite.scala:2013: the delimiter is the literal ',', so delim_col and its IS NOT NULL filter are unused and the comment's "(and delimiter)" is inaccurate — see inline

Verification

Traced the ev.isNull lifecycle across all result paths in genCodeForArrayAndDelimiter: doGenCode initializes ev.isNull = true when nullable; the array.nullable || delimiter.nullable branch already resets it to false inside nullSafeExec; the buggy else branch (both children non-nullable) is reached only when nullReplacement is present and nullable, and now resets ev.isNull = false via the nullable-guarded resetIsNull. The replacement-null case is correctly left as NULL by the outer nullSafeExec(replacement.nullable). The non-nullable expression case keeps ev.isNull as FalseLiteral (guard avoids assigning to a literal). This restores codegen/eval() parity for every input combination.

cloud-fan · 2026-06-02T07:02:10Z

thanks, merging to master/4.x/4.2 (bug fix)

…umn nullReplacement ### What changes were proposed in this pull request? This PR fixes a whole-stage codegen (WSCG) correctness bug in ArrayJoin (array_join) where the generated code computes the correct joined string but discards it as NULL. `ArrayJoin.doGenCode` initializes `ev.isNull = true` whenever the expression is nullable (which is the case when the optional `nullReplacement` argument is a nullable column). The actual join is then produced by `genCodeForArrayAndDelimiter`, which has two branches: When array or delimiter is nullable, the body is wrapped in `nullSafeExec` and explicitly emits `ev.isNull = false` before building the result. When both array and delimiter are non-nullable, the else branch builds the result but never resets `ev.isNull`, leaving it at its initialized true. A minimal reproduction: SET spark.sql.codegen.wholeStage = true; SET spark.sql.codegen.factoryMode = CODEGEN_ONLY; -- Returns NULL for every row (buggy): SELECT array_join( array('a', 'b'), ',', CASE WHEN id % 2 = 0 THEN 'NR' ELSE CAST(NULL AS STRING) END ) AS r FROM range(4); SET spark.sql.codegen.wholeStage = false; SET spark.sql.codegen.factoryMode = NO_CODEGEN; -- Returns ['a,NR,b', NULL, 'a,NR,b', NULL] (correct): SELECT array_join( array('a', 'b'), ',', CASE WHEN id % 2 = 0 THEN 'NR' ELSE CAST(NULL AS STRING) END ) AS r FROM range(4); ### Why are the changes needed? This is a silent correctness bug: `array_join(arr, delimiter, repl)` returns `NULL` for every row instead of the joined string, but only under a specific (and realistic) combination: - The third argument nullReplacement is a nullable, non-foldable column, so `ArrayJoin.nullable` is true. - An upstream `Filter` containing `IsNotNull(array)` (and/or `IsNotNull(delimiter)`) tightens those children to non-nullable. `FilterExec.output` marks `IsNotNull`-referenced attributes as non-nullable, and `UpdateAttributeNullability` propagates this downstream, so `genCodeForArrayAndDelimiter` takes the non-nullable else branch. - The query stays in whole-stage codegen over a materialized source (e.g. `FileScan parquet`, or an `InMemoryRelation` from `CACHE TABLE`). Inline `VALUES / WITH` sources are folded by `ConvertToLocalRelation` to interpreted `eval()` and therefore do not hit the bug. Interpreted `eval()` returns the correct result, so the same query produces different answers depending on whether codegen kicks in. ### Does this PR introduce _any_ user-facing change? Yes. It fixes incorrect results. Previously, `array_join(arr, delimiter, nullReplacement)` could return `NULL` for every row under whole-stage codegen when nullReplacement was a nullable column and an upstream `IsNotNull` filter made the array/delimiter non-nullable. After this change, such queries return the correctly joined string, matching interpreted execution. Queries that were already correct (2-arg form, literal non-null `nullReplacement`, no upstream `IsNotNull` filter, or non-codegen execution) are unaffected. ### How was this patch tested? Unit testing in `CollectionExpressionsSuite` and `DataFrameFunctionsSuite` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 Closes #56249 from rgyhuang/r-huang_data/rgyhuang/JVM-codegen-fix. Lead-authored-by: Roy Huang <57263072+rgyhuang@users.noreply.github.com> Co-authored-by: Roy Huang <r.huang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b0633b0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

rgyhuang marked this pull request as ready for review June 1, 2026 19:01

rgyhuang force-pushed the r-huang_data/rgyhuang/JVM-codegen-fix branch from 62b9db0 to a8813f4 Compare June 1, 2026 19:05

Fix with test suite updates

1a055d0

rgyhuang force-pushed the r-huang_data/rgyhuang/JVM-codegen-fix branch from a8813f4 to 1a055d0 Compare June 1, 2026 20:47

rgyhuang changed the title ~~[SPARK-XXXXX] Fix JVM Codegen Bug - NULL for 3-arg form with column nullReplacement~~ [SPARK-57200] Fix JVM Codegen Bug - NULL for 3-arg form with column nullReplacement Jun 1, 2026

cloud-fan reviewed Jun 2, 2026

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala Outdated

Update DataFrameFunction unit test with delim_col

116012e

cloud-fan closed this in b0633b0 Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57200] Fix JVM Codegen Bug - NULL for 3-arg form with column nullReplacement#56249

[SPARK-57200] Fix JVM Codegen Bug - NULL for 3-arg form with column nullReplacement#56249
rgyhuang wants to merge 2 commits into
apache:masterfrom
rgyhuang:r-huang_data/rgyhuang/JVM-codegen-fix

rgyhuang commented Jun 1, 2026 •

edited

Loading

Uh oh!

cloud-fan left a comment

Uh oh!

Uh oh!

cloud-fan commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rgyhuang commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Suggestions (1)

Verification

Uh oh!

Uh oh!

cloud-fan commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rgyhuang commented Jun 1, 2026 •

edited

Loading