Skip to content

[SPARK-57200] Fix JVM Codegen Bug - NULL for 3-arg form with column nullReplacement#56249

Closed
rgyhuang wants to merge 2 commits into
apache:masterfrom
rgyhuang:r-huang_data/rgyhuang/JVM-codegen-fix
Closed

[SPARK-57200] Fix JVM Codegen Bug - NULL for 3-arg form with column nullReplacement#56249
rgyhuang wants to merge 2 commits into
apache:masterfrom
rgyhuang:r-huang_data/rgyhuang/JVM-codegen-fix

Conversation

@rgyhuang
Copy link
Copy Markdown
Contributor

@rgyhuang rgyhuang commented Jun 1, 2026

What changes were proposed in this pull request?

This PR fixes a whole-stage codegen (WSCG) correctness bug in ArrayJoin (array_join) where the generated code computes the correct joined string but discards it as NULL.

ArrayJoin.doGenCode initializes ev.isNull = true whenever the expression is nullable (which is the case when the optional nullReplacement argument is a nullable column). The actual join is then produced by genCodeForArrayAndDelimiter, which has two branches:

When array or delimiter is nullable, the body is wrapped in nullSafeExec and explicitly emits ev.isNull = false before building the result. When both array and delimiter are non-nullable, the else branch builds the result but never resets ev.isNull, leaving it at its initialized true.

A minimal reproduction:

  SET spark.sql.codegen.wholeStage = true;
  SET spark.sql.codegen.factoryMode = CODEGEN_ONLY;
  -- Returns NULL for every row (buggy):
  SELECT array_join(
           array('a', 'b'),
           ',',
           CASE WHEN id % 2 = 0 THEN 'NR' ELSE CAST(NULL AS STRING) END
         ) AS r
  FROM range(4);
  SET spark.sql.codegen.wholeStage = false;
  SET spark.sql.codegen.factoryMode = NO_CODEGEN;
  -- Returns ['a,NR,b', NULL, 'a,NR,b', NULL] (correct):
  SELECT array_join(
           array('a', 'b'),
           ',',
           CASE WHEN id % 2 = 0 THEN 'NR' ELSE CAST(NULL AS STRING) END
         ) AS r
  FROM range(4);

Why are the changes needed?

This is a silent correctness bug: array_join(arr, delimiter, repl) returns NULL for every row instead of the joined string, but only under a specific (and realistic) combination:

  • The third argument nullReplacement is a nullable, non-foldable column, so ArrayJoin.nullable is true.
  • An upstream Filter containing IsNotNull(array) (and/or IsNotNull(delimiter)) tightens those children to non-nullable. FilterExec.output marks IsNotNull-referenced attributes as non-nullable, and UpdateAttributeNullability propagates this downstream, so genCodeForArrayAndDelimiter takes the non-nullable else branch.
  • The query stays in whole-stage codegen over a materialized source (e.g. FileScan parquet, or an InMemoryRelation from CACHE TABLE). Inline VALUES / WITH sources are folded by ConvertToLocalRelation to interpreted eval() and therefore do not hit the bug.

Interpreted eval() returns the correct result, so the same query produces different answers depending on whether codegen kicks in.

Does this PR introduce any user-facing change?

Yes. It fixes incorrect results. Previously, array_join(arr, delimiter, nullReplacement) could return NULL for every row under whole-stage codegen when nullReplacement was a nullable column and an upstream IsNotNull filter made the array/delimiter non-nullable. After this change, such queries return the correctly joined string, matching interpreted execution. Queries that were already correct (2-arg form, literal non-null nullReplacement, no upstream IsNotNull filter, or non-codegen execution) are unaffected.

How was this patch tested?

Unit testing in CollectionExpressionsSuite and DataFrameFunctionsSuite

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8

@rgyhuang rgyhuang marked this pull request as ready for review June 1, 2026 19:01
@rgyhuang rgyhuang force-pushed the r-huang_data/rgyhuang/JVM-codegen-fix branch from 62b9db0 to a8813f4 Compare June 1, 2026 19:05
@rgyhuang rgyhuang force-pushed the r-huang_data/rgyhuang/JVM-codegen-fix branch from a8813f4 to 1a055d0 Compare June 1, 2026 20:47
@rgyhuang rgyhuang changed the title [SPARK-XXXXX] Fix JVM Codegen Bug - NULL for 3-arg form with column nullReplacement [SPARK-57200] Fix JVM Codegen Bug - NULL for 3-arg form with column nullReplacement Jun 1, 2026
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 blocking, 1 non-blocking, 0 nits.
Correct, complete, well-tested WSCG correctness fix that adopts an existing in-file idiom (ArrayContains's setIsNullCode, same file). The only finding is a minor test-clarity cleanup.

Suggestions (1)

  • DataFrameFunctionsSuite.scala:2013: the delimiter is the literal ',', so delim_col and its IS NOT NULL filter are unused and the comment's "(and delimiter)" is inaccurate — see inline

Verification

Traced the ev.isNull lifecycle across all result paths in genCodeForArrayAndDelimiter: doGenCode initializes ev.isNull = true when nullable; the array.nullable || delimiter.nullable branch already resets it to false inside nullSafeExec; the buggy else branch (both children non-nullable) is reached only when nullReplacement is present and nullable, and now resets ev.isNull = false via the nullable-guarded resetIsNull. The replacement-null case is correctly left as NULL by the outer nullSafeExec(replacement.nullable). The non-nullable expression case keeps ev.isNull as FalseLiteral (guard avoids assigning to a literal). This restores codegen/eval() parity for every input combination.

Comment thread sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala Outdated
@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master/4.x/4.2 (bug fix)

@cloud-fan cloud-fan closed this in b0633b0 Jun 2, 2026
cloud-fan pushed a commit that referenced this pull request Jun 2, 2026
…umn nullReplacement

### What changes were proposed in this pull request?

This PR fixes a whole-stage codegen (WSCG) correctness bug in ArrayJoin (array_join) where the generated code computes the correct joined string but discards it as NULL.

`ArrayJoin.doGenCode` initializes `ev.isNull = true` whenever the expression is nullable (which is the case when the optional `nullReplacement` argument is a nullable column). The actual join is then produced by `genCodeForArrayAndDelimiter`, which has two branches:

When array or delimiter is nullable, the body is wrapped in `nullSafeExec` and explicitly emits `ev.isNull = false` before building the result. When both array and delimiter are non-nullable, the else branch builds the result but never resets `ev.isNull`, leaving it at its initialized true.

A minimal reproduction:

      SET spark.sql.codegen.wholeStage = true;
      SET spark.sql.codegen.factoryMode = CODEGEN_ONLY;
      -- Returns NULL for every row (buggy):
      SELECT array_join(
               array('a', 'b'),
               ',',
               CASE WHEN id % 2 = 0 THEN 'NR' ELSE CAST(NULL AS STRING) END
             ) AS r
      FROM range(4);
      SET spark.sql.codegen.wholeStage = false;
      SET spark.sql.codegen.factoryMode = NO_CODEGEN;
      -- Returns ['a,NR,b', NULL, 'a,NR,b', NULL] (correct):
      SELECT array_join(
               array('a', 'b'),
               ',',
               CASE WHEN id % 2 = 0 THEN 'NR' ELSE CAST(NULL AS STRING) END
             ) AS r
      FROM range(4);

### Why are the changes needed?

This is a silent correctness bug: `array_join(arr, delimiter, repl)` returns `NULL` for every row instead of the joined string, but only under a specific (and realistic) combination:

- The third argument nullReplacement is a nullable, non-foldable column, so `ArrayJoin.nullable` is true.
- An upstream `Filter` containing `IsNotNull(array)` (and/or `IsNotNull(delimiter)`) tightens those children to non-nullable. `FilterExec.output` marks `IsNotNull`-referenced attributes as non-nullable, and `UpdateAttributeNullability` propagates this downstream, so `genCodeForArrayAndDelimiter` takes the non-nullable else branch.
- The query stays in whole-stage codegen over a materialized source (e.g. `FileScan parquet`, or an `InMemoryRelation` from `CACHE TABLE`). Inline `VALUES / WITH` sources are folded by `ConvertToLocalRelation` to interpreted `eval()` and therefore do not hit the bug.

Interpreted `eval()` returns the correct result, so the same query produces different answers depending on whether codegen kicks in.

### Does this PR introduce _any_ user-facing change?

Yes. It fixes incorrect results. Previously, `array_join(arr, delimiter, nullReplacement)` could return `NULL` for every row under whole-stage codegen when nullReplacement was a nullable column and an upstream `IsNotNull` filter made the array/delimiter non-nullable. After this change, such queries return the correctly joined string, matching interpreted execution. Queries that were already correct (2-arg form, literal non-null `nullReplacement`, no upstream `IsNotNull` filter, or non-codegen execution) are unaffected.

### How was this patch tested?

Unit testing in `CollectionExpressionsSuite` and `DataFrameFunctionsSuite`

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8

Closes #56249 from rgyhuang/r-huang_data/rgyhuang/JVM-codegen-fix.

Lead-authored-by: Roy Huang <57263072+rgyhuang@users.noreply.github.com>
Co-authored-by: Roy Huang <r.huang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b0633b0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jun 2, 2026
…umn nullReplacement

### What changes were proposed in this pull request?

This PR fixes a whole-stage codegen (WSCG) correctness bug in ArrayJoin (array_join) where the generated code computes the correct joined string but discards it as NULL.

`ArrayJoin.doGenCode` initializes `ev.isNull = true` whenever the expression is nullable (which is the case when the optional `nullReplacement` argument is a nullable column). The actual join is then produced by `genCodeForArrayAndDelimiter`, which has two branches:

When array or delimiter is nullable, the body is wrapped in `nullSafeExec` and explicitly emits `ev.isNull = false` before building the result. When both array and delimiter are non-nullable, the else branch builds the result but never resets `ev.isNull`, leaving it at its initialized true.

A minimal reproduction:

      SET spark.sql.codegen.wholeStage = true;
      SET spark.sql.codegen.factoryMode = CODEGEN_ONLY;
      -- Returns NULL for every row (buggy):
      SELECT array_join(
               array('a', 'b'),
               ',',
               CASE WHEN id % 2 = 0 THEN 'NR' ELSE CAST(NULL AS STRING) END
             ) AS r
      FROM range(4);
      SET spark.sql.codegen.wholeStage = false;
      SET spark.sql.codegen.factoryMode = NO_CODEGEN;
      -- Returns ['a,NR,b', NULL, 'a,NR,b', NULL] (correct):
      SELECT array_join(
               array('a', 'b'),
               ',',
               CASE WHEN id % 2 = 0 THEN 'NR' ELSE CAST(NULL AS STRING) END
             ) AS r
      FROM range(4);

### Why are the changes needed?

This is a silent correctness bug: `array_join(arr, delimiter, repl)` returns `NULL` for every row instead of the joined string, but only under a specific (and realistic) combination:

- The third argument nullReplacement is a nullable, non-foldable column, so `ArrayJoin.nullable` is true.
- An upstream `Filter` containing `IsNotNull(array)` (and/or `IsNotNull(delimiter)`) tightens those children to non-nullable. `FilterExec.output` marks `IsNotNull`-referenced attributes as non-nullable, and `UpdateAttributeNullability` propagates this downstream, so `genCodeForArrayAndDelimiter` takes the non-nullable else branch.
- The query stays in whole-stage codegen over a materialized source (e.g. `FileScan parquet`, or an `InMemoryRelation` from `CACHE TABLE`). Inline `VALUES / WITH` sources are folded by `ConvertToLocalRelation` to interpreted `eval()` and therefore do not hit the bug.

Interpreted `eval()` returns the correct result, so the same query produces different answers depending on whether codegen kicks in.

### Does this PR introduce _any_ user-facing change?

Yes. It fixes incorrect results. Previously, `array_join(arr, delimiter, nullReplacement)` could return `NULL` for every row under whole-stage codegen when nullReplacement was a nullable column and an upstream `IsNotNull` filter made the array/delimiter non-nullable. After this change, such queries return the correctly joined string, matching interpreted execution. Queries that were already correct (2-arg form, literal non-null `nullReplacement`, no upstream `IsNotNull` filter, or non-codegen execution) are unaffected.

### How was this patch tested?

Unit testing in `CollectionExpressionsSuite` and `DataFrameFunctionsSuite`

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8

Closes #56249 from rgyhuang/r-huang_data/rgyhuang/JVM-codegen-fix.

Lead-authored-by: Roy Huang <57263072+rgyhuang@users.noreply.github.com>
Co-authored-by: Roy Huang <r.huang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b0633b0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants