[SPARK-57200] Fix JVM Codegen Bug - NULL for 3-arg form with column nullReplacement#56249
[SPARK-57200] Fix JVM Codegen Bug - NULL for 3-arg form with column nullReplacement#56249rgyhuang wants to merge 2 commits into
Conversation
62b9db0 to
a8813f4
Compare
a8813f4 to
1a055d0
Compare
cloud-fan
left a comment
There was a problem hiding this comment.
0 blocking, 1 non-blocking, 0 nits.
Correct, complete, well-tested WSCG correctness fix that adopts an existing in-file idiom (ArrayContains's setIsNullCode, same file). The only finding is a minor test-clarity cleanup.
Suggestions (1)
- DataFrameFunctionsSuite.scala:2013: the delimiter is the literal
',', sodelim_coland itsIS NOT NULLfilter are unused and the comment's "(and delimiter)" is inaccurate — see inline
Verification
Traced the ev.isNull lifecycle across all result paths in genCodeForArrayAndDelimiter: doGenCode initializes ev.isNull = true when nullable; the array.nullable || delimiter.nullable branch already resets it to false inside nullSafeExec; the buggy else branch (both children non-nullable) is reached only when nullReplacement is present and nullable, and now resets ev.isNull = false via the nullable-guarded resetIsNull. The replacement-null case is correctly left as NULL by the outer nullSafeExec(replacement.nullable). The non-nullable expression case keeps ev.isNull as FalseLiteral (guard avoids assigning to a literal). This restores codegen/eval() parity for every input combination.
|
thanks, merging to master/4.x/4.2 (bug fix) |
…umn nullReplacement
### What changes were proposed in this pull request?
This PR fixes a whole-stage codegen (WSCG) correctness bug in ArrayJoin (array_join) where the generated code computes the correct joined string but discards it as NULL.
`ArrayJoin.doGenCode` initializes `ev.isNull = true` whenever the expression is nullable (which is the case when the optional `nullReplacement` argument is a nullable column). The actual join is then produced by `genCodeForArrayAndDelimiter`, which has two branches:
When array or delimiter is nullable, the body is wrapped in `nullSafeExec` and explicitly emits `ev.isNull = false` before building the result. When both array and delimiter are non-nullable, the else branch builds the result but never resets `ev.isNull`, leaving it at its initialized true.
A minimal reproduction:
SET spark.sql.codegen.wholeStage = true;
SET spark.sql.codegen.factoryMode = CODEGEN_ONLY;
-- Returns NULL for every row (buggy):
SELECT array_join(
array('a', 'b'),
',',
CASE WHEN id % 2 = 0 THEN 'NR' ELSE CAST(NULL AS STRING) END
) AS r
FROM range(4);
SET spark.sql.codegen.wholeStage = false;
SET spark.sql.codegen.factoryMode = NO_CODEGEN;
-- Returns ['a,NR,b', NULL, 'a,NR,b', NULL] (correct):
SELECT array_join(
array('a', 'b'),
',',
CASE WHEN id % 2 = 0 THEN 'NR' ELSE CAST(NULL AS STRING) END
) AS r
FROM range(4);
### Why are the changes needed?
This is a silent correctness bug: `array_join(arr, delimiter, repl)` returns `NULL` for every row instead of the joined string, but only under a specific (and realistic) combination:
- The third argument nullReplacement is a nullable, non-foldable column, so `ArrayJoin.nullable` is true.
- An upstream `Filter` containing `IsNotNull(array)` (and/or `IsNotNull(delimiter)`) tightens those children to non-nullable. `FilterExec.output` marks `IsNotNull`-referenced attributes as non-nullable, and `UpdateAttributeNullability` propagates this downstream, so `genCodeForArrayAndDelimiter` takes the non-nullable else branch.
- The query stays in whole-stage codegen over a materialized source (e.g. `FileScan parquet`, or an `InMemoryRelation` from `CACHE TABLE`). Inline `VALUES / WITH` sources are folded by `ConvertToLocalRelation` to interpreted `eval()` and therefore do not hit the bug.
Interpreted `eval()` returns the correct result, so the same query produces different answers depending on whether codegen kicks in.
### Does this PR introduce _any_ user-facing change?
Yes. It fixes incorrect results. Previously, `array_join(arr, delimiter, nullReplacement)` could return `NULL` for every row under whole-stage codegen when nullReplacement was a nullable column and an upstream `IsNotNull` filter made the array/delimiter non-nullable. After this change, such queries return the correctly joined string, matching interpreted execution. Queries that were already correct (2-arg form, literal non-null `nullReplacement`, no upstream `IsNotNull` filter, or non-codegen execution) are unaffected.
### How was this patch tested?
Unit testing in `CollectionExpressionsSuite` and `DataFrameFunctionsSuite`
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.8
Closes #56249 from rgyhuang/r-huang_data/rgyhuang/JVM-codegen-fix.
Lead-authored-by: Roy Huang <57263072+rgyhuang@users.noreply.github.com>
Co-authored-by: Roy Huang <r.huang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b0633b0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…umn nullReplacement
### What changes were proposed in this pull request?
This PR fixes a whole-stage codegen (WSCG) correctness bug in ArrayJoin (array_join) where the generated code computes the correct joined string but discards it as NULL.
`ArrayJoin.doGenCode` initializes `ev.isNull = true` whenever the expression is nullable (which is the case when the optional `nullReplacement` argument is a nullable column). The actual join is then produced by `genCodeForArrayAndDelimiter`, which has two branches:
When array or delimiter is nullable, the body is wrapped in `nullSafeExec` and explicitly emits `ev.isNull = false` before building the result. When both array and delimiter are non-nullable, the else branch builds the result but never resets `ev.isNull`, leaving it at its initialized true.
A minimal reproduction:
SET spark.sql.codegen.wholeStage = true;
SET spark.sql.codegen.factoryMode = CODEGEN_ONLY;
-- Returns NULL for every row (buggy):
SELECT array_join(
array('a', 'b'),
',',
CASE WHEN id % 2 = 0 THEN 'NR' ELSE CAST(NULL AS STRING) END
) AS r
FROM range(4);
SET spark.sql.codegen.wholeStage = false;
SET spark.sql.codegen.factoryMode = NO_CODEGEN;
-- Returns ['a,NR,b', NULL, 'a,NR,b', NULL] (correct):
SELECT array_join(
array('a', 'b'),
',',
CASE WHEN id % 2 = 0 THEN 'NR' ELSE CAST(NULL AS STRING) END
) AS r
FROM range(4);
### Why are the changes needed?
This is a silent correctness bug: `array_join(arr, delimiter, repl)` returns `NULL` for every row instead of the joined string, but only under a specific (and realistic) combination:
- The third argument nullReplacement is a nullable, non-foldable column, so `ArrayJoin.nullable` is true.
- An upstream `Filter` containing `IsNotNull(array)` (and/or `IsNotNull(delimiter)`) tightens those children to non-nullable. `FilterExec.output` marks `IsNotNull`-referenced attributes as non-nullable, and `UpdateAttributeNullability` propagates this downstream, so `genCodeForArrayAndDelimiter` takes the non-nullable else branch.
- The query stays in whole-stage codegen over a materialized source (e.g. `FileScan parquet`, or an `InMemoryRelation` from `CACHE TABLE`). Inline `VALUES / WITH` sources are folded by `ConvertToLocalRelation` to interpreted `eval()` and therefore do not hit the bug.
Interpreted `eval()` returns the correct result, so the same query produces different answers depending on whether codegen kicks in.
### Does this PR introduce _any_ user-facing change?
Yes. It fixes incorrect results. Previously, `array_join(arr, delimiter, nullReplacement)` could return `NULL` for every row under whole-stage codegen when nullReplacement was a nullable column and an upstream `IsNotNull` filter made the array/delimiter non-nullable. After this change, such queries return the correctly joined string, matching interpreted execution. Queries that were already correct (2-arg form, literal non-null `nullReplacement`, no upstream `IsNotNull` filter, or non-codegen execution) are unaffected.
### How was this patch tested?
Unit testing in `CollectionExpressionsSuite` and `DataFrameFunctionsSuite`
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.8
Closes #56249 from rgyhuang/r-huang_data/rgyhuang/JVM-codegen-fix.
Lead-authored-by: Roy Huang <57263072+rgyhuang@users.noreply.github.com>
Co-authored-by: Roy Huang <r.huang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b0633b0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR fixes a whole-stage codegen (WSCG) correctness bug in ArrayJoin (array_join) where the generated code computes the correct joined string but discards it as NULL.
ArrayJoin.doGenCodeinitializesev.isNull = truewhenever the expression is nullable (which is the case when the optionalnullReplacementargument is a nullable column). The actual join is then produced bygenCodeForArrayAndDelimiter, which has two branches:When array or delimiter is nullable, the body is wrapped in
nullSafeExecand explicitly emitsev.isNull = falsebefore building the result. When both array and delimiter are non-nullable, the else branch builds the result but never resetsev.isNull, leaving it at its initialized true.A minimal reproduction:
Why are the changes needed?
This is a silent correctness bug:
array_join(arr, delimiter, repl)returnsNULLfor every row instead of the joined string, but only under a specific (and realistic) combination:ArrayJoin.nullableis true.FiltercontainingIsNotNull(array)(and/orIsNotNull(delimiter)) tightens those children to non-nullable.FilterExec.outputmarksIsNotNull-referenced attributes as non-nullable, andUpdateAttributeNullabilitypropagates this downstream, sogenCodeForArrayAndDelimitertakes the non-nullable else branch.FileScan parquet, or anInMemoryRelationfromCACHE TABLE). InlineVALUES / WITHsources are folded byConvertToLocalRelationto interpretedeval()and therefore do not hit the bug.Interpreted
eval()returns the correct result, so the same query produces different answers depending on whether codegen kicks in.Does this PR introduce any user-facing change?
Yes. It fixes incorrect results. Previously,
array_join(arr, delimiter, nullReplacement)could returnNULLfor every row under whole-stage codegen when nullReplacement was a nullable column and an upstreamIsNotNullfilter made the array/delimiter non-nullable. After this change, such queries return the correctly joined string, matching interpreted execution. Queries that were already correct (2-arg form, literal non-nullnullReplacement, no upstreamIsNotNullfilter, or non-codegen execution) are unaffected.How was this patch tested?
Unit testing in
CollectionExpressionsSuiteandDataFrameFunctionsSuiteWas this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.8