[SPARK-43413][SQL] Fix IN subquery ListQuery nullability #41094

jchen5 · 2023-05-08T21:22:21Z

What changes were proposed in this pull request?

Before this PR, IN subquery expressions are incorrectly marked as non-nullable, even when they are actually nullable. They correctly check the nullability of the left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE. Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col), which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE.

Fix this by calculating nullability correctly from the ListQuery child output expressions.

This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests at least.

This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed.

Why are the changes needed?

Fix correctness bug.

Does this PR introduce any user-facing change?

May change query results to fix correctness bug.

How was this patch tested?

Unit tests

jchen5 · 2023-05-08T21:25:09Z

@cloud-fan @allisonwang-db

...test/scala/org/apache/spark/sql/catalyst/optimizer/BinaryComparisonSimplificationSuite.scala

dongjoon-hyun

Thank you for making a PR. Apache Spark has a test case directory for IN subquery like the following. Could you add a nullable ListQuery test case there, please?

https://github.com/apache/spark/tree/master/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery

dongjoon-hyun

Thank you for updates!

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-nulls.sql

Stale

cloud-fan · 2023-05-16T01:40:31Z

thanks, merging to master!

dongjoon-hyun · 2023-05-16T16:41:43Z

Thank you, @jchen5 and @cloud-fan .

dongjoon-hyun · 2023-05-16T16:45:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "behavior that does not check the right side's nullability.")
+      .version("3.5.0")
+      .booleanConf
+      .createWithDefault(false)


Since this is false, we need to add spark.sql.legacy.inSubqueryNullability to our SQL migration guide. Could you make a followup, @jchen5 ?

I think the purpose of this flag is not to let users go back to the wrong result, but just in case the newly added assertion fails. I think it's better to mention this flag in the assert error message, than mentioning it in the migration guide.

+100 for @cloud-fan 's holistic and complete solution.

Agreed, created followup here: #41202

…nullability assertion ### What changes were proposed in this pull request? In case the assert for the call to ListQuery.nullable is hit, mention in the assert error message the conf flag that can be used to disable the assert. Follow-up to #41094 (comment) ### Why are the changes needed? Improve error message. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests Closes #41202 from jchen5/in-nullability-assert. Authored-by: Jack Chen <jack.chen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

[SPARK-43413] Fix IN subquery ListQuery nullability

30ec9e2

github-actions bot added the SQL label May 8, 2023

dongjoon-hyun reviewed May 8, 2023

View reviewed changes

...test/scala/org/apache/spark/sql/catalyst/optimizer/BinaryComparisonSimplificationSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed May 8, 2023

View reviewed changes

...test/scala/org/apache/spark/sql/catalyst/optimizer/BinaryComparisonSimplificationSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun previously requested changes May 8, 2023

View reviewed changes

Review comments

7a1081c

dongjoon-hyun reviewed May 9, 2023

View reviewed changes

cloud-fan reviewed May 9, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 9, 2023

View reviewed changes

sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-nulls.sql Outdated Show resolved Hide resolved

jchen5 added 4 commits May 10, 2023 10:25

Test updates

84521f1

Move logic to InSubquery

f6a94d8

Add flag for legacy behavior

90d7b9a

Retrigger tests

112d77e

jchen5 requested review from dongjoon-hyun and cloud-fan May 11, 2023 18:11

cloud-fan approved these changes May 16, 2023

View reviewed changes

cloud-fan closed this in 2e56821 May 16, 2023

dongjoon-hyun reviewed May 16, 2023

View reviewed changes

jchen5 mentioned this pull request May 17, 2023

[SPARK-43413][SQL][FOLLOWUP] Show a directional message in ListQuery nullability assertion #41202

Closed

uday1409 mentioned this pull request Jun 15, 2023

Lineage is not getting tracked for subquery IN AbsaOSS/spline-spark-agent#700

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43413][SQL] Fix IN subquery ListQuery nullability #41094

[SPARK-43413][SQL] Fix IN subquery ListQuery nullability #41094

jchen5 commented May 8, 2023 •

edited

Loading

jchen5 commented May 8, 2023

dongjoon-hyun left a comment

dongjoon-hyun left a comment

cloud-fan commented May 16, 2023

dongjoon-hyun commented May 16, 2023 •

edited

Loading

dongjoon-hyun May 16, 2023

cloud-fan May 17, 2023

dongjoon-hyun May 17, 2023

jchen5 May 17, 2023

[SPARK-43413][SQL] Fix IN subquery ListQuery nullability #41094

[SPARK-43413][SQL] Fix IN subquery ListQuery nullability #41094

Conversation

jchen5 commented May 8, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

jchen5 commented May 8, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

cloud-fan commented May 16, 2023

dongjoon-hyun commented May 16, 2023 • edited Loading

dongjoon-hyun May 16, 2023

Choose a reason for hiding this comment

cloud-fan May 17, 2023

Choose a reason for hiding this comment

dongjoon-hyun May 17, 2023

Choose a reason for hiding this comment

jchen5 May 17, 2023

Choose a reason for hiding this comment

jchen5 commented May 8, 2023 •

edited

Loading

dongjoon-hyun commented May 16, 2023 •

edited

Loading