-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43413][SQL] Fix IN subquery ListQuery nullability #41094
Conversation
...test/scala/org/apache/spark/sql/catalyst/optimizer/BinaryComparisonSimplificationSuite.scala
Outdated
Show resolved
Hide resolved
...test/scala/org/apache/spark/sql/catalyst/optimizer/BinaryComparisonSimplificationSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR. Apache Spark has a test case directory for IN subquery
like the following. Could you add a nullable ListQuery test case there, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updates!
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-nulls.sql
Outdated
Show resolved
Hide resolved
thanks, merging to master! |
Thank you, @jchen5 and @cloud-fan . |
"behavior that does not check the right side's nullability.") | ||
.version("3.5.0") | ||
.booleanConf | ||
.createWithDefault(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is false
, we need to add spark.sql.legacy.inSubqueryNullability
to our SQL migration guide. Could you make a followup, @jchen5 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the purpose of this flag is not to let users go back to the wrong result, but just in case the newly added assertion fails. I think it's better to mention this flag in the assert error message, than mentioning it in the migration guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+100 for @cloud-fan 's holistic and complete solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, created followup here: #41202
…nullability assertion ### What changes were proposed in this pull request? In case the assert for the call to ListQuery.nullable is hit, mention in the assert error message the conf flag that can be used to disable the assert. Follow-up to #41094 (comment) ### Why are the changes needed? Improve error message. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests Closes #41202 from jchen5/in-nullability-assert. Authored-by: Jack Chen <jack.chen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Before this PR, IN subquery expressions are incorrectly marked as non-nullable, even when they are actually nullable. They correctly check the nullability of the left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations.
Example:
(non_nullable_col IN (select nullable_col)) <=> TRUE
. Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression tonon_nullable_col IN (select nullable_col)
, which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE.Fix this by calculating nullability correctly from the ListQuery child output expressions.
This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests at least.
This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed.
Why are the changes needed?
Fix correctness bug.
Does this PR introduce any user-facing change?
May change query results to fix correctness bug.
How was this patch tested?
Unit tests