Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43413][SQL] Fix IN subquery ListQuery nullability #41094

Closed
wants to merge 6 commits into from

Conversation

jchen5
Copy link
Contributor

@jchen5 jchen5 commented May 8, 2023

What changes were proposed in this pull request?

Before this PR, IN subquery expressions are incorrectly marked as non-nullable, even when they are actually nullable. They correctly check the nullability of the left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE. Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col), which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE.

Fix this by calculating nullability correctly from the ListQuery child output expressions.

This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests at least.

This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed.

Why are the changes needed?

Fix correctness bug.

Does this PR introduce any user-facing change?

May change query results to fix correctness bug.

How was this patch tested?

Unit tests

@github-actions github-actions bot added the SQL label May 8, 2023
@jchen5
Copy link
Contributor Author

jchen5 commented May 8, 2023

@cloud-fan @allisonwang-db

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR. Apache Spark has a test case directory for IN subquery like the following. Could you add a nullable ListQuery test case there, please?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updates!

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 2e56821 May 16, 2023
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 16, 2023

Thank you, @jchen5 and @cloud-fan .

"behavior that does not check the right side's nullability.")
.version("3.5.0")
.booleanConf
.createWithDefault(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is false, we need to add spark.sql.legacy.inSubqueryNullability to our SQL migration guide. Could you make a followup, @jchen5 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the purpose of this flag is not to let users go back to the wrong result, but just in case the newly added assertion fails. I think it's better to mention this flag in the assert error message, than mentioning it in the migration guide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+100 for @cloud-fan 's holistic and complete solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, created followup here: #41202

dongjoon-hyun pushed a commit that referenced this pull request May 18, 2023
…nullability assertion

### What changes were proposed in this pull request?
In case the assert for the call to ListQuery.nullable is hit, mention in the assert error message the conf flag that can be used to disable the assert. Follow-up to #41094 (comment)

### Why are the changes needed?
Improve error message.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit tests

Closes #41202 from jchen5/in-nullability-assert.

Authored-by: Jack Chen <jack.chen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants