-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40618][SQL] Fix bug in MergeScalarSubqueries rule with nested subqueries #38052
Conversation
Hi @peter-toth, @sigmod, @gengliangwang could you please help to review this bug fix? 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @dtenedor!
...lyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueriesSuite.scala
Outdated
Show resolved
Hide resolved
...lyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueriesSuite.scala
Outdated
Show resolved
Hide resolved
@dtenedor just reminder, could you check the test failure? |
|
@@ -2157,7 +2157,7 @@ class SubquerySuite extends QueryTest | |||
} | |||
} | |||
|
|||
test("Merge non-correlated scalar subqueries from different parent plans") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this is also a valid test and so it shouldn't be altered. The 2 leaf level subqueries are independent and can be merted to compute once. Similarly the 2 parents can be also merged. Admittedly, this test case is not the best, I should have used a different table than testData
in the parent queries to make the test more simple.
But I don't want to block this fix so if we need it urgently I'm ok with mergint this PR but probably we can come up with a better alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is that currently we allow merging a subquery to another one that is needed to compute the subquery itself. So we should collect the (transitive) subquery references in the new plan and don't try merging the new plan to those.
@dtenedor, let me know if you want to update this PR or I can open one with the fix tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peter-toth Yeah, the bug is in tryMergePlans
where it tries to merge a subquery plan with another plan that contains the original subquery plan inside, e.g. the regression test select(select sum((select sum(col) from t)) from t
. I tried doing that by keeping sets of visited plans and checking them before merging, but it got complex since some ScalarSubqueryReference
s were already converted.
We should probably merge this PR to fix planning errors while keeping most of the optimizations from this rule in the short term. Then we can follow-up to restore the remaining optimization. I put a TODO for that; with the regression test present, it should be safe to proceed from there.
Can one of the admins verify this patch? |
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
Thanks, merging to master |
…subqueries using reference tracking ### What changes were proposed in this pull request? This PR reverts the previous fix #38052 and adds subquery reference tracking to `MergeScalarSubqueries` to restore previous functionality of merging independent nested subqueries. ### Why are the changes needed? Restore previous functionality but fix the bug discovered in https://issues.apache.org/jira/browse/SPARK-40618. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing and new UTs. Closes #38093 from peter-toth/SPARK-40618-fix-mergescalarsubqueries. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…subqueries using reference tracking ### What changes were proposed in this pull request? This PR reverts the previous fix apache#38052 and adds subquery reference tracking to `MergeScalarSubqueries` to restore previous functionality of merging independent nested subqueries. ### Why are the changes needed? Restore previous functionality but fix the bug discovered in https://issues.apache.org/jira/browse/SPARK-40618. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing and new UTs. Closes apache#38093 from peter-toth/SPARK-40618-fix-mergescalarsubqueries. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
There is a bug in the
MergeScalarSubqueries
rule for queries with subquery expressions nested inside each other, wherein the rule attempts to merge the nested subquery with its enclosing parent subquery. The result is not a valid plan and raises an exception in the optimizer. Here is a minimal reproducing case:To fix, we disable the optimization for subqueries with nested subqueries inside them for now.
Why are the changes needed?
This fixes a bug.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Updated existing unit tests and added the reproducing case as a new test case.