[SPARK-40618][SQL] Fix bug in MergeScalarSubqueries rule with nested subqueries #38052

dtenedor · 2022-09-29T23:45:31Z

What changes were proposed in this pull request?

There is a bug in the MergeScalarSubqueries rule for queries with subquery expressions nested inside each other, wherein the rule attempts to merge the nested subquery with its enclosing parent subquery. The result is not a valid plan and raises an exception in the optimizer. Here is a minimal reproducing case:

sql("create table test(col int) using csv")
checkAnswer(sql("select(select sum((select sum(col) from test)) from test)"), Row(null))

To fix, we disable the optimization for subqueries with nested subqueries inside them for now.

Why are the changes needed?

This fixes a bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Updated existing unit tests and added the reproducing case as a new test case.

dtenedor · 2022-09-29T23:49:57Z

Hi @peter-toth, @sigmod, @gengliangwang could you please help to review this bug fix? 🙏

sigmod

thanks @dtenedor!

...lyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueriesSuite.scala

gengliangwang · 2022-09-30T20:27:56Z

@dtenedor just reminder, could you check the test failure?

dtenedor · 2022-09-30T21:30:57Z

@dtenedor just reminder, could you check the test failure?
@gengliangwang my mistake 😦 I have updated the unit test file. That test should now show that we avoid merging nested subqueries in that case. I updated it accordingly.

peter-toth · 2022-10-01T09:43:51Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

@@ -2157,7 +2157,7 @@ class SubquerySuite extends QueryTest
    }
  }

-  test("Merge non-correlated scalar subqueries from different parent plans") {


Unfortunately, this is also a valid test and so it shouldn't be altered. The 2 leaf level subqueries are independent and can be merted to compute once. Similarly the 2 parents can be also merged. Admittedly, this test case is not the best, I should have used a different table than testData in the parent queries to make the test more simple.

But I don't want to block this fix so if we need it urgently I'm ok with mergint this PR but probably we can come up with a better alternative.

I think the issue is that currently we allow merging a subquery to another one that is needed to compute the subquery itself. So we should collect the (transitive) subquery references in the new plan and don't try merging the new plan to those.
@dtenedor, let me know if you want to update this PR or I can open one with the fix tomorrow.

@peter-toth Yeah, the bug is in tryMergePlans where it tries to merge a subquery plan with another plan that contains the original subquery plan inside, e.g. the regression test select(select sum((select sum(col) from t)) from t. I tried doing that by keeping sets of visited plans and checking them before merging, but it got complex since some ScalarSubqueryReferences were already converted.

We should probably merge this PR to fix planning errors while keeping most of the optimizations from this rule in the short term. Then we can follow-up to restore the remaining optimization. I put a TODO for that; with the regression test present, it should be safe to proceed from there.

AmplabJenkins · 2022-10-01T14:05:07Z

Can one of the admins verify this patch?

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

gengliangwang · 2022-10-03T18:24:48Z

Thanks, merging to master

…subqueries using reference tracking ### What changes were proposed in this pull request? This PR reverts the previous fix #38052 and adds subquery reference tracking to `MergeScalarSubqueries` to restore previous functionality of merging independent nested subqueries. ### Why are the changes needed? Restore previous functionality but fix the bug discovered in https://issues.apache.org/jira/browse/SPARK-40618. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing and new UTs. Closes #38093 from peter-toth/SPARK-40618-fix-mergescalarsubqueries. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…subqueries using reference tracking ### What changes were proposed in this pull request? This PR reverts the previous fix apache#38052 and adds subquery reference tracking to `MergeScalarSubqueries` to restore previous functionality of merging independent nested subqueries. ### Why are the changes needed? Restore previous functionality but fix the bug discovered in https://issues.apache.org/jira/browse/SPARK-40618. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing and new UTs. Closes apache#38093 from peter-toth/SPARK-40618-fix-mergescalarsubqueries. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dtenedor added 2 commits September 29, 2022 16:38

initial implementation

766f862

initial implementation

59a5d49

github-actions bot added the SQL label Sep 29, 2022

dtenedor marked this pull request as ready for review September 29, 2022 23:46

initial implementation

49a4bee

sigmod approved these changes Sep 29, 2022

View reviewed changes

gengliangwang reviewed Sep 30, 2022

View reviewed changes

...lyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueriesSuite.scala Outdated Show resolved Hide resolved

gengliangwang approved these changes Sep 30, 2022

View reviewed changes

peter-toth reviewed Sep 30, 2022

View reviewed changes

...lyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueriesSuite.scala Outdated Show resolved Hide resolved

dtenedor added 2 commits September 30, 2022 10:57

respond to code review comments

5c70a69

respond to code review comments

d51df03

dtenedor requested review from peter-toth and gengliangwang and removed request for peter-toth and gengliangwang September 30, 2022 17:59

peter-toth approved these changes Sep 30, 2022

View reviewed changes

fix test

c8ee73c

peter-toth reviewed Oct 1, 2022

View reviewed changes

amaliujia reviewed Oct 1, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Outdated Show resolved Hide resolved

respond to code review comments

03ba8df

gengliangwang closed this in 9ac9cd5 Oct 3, 2022

peter-toth mentioned this pull request Oct 4, 2022

[SPARK-40618][SQL] Fix bug in MergeScalarSubqueries rule with nested subqueries using reference tracking #38093

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40618][SQL] Fix bug in MergeScalarSubqueries rule with nested subqueries #38052

[SPARK-40618][SQL] Fix bug in MergeScalarSubqueries rule with nested subqueries #38052

dtenedor commented Sep 29, 2022

dtenedor commented Sep 29, 2022

sigmod left a comment

gengliangwang commented Sep 30, 2022

dtenedor commented Sep 30, 2022

peter-toth Oct 1, 2022 •

edited

Loading

peter-toth Oct 3, 2022 •

edited

Loading

dtenedor Oct 3, 2022

AmplabJenkins commented Oct 1, 2022

gengliangwang commented Oct 3, 2022

[SPARK-40618][SQL] Fix bug in MergeScalarSubqueries rule with nested subqueries #38052

[SPARK-40618][SQL] Fix bug in MergeScalarSubqueries rule with nested subqueries #38052

Conversation

dtenedor commented Sep 29, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dtenedor commented Sep 29, 2022

sigmod left a comment

Choose a reason for hiding this comment

gengliangwang commented Sep 30, 2022

dtenedor commented Sep 30, 2022

peter-toth Oct 1, 2022 • edited Loading

Choose a reason for hiding this comment

peter-toth Oct 3, 2022 • edited Loading

Choose a reason for hiding this comment

dtenedor Oct 3, 2022

Choose a reason for hiding this comment

AmplabJenkins commented Oct 1, 2022

gengliangwang commented Oct 3, 2022

peter-toth Oct 1, 2022 •

edited

Loading

peter-toth Oct 3, 2022 •

edited

Loading