Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40618][SQL] Fix bug in MergeScalarSubqueries rule with nested subqueries #38052

Closed
wants to merge 7 commits into from
Closed

[SPARK-40618][SQL] Fix bug in MergeScalarSubqueries rule with nested subqueries #38052

wants to merge 7 commits into from

Conversation

dtenedor
Copy link
Contributor

What changes were proposed in this pull request?

There is a bug in the MergeScalarSubqueries rule for queries with subquery expressions nested inside each other, wherein the rule attempts to merge the nested subquery with its enclosing parent subquery. The result is not a valid plan and raises an exception in the optimizer. Here is a minimal reproducing case:

sql("create table test(col int) using csv")
checkAnswer(sql("select(select sum((select sum(col) from test)) from test)"), Row(null))

To fix, we disable the optimization for subqueries with nested subqueries inside them for now.

Why are the changes needed?

This fixes a bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Updated existing unit tests and added the reproducing case as a new test case.

@github-actions github-actions bot added the SQL label Sep 29, 2022
@dtenedor dtenedor marked this pull request as ready for review September 29, 2022 23:46
@dtenedor
Copy link
Contributor Author

Hi @peter-toth, @sigmod, @gengliangwang could you please help to review this bug fix? 🙏

Copy link
Contributor

@sigmod sigmod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @dtenedor!

@dtenedor dtenedor requested review from peter-toth and gengliangwang and removed request for peter-toth and gengliangwang September 30, 2022 17:59
@gengliangwang
Copy link
Member

@dtenedor just reminder, could you check the test failure?

@dtenedor
Copy link
Contributor Author

@dtenedor just reminder, could you check the test failure?
@gengliangwang my mistake 😦 I have updated the unit test file. That test should now show that we avoid merging nested subqueries in that case. I updated it accordingly.

@@ -2157,7 +2157,7 @@ class SubquerySuite extends QueryTest
}
}

test("Merge non-correlated scalar subqueries from different parent plans") {
Copy link
Contributor

@peter-toth peter-toth Oct 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this is also a valid test and so it shouldn't be altered. The 2 leaf level subqueries are independent and can be merted to compute once. Similarly the 2 parents can be also merged. Admittedly, this test case is not the best, I should have used a different table than testData in the parent queries to make the test more simple.

But I don't want to block this fix so if we need it urgently I'm ok with mergint this PR but probably we can come up with a better alternative.

Copy link
Contributor

@peter-toth peter-toth Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is that currently we allow merging a subquery to another one that is needed to compute the subquery itself. So we should collect the (transitive) subquery references in the new plan and don't try merging the new plan to those.
@dtenedor, let me know if you want to update this PR or I can open one with the fix tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peter-toth Yeah, the bug is in tryMergePlans where it tries to merge a subquery plan with another plan that contains the original subquery plan inside, e.g. the regression test select(select sum((select sum(col) from t)) from t. I tried doing that by keeping sets of visited plans and checking them before merging, but it got complex since some ScalarSubqueryReferences were already converted.

We should probably merge this PR to fix planning errors while keeping most of the optimizations from this rule in the short term. Then we can follow-up to restore the remaining optimization. I put a TODO for that; with the regression test present, it should be safe to proceed from there.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@gengliangwang
Copy link
Member

Thanks, merging to master

cloud-fan pushed a commit that referenced this pull request Oct 13, 2022
…subqueries using reference tracking

### What changes were proposed in this pull request?
This PR reverts the previous fix #38052 and adds subquery reference tracking to `MergeScalarSubqueries` to restore previous functionality of merging independent nested subqueries.

### Why are the changes needed?
Restore previous functionality but fix the bug discovered in https://issues.apache.org/jira/browse/SPARK-40618.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing and new UTs.

Closes #38093 from peter-toth/SPARK-40618-fix-mergescalarsubqueries.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…subqueries using reference tracking

### What changes were proposed in this pull request?
This PR reverts the previous fix apache#38052 and adds subquery reference tracking to `MergeScalarSubqueries` to restore previous functionality of merging independent nested subqueries.

### Why are the changes needed?
Restore previous functionality but fix the bug discovered in https://issues.apache.org/jira/browse/SPARK-40618.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing and new UTs.

Closes apache#38093 from peter-toth/SPARK-40618-fix-mergescalarsubqueries.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants