New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-26968: Compare DPP sources when compare and gather parent operators #3981
HIVE-26968: Compare DPP sources when compare and gather parent operators #3981
Conversation
Kudos, SonarCloud Quality Gate passed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @ngsg . Since the ticket claims that the optimization creates a semantically incorrect query plan the PR should contain a minimal .q test with an EXPLAIN
demonstrating the problem and validating the fix. This would also ensure that the problem does not resurface again in the future.
Kudos, SonarCloud Quality Gate passed! |
Hello @zabetak. I have added a new qfile, which validates my PR. In a nutshell, this qfile submits the same query twice while varying the value of hive.optimize.shared.work.dppunion. I checked that current Hive produces different results as I described in the JIRA issue (https://issues.apache.org/jira/browse/HIVE-26968). Could you please review the changes? Thank you. |
Thanks for the update @ngsg ; I am bit underwater these days but this is on my TODO list! |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
1. Use regular tables instead of Iceberg since the problem is not specific to Iceberg. 2. Drop redundant columns from DDLs and queries since we can repro the problem without these. 3. Drop redundant WHERE filters and aggregate functions from queries to make plans more compact and readable. 4. Remove explicit set of hive.optimize.shared.work.dppunion since the problem appears no matter the value of this property. 5. Remove comments about the expected plan since they are hard to follow and actually the whole point of running EXPLAIN just after is to verify that we get the expected plan.
Move the test under the correct directory and update the plan since hive.user.explain=false is in effect.
b6b0cb0
to
6102c2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ngsg apologies for the late response. It took me a bit of time to understand how the shared work optimizer works but I now understand the problem and the proposed fix LGTM.
I rebased to latest master and added a few commits simplifying your test reproducer. Please go over the new changes and let me know if you have any comments.
If you are OK with the new changes and the tests come back green I will merge this PR.
Thank you for your review, @zabetak . I have checked the changes and it looks good to me. |
Kudos, SonarCloud Quality Gate passed! |
…or with different DPP edges (Seonggon Namgung reviewed by Stamatis Zampetakis) During extended shared work optimization, 2 TS operators are merged even if they have semantically different DPP parent operators. The merged TS operator keeps only one of the DPP parent operators, and the filter generated by the missing DPP operator will not be applied on the merged TS operator. As a consequence, executing the optimized query plan returns incorrect result set. Modify SharedWorkOptimizer to check the parents of TableScan operators and bail out if DPP edges differ. Closes apache#3981
…or with different DPP edges (Seonggon Namgung reviewed by Stamatis Zampetakis) During extended shared work optimization, 2 TS operators are merged even if they have semantically different DPP parent operators. The merged TS operator keeps only one of the DPP parent operators, and the filter generated by the missing DPP operator will not be applied on the merged TS operator. As a consequence, executing the optimized query plan returns incorrect result set. Modify SharedWorkOptimizer to check the parents of TableScan operators and bail out if DPP edges differ. Closes apache#3981
What changes were proposed in this pull request?
Compare parent DPP operators when compare and gather the parents of 2 TableScan operators.
Why are the changes needed?
During extended shared work optimization, 2 TS operators are merged even if they have semantically different DPP parent operators. The merged TS operator keeps only one of the DPP parent operators, and the filter generated by the missing DPP operator will not be applied on the merged TS operator. As a consequence, executing the optimized query plan returns incorrect result set.
Does this PR introduce any user-facing change?
No
How was this patch tested?