Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-26968: Compare DPP sources when compare and gather parent operators #3981

Conversation

ngsg
Copy link
Contributor

@ngsg ngsg commented Jan 25, 2023

What changes were proposed in this pull request?

Compare parent DPP operators when compare and gather the parents of 2 TableScan operators.

Why are the changes needed?

During extended shared work optimization, 2 TS operators are merged even if they have semantically different DPP parent operators. The merged TS operator keeps only one of the DPP parent operators, and the filter generated by the missing DPP operator will not be applied on the merged TS operator. As a consequence, executing the optimized query plan returns incorrect result set.

Does this PR introduce any user-facing change?

No

How was this patch tested?

mvn test -Dtest=TestSharedWorkOptimizer
cd itests; mvn test -Dtest=TestCliDriver

@sonarcloud
Copy link

sonarcloud bot commented Jan 25, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@zabetak zabetak self-requested a review March 21, 2023 08:30
@zabetak zabetak self-assigned this Mar 21, 2023
Copy link
Contributor

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @ngsg . Since the ticket claims that the optimization creates a semantically incorrect query plan the PR should contain a minimal .q test with an EXPLAIN demonstrating the problem and validating the fix. This would also ensure that the problem does not resurface again in the future.

@sonarcloud
Copy link

sonarcloud bot commented Mar 28, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@ngsg
Copy link
Contributor Author

ngsg commented Mar 29, 2023

Hello @zabetak. I have added a new qfile, which validates my PR. In a nutshell, this qfile submits the same query twice while varying the value of hive.optimize.shared.work.dppunion. I checked that current Hive produces different results as I described in the JIRA issue (https://issues.apache.org/jira/browse/HIVE-26968). Could you please review the changes? Thank you.

@zabetak
Copy link
Contributor

zabetak commented Apr 5, 2023

Thanks for the update @ngsg ; I am bit underwater these days but this is on my TODO list!

@github-actions
Copy link

github-actions bot commented Jun 5, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label Jun 5, 2023
ngsg and others added 5 commits June 5, 2023 11:49
1. Use regular tables instead of Iceberg since the problem is not specific to Iceberg.
2. Drop redundant columns from DDLs and queries since we can repro the problem without these.
3. Drop redundant WHERE filters and aggregate functions from queries to make plans more compact and readable.
4. Remove explicit set of hive.optimize.shared.work.dppunion since the problem appears no matter the value of this property.
5. Remove comments about the expected plan since they are hard to follow and actually the whole point of running EXPLAIN just after is to verify that we get the expected plan.
Move the test under the correct directory and update the plan since hive.user.explain=false is in effect.
@zabetak zabetak force-pushed the HIVE-26968-compareAndGatherOps-compares-DPPSources branch from b6b0cb0 to 6102c2e Compare June 5, 2023 10:52
Copy link
Contributor

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ngsg apologies for the late response. It took me a bit of time to understand how the shared work optimizer works but I now understand the problem and the proposed fix LGTM.

I rebased to latest master and added a few commits simplifying your test reproducer. Please go over the new changes and let me know if you have any comments.

If you are OK with the new changes and the tests come back green I will merge this PR.

@ngsg
Copy link
Contributor Author

ngsg commented Jun 5, 2023

Thank you for your review, @zabetak . I have checked the changes and it looks good to me.

@sonarcloud
Copy link

sonarcloud bot commented Jun 5, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 7 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@zabetak zabetak closed this in 136aff9 Jun 8, 2023
yeahyung pushed a commit to yeahyung/hive that referenced this pull request Jul 20, 2023
…or with different DPP edges (Seonggon Namgung reviewed by Stamatis Zampetakis)

During extended shared work optimization, 2 TS operators are merged
even if they have semantically different DPP parent operators. The
merged TS operator keeps only one of the DPP parent operators, and the
filter generated by the missing DPP operator will not be applied on the
merged TS operator. As a consequence, executing the optimized query
plan returns incorrect result set.

Modify SharedWorkOptimizer to check the parents of TableScan operators
and bail out if DPP edges differ.

Closes apache#3981
tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Dec 19, 2023
…or with different DPP edges (Seonggon Namgung reviewed by Stamatis Zampetakis)

During extended shared work optimization, 2 TS operators are merged
even if they have semantically different DPP parent operators. The
merged TS operator keeps only one of the DPP parent operators, and the
filter generated by the missing DPP operator will not be applied on the
merged TS operator. As a consequence, executing the optimized query
plan returns incorrect result set.

Modify SharedWorkOptimizer to check the parents of TableScan operators
and bail out if DPP edges differ.

Closes apache#3981
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants