HIVE-26184: COLLECT_SET with GROUP BY is very slow when some keys are highly skewed#3253
Merged
Merged
Conversation
kgyrtkirk
approved these changes
May 3, 2022
Contributor
Author
|
CI failed but I think it's not apparently caused by this PR. |
DongWei-4
pushed a commit
to DongWei-4/hive
that referenced
this pull request
Oct 28, 2022
… highly skewed (apache#3253) (okumin reviewed by Zoltan Haindrich)
dengzhhu653
pushed a commit
to dengzhhu653/hive
that referenced
this pull request
Dec 15, 2022
… highly skewed (apache#3253) (okumin reviewed by Zoltan Haindrich)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This would reduce the time complexity of
COLLECT_SETfromO({maximum length} * {num rows})intoO({maximum length} + {num rows}).https://issues.apache.org/jira/browse/HIVE-26184
Why are the changes needed?
I'm observing some reducers take much time due to this issue.
Does this PR introduce any user-facing change?
No
How was this patch tested?
I have run the reproduction case in HIVE-26184 with this patch and confirmed the reduce vertex finished more than 30x faster.