-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-25448: Invalid partition columns when skew with distinct #2585
Conversation
8787f86
to
512a9ef
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@dengzhhu653 do you happen to have a testcase for this? |
Not yet, I have tested on our environment for the skew table, shows that it can get pretty performance gain(mr). |
Hi @kgyrtkirk, what do you think about this? there are also some tests like groupby11.q and groupby8_map_skew.q showing the changes in partition columns after applying the fix. Thank you! |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
@@ -56,7 +56,7 @@ STAGE PLANS: | |||
key expressions: _col0 (type: string), _col1 (type: string) | |||
null sort order: zz | |||
sort order: ++ | |||
Map-reduce partition columns: _col0 (type: string) | |||
Map-reduce partition columns: _col0 (type: string), _col1 (type: string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you happen to have a directed testcase which were working incorrectly before this patch?
I guess it was returning 3 for distinct in case the rows were in the order of:
a | b
a | a
a | b
94eb868
to
cf10957
Compare
I found something interesting, when I explain
The partition column is rand() for this case. It's seems we have done something to improve the skew case, though I'm not able to find where the cause locates. |
Reducer 3 <- Reducer 2 (SIMPLE_EDGE) | ||
Reducer 4 <- Reducer 3 (SIMPLE_EDGE) | ||
Reducer 5 <- Reducer 4 (SIMPLE_EDGE) | ||
#### A masked pattern was here #### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan of select col1, count(distinct col2) from partition_distinct_skew group by col1
introduces some redundant reducers.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?