Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-25448: Invalid partition columns when skew with distinct #2585

Closed
wants to merge 4 commits into from

Conversation

dengzhhu653
Copy link
Member

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

@dengzhhu653

This comment has been minimized.

@dengzhhu653

This comment has been minimized.

@kgyrtkirk
Copy link
Member

@dengzhhu653 do you happen to have a testcase for this?

@dengzhhu653
Copy link
Member Author

@dengzhhu653 do you happen to have a testcase for this?

Not yet, I have tested on our environment for the skew table, shows that it can get pretty performance gain(mr).

@dengzhhu653
Copy link
Member Author

@dengzhhu653 do you happen to have a testcase for this?

Not yet, I have tested on our environment for the skew table, shows that it can get pretty performance gain(mr).

Hi @kgyrtkirk, what do you think about this? there are also some tests like groupby11.q and groupby8_map_skew.q showing the changes in partition columns after applying the fix. Thank you!

@github-actions
Copy link

github-actions bot commented Feb 2, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

@@ -56,7 +56,7 @@ STAGE PLANS:
key expressions: _col0 (type: string), _col1 (type: string)
null sort order: zz
sort order: ++
Map-reduce partition columns: _col0 (type: string)
Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you happen to have a directed testcase which were working incorrectly before this patch?

I guess it was returning 3 for distinct in case the rows were in the order of:

a | b
a | a
a | b

@dengzhhu653
Copy link
Member Author

dengzhhu653 commented Feb 23, 2022

I found something interesting, when I explain select col1, count(distinct col2) from partition_distinct_skew group by col1; on master branch, the output is following:

      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: partition_distinct_skew
                  Statistics: Num rows: 3 Data size: 510 Basic stats: COMPLETE Column stats: COMPLETE
                  Select Operator
                    expressions: col1 (type: string), col2 (type: string)
                    outputColumnNames: col1, col2
                    Statistics: Num rows: 3 Data size: 510 Basic stats: COMPLETE Column stats: COMPLETE
                    Group By Operator
                      keys: col1 (type: string), col2 (type: string)
                      minReductionHashAggr: 0.4
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 2 Data size: 340 Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: string), _col1 (type: string)
                        null sort order: zz
                        sort order: ++
                        Map-reduce partition columns: rand() (type: double)
                        Statistics: Num rows: 2 Data size: 340 Basic stats: COMPLETE Column stats: COMPLETE

The partition column is rand() for this case. It's seems we have done something to improve the skew case, though I'm not able to find where the cause locates.

Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
Reducer 4 <- Reducer 3 (SIMPLE_EDGE)
Reducer 5 <- Reducer 4 (SIMPLE_EDGE)
#### A masked pattern was here ####
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan of select col1, count(distinct col2) from partition_distinct_skew group by col1 introduces some redundant reducers.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label Apr 27, 2022
@github-actions github-actions bot closed this May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants