HIVE-25448: Invalid partition columns when skew with distinct #2585

dengzhhu653 · 2021-08-16T02:35:29Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

kgyrtkirk · 2021-11-22T11:45:39Z

@dengzhhu653 do you happen to have a testcase for this?

dengzhhu653 · 2021-11-22T12:00:43Z

@dengzhhu653 do you happen to have a testcase for this?

Not yet, I have tested on our environment for the skew table, shows that it can get pretty performance gain(mr).

dengzhhu653 · 2021-12-03T11:12:48Z

@dengzhhu653 do you happen to have a testcase for this?

Not yet, I have tested on our environment for the skew table, shows that it can get pretty performance gain(mr).

Hi @kgyrtkirk, what do you think about this? there are also some tests like groupby11.q and groupby8_map_skew.q showing the changes in partition columns after applying the fix. Thank you!

github-actions · 2022-02-02T00:13:24Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

kgyrtkirk · 2022-02-23T10:16:03Z

ql/src/test/results/clientpositive/llap/autoColumnStats_7.q.out

@@ -56,7 +56,7 @@ STAGE PLANS:
                      key expressions: _col0 (type: string), _col1 (type: string)
                      null sort order: zz
                      sort order: ++
-                      Map-reduce partition columns: _col0 (type: string)
+                      Map-reduce partition columns: _col0 (type: string), _col1 (type: string)


do you happen to have a directed testcase which were working incorrectly before this patch?

I guess it was returning 3 for distinct in case the rows were in the order of:

a | b a | a a | b

dengzhhu653 · 2022-02-23T15:28:01Z

I found something interesting, when I explain select col1, count(distinct col2) from partition_distinct_skew group by col1; on master branch, the output is following:

      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: partition_distinct_skew
                  Statistics: Num rows: 3 Data size: 510 Basic stats: COMPLETE Column stats: COMPLETE
                  Select Operator
                    expressions: col1 (type: string), col2 (type: string)
                    outputColumnNames: col1, col2
                    Statistics: Num rows: 3 Data size: 510 Basic stats: COMPLETE Column stats: COMPLETE
                    Group By Operator
                      keys: col1 (type: string), col2 (type: string)
                      minReductionHashAggr: 0.4
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 2 Data size: 340 Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: string), _col1 (type: string)
                        null sort order: zz
                        sort order: ++
                        Map-reduce partition columns: rand() (type: double)
                        Statistics: Num rows: 2 Data size: 340 Basic stats: COMPLETE Column stats: COMPLETE

The partition column is rand() for this case. It's seems we have done something to improve the skew case, though I'm not able to find where the cause locates.

dengzhhu653 · 2022-02-24T02:08:11Z

ql/src/test/results/clientpositive/llap/partition_distinct_skew.q.out

+        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
+        Reducer 4 <- Reducer 3 (SIMPLE_EDGE)
+        Reducer 5 <- Reducer 4 (SIMPLE_EDGE)
+#### A masked pattern was here ####


The plan of select col1, count(distinct col2) from partition_distinct_skew group by col1 introduces some redundant reducers.

github-actions · 2022-04-27T00:23:16Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

kgyrtkirk added tests pending tests unstable and removed tests pending labels Aug 16, 2021

dengzhhu653 force-pushed the HIVE-25448 branch from 8787f86 to 512a9ef Compare August 17, 2021 01:01

kgyrtkirk added tests pending tests passed and removed tests unstable tests pending labels Aug 17, 2021

This comment has been minimized.

Sign in to view

github-actions bot added the stale label Feb 2, 2022

github-actions bot closed this Feb 9, 2022

dengzhhu653 reopened this Feb 11, 2022

kgyrtkirk added tests pending tests unstable tests passed and removed tests passed tests pending tests unstable labels Feb 11, 2022

github-actions bot closed this Feb 19, 2022

kgyrtkirk reviewed Feb 23, 2022

View reviewed changes

dengzhhu653 added 3 commits February 23, 2022 22:21

HIVE-25448: Invalid partition columns when skew with distinct

b09e33f

remove some codes

16cfe8a

add test

cf10957

dengzhhu653 reopened this Feb 23, 2022

kgyrtkirk added tests pending and removed tests passed labels Feb 23, 2022

dengzhhu653 force-pushed the HIVE-25448 branch from 94eb868 to cf10957 Compare February 23, 2022 15:15

kgyrtkirk added tests failed tests pending and removed tests pending tests failed labels Feb 23, 2022

github-actions bot removed the stale label Feb 24, 2022

more

771299f

kgyrtkirk added tests pending and removed tests failed labels Feb 24, 2022

dengzhhu653 commented Feb 24, 2022

View reviewed changes

kgyrtkirk added tests unstable and removed tests pending labels Feb 24, 2022

github-actions bot added the stale label Apr 27, 2022

github-actions bot closed this May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-25448: Invalid partition columns when skew with distinct #2585

HIVE-25448: Invalid partition columns when skew with distinct #2585

dengzhhu653 commented Aug 16, 2021

This comment has been minimized.

This comment has been minimized.

kgyrtkirk commented Nov 22, 2021

dengzhhu653 commented Nov 22, 2021

dengzhhu653 commented Dec 3, 2021

github-actions bot commented Feb 2, 2022

kgyrtkirk Feb 23, 2022

dengzhhu653 commented Feb 23, 2022 •

edited

Loading

dengzhhu653 Feb 24, 2022

github-actions bot commented Apr 27, 2022

HIVE-25448: Invalid partition columns when skew with distinct #2585

HIVE-25448: Invalid partition columns when skew with distinct #2585

Conversation

dengzhhu653 commented Aug 16, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

This comment has been minimized.

This comment has been minimized.

kgyrtkirk commented Nov 22, 2021

dengzhhu653 commented Nov 22, 2021

dengzhhu653 commented Dec 3, 2021

github-actions bot commented Feb 2, 2022

kgyrtkirk Feb 23, 2022

Choose a reason for hiding this comment

dengzhhu653 commented Feb 23, 2022 • edited Loading

dengzhhu653 Feb 24, 2022

Choose a reason for hiding this comment

github-actions bot commented Apr 27, 2022

dengzhhu653 commented Feb 23, 2022 •

edited

Loading