[feature-wip](multi-catalog) support pruning buckets for hive bucket table by AshinGau · Pull Request #11156 · apache/doris

AshinGau · 2022-07-24T18:59:26Z

Proposed changes

Support pruning buckets for hive bucket table.

Notes

Spark currently does not populate bucketed output which is compatible with Hive, so spark bucket table is not supported in current implementation.
Hive 3.0 introduced bucket version 2, but doris still uses hive 2.3.7 which lacks the hash function of version 2, so I refer to the implementation of Trino and copy the hash function from hive.
Current implementation doest not support the table with multiple bucketed columns, and only support Equal and In predicates.

Profit

Suppose lineitem table in tpch(sf=100) is bucketed by l_orderkey into 96 buekts, when query:

select count(*) from lineitem where l_orderkey = 1;

Splits scanning reduced from 144 to 6.

Checklist(Required)

Does it affect the original behavior: (No)
Has unit tests been added: (No)
Has document been added or modified: (No)
Does it need to update dependencies: (No)
Are there any changes that cannot be rolled back: (No)

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

morningman · 2022-07-25T01:48:06Z

fe/fe-core/src/main/java/org/apache/doris/catalog/HiveBucketUtil.java

+                }
+            } else {
+                valid = false;
+                LOG.warn("File {} is not a bucket file in hive table {}, skip bucket pruning.", fileName, tableName);


morningman · 2022-07-25T01:48:42Z

fe/fe-core/src/main/java/org/apache/doris/catalog/HiveBucketUtil.java

+            }
+        }
+        if (valid) {
+            LOG.info("{} / {} input splits in hive table {} after bucket pruning.",


morningman · 2022-07-25T03:32:51Z

fe/fe-core/src/main/java/org/apache/doris/planner/external/ExternalHiveScanProvider.java

-            return getSplitsByPath(inputFormat, configuration, splitsPath);
+            result = getSplitsByPath(inputFormat, configuration, splitsPath);
+        }
+        Optional<Set<Integer>> prunedBuckets = HiveBucketUtil.getPrunedBuckets(


Looks like we can merge getPrunedBuckets and getPrunedSplitsByBuckets into one method.
To make it simpler to use?

…table

morningman

LGTM

github-actions · 2022-07-27T02:37:48Z

PR approved by at least one committer and no changes requested.

github-actions · 2022-07-27T02:37:51Z

PR approved by anyone and no changes requested.

…table (apache#11156) 1. Spark currently does not populate bucketed output which is compatible with Hive, so spark bucket table is not supported in current implementation. 2. Hive 3.0 introduced bucket version 2, but doris still uses hive 2.3.7 which lacks the hash function of version 2, so I refer to the implementation of Trino and copy the hash function from hive. 3. Current implementation doest not support the table with multiple bucketed columns, and only support `Equal` and `In` predicates.

github-actions bot added the area/planner Issues or PRs related to the query planner label Jul 24, 2022

morningman added the area/multi-catalog label Jul 25, 2022

morningman reviewed Jul 25, 2022

View reviewed changes

AshinGau force-pushed the bucket-prune branch 5 times, most recently from 74fba98 to 9179e12 Compare July 25, 2022 12:52

[feature-wip](multi-catalog) support pruning buckets for hive bucket …

38ad795

…table

AshinGau force-pushed the bucket-prune branch from 9179e12 to 38ad795 Compare July 26, 2022 02:47

morningman approved these changes Jul 27, 2022

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 27, 2022

github-actions bot added the reviewed label Jul 27, 2022

morningman merged commit 8128e5d into apache:master Jul 27, 2022

AshinGau deleted the bucket-prune branch August 3, 2022 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature-wip](multi-catalog) support pruning buckets for hive bucket table#11156

[feature-wip](multi-catalog) support pruning buckets for hive bucket table#11156
morningman merged 1 commit intoapache:masterfrom
AshinGau:bucket-prune

AshinGau commented Jul 24, 2022

Uh oh!

morningman Jul 25, 2022

Uh oh!

AshinGau Jul 25, 2022

Uh oh!

morningman Jul 25, 2022

Uh oh!

AshinGau Jul 25, 2022

Uh oh!

morningman Jul 25, 2022

Uh oh!

AshinGau Jul 25, 2022

Uh oh!

morningman left a comment

Uh oh!

github-actions bot commented Jul 27, 2022

Uh oh!

github-actions bot commented Jul 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AshinGau commented Jul 24, 2022

Proposed changes

Notes

Profit

Checklist(Required)

Further comments

Uh oh!

morningman Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

AshinGau Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

morningman Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

AshinGau Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

morningman Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

AshinGau Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

morningman left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 27, 2022

Uh oh!

github-actions bot commented Jul 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants