Skip to content

[feature-wip](multi-catalog) support pruning buckets for hive bucket table#11156

Merged
morningman merged 1 commit intoapache:masterfrom
AshinGau:bucket-prune
Jul 27, 2022
Merged

[feature-wip](multi-catalog) support pruning buckets for hive bucket table#11156
morningman merged 1 commit intoapache:masterfrom
AshinGau:bucket-prune

Conversation

@AshinGau
Copy link
Member

Proposed changes

Support pruning buckets for hive bucket table.

Notes

  1. Spark currently does not populate bucketed output which is compatible with Hive, so spark bucket table is not supported in current implementation.
  2. Hive 3.0 introduced bucket version 2, but doris still uses hive 2.3.7 which lacks the hash function of version 2, so I refer to the implementation of Trino and copy the hash function from hive.
  3. Current implementation doest not support the table with multiple bucketed columns, and only support Equal and In predicates.

Profit

Suppose lineitem table in tpch(sf=100) is bucketed by l_orderkey into 96 buekts, when query:

select count(*) from lineitem where l_orderkey = 1;

Splits scanning reduced from 144 to 6.

Checklist(Required)

  1. Does it affect the original behavior: (No)
  2. Has unit tests been added: (No)
  3. Has document been added or modified: (No)
  4. Does it need to update dependencies: (No)
  5. Are there any changes that cannot be rolled back: (No)

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@github-actions github-actions bot added the area/planner Issues or PRs related to the query planner label Jul 24, 2022
}
} else {
valid = false;
LOG.warn("File {} is not a bucket file in hive table {}, skip bucket pruning.", fileName, tableName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use debug

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}
if (valid) {
LOG.info("{} / {} input splits in hive table {} after bucket pruning.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return getSplitsByPath(inputFormat, configuration, splitsPath);
result = getSplitsByPath(inputFormat, configuration, splitsPath);
}
Optional<Set<Integer>> prunedBuckets = HiveBucketUtil.getPrunedBuckets(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can merge getPrunedBuckets and getPrunedSplitsByBuckets into one method.
To make it simpler to use?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@AshinGau AshinGau force-pushed the bucket-prune branch 5 times, most recently from 74fba98 to 9179e12 Compare July 25, 2022 12:52
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 27, 2022
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@morningman morningman merged commit 8128e5d into apache:master Jul 27, 2022
miswujian pushed a commit to miswujian/doris that referenced this pull request Jul 28, 2022
…table (apache#11156)

1. Spark currently does not populate bucketed output which is compatible with Hive, so spark bucket table is not supported in current implementation.
2. Hive 3.0 introduced bucket version 2, but doris still uses hive 2.3.7 which lacks the hash function of version 2, so I refer to the implementation of Trino and copy the hash function from hive.
3. Current implementation doest not support the table with multiple bucketed columns, and only support `Equal` and `In` predicates.
whutpencil pushed a commit to whutpencil/incubator-doris that referenced this pull request Jul 29, 2022
…table (apache#11156)

1. Spark currently does not populate bucketed output which is compatible with Hive, so spark bucket table is not supported in current implementation.
2. Hive 3.0 introduced bucket version 2, but doris still uses hive 2.3.7 which lacks the hash function of version 2, so I refer to the implementation of Trino and copy the hash function from hive.
3. Current implementation doest not support the table with multiple bucketed columns, and only support `Equal` and `In` predicates.
@AshinGau AshinGau deleted the bucket-prune branch August 3, 2022 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. area/multi-catalog area/planner Issues or PRs related to the query planner reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants