[MINOR] Fix skew in clustering operator#12765
Conversation
| LOG.info("Execute clustering plan for instant {} as {} file slices", clusteringInstantTime, clusteringGroup.getSlices().size()); | ||
| output.collect(new StreamRecord<>( | ||
| new ClusteringPlanEvent(clusteringInstantTime, ClusteringGroupInfo.create(clusteringGroup), clusteringPlan.getStrategy().getStrategyParams()) | ||
| new ClusteringPlanEvent(clusteringInstantTime, ClusteringGroupInfo.create(clusteringGroup), clusteringPlan.getStrategy().getStrategyParams(), operationIndex) |
There was a problem hiding this comment.
do we need a hash map like in this PR: https://github.com/apache/hudi/pull/11757/files
There was a problem hiding this comment.
Fileids of each operation in clustering plan is unique. So I think a hash map is not necessary. WDYT
There was a problem hiding this comment.
Is it possible the operation come from two different plans?
There was a problem hiding this comment.
No, a hash map here can only store operations of one plan.
There was a problem hiding this comment.
Why, because the plan generator can actually handle multiple plans actually.
There was a problem hiding this comment.
// the first instant takes the highest priority.
Option<HoodieInstant> firstRequested = Option.fromJavaOptional(
pendingClusteringInstantTimes.stream()
.filter(instant -> instant.getState() == HoodieInstant.State.REQUESTED).findFirst());The scheduleClustering method in ClusteringPlanOperator only handles the first plan each time.
There was a problem hiding this comment.
let's add a hash map just like the compaction does.
(cherry picked from commit 7380c26)
(cherry picked from commit 7380c26)
(cherry picked from commit 7380c26)
Change Logs
related to #11757
same skew in clustering
Impact
none
Risk level (write none, low medium or high below)
none
Documentation Update
none
Contributor's checklist