[HUDI-7957] fix data skew when writing with bulk_insert + bucket_inde…#11578
[HUDI-7957] fix data skew when writing with bulk_insert + bucket_inde…#11578danny0405 merged 1 commit intoapache:masterfrom
Conversation
| public static Functions.Function2<String, Integer, Integer> getPartitionIndexFunc(int bucketNum, int parallelism) { | ||
| return (partition, curBucket) -> { | ||
| int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) % parallelism * bucketNum; | ||
| int globalIndex = partitionIndex + curBucket; |
There was a problem hiding this comment.
The logic in Flink has got a regression?
There was a problem hiding this comment.
after trying it many times, can't come up with this problem
There was a problem hiding this comment.
Did you check the original commit PR description and comments, maybe we can sync up there first: #11346
There was a problem hiding this comment.
yes, I have check, added more complete ut.
There was a problem hiding this comment.
These two algorithms divide the parallelism into bucket-sized blocks, so that the tasks are evenly distributed.
(partition.hashCode() & Integer.MAX_VALUE) % parallelism * bucketNum means the start index of the buckets in a partition.
The difference is whether “% parallelism” or “/ (parallelism/bucketNum)”。
Assume bucketNum = 2, parallelism = 10
“(partition.hashCode() & Integer.MAX_VALUE) % parallelism” means the start index will be any value in [0-parallelism ], the value of diffrent partitions can be next to each other。
“(partition.hashCode() & Integer.MAX_VALUE) / (parallelism / bucketNum )” ,There will be a interval between the start index of different partitions
There was a problem hiding this comment.
Can you summarize the current drawback of the Flink partitioner algorithm for 3 scenarios and what is the main reason to rollback the changes to the original?
@danny0405 no, the original logical in flink is (partition.hashCode() & Integer.MAX_VALUE) % parallelism current is (partition.hashCode() & Integer.MAX_VALUE) % parallelism * bucketNumber
There was a problem hiding this comment.
@xicm (partition.hashCode() & Integer.MAX_VALUE) % parallelism is right, but (partition.hashCode() & Integer.MAX_VALUE) / (parallelism / bucketNum + 1) * bucketNum have a little problem, because it is / not %, so I think it should be like this. Each adjacent partition will be in the same slot, the range depend on parallelism / bucketNum, like you case, is 5

There was a problem hiding this comment.
Each adjacent partition will be in the same slot
this is not what we expect.
you can run the test case with parallelism = 10, bucketNumber = 5 and partition = ["2021-01-01", "2021-01-03"], the result is inlimit : [2, 2, 2, 2, 2], half of TM are idle.
There was a problem hiding this comment.
@xicm hi, #11578 (comment), as the comment say, it can not cover all case, just based on practical considerations
hudi-common/src/test/java/org/apache/hudi/common/util/hash/TestBucketIndexUtil.java
Outdated
Show resolved
Hide resolved
| // parallelism2TaskCount size need Infinitely close parallelism or (bucketNumber * 8) | ||
| if (parallelism >= (bucketNumber * 8)) { | ||
| assertTrue(parallelism2TaskCount.size() >= (bucketNumber * 8 * 0.9)); | ||
| } else { | ||
| assertTrue(parallelism2TaskCount.size() >= parallelism * 0.9); | ||
| } |
There was a problem hiding this comment.
add a new check, the work thread need infinitely close parallelism or (bucketNumber * partitionNumber) if we have enougth resource
KnightChess
left a comment
There was a problem hiding this comment.
@danny0405 @xicm From the discussion results and unit test situations, we can conclude that in the case of consecutive partitions, the new algorithm is more stable than the old algorithm. However, in the case of non-consecutive partitions, both algorithms exhibit some fluctuations. For example, with a parallelism of 10 and 5 buckets, the new algorithm can only achieve half of the parallelism for partitions 01 and 03, whereas the old algorithm can fully utilize the parallelism. With a parallelism of 20 and 5 buckets, and the same partitions, the new algorithm can achieve half of the parallelism, but the old algorithm can only achieve a quarter. The specific outcome depends on the initial position of the partitions.
|
both of these algorithms are better than the original spark bulk bucket partitioner algorithm. I think they can both address the skew issue to some extent. If we want to maintain the original flink implementation, I will modify the logic and unit tests, because with the current unit test scenarios, half of them cannot pass. I am inclined towards the current fix. What do you think? |
Yes, the result of the two algorithms depends on the hash value of the partition, which is almost random. Is there a better way to get the partition number? |
The old algorithm has overflow problems, if we fix the overflow problem, the old algorithm is better. |
Let's fire a fix for it, and @KnightChess let's keep the Flink hashing algorithm the same as it is and we can improve it in a separate PR I think. |
|
@xicm no, although fixing the overflow problem, the old will not be better, you can try the ut. I have tried before. |
|
@danny0405 I have tried before, the result is the new algorithm better. I will cherry pick it in a separate pr. |
724e93b to
d9c0ce2
Compare
oh, there's something wrong with my test case , the old algorithm also has drawbacks. |
|
@xicm @KnightChess So we reach concensus the algorithm raised by @KnightChess is better? If that's true, let's fire a fix in a separate PR. |
Both algorithms have drawbacks. parallelism = 20, bucketNumber = 5 and partition = ["2021-01-01", "2021-01-03"] The element in the array means how many data slice each TM processes. |
@xicm That's fine, the new algorithm looks simpler, there is no need to distinguish between different parallelisms. |
|
@danny0405 @KnightChess Shall we use the new algorithm? new algorithm looks simpler. The previous implementation had an overflow problem,need a fix? |
|
@xicm @danny0405 yes, I will submit a new pr to fix it recently |




…x enabled
Change Logs
BucketIndexUtilpartitionIndex algorithm make the data be evenly distributed.BucketPartitionUtilsin spark useBucketIndexUtilmethod, same logic as flink.Impact
Risk level (write none, low medium or high below)
low
Documentation Update
None
Contributor's checklist