[HUDI-7957] fix data skew when writing with bulk_insert + bucket_inde… by KnightChess · Pull Request #11578 · apache/hudi

KnightChess · 2024-07-05T14:57:59Z

…x enabled

Change Logs

imporve BucketIndexUtil partitionIndex algorithm make the data be evenly distributed.
BucketPartitionUtils in spark use BucketIndexUtil method, same logic as flink.

Impact

flink partitionIndex will use new algorithm

Risk level (write none, low medium or high below)

low

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2024-07-06T00:01:22Z

hudi-common/src/main/java/org/apache/hudi/common/util/hash/BucketIndexUtil.java

+  public static Functions.Function2<String, Integer, Integer> getPartitionIndexFunc(int bucketNum, int parallelism) {
+    return (partition, curBucket) -> {
+      int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) % parallelism * bucketNum;
+      int globalIndex = partitionIndex + curBucket;


The logic in Flink has got a regression?

It's not sure, no problem found in my local area, I re -triggered

after trying it many times, can't come up with this problem

Did you check the original commit PR description and comments, maybe we can sync up there first: #11346

yes, I have check, added more complete ut.

These two algorithms divide the parallelism into bucket-sized blocks, so that the tasks are evenly distributed.

(partition.hashCode() & Integer.MAX_VALUE) % parallelism * bucketNum means the start index of the buckets in a partition.

The difference is whether “% parallelism” or “/ (parallelism/bucketNum)”。

Assume bucketNum = 2, parallelism = 10

“(partition.hashCode() & Integer.MAX_VALUE) % parallelism” means the start index will be any value in [0-parallelism ], the value of diffrent partitions can be next to each other。

“(partition.hashCode() & Integer.MAX_VALUE) / (parallelism / bucketNum )” ，There will be a interval between the start index of different partitions

Can you summarize the current drawback of the Flink partitioner algorithm for 3 scenarios and what is the main reason to rollback the changes to the original?

@danny0405 no, the original logical in flink is (partition.hashCode() & Integer.MAX_VALUE) % parallelism current is (partition.hashCode() & Integer.MAX_VALUE) % parallelism * bucketNumber

@xicm (partition.hashCode() & Integer.MAX_VALUE) % parallelism is right, but (partition.hashCode() & Integer.MAX_VALUE) / (parallelism / bucketNum + 1) * bucketNum have a little problem, because it is / not %, so I think it should be like this. Each adjacent partition will be in the same slot, the range depend on parallelism / bucketNum, like you case, is 5

Each adjacent partition will be in the same slot

this is not what we expect.

you can run the test case with parallelism = 10, bucketNumber = 5 and partition = ["2021-01-01", "2021-01-03"], the result is inlimit : [2, 2, 2, 2, 2], half of TM are idle.

@xicm hi, #11578 (comment), as the comment say, it can not cover all case, just based on practical considerations

hudi-common/src/test/java/org/apache/hudi/common/util/hash/TestBucketIndexUtil.java

KnightChess · 2024-07-08T14:10:05Z

hudi-common/src/test/java/org/apache/hudi/common/util/hash/TestBucketIndexUtil.java

+    // parallelism2TaskCount size need Infinitely close parallelism or (bucketNumber * 8)
+    if (parallelism >= (bucketNumber * 8)) {
+      assertTrue(parallelism2TaskCount.size() >= (bucketNumber * 8 * 0.9));
+    } else {
+      assertTrue(parallelism2TaskCount.size() >= parallelism * 0.9);
+    }


add a new check, the work thread need infinitely close parallelism or (bucketNumber * partitionNumber) if we have enougth resource

KnightChess

@danny0405 @xicm From the discussion results and unit test situations, we can conclude that in the case of consecutive partitions, the new algorithm is more stable than the old algorithm. However, in the case of non-consecutive partitions, both algorithms exhibit some fluctuations. For example, with a parallelism of 10 and 5 buckets, the new algorithm can only achieve half of the parallelism for partitions 01 and 03, whereas the old algorithm can fully utilize the parallelism. With a parallelism of 20 and 5 buckets, and the same partitions, the new algorithm can achieve half of the parallelism, but the old algorithm can only achieve a quarter. The specific outcome depends on the initial position of the partitions.

KnightChess · 2024-07-09T11:58:37Z

both of these algorithms are better than the original spark bulk bucket partitioner algorithm. I think they can both address the skew issue to some extent. If we want to maintain the original flink implementation, I will modify the logic and unit tests, because with the current unit test scenarios, half of them cannot pass. I am inclined towards the current fix. What do you think?

xicm · 2024-07-09T12:37:29Z

@danny0405 @xicm From the discussion results and unit test situations, we can conclude that in the case of consecutive partitions, the new algorithm is more stable than the old algorithm. However, in the case of non-consecutive partitions, both algorithms exhibit some fluctuations. For example, with a parallelism of 10 and 5 buckets, the new algorithm can only achieve half of the parallelism for partitions 01 and 03, whereas the old algorithm can fully utilize the parallelism. With a parallelism of 20 and 5 buckets, and the same partitions, the new algorithm can achieve half of the parallelism, but the old algorithm can only achieve a quarter. The specific outcome depends on the initial position of the partitions.

Yes, the result of the two algorithms depends on the hash value of the partition, which is almost random. Is there a better way to get the partition number?

xicm · 2024-07-09T12:50:05Z

@danny0405 @xicm From the discussion results and unit test situations, we can conclude that in the case of consecutive partitions, the new algorithm is more stable than the old algorithm. However, in the case of non-consecutive partitions, both algorithms exhibit some fluctuations. For example, with a parallelism of 10 and 5 buckets, the new algorithm can only achieve half of the parallelism for partitions 01 and 03, whereas the old algorithm can fully utilize the parallelism. With a parallelism of 20 and 5 buckets, and the same partitions, the new algorithm can achieve half of the parallelism, but the old algorithm can only achieve a quarter. The specific outcome depends on the initial position of the partitions.

Yes, the result of the two algorithms depends on the hash value of the partition, which is almost random. Is there a better way to get the partition number?

@KnightChess

The old algorithm has overflow problems, if we fix the overflow problem, the old algorithm is better.

          int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / (parallelism / bucketNum) * bucketNum;
          int globalIndex = Math.abs(partitionIndex) + curBucket;
          return BucketIdentifier.mod(globalIndex, parallelism);

danny0405 · 2024-07-10T00:28:58Z

if we fix the overflow problem, the old algorithm is better.

Let's fire a fix for it, and @KnightChess let's keep the Flink hashing algorithm the same as it is and we can improve it in a separate PR I think.

KnightChess · 2024-07-10T01:11:40Z

@xicm no, although fixing the overflow problem, the old will not be better, you can try the ut. I have tried before.

KnightChess · 2024-07-10T01:16:14Z

@danny0405 I have tried before, the result is the new algorithm better. I will cherry pick it in a separate pr.

…x enabled

xicm · 2024-07-10T01:32:53Z

@xicm no, although fixing the overflow problem, the old will not be better, you can try the ut. I have tried before.

oh, there's something wrong with my test case , the old algorithm also has drawbacks.

danny0405 · 2024-07-10T01:36:07Z

@xicm @KnightChess So we reach concensus the algorithm raised by @KnightChess is better? If that's true, let's fire a fix in a separate PR.

xicm · 2024-07-10T02:07:11Z

@xicm @KnightChess So we reach concensus the algorithm raised by @KnightChess is better? If that's true, let's fire a fix in a separate PR.

Both algorithms have drawbacks.
For example,
parallelism = 10, bucketNumber = 5 and partition = ["2021-01-01", "2021-01-03"]
old: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
new: [2, 2, 2, 2, 2]

parallelism = 20, bucketNumber = 5 and partition = ["2021-01-01", "2021-01-03"]
old: [2, 2, 2, 2, 2]
new: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

The element in the array means how many data slice each TM processes.

hudi-bot · 2024-07-10T02:49:46Z

CI report:

d9c0ce2 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

KnightChess · 2024-07-10T03:09:24Z

don't know why this check will contain docker moudle, other success look like not contain, retrigger again

danny0405 · 2024-07-10T04:26:48Z

Both algorithms have drawbacks.

@xicm That's fine, the new algorithm looks simpler, there is no need to distinguish between different parallelisms.

xicm · 2024-07-10T08:53:58Z

@danny0405 @KnightChess Shall we use the new algorithm? new algorithm looks simpler. The previous implementation had an overflow problem，need a fix？

KnightChess · 2024-07-10T08:55:39Z

@xicm @danny0405 yes, I will submit a new pr to fix it recently

github-actions bot added the size:M PR with lines of changes in (100, 300] label Jul 5, 2024

danny0405 reviewed Jul 6, 2024

View reviewed changes

KnightChess closed this Jul 6, 2024

KnightChess reopened this Jul 6, 2024

KnightChess commented Jul 7, 2024

View reviewed changes

hudi-common/src/test/java/org/apache/hudi/common/util/hash/TestBucketIndexUtil.java Outdated Show resolved Hide resolved

KnightChess commented Jul 8, 2024

View reviewed changes

KnightChess commented Jul 9, 2024

View reviewed changes

danny0405 self-assigned this Jul 10, 2024

danny0405 added the area:performance Performance optimizations label Jul 10, 2024

[HUDI-7957] fix data skew when writing with bulk_insert + bucket_inde…

d9c0ce2

…x enabled

KnightChess force-pushed the fix-data-skew branch from 724e93b to d9c0ce2 Compare July 10, 2024 01:25

danny0405 approved these changes Jul 10, 2024

View reviewed changes

github-actions bot added size:S PR with lines of changes in (10, 100] and removed size:M PR with lines of changes in (100, 300] labels Jul 10, 2024

KnightChess closed this Jul 10, 2024

KnightChess reopened this Jul 10, 2024

KnightChess closed this Jul 10, 2024

KnightChess reopened this Jul 10, 2024

danny0405 merged commit bb0621e into apache:master Jul 10, 2024

KnightChess deleted the fix-data-skew branch July 10, 2024 08:55

KnightChess mentioned this pull request Jul 10, 2024

[HUDI-7977] Improve bucket index partitioner algorithm #11608

Merged

4 tasks

Conversation

KnightChess commented Jul 5, 2024

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xicm Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KnightChess Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xicm Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KnightChess left a comment

Choose a reason for hiding this comment

Uh oh!

KnightChess commented Jul 9, 2024

Uh oh!

xicm commented Jul 9, 2024

Uh oh!

xicm commented Jul 9, 2024

Uh oh!

danny0405 commented Jul 10, 2024

Uh oh!

KnightChess commented Jul 10, 2024

Uh oh!

KnightChess commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xicm commented Jul 10, 2024

Uh oh!

danny0405 commented Jul 10, 2024

Uh oh!

xicm commented Jul 10, 2024

Uh oh!

hudi-bot commented Jul 10, 2024

CI report:

Uh oh!

KnightChess commented Jul 10, 2024

Uh oh!

danny0405 commented Jul 10, 2024

Uh oh!

xicm commented Jul 10, 2024

Uh oh!

KnightChess commented Jul 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xicm Jul 9, 2024 •

edited

Loading

KnightChess Jul 9, 2024 •

edited

Loading

xicm Jul 9, 2024 •

edited

Loading

KnightChess commented Jul 10, 2024 •

edited

Loading