[FLINK-25796][network] Avoid record copy for result partition of sort-shuffle if there are enough buffers for better performance #18505

wsry · 2022-01-25T13:03:56Z

What is the purpose of the change

Currently, for result partition of sort-shuffle, there is extra record copy overhead Introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This ticket aims to solve the problem.

In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression.

This ticket tries to solve the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used.

Brief change log

Dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used.

Verifying this change

This change added tests and existing tests can also help to verify the change.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2022-01-25T13:09:59Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 7937706 (Tue Jan 25 13:09:59 UTC 2022)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2022-01-25T13:11:51Z

CI report:

08ff0ff Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

…-shuffle if there are enough buffers for better performance Currently, for result partition of sort-shuffle, there is extra record copy overhead introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This patch aims to solve the problem. In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression. This patch solves the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used. This closes apache#18505.

...time/src/main/java/org/apache/flink/runtime/io/network/api/writer/ResultPartitionWriter.java

…-shuffle if there are enough buffers for better performance Currently, for result partition of sort-shuffle, there is extra record copy overhead introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This patch aims to solve the problem. In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression. This patch solves the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used. This closes apache#18505.

gaoyunhaii

Very thanks @wsry for the PR! I have left some comments.

For the long run, I tend to we move the difference implementation to the implementation class, like the strategy to split write buffers and sort buffers and how they write the buffers to the files. Perhaps we could create a new issue for the future refactor?

Also if possible I tend to we rename the classes to DataBuffer, HashBasedDataBuffer and SortBasedDataBuffer and also the variables to avoid reuse the word sort.

gaoyunhaii · 2022-02-08T03:15:24Z

...me/src/main/java/org/apache/flink/runtime/io/network/partition/SortMergeResultPartition.java

    private void writeLargeRecord(
            ByteBuffer record, int targetSubpartition, DataType dataType, boolean isBroadcast)
            throws IOException {
+        checkState(numBuffersForWrite > 0, "No buffers available for writing.");


Would this cause problem if there is large records when using hash-based implementation? Might we keep at least one buffer for write?

For hash-based implementation, large record will be appended to the sort buffer, when the data buffer is full the partial data of the record will be spilled as a data region and the remaining data of the large record will be appended to the sort buffer again. That is to say, a large record can span multiple data regions.

gaoyunhaii · 2022-02-09T09:35:33Z

.../main/java/org/apache/flink/runtime/io/network/partition/HashBasedPartitionSortedBuffer.java

+        if (!isFull) {
+            ++numTotalRecords;
+        }
+        numTotalBytes += totalBytes - source.remaining();


If source takes 5 buffers and 3 buffers are written, do we expect to write the remaining buffers in the next buffer? If so might add some comments in the method docs?

Yes, I will add some comments to explain that.

gaoyunhaii · 2022-02-09T09:55:12Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/SortBuffer.java

    boolean isReleased();
+
+    /** Resets this {@link SortBuffer} to be reused for data appending. */
+    void reset();


Might move this method before finish().

Perhaps we could also add some description of the lifecycle of the SortBuffer in the class document? Like describe the process of write, writer, full, read, read, reset, ..., finish, release.

gaoyunhaii · 2022-02-09T10:02:51Z

...me/src/main/java/org/apache/flink/runtime/io/network/partition/SortMergeResultPartition.java

+     * Number of reserved network buffers for data writing. This value can be 0 and 0 means that
+     * {@link HashBasedPartitionSortedBuffer} will be used.
+     */
+    private int numBuffersForWrite;


From the following modification, it seems the variable here mainly plays a role to indicate whether we want to use hash-based implementation and sort-based implementation. I think perhaps we could directly use a variable like sortBufferType or useHashBuffer to make it more clear. The numBuffersForWrite could be changed to be a local variable in the constructor. We could change the implementation to be like

if (numRequiredBuffer >= 2 * numSubpartitions) { useHashBuffer = true; } else { useHashBuffer = false; } if (!useHashBuffer) { int expectedWriteBuffers; if (numRequiredBuffer >= 2 * numSubpartitions) { expectedWriteBuffers = 0; } else if (networkBufferSize >= NUM_WRITE_BUFFER_BYTES) { expectedWriteBuffers = 1; } else { expectedWriteBuffers = Math.min(EXPECTED_WRITE_BATCH_SIZE, NUM_WRITE_BUFFER_BYTES / networkBufferSize); } int numBuffersForWrite = Math.min(numRequiredBuffer / 2, expectedWriteBuffers); numBuffersForSort = numRequiredBuffer - numBuffersForWrite; try { for (int i = 0; i < numBuffersForWrite; ++i) { MemorySegment segment = bufferPool.requestMemorySegmentBlocking(); writeSegments.add(segment); } } catch (InterruptedException exception) { // the setup method does not allow InterruptedException throw new IOException(exception); } }

gaoyunhaii · 2022-02-09T10:08:28Z

...me/src/main/java/org/apache/flink/runtime/io/network/partition/SortMergeResultPartition.java


-            fileWriter.writeBuffers(toWrite);
-        }
+            BufferWithChannel bufferWithChannel = sortBuffer.copyIntoSegment(segments.poll());


It is also a bit weird that copyIntoSegment might pass a null segment. In consideration of the deadline I think we might rename the method to be like getNextBuffer(@Nullable MemorySegment transitBuffer) and add proper comments ?

wsry · 2022-02-09T12:01:07Z

Very thanks @wsry for the PR! I have left some comments.

For the long run, I tend to we move the difference implementation to the implementation class, like the strategy to split write buffers and sort buffers and how they write the buffers to the files. Perhaps we could create a new issue for the future refactor?

Also if possible I tend to we rename the classes to DataBuffer, HashBasedDataBuffer and SortBasedDataBuffer and also the variables to avoid reuse the word sort.

@gaoyunhaii Thanks for the review and comments. I agree that we can rename the sort buffer class in this PR. As for other refactor, I will create a new ticket to do it latter. I will update the PR soon.

…-shuffle if there are enough buffers for better performance Currently, for result partition of sort-shuffle, there is extra record copy overhead introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This patch aims to solve the problem. In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression. This patch solves the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used. This closes apache#18505.

wsry · 2022-02-10T03:32:13Z

@gaoyunhaii I have updated the PR.

gaoyunhaii

LGTM % one small comment.

Also @wsry could you squash the comments and rebase to the latest master to retrigger the ci pipeline? The architecture check issue should be fixed.

gaoyunhaii · 2022-02-11T01:57:34Z

...rc/test/java/org/apache/flink/runtime/io/network/partition/SortMergeResultPartitionTest.java


    private static final int numThreads = 4;

+    private final boolean useHashBasedSortBuffer;


nit: might change to useHashBasedDataBuffer ?

…-shuffle if there are enough buffers for better performance Currently, for result partition of sort-shuffle, there is extra record copy overhead introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This patch aims to solve the problem. In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression. This patch solves the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used. This closes apache#18505.

wsry · 2022-02-11T06:39:08Z

@flinkbot run azure

…-shuffle if there are enough buffers for better performance Currently, for result partition of sort-shuffle, there is extra record copy overhead introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This patch aims to solve the problem. In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression. This patch solves the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used. This closes apache#18505.

### What changes were proposed in this pull request? Refactor `SortBuffer` and `PartitionSortedBuffer` with introduction of `DataBuffer` and `SortBasedDataBuffer`. ### Why are the changes needed? `SortBuffer` and `PartitionSortedBuffer` is refactored in apache/flink#18505. Celeborn Flink should also refactor `SortBuffer` and `PartitionSortedBuffer` to sync the interface changes in Flink. Meanwhile, `SortBuffer` and `PartitionSortedBuffer` should distinguish channel and subpartition for apache/flink#23927. ### Does this PR introduce _any_ user-facing change? - `SortBuffer` renames to `DataBuffer`. - `PartitionSortedBuffer` renames to `SortBasedDataBuffer`. - `SortBuffer.BufferWithChannel` renames to `BufferWithSubpartition` ### How was this patch tested? UT and IT. Closes #2448 from SteNicholas/CELEBORN-1374. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>

wsry requested a review from gaoyunhaii January 25, 2022 13:04

wsry force-pushed the FLINK-25796 branch from b31aaa6 to 7937706 Compare January 25, 2022 13:05

rmetzger added the component=Runtime/Network label Jan 25, 2022

wsry force-pushed the FLINK-25796 branch from 7937706 to 62ce77f Compare January 26, 2022 02:32

wsry force-pushed the FLINK-25796 branch from 62ce77f to 7ab24a6 Compare January 26, 2022 03:12

wsry force-pushed the FLINK-25796 branch from 7ab24a6 to 94e6392 Compare January 26, 2022 08:24

pnowojski reviewed Jan 26, 2022

View reviewed changes

...time/src/main/java/org/apache/flink/runtime/io/network/api/writer/ResultPartitionWriter.java Outdated Show resolved Hide resolved

wsry force-pushed the FLINK-25796 branch from 94e6392 to d28c15f Compare January 26, 2022 09:49

gaoyunhaii reviewed Feb 9, 2022

View reviewed changes

wsry force-pushed the FLINK-25796 branch from d28c15f to 6216fd0 Compare February 10, 2022 03:31

gaoyunhaii approved these changes Feb 11, 2022

View reviewed changes

wsry force-pushed the FLINK-25796 branch from 6216fd0 to a2bac72 Compare February 11, 2022 02:56

wsry force-pushed the FLINK-25796 branch from a2bac72 to 08ff0ff Compare February 11, 2022 10:54

wsry closed this in 3be35d9 Feb 12, 2022

SteNicholas mentioned this pull request Apr 7, 2024

[CELEBORN-1374] Refactor SortBuffer and PartitionSortedBuffer apache/celeborn#2448

Closed


		private static final int numThreads = 4;

		private final boolean useHashBasedSortBuffer;

[FLINK-25796][network] Avoid record copy for result partition of sort-shuffle if there are enough buffers for better performance #18505

[FLINK-25796][network] Avoid record copy for result partition of sort-shuffle if there are enough buffers for better performance #18505

Uh oh!

Conversation

wsry commented Jan 25, 2022

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Jan 25, 2022

Automated Checks

Review Progress

Uh oh!

flinkbot commented Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

Uh oh!

gaoyunhaii left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wsry commented Feb 9, 2022

Uh oh!

wsry commented Feb 10, 2022

Uh oh!

gaoyunhaii left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wsry commented Feb 11, 2022

Uh oh!

Uh oh!

flinkbot commented Jan 25, 2022 •

edited

Loading