-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-25796][network] Avoid record copy for result partition of sort-shuffle if there are enough buffers for better performance #18505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 7937706 (Tue Jan 25 13:09:59 UTC 2022) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
…-shuffle if there are enough buffers for better performance Currently, for result partition of sort-shuffle, there is extra record copy overhead introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This patch aims to solve the problem. In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression. This patch solves the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used. This closes apache#18505.
…-shuffle if there are enough buffers for better performance Currently, for result partition of sort-shuffle, there is extra record copy overhead introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This patch aims to solve the problem. In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression. This patch solves the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used. This closes apache#18505.
…-shuffle if there are enough buffers for better performance Currently, for result partition of sort-shuffle, there is extra record copy overhead introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This patch aims to solve the problem. In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression. This patch solves the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used. This closes apache#18505.
...time/src/main/java/org/apache/flink/runtime/io/network/api/writer/ResultPartitionWriter.java
Outdated
Show resolved
Hide resolved
…-shuffle if there are enough buffers for better performance Currently, for result partition of sort-shuffle, there is extra record copy overhead introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This patch aims to solve the problem. In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression. This patch solves the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used. This closes apache#18505.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very thanks @wsry for the PR! I have left some comments.
For the long run, I tend to we move the difference implementation to the implementation class, like the strategy to split write buffers and sort buffers and how they write the buffers to the files. Perhaps we could create a new issue for the future refactor?
Also if possible I tend to we rename the classes to DataBuffer
, HashBasedDataBuffer
and SortBasedDataBuffer
and also the variables to avoid reuse the word sort
.
private void writeLargeRecord( | ||
ByteBuffer record, int targetSubpartition, DataType dataType, boolean isBroadcast) | ||
throws IOException { | ||
checkState(numBuffersForWrite > 0, "No buffers available for writing."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this cause problem if there is large records when using hash-based implementation? Might we keep at least one buffer for write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For hash-based implementation, large record will be appended to the sort buffer, when the data buffer is full the partial data of the record will be spilled as a data region and the remaining data of the large record will be appended to the sort buffer again. That is to say, a large record can span multiple data regions.
if (!isFull) { | ||
++numTotalRecords; | ||
} | ||
numTotalBytes += totalBytes - source.remaining(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If source takes 5 buffers and 3 buffers are written, do we expect to write the remaining buffers in the next buffer? If so might add some comments in the method docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will add some comments to explain that.
boolean isReleased(); | ||
|
||
/** Resets this {@link SortBuffer} to be reused for data appending. */ | ||
void reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might move this method before finish()
.
Perhaps we could also add some description of the lifecycle of the SortBuffer
in the class document? Like describe the process of write, writer, full, read, read, reset, ..., finish, release
.
* Number of reserved network buffers for data writing. This value can be 0 and 0 means that | ||
* {@link HashBasedPartitionSortedBuffer} will be used. | ||
*/ | ||
private int numBuffersForWrite; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the following modification, it seems the variable here mainly plays a role to indicate whether we want to use hash-based implementation and sort-based implementation. I think perhaps we could directly use a variable like sortBufferType
or useHashBuffer
to make it more clear. The numBuffersForWrite
could be changed to be a local variable in the constructor. We could change the implementation to be like
if (numRequiredBuffer >= 2 * numSubpartitions) {
useHashBuffer = true;
} else {
useHashBuffer = false;
}
if (!useHashBuffer) {
int expectedWriteBuffers;
if (numRequiredBuffer >= 2 * numSubpartitions) {
expectedWriteBuffers = 0;
} else if (networkBufferSize >= NUM_WRITE_BUFFER_BYTES) {
expectedWriteBuffers = 1;
} else {
expectedWriteBuffers =
Math.min(EXPECTED_WRITE_BATCH_SIZE, NUM_WRITE_BUFFER_BYTES / networkBufferSize);
}
int numBuffersForWrite = Math.min(numRequiredBuffer / 2, expectedWriteBuffers);
numBuffersForSort = numRequiredBuffer - numBuffersForWrite;
try {
for (int i = 0; i < numBuffersForWrite; ++i) {
MemorySegment segment = bufferPool.requestMemorySegmentBlocking();
writeSegments.add(segment);
}
} catch (InterruptedException exception) {
// the setup method does not allow InterruptedException
throw new IOException(exception);
}
}
|
||
fileWriter.writeBuffers(toWrite); | ||
} | ||
BufferWithChannel bufferWithChannel = sortBuffer.copyIntoSegment(segments.poll()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also a bit weird that copyIntoSegment
might pass a null segment. In consideration of the deadline I think we might rename the method to be like getNextBuffer(@Nullable MemorySegment transitBuffer)
and add proper comments ?
@gaoyunhaii Thanks for the review and comments. I agree that we can rename the sort buffer class in this PR. As for other refactor, I will create a new ticket to do it latter. I will update the PR soon. |
…-shuffle if there are enough buffers for better performance Currently, for result partition of sort-shuffle, there is extra record copy overhead introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This patch aims to solve the problem. In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression. This patch solves the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used. This closes apache#18505.
@gaoyunhaii I have updated the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % one small comment.
Also @wsry could you squash the comments and rebase to the latest master to retrigger the ci pipeline? The architecture check issue should be fixed.
|
||
private static final int numThreads = 4; | ||
|
||
private final boolean useHashBasedSortBuffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might change to useHashBasedDataBuffer
?
…-shuffle if there are enough buffers for better performance Currently, for result partition of sort-shuffle, there is extra record copy overhead introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This patch aims to solve the problem. In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression. This patch solves the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used. This closes apache#18505.
@flinkbot run azure |
…-shuffle if there are enough buffers for better performance Currently, for result partition of sort-shuffle, there is extra record copy overhead introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This patch aims to solve the problem. In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression. This patch solves the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used. This closes apache#18505.
…-shuffle if there are enough buffers for better performance Currently, for result partition of sort-shuffle, there is extra record copy overhead introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This patch aims to solve the problem. In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression. This patch solves the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used. This closes apache#18505.
### What changes were proposed in this pull request? Refactor `SortBuffer` and `PartitionSortedBuffer` with introduction of `DataBuffer` and `SortBasedDataBuffer`. ### Why are the changes needed? `SortBuffer` and `PartitionSortedBuffer` is refactored in apache/flink#18505. Celeborn Flink should also refactor `SortBuffer` and `PartitionSortedBuffer` to sync the interface changes in Flink. Meanwhile, `SortBuffer` and `PartitionSortedBuffer` should distinguish channel and subpartition for apache/flink#23927. ### Does this PR introduce _any_ user-facing change? - `SortBuffer` renames to `DataBuffer`. - `PartitionSortedBuffer` renames to `SortBasedDataBuffer`. - `SortBuffer.BufferWithChannel` renames to `BufferWithSubpartition` ### How was this patch tested? UT and IT. Closes #2448 from SteNicholas/CELEBORN-1374. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
What is the purpose of the change
Currently, for result partition of sort-shuffle, there is extra record copy overhead Introduced by clustering records by subpartition index. For small records, this overhead can cause even 20% performance regression. This ticket aims to solve the problem.
In fact, the hash-based implementation is a nature way to achieve the goal of sorting records by partition index. However, it incurs some serious weaknesses. For example, when there is no enough buffers or there is data skew, it can waste buffers and influence compression efficiency which can cause performance regression.
This ticket tries to solve the issue by dynamically switching between the two implementations, that is, if there are enough buffers, the hash-based implementation will be used and if there is no enough buffers, the sort-based implementation will be used.
Brief change log
Verifying this change
This change added tests and existing tests can also help to verify the change.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation