[#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffle blocks instead of reallocating it #1521

rickyma · 2024-02-12T07:30:42Z

What changes were proposed in this pull request?

Reuse ByteBuf when decoding shuffle blocks instead of reallocating it

Why are the changes needed?

A sub PR for: #1519

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

…e blocks instead of reallocating it

codecov-commenter · 2024-02-12T07:49:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (576a925) 54.27% compared to head (f85291a) 55.15%.
Report is 9 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #1521      +/-   ##
============================================
+ Coverage     54.27%   55.15%   +0.87%     
+ Complexity     2807     2806       -1     
============================================
  Files           427      410      -17     
  Lines         24349    22048    -2301     
  Branches       2077     2082       +5     
============================================
- Hits          13215    12160    -1055     
+ Misses        10305     9129    -1176     
+ Partials        829      759      -70

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-02-12T08:19:00Z

Test Results

2 287 files ±0 2 287 suites ±0 4h 30m 33s ⏱️ + 1m 33s
819 tests ±0 818 ✅ ±0 1 💤 ±0 0 ❌ ±0
9 086 runs ±0 9 073 ✅ ±0 13 💤 ±0 0 ❌ ±0

Results for commit f85291a. ± Comparison against base commit d87dc90.

♻️ This comment has been updated with latest results.

jerqi · 2024-02-13T02:04:31Z

common/src/main/java/org/apache/uniffle/common/netty/protocol/Decoders.java

@@ -47,8 +46,7 @@ public static ShuffleBlockInfo decodeShuffleBlockInfo(ByteBuf byteBuf) {
    long crc = byteBuf.readLong();
    long taskAttemptId = byteBuf.readLong();
    int dataLength = byteBuf.readInt();
-    ByteBuf data = NettyUtils.getNettyBufferAllocator().directBuffer(dataLength);
-    data.writeBytes(byteBuf, dataLength);
+    ByteBuf data = byteBuf.retain().readSlice(dataLength);


Will byteBuf be spitted into muliple parts? Every part will released multiple times? Will it bring errors?

ByteBuf will not be splitted into multiple parts. It will be used by a SendShuffleDataRequest as a whole.
It will not bring errors. Because we retain the ByteBuf(refCnf++) everytime when we do a readSlice.

public static SendShuffleDataRequest decode(ByteBuf byteBuf) { long requestId = byteBuf.readLong(); String appId = ByteBufUtils.readLengthAndString(byteBuf); int shuffleId = byteBuf.readInt(); long requireId = byteBuf.readLong(); Map<Integer, List<ShuffleBlockInfo>> partitionToBlocks = decodePartitionData(byteBuf); long timestamp = byteBuf.readLong(); return new SendShuffleDataRequest( requestId, appId, shuffleId, requireId, partitionToBlocks, timestamp); }

But it might slow down the flushing process.
Because it will not trigger the actual flushing process util all the ShufflePartitionedData is flushed(refCnt decreased to 0):

List<ShufflePartitionedData> shufflePartitionedData = toPartitionedData(sendShuffleDataRequest); ... for (ShufflePartitionedData spd : shufflePartitionedData) { ... ret = manager.cacheShuffleData(appId, shuffleId, isPreAllocated, spd); ... }

ByteBuf cannot be splitted, once splitted we have to allocate new ByteBufs.
~~So maybe we can hold this PR and find a better way to do this.~~
But it will speed up the decoding process on the other hand.

rickyma · 2024-02-13T18:28:22Z

I reopened this PR.

After stress testing the shuffle server without this PR, we will easily encounter OutOfDirectMemoryError, which means this PR is necessary.

[epollEventLoopGroup-3-45] [WARN] TransportChannelHandler.exceptionCaught - Exception in connection from /127.0.0.1:58767
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 4194304 byte(s) of direct memory (used: 161061273600, max: 161061273600)
at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:843)
at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:772)
at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:710)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:685)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:212)
at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:194)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:136)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:126)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:397)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
at org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50)
at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decodePartitionData(SendShuffleDataRequest.java:95)
at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decode(SendShuffleDataRequest.java:107)
at org.apache.uniffle.common.netty.protocol.Message.decode(Message.java:145)
at org.apache.uniffle.common.netty.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:77)

We can see that each time an out-of-direct-memory error occurs, it is caused by the code org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50), which is ByteBuf data = NettyUtils.getNettyBufferAllocator().directBuffer(dataLength). This is the most direct trigger for insufficient direct memory.

Because when a large number of requests arrive simultaneously, there might be a brief period (before the TransportFrameDecoder has a chance to release the ByteBuf) during which the shuffle server has doubled the created ByteBuf. This means, for a very short time, the direct memory usage is doubled, which is extremely uncontrollable.
That is why it is very easy to cause an out-of-direct-memory error without this PR.

So, we need this PR anyway. It might slow down the flushing process a little bit, but the shuffle server will at least remain available during the whole stress test.

From the results of my stress tests, there doesn't seem to be any impact on performance. In fact, it may even be faster, as it can speed up the decoding process by not reallocating new ByteBufs on the other hand.
There have been no anomalies or performance issues caused by the slowing down of the flushing process. Eventually, all these buffers will be flushed, and all ByteBufs will be successfully released, with no memory leaks.

PTAL @jerqi

jerqi · 2024-02-14T02:22:42Z

I reopened this PR.

After stress testing the shuffle server without this PR, we will easily encounter OutOfDirectMemoryError, which means this PR is necessary.

[epollEventLoopGroup-3-45] [WARN] TransportChannelHandler.exceptionCaught - Exception in connection from /127.0.0.1:58767 io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 4194304 byte(s) of direct memory (used: 161061273600, max: 161061273600) at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:843) at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:772) at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:710) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:685) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:212) at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:194) at io.netty.buffer.PoolArena.allocate(PoolArena.java:136) at io.netty.buffer.PoolArena.allocate(PoolArena.java:126) at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:397) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179) at org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50) at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decodePartitionData(SendShuffleDataRequest.java:95) at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decode(SendShuffleDataRequest.java:107) at org.apache.uniffle.common.netty.protocol.Message.decode(Message.java:145) at org.apache.uniffle.common.netty.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:77)

We can see that each time an out-of-direct-memory error occurs, it is caused by the code org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50), which is ByteBuf data = NettyUtils.getNettyBufferAllocator().directBuffer(dataLength). This is the most direct trigger for insufficient direct memory.

Because when a large number of requests arrive simultaneously, there might be a brief period (before the TransportFrameDecoder has a chance to release the ByteBuf) during which the shuffle server has double the created ByteBuf. This means, for a very short time, the direct memory usage is doubled, which is extremely uncontrollable. That is why it is very easy to cause an out-of-direct-memory error without this PR.

So, we need this PR anyway. It might slow down the flushing process a little bit, but the shuffle server will at least remain available during the whole stress test.

From the results of my stress tests, there doesn't seem to be any impact on performance. In fact, it may even be faster, as it can speed up the decoding process by not reallocating new ByteBufs on the other hand. There have been no anomalies or performance issues caused by the slowing down of the flushing process. Eventually, all these buffers will be flushed, and all ByteBufs will be successfully released, with no memory leaks.

PTAL @jerqi

Maybe we should modify our flush strategy, too. Now we will flush a larger reduce partition. But if the map partition contains a smaller reduce partition. The memory won't be released, too.

jerqi · 2024-02-14T02:34:49Z

I prefer adding a config option for this improvement.

rickyma · 2024-02-14T06:51:03Z

I reopened this PR.
After stress testing the shuffle server without this PR, we will easily encounter OutOfDirectMemoryError, which means this PR is necessary.
[epollEventLoopGroup-3-45] [WARN] TransportChannelHandler.exceptionCaught - Exception in connection from /127.0.0.1:58767 io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 4194304 byte(s) of direct memory (used: 161061273600, max: 161061273600) at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:843) at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:772) at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:710) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:685) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:212) at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:194) at io.netty.buffer.PoolArena.allocate(PoolArena.java:136) at io.netty.buffer.PoolArena.allocate(PoolArena.java:126) at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:397) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179) at org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50) at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decodePartitionData(SendShuffleDataRequest.java:95) at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decode(SendShuffleDataRequest.java:107) at org.apache.uniffle.common.netty.protocol.Message.decode(Message.java:145) at org.apache.uniffle.common.netty.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:77)
We can see that each time an out-of-direct-memory error occurs, it is caused by the code org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50), which is ByteBuf data = NettyUtils.getNettyBufferAllocator().directBuffer(dataLength). This is the most direct trigger for insufficient direct memory.
Because when a large number of requests arrive simultaneously, there might be a brief period (before the TransportFrameDecoder has a chance to release the ByteBuf) during which the shuffle server has double the created ByteBuf. This means, for a very short time, the direct memory usage is doubled, which is extremely uncontrollable. That is why it is very easy to cause an out-of-direct-memory error without this PR.
So, we need this PR anyway. It might slow down the flushing process a little bit, but the shuffle server will at least remain available during the whole stress test.
From the results of my stress tests, there doesn't seem to be any impact on performance. In fact, it may even be faster, as it can speed up the decoding process by not reallocating new ByteBufs on the other hand. There have been no anomalies or performance issues caused by the slowing down of the flushing process. Eventually, all these buffers will be flushed, and all ByteBufs will be successfully released, with no memory leaks.
PTAL @jerqi

Maybe we should modify our flush strategy, too. Now we will flush a larger reduce partition. But if the map partition contains a smaller reduce partition. The memory won't be released, too.

Flushing strategy will be changed in the final PR.

rickyma · 2024-02-14T06:52:30Z

I prefer adding a config option for this improvement.

This is not an improvement, it is actually a bug. Because the shuffle server won't be available during stress testing.
So I think it's a must rather than an improvement. We must avoid double allocation of ByteBuf at all costs.

jerqi

LGTM, thanks @rickyma

…mory issue causing OOM (#1534) ### What changes were proposed in this pull request? When we use `UnpooledByteBufAllocator` to allocate off-heap `ByteBuf`, Netty directly requests off-heap memory from the operating system instead of allocating it according to `pageSize` and `chunkSize`. This way, we can obtain the exact `ByteBuf` size during the pre-allocation of memory, avoiding distortion of metrics such as `usedMemory`. Moreover, we have restored the code submission of the PR [#1521](#1521). We ensure that there is sufficient direct memory for the Netty server during decoding `sendShuffleDataRequest` by taking into account the `encodedLength` of `ByteBuf` in advance during the pre-allocation of memory, thus avoiding OOM during decoding `sendShuffleDataRequest`. Since we are not using `PooledByteBufAllocator`, the PR [#1524](#1524) is no longer needed. ### Why are the changes needed? A sub PR for: #1519 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs.

[apache#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffl…

f85291a

…e blocks instead of reallocating it

rickyma force-pushed the issue-1472-part-2 branch from 45ace37 to f85291a Compare February 12, 2024 07:38

jerqi reviewed Feb 13, 2024

View reviewed changes

rickyma closed this Feb 13, 2024

rickyma reopened this Feb 13, 2024

rickyma requested a review from jerqi February 14, 2024 08:48

jerqi approved these changes Feb 14, 2024

View reviewed changes

jerqi merged commit 7fbe7c9 into apache:master Feb 15, 2024
75 checks passed

jerqi mentioned this pull request Feb 16, 2024

[#1472][part-5] Inaccurate flow control leads to Shuffle server OOM when enabling Netty #1531

Closed

This was referenced Feb 18, 2024

[#1472] fix(server): Inaccurate flow control leads to Shuffle server OOM when enabling Netty #1519

Closed

[#1472][part-5] Use UnpooledByteBufAllocator to obtain accurate ByteBuf sizes to fix inaccurate usedMemory issue causing OOM #1534

Merged

rickyma deleted the issue-1472-part-2 branch May 5, 2024 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffle blocks instead of reallocating it #1521

[#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffle blocks instead of reallocating it #1521

rickyma commented Feb 12, 2024

codecov-commenter commented Feb 12, 2024 •

edited

Loading

github-actions bot commented Feb 12, 2024 •

edited

Loading

jerqi Feb 13, 2024

rickyma Feb 13, 2024 •

edited

Loading

rickyma commented Feb 13, 2024 •

edited

Loading

jerqi commented Feb 14, 2024

jerqi commented Feb 14, 2024

rickyma commented Feb 14, 2024

rickyma commented Feb 14, 2024 •

edited

Loading

jerqi left a comment

[#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffle blocks instead of reallocating it #1521

[#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffle blocks instead of reallocating it #1521

Conversation

rickyma commented Feb 12, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

codecov-commenter commented Feb 12, 2024 • edited Loading

Codecov Report

github-actions bot commented Feb 12, 2024 • edited Loading

Test Results

jerqi Feb 13, 2024

Choose a reason for hiding this comment

rickyma Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

rickyma commented Feb 13, 2024 • edited Loading

jerqi commented Feb 14, 2024

jerqi commented Feb 14, 2024

rickyma commented Feb 14, 2024

rickyma commented Feb 14, 2024 • edited Loading

jerqi left a comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 12, 2024 •

edited

Loading

github-actions bot commented Feb 12, 2024 •

edited

Loading

rickyma Feb 13, 2024 •

edited

Loading

rickyma commented Feb 13, 2024 •

edited

Loading

rickyma commented Feb 14, 2024 •

edited

Loading