[#1472][part-5] Inaccurate flow control leads to Shuffle server OOM when enabling Netty #1531

rickyma · 2024-02-15T21:02:32Z

What changes were proposed in this pull request?

When the shuffle server enables Netty, during the pre-allocation of memory and flushing buffer, we should use the actual used direct memory(which is pinnedDirectMemory in PooledByteBufAllocator) for the if statement, instead of the previous usedMemory and capacity due to #1472.

When initializing the capacity variable, direct memory will be used.
When setting usedMemory variable, pinnedDirectMemory will be used.
usedMemory will be updated in NettyDirectMemoryTracker periodically.

Default values of rss.server.netty.directMemoryTracker.memoryUsage.updateMetricsIntervalMs and rss.server.netty.directMemoryTracker.memoryUsage.initialFetchDelayMs configurations are decreased to 1s.

Why are the changes needed?

A sub PR for: #1519

Does this PR introduce any user-facing change?

No.

How was this patch tested?

1、Modified existed UTs.
2、Fix #1008. It does not actually test GRPC_NETTY mode, because it uses ShuffleServerGrpcClient everywhere instead of ShuffleServerGrpcNettyClient.

…erver OOM when enabling Netty

github-actions · 2024-02-15T21:29:34Z

Test Results

2 289 files - 140 2 289 suites - 140 4h 33m 25s ⏱️ - 7m 6s
816 tests - 3 815 ✅ - 3 1 💤 ±0 0 ❌ ±0
9 621 runs - 92 9 607 ✅ - 92 14 💤 ±0 0 ❌ ±0

Results for commit 7cdccde. ± Comparison against base commit b924aca.

This pull request removes 29 and adds 26 tests. Note that renamed tests count towards both.

org.apache.uniffle.server.TopNShuffleDataSizeOfAppCalcTaskTest ‑ testTopNShuffleDataSizeOfAppCalcTask
org.apache.uniffle.test.DiskErrorToleranceTest ‑ diskErrorTest
org.apache.uniffle.test.HybridStorageHadoopFallbackTest ‑ fallbackTest
org.apache.uniffle.test.HybridStorageLocalFileFallbackTest ‑ fallbackTest
org.apache.uniffle.test.ShuffleServerConcurrentWriteOfHadoopTest ‑ hadoopWriteReadTest
org.apache.uniffle.test.ShuffleServerConcurrentWriteOfHadoopTest ‑ testConcurrentWrite2Hadoop{int, int}[1]
org.apache.uniffle.test.ShuffleServerConcurrentWriteOfHadoopTest ‑ testConcurrentWrite2Hadoop{int, int}[2]
org.apache.uniffle.test.ShuffleServerFaultToleranceTest ‑ testReadFaultTolerance
org.apache.uniffle.test.ShuffleServerGrpcTest ‑ sendDataAndRequireBufferTest
org.apache.uniffle.test.ShuffleServerGrpcTest ‑ sendDataWithoutRegisterTest
…

org.apache.uniffle.test.ShuffleServerConcurrentWriteOfHadoopTest ‑ testConcurrentWrite2Hadoop{int, int, boolean}[1]
org.apache.uniffle.test.ShuffleServerConcurrentWriteOfHadoopTest ‑ testConcurrentWrite2Hadoop{int, int, boolean}[2]
org.apache.uniffle.test.ShuffleServerConcurrentWriteOfHadoopTest ‑ testConcurrentWrite2Hadoop{int, int, boolean}[3]
org.apache.uniffle.test.ShuffleServerConcurrentWriteOfHadoopTest ‑ testConcurrentWrite2Hadoop{int, int, boolean}[4]
org.apache.uniffle.test.ShuffleServerWithMemLocalHadoopTest ‑ memoryLocalFileHadoopReadWithFilterTest{boolean, boolean}[1]
org.apache.uniffle.test.ShuffleServerWithMemLocalHadoopTest ‑ memoryLocalFileHadoopReadWithFilterTest{boolean, boolean}[2]
org.apache.uniffle.test.SparkClientWithLocalTest ‑ readTest10{boolean}[1]
org.apache.uniffle.test.SparkClientWithLocalTest ‑ readTest10{boolean}[2]
org.apache.uniffle.test.SparkClientWithLocalTest ‑ readTest1{boolean}[1]
org.apache.uniffle.test.SparkClientWithLocalTest ‑ readTest1{boolean}[2]
…

♻️ This comment has been updated with latest results.

codecov-commenter · 2024-02-15T21:51:19Z

Codecov Report

Attention: 140 lines in your changes are missing coverage. Please review.

Comparison is base (7fbe7c9) 54.15% compared to head (7cdccde) 54.40%.
Report is 4 commits behind head on master.

Files	Patch %	Lines
...iffle/server/buffer/NettyShuffleBufferManager.java	18.27%	74 Missing and 2 partials ⚠️
...ava/org/apache/uniffle/common/util/NettyUtils.java	6.66%	14 Missing ⚠️
...le/server/buffer/AbstractShuffleBufferManager.java	75.00%	8 Missing and 5 partials ⚠️
...ache/uniffle/server/buffer/NettyShuffleBuffer.java	0.00%	13 Missing ⚠️
...a/org/apache/uniffle/common/ShuffleServerInfo.java	0.00%	6 Missing ⚠️
...niffle/server/netty/ShuffleServerNettyHandler.java	0.00%	5 Missing ⚠️
.../org/apache/uniffle/server/ShuffleTaskManager.java	33.33%	3 Missing and 1 partial ⚠️
...niffle/server/buffer/GrpcShuffleBufferManager.java	95.74%	1 Missing and 3 partials ⚠️
...e/uniffle/server/buffer/AbstractShuffleBuffer.java	57.14%	3 Missing ⚠️
...g/apache/uniffle/server/ShuffleDataFlushEvent.java	80.00%	1 Missing ⚠️
... and 1 more

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #1531      +/-   ##
============================================
+ Coverage     54.15%   54.40%   +0.24%     
- Complexity     2803     2808       +5     
============================================
  Files           430      415      -15     
  Lines         24417    22259    -2158     
  Branches       2081     2112      +31     
============================================
- Hits          13224    12110    -1114     
+ Misses        10361     9389     -972     
+ Partials        832      760      -72

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rickyma · 2024-02-15T21:53:44Z

PTAL @jerqi.

The main modifications are focused in the following files:
ShuffleBuffer.java
ShuffleBufferManager.java (Mostly in this file)
ShuffleServerNettyHandler.java
NettyDirectMemoryTracker.java
ShuffleDataFlushEvent.java
ShuffleServer.java
ShuffleTaskManager.java.

Other modifications are mainly in test files.

jerqi · 2024-02-16T13:29:11Z

cc @zuston

jerqi · 2024-02-16T13:32:38Z

server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java

      flushBuffer(buffer, appId, shuffleId, startPartition, endPartition, isHugePartition);
      return;
    }
  }

  public void flushIfNecessary() {
    // if data size in buffer > highWaterMark, do the flush
-    if (usedMemory.get() - preAllocatedSize.get() - inFlushSize.get() > highWaterMark) {


Could you extract a method to make logic more clearer?

Could you extract a method to make logic more clearer?

I think the code is clear enough. I don't know if we need to extract a method?
After extracting a method, will it make it less clear? I don't know.

if (nettyServerEnabled) { needFlush = pinnedDirectMemory > highWaterMark; } else { needFlush = usedMemory.get() - preAllocatedSize.get() - inFlushSize.get() > highWaterMark; }

the pseudocode for needFlush is as follows:

needFlush = current shuffle server's actual used buffer > highWaterMark;

We use PooledByteBufAllocator to allocate buffer in Netty mode, so we can basically regard pinnedDirectMemory as current shuffle server's actual used buffer when enabling Netty.

In netty mode, current shuffle server's actual used buffer will be pinnedUsedMemory.
In grpc mode, current shuffle server's actual used buffer will be usedMemory.get() - preAllocatedSize.get() - inFlushSize.get().

jerqi · 2024-02-16T13:36:09Z

Do we need to modify the logic of method pickFlushedShuffle? You can refer to the comment [#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffle blocks instead of reallocating it #1521 (comment)
Could extract some methods to make the logic more clearer?

jerqi · 2024-02-16T16:17:33Z

server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBuffer.java

@@ -47,6 +47,8 @@ public class ShuffleBuffer {

  private final long capacity;
  private long size;
+  // for Netty mode
+  private long estimatedSize;


Do we need estimatedSize? Could we reuse estimatedSize?

Because we use the accurate real-time used direct memory pinnedDirectMemory to determine whether to do the pre-allocation(or flush) or not. If we use size to calculate usedMemory, usedMemory will gradually deviate from pinnedDirectMemory over time(In fact, they will deviate more and more, with an increasing divergence.). This will lead to inaccuracies when calling the pickFlushedShuffle method and when the coordinator allocates shuffle servers, as they both continue to use usedMemory as the basis for judgment.

And also, we cannot reuse size, because the real size of file will still be used in places like:
LocalStorageManager.updateWriteMetrics
HadoopStorageManager.updateWriteMetrics
ShuffleTaskInfo.addOnLocalFileDataSize
ShuffleTaskInfo.addOnHadoopDataSize
StorageWriteMetrics.eventSize

We have to keep it the original way.

rickyma · 2024-02-16T16:26:03Z

Do we need to modify the logic of method pickFlushedShuffle? You can refer to the comment [#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffle blocks instead of reallocating it #1521 (comment)

Could extract some methods to make the logic more clearer?

We reuse highWaterMark and lowWaterMark:

this.capacity = conf.getSizeAsBytes(ShuffleServerConf.SERVER_BUFFER_CAPACITY);
if (this.capacity < 0) {
  this.capacity =
      nettyServerEnabled
          ? (long)
              (NettyUtils.getMaxDirectMemory()
                  * conf.getDouble(ShuffleServerConf.SERVER_BUFFER_CAPACITY_RATIO))
          : (long) (heapSize * conf.getDouble(ShuffleServerConf.SERVER_BUFFER_CAPACITY_RATIO));
}
this.highWaterMark =
    (long)
        (capacity
            / 100.0
            * conf.get(ShuffleServerConf.SERVER_MEMORY_SHUFFLE_HIGHWATERMARK_PERCENTAGE));
this.lowWaterMark =
    (long)
        (capacity
            / 100.0
            * conf.get(ShuffleServerConf.SERVER_MEMORY_SHUFFLE_LOWWATERMARK_PERCENTAGE));

So we don't need to modify the logic of method pickFlushedShuffle.
The pickedFlushSize in pickFlushedShuffle will become estimatedSize in Netty mode, because it comes from shuffleSizeMap which will be modified in cacheShuffleData -> updateShuffleSize.

jerqi · 2024-02-17T02:36:20Z

Could you provide some common abstraction for Netty mode and non-Netty mode? Netty mode implement specific behaviour and non-Netty mode implement specific behaviour. Maybe we need some interfaces.

rickyma · 2024-02-17T18:23:01Z

Could you provide some common abstraction for Netty mode and non-Netty mode? Netty mode implement specific behaviour and non-Netty mode implement specific behaviour. Maybe we need some interfaces.

Abstraction is provided as belows:

AbstractShuffleBuffer
├── GrpcShuffleBuffer
└── NettyShuffleBuffer

AbstractShuffleBufferManager
├── GrpcShuffleBufferManager
└── NettyShuffleBufferManager

ShuffleBufferManagerFactory
└── createShuffleBufferManager()

@jerqi

zuston · 2024-02-18T07:25:17Z

I'm still evaluating this PR effective and rationality. Do you have similar experience about netty? @EnricoMi

Detail could be found on #1519

rickyma · 2024-02-18T08:34:49Z

I'm still evaluating this PR effective and rationality. Do you have similar experience about netty? @EnricoMi

Detail could be found on #1519

Tested in our test env, it solved the issue successfully. Before this PR, it will fail very quickly.

XuQianJin-Stars · 2024-02-18T09:15:03Z

hi @rickyma The core code overall looks good, Netty memory's ut needs to be increased to account for the growth of mem.

zuston

I have understood your motivation, but the change is not reasonable, that looks hack

zuston · 2024-02-18T11:50:39Z

common/src/main/java/org/apache/uniffle/common/util/NettyUtils.java

+   * @param requestedSize The requested size of the direct memory.
+   * @return The estimated allocated direct memory size.
+   */
+  public static int calculateEstimatedMemoryAllocationSize(int requestedSize) {


It's really weird

zuston · 2024-02-18T11:54:06Z

server/src/main/java/org/apache/uniffle/server/NettyDirectMemoryTracker.java

@@ -68,6 +74,9 @@ public void start() {
            ShuffleServerMetrics.gaugeUsedDirectMemorySize.set(usedDirectMemory);
            ShuffleServerMetrics.gaugeAllocatedDirectMemorySize.set(allocatedDirectMemory);
            ShuffleServerMetrics.gaugePinnedDirectMemorySize.set(pinnedDirectMemory);
+            if (nettyServerEnabled) {
+              shuffleBufferManager.setUsedMemory(pinnedDirectMemory);


Emm... It's not a good design that use the scheduled thread to update usedMem, which is not determined.

You can never accurately obtain the usedMemory calculated through business code.

The first reason is that you cannot estimate this size, due to a lot reasons mentioned before, like network fluctuations.
The second reason is that PooledByteBufAllocator may reuse direct memory through caching.
That means even if you calculate the size directly through the ByteBuf received in the method channedRead on server side, the usedMemory you count may still be larger than the memory managed by PooledByteBufAllocator.

Moreover, NettyUtils.getNettyBufferAllocator().pinnedDirectMemory() is very performance-consuming, so it is periodically obtained.

So, it is meant to be not determined anyways. And we don't need a determined usedMemory here. That's why I use a calculateEstimatedMemoryAllocationSize method to calculate the size.

rickyma · 2024-02-21T06:53:52Z

Closed. I've created a new PR to solve this problem: #1534

@zuston @jerqi

rickyma added 10 commits February 5, 2024 13:51

[apache#1472] fix(server): Inaccurate flow control leads to Shuffle s…

89d2ae8

…erver OOM when enabling Netty

[apache#1472] fix(server): Inaccurate flow control leads to Shuffle s…

947bbf3

…erver OOM when enabling Netty

[apache#1472] fix(server): Inaccurate flow control leads to Shuffle s…

6980ae0

…erver OOM when enabling Netty

fix previous tests

a34c300

fix previous tests

a4db16e

fix previous tests

a1600fa

fix previous tests

3fb2c8f

Upgrade GRPC to latest

409a307

Merge branch 'master' into issue-1472

988fc8b

Fix usedMemory

292a1d1

rickyma force-pushed the issue-1472-part5 branch from d890a53 to 292a1d1 Compare February 15, 2024 21:06

rickyma added 3 commits February 16, 2024 05:37

Remove redundant configurations

48ad6c1

Merge branch 'master' into issue-1472-part5

d15dbb7

Remove redundant configurations

3f9dc40

rickyma force-pushed the issue-1472-part5 branch from 2723676 to 3f9dc40 Compare February 16, 2024 08:11

jerqi reviewed Feb 16, 2024

View reviewed changes

rickyma requested a review from jerqi February 16, 2024 16:48

rickyma force-pushed the issue-1472-part5 branch from b25e6fd to dc7ea2b Compare February 17, 2024 18:18

rickyma force-pushed the issue-1472-part5 branch 3 times, most recently from b1713a0 to e83e9a1 Compare February 18, 2024 03:27

rickyma force-pushed the issue-1472-part5 branch 6 times, most recently from 5c01e94 to 73a2ac6 Compare February 18, 2024 07:12

Refactor ShuffleBufferManager & ShuffleBuffer

7cdccde

rickyma force-pushed the issue-1472-part5 branch from 73a2ac6 to 7cdccde Compare February 18, 2024 08:25

rickyma closed this Feb 18, 2024

rickyma reopened this Feb 18, 2024

zuston reviewed Feb 18, 2024

View reviewed changes

rickyma closed this Feb 21, 2024

rickyma mentioned this pull request Feb 23, 2024

[#1472][part-5] Use UnpooledByteBufAllocator to obtain accurate ByteBuf sizes to fix inaccurate usedMemory issue causing OOM #1534

Merged

rickyma deleted the issue-1472-part5 branch May 5, 2024 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#1472][part-5] Inaccurate flow control leads to Shuffle server OOM when enabling Netty #1531

[#1472][part-5] Inaccurate flow control leads to Shuffle server OOM when enabling Netty #1531

rickyma commented Feb 15, 2024 •

edited

Loading

github-actions bot commented Feb 15, 2024 •

edited

Loading

codecov-commenter commented Feb 15, 2024 •

edited

Loading

rickyma commented Feb 15, 2024

jerqi commented Feb 16, 2024

jerqi Feb 16, 2024

rickyma Feb 16, 2024 •

edited

Loading

jerqi commented Feb 16, 2024 •

edited

Loading

jerqi Feb 16, 2024

rickyma Feb 16, 2024 •

edited

Loading

rickyma Feb 16, 2024 •

edited

Loading

rickyma commented Feb 16, 2024

jerqi commented Feb 17, 2024

rickyma commented Feb 17, 2024 •

edited

Loading

zuston commented Feb 18, 2024 •

edited

Loading

rickyma commented Feb 18, 2024 •

edited

Loading

XuQianJin-Stars commented Feb 18, 2024

zuston left a comment

zuston Feb 18, 2024

zuston Feb 18, 2024

rickyma Feb 19, 2024 •

edited

Loading

rickyma commented Feb 21, 2024

[#1472][part-5] Inaccurate flow control leads to Shuffle server OOM when enabling Netty #1531

[#1472][part-5] Inaccurate flow control leads to Shuffle server OOM when enabling Netty #1531

Conversation

rickyma commented Feb 15, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

github-actions bot commented Feb 15, 2024 • edited Loading

Test Results

codecov-commenter commented Feb 15, 2024 • edited Loading

Codecov Report

rickyma commented Feb 15, 2024

jerqi commented Feb 16, 2024

jerqi Feb 16, 2024

Choose a reason for hiding this comment

rickyma Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

jerqi commented Feb 16, 2024 • edited Loading

jerqi Feb 16, 2024

Choose a reason for hiding this comment

rickyma Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

rickyma Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

rickyma commented Feb 16, 2024

jerqi commented Feb 17, 2024

rickyma commented Feb 17, 2024 • edited Loading

zuston commented Feb 18, 2024 • edited Loading

rickyma commented Feb 18, 2024 • edited Loading

XuQianJin-Stars commented Feb 18, 2024

zuston left a comment

Choose a reason for hiding this comment

zuston Feb 18, 2024

Choose a reason for hiding this comment

zuston Feb 18, 2024

Choose a reason for hiding this comment

rickyma Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

rickyma commented Feb 21, 2024

rickyma commented Feb 15, 2024 •

edited

Loading

github-actions bot commented Feb 15, 2024 •

edited

Loading

codecov-commenter commented Feb 15, 2024 •

edited

Loading

rickyma Feb 16, 2024 •

edited

Loading

jerqi commented Feb 16, 2024 •

edited

Loading

rickyma Feb 16, 2024 •

edited

Loading

rickyma Feb 16, 2024 •

edited

Loading

rickyma commented Feb 17, 2024 •

edited

Loading

zuston commented Feb 18, 2024 •

edited

Loading

rickyma commented Feb 18, 2024 •

edited

Loading

rickyma Feb 19, 2024 •

edited

Loading