-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#1472][part-5] Inaccurate flow control leads to Shuffle server OOM when enabling Netty #1531
Conversation
…erver OOM when enabling Netty
…erver OOM when enabling Netty
…erver OOM when enabling Netty
d890a53
to
292a1d1
Compare
Test Results2 289 files - 140 2 289 suites - 140 4h 33m 25s ⏱️ - 7m 6s Results for commit 7cdccde. ± Comparison against base commit b924aca. This pull request removes 29 and adds 26 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
PTAL @jerqi. The main modifications are focused in the following files: Other modifications are mainly in test files. |
2723676
to
3f9dc40
Compare
cc @zuston |
flushBuffer(buffer, appId, shuffleId, startPartition, endPartition, isHugePartition); | ||
return; | ||
} | ||
} | ||
|
||
public void flushIfNecessary() { | ||
// if data size in buffer > highWaterMark, do the flush | ||
if (usedMemory.get() - preAllocatedSize.get() - inFlushSize.get() > highWaterMark) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you extract a method to make logic more clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you extract a method to make logic more clearer?
I think the code is clear enough. I don't know if we need to extract a method?
After extracting a method, will it make it less clear? I don't know.
if (nettyServerEnabled) {
needFlush = pinnedDirectMemory > highWaterMark;
} else {
needFlush = usedMemory.get() - preAllocatedSize.get() - inFlushSize.get() > highWaterMark;
}
the pseudocode for needFlush
is as follows:
needFlush = current shuffle server's actual used buffer > highWaterMark;
We use PooledByteBufAllocator
to allocate buffer in Netty mode, so we can basically regard pinnedDirectMemory
as current shuffle server's actual used buffer
when enabling Netty.
In netty mode, current shuffle server's actual used buffer
will be pinnedUsedMemory
.
In grpc mode, current shuffle server's actual used buffer
will be usedMemory.get()
- preAllocatedSize.get()
- inFlushSize.get()
.
|
@@ -47,6 +47,8 @@ public class ShuffleBuffer { | |||
|
|||
private final long capacity; | |||
private long size; | |||
// for Netty mode | |||
private long estimatedSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need estimatedSize? Could we reuse estimatedSize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we use the accurate real-time used direct memory pinnedDirectMemory
to determine whether to do the pre-allocation(or flush) or not. If we use size
to calculate usedMemory
, usedMemory
will gradually deviate from pinnedDirectMemory
over time(In fact, they will deviate more and more, with an increasing divergence.). This will lead to inaccuracies when calling the pickFlushedShuffle
method and when the coordinator allocates shuffle servers, as they both continue to use usedMemory
as the basis for judgment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also, we cannot reuse size
, because the real size of file will still be used in places like:
LocalStorageManager
.updateWriteMetrics
HadoopStorageManager
.updateWriteMetrics
ShuffleTaskInfo
.addOnLocalFileDataSize
ShuffleTaskInfo
.addOnHadoopDataSize
StorageWriteMetrics
.eventSize
We have to keep it the original way.
We reuse
So we don't need to modify the logic of method |
Could you provide some common abstraction for Netty mode and non-Netty mode? Netty mode implement specific behaviour and non-Netty mode implement specific behaviour. Maybe we need some interfaces. |
b25e6fd
to
dc7ea2b
Compare
Abstraction is provided as belows:
|
b1713a0
to
e83e9a1
Compare
5c01e94
to
73a2ac6
Compare
73a2ac6
to
7cdccde
Compare
hi @rickyma The core code overall looks good, Netty memory's ut needs to be increased to account for the growth of mem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have understood your motivation, but the change is not reasonable, that looks hack
* @param requestedSize The requested size of the direct memory. | ||
* @return The estimated allocated direct memory size. | ||
*/ | ||
public static int calculateEstimatedMemoryAllocationSize(int requestedSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really weird
@@ -68,6 +74,9 @@ public void start() { | |||
ShuffleServerMetrics.gaugeUsedDirectMemorySize.set(usedDirectMemory); | |||
ShuffleServerMetrics.gaugeAllocatedDirectMemorySize.set(allocatedDirectMemory); | |||
ShuffleServerMetrics.gaugePinnedDirectMemorySize.set(pinnedDirectMemory); | |||
if (nettyServerEnabled) { | |||
shuffleBufferManager.setUsedMemory(pinnedDirectMemory); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emm... It's not a good design that use the scheduled thread to update usedMem, which is not determined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can never accurately obtain the usedMemory
calculated through business code.
The first reason is that you cannot estimate this size, due to a lot reasons mentioned before, like network fluctuations.
The second reason is that PooledByteBufAllocator
may reuse direct memory through caching.
That means even if you calculate the size directly through the ByteBuf
received in the method channedRead
on server side, the usedMemory
you count may still be larger than the memory managed by PooledByteBufAllocator
.
Moreover, NettyUtils.getNettyBufferAllocator().pinnedDirectMemory() is very performance-consuming, so it is periodically obtained.
So, it is meant to be not determined anyways. And we don't need a determined usedMemory
here. That's why I use a calculateEstimatedMemoryAllocationSize
method to calculate the size.
What changes were proposed in this pull request?
When the shuffle server enables Netty, during the pre-allocation of memory and flushing buffer, we should use the actual used direct memory(which is pinnedDirectMemory in
PooledByteBufAllocator
) for the if statement, instead of the previoususedMemory
andcapacity
due to #1472.When initializing the
capacity
variable, direct memory will be used.When setting
usedMemory
variable,pinnedDirectMemory
will be used.usedMemory
will be updated inNettyDirectMemoryTracker
periodically.Default values of
rss.server.netty.directMemoryTracker.memoryUsage.updateMetricsIntervalMs
andrss.server.netty.directMemoryTracker.memoryUsage.initialFetchDelayMs
configurations are decreased to 1s.Why are the changes needed?
A sub PR for: #1519
Does this PR introduce any user-facing change?
No.
How was this patch tested?
1、Modified existed UTs.
2、Fix #1008. It does not actually test
GRPC_NETTY
mode, because it usesShuffleServerGrpcClient
everywhere instead ofShuffleServerGrpcNettyClient
.