-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#567] feat: add a shuffle-server metric about read_used_buffer_size #568
Conversation
@@ -362,10 +363,12 @@ public boolean requireReadMemoryWithRetry(long size) { | |||
public void releaseReadMemory(long size) { | |||
if (readDataMemory.get() >= size) { | |||
readDataMemory.addAndGet(-size); | |||
ShuffleServerMetrics.gaugeReadBufferUsedSize.dec(size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShuffleServerMetrics.gaugeReadBufferUsedSize.dec(size); | |
ShuffleServerMetrics.gaugeReadBufferUsedSize.set(readDataMemory.get()); |
How about this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's Ok. The metrics should be always consistent with the value of readDataMemory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the suggest might not be better than the previous one.
Consider this case:
T1, T2 for two threads.
t0, t1, t2 for times
N: current read buffer
t0: T1: readDataMemory.addAndGet(-size)
t1: T1: readDataMemory.get() --> N - size
t2: T2: readDataMemory.addAndGet(-size)
t3: T2: readDataMemory.get() --> N - size - size
t4: T2 : gaugeReadBufferUsedSize <- N -size - size
t5: T1: gaugeReadBufferUsedSize <- N - size
Unless there's safe bet to actually reset the gauge to the readDataMemory.get().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the suggest might not be better than the previous one.
Consider this case:
T1, T2 for two threads. t0, t1, t2 for times N: current read buffer t0: T1: readDataMemory.addAndGet(-size) t1: T1: readDataMemory.get() --> N - size t2: T2: readDataMemory.addAndGet(-size) t3: T2: readDataMemory.get() --> N - size - size t4: T2 : gaugeReadBufferUsedSize <- N -size - size t5: T1: gaugeReadBufferUsedSize <- N - size
Unless there's safe bet to actually reset the gauge to the readDataMemory.get().
You are right, My mistake. Forgot that the same problem occurred before. 😂 #404
@@ -362,10 +363,12 @@ public boolean requireReadMemoryWithRetry(long size) { | |||
public void releaseReadMemory(long size) { | |||
if (readDataMemory.get() >= size) { | |||
readDataMemory.addAndGet(-size); | |||
ShuffleServerMetrics.gaugeReadBufferUsedSize.dec(size); | |||
} else { | |||
LOG.warn("Current read memory[" + readDataMemory.get() | |||
+ "] is less than released[" + size + "], set read memory to 0"); | |||
readDataMemory.set(0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unreasonable to set 0 here,because it means there is a bug here and the server may be unhealthy and the user should restart it in this case. So we should expose it. I think we can add a new metric for this. But we can do this in another pr.
By the way, this problem also exists elsewhere:
incubator-uniffle/server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java
Line 315 in 29322c5
usedMemory.set(0L); |
incubator-uniffle/server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java
Line 335 in 29322c5
inFlushSize.set(0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I also found this problem, but I don't know its context.
Codecov Report
@@ Coverage Diff @@
## master #568 +/- ##
=========================================
Coverage 60.86% 60.86%
- Complexity 1795 1796 +1
=========================================
Files 213 213
Lines 12375 12379 +4
Branches 1049 1049
=========================================
+ Hits 7532 7535 +3
- Misses 4435 4437 +2
+ Partials 408 407 -1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What changes were proposed in this pull request?
add a shuffle-server metric about read_used_buffer_size
Why are the changes needed?
Show more server's info for users.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Don't need.