Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#567] feat: add a shuffle-server metric about read_used_buffer_size #568

Merged
merged 1 commit into from
Feb 9, 2023

Conversation

zuston
Copy link
Member

@zuston zuston commented Feb 9, 2023

What changes were proposed in this pull request?

add a shuffle-server metric about read_used_buffer_size

Why are the changes needed?

Show more server's info for users.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Don't need.

@@ -362,10 +363,12 @@ public boolean requireReadMemoryWithRetry(long size) {
public void releaseReadMemory(long size) {
if (readDataMemory.get() >= size) {
readDataMemory.addAndGet(-size);
ShuffleServerMetrics.gaugeReadBufferUsedSize.dec(size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ShuffleServerMetrics.gaugeReadBufferUsedSize.dec(size);
ShuffleServerMetrics.gaugeReadBufferUsedSize.set(readDataMemory.get());

How about this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's Ok. The metrics should be always consistent with the value of readDataMemory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the suggest might not be better than the previous one.

Consider this case:

T1, T2 for two threads.
t0, t1, t2 for times
N: current read buffer
t0: T1: readDataMemory.addAndGet(-size)
t1: T1: readDataMemory.get() --> N - size
t2: T2: readDataMemory.addAndGet(-size)
t3: T2: readDataMemory.get() --> N - size - size
t4: T2 : gaugeReadBufferUsedSize <- N -size - size
t5: T1:  gaugeReadBufferUsedSize  <- N - size

Unless there's safe bet to actually reset the gauge to the readDataMemory.get().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the suggest might not be better than the previous one.

Consider this case:

T1, T2 for two threads.
t0, t1, t2 for times
N: current read buffer
t0: T1: readDataMemory.addAndGet(-size)
t1: T1: readDataMemory.get() --> N - size
t2: T2: readDataMemory.addAndGet(-size)
t3: T2: readDataMemory.get() --> N - size - size
t4: T2 : gaugeReadBufferUsedSize <- N -size - size
t5: T1:  gaugeReadBufferUsedSize  <- N - size

Unless there's safe bet to actually reset the gauge to the readDataMemory.get().

You are right, My mistake. Forgot that the same problem occurred before. 😂 #404

@@ -362,10 +363,12 @@ public boolean requireReadMemoryWithRetry(long size) {
public void releaseReadMemory(long size) {
if (readDataMemory.get() >= size) {
readDataMemory.addAndGet(-size);
ShuffleServerMetrics.gaugeReadBufferUsedSize.dec(size);
} else {
LOG.warn("Current read memory[" + readDataMemory.get()
+ "] is less than released[" + size + "], set read memory to 0");
readDataMemory.set(0L);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unreasonable to set 0 here,because it means there is a bug here and the server may be unhealthy and the user should restart it in this case. So we should expose it. I think we can add a new metric for this. But we can do this in another pr.

By the way, this problem also exists elsewhere:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I also found this problem, but I don't know its context.

@codecov-commenter
Copy link

codecov-commenter commented Feb 9, 2023

Codecov Report

Merging #568 (29322c5) into master (0779946) will increase coverage by 0.00%.
The diff coverage is 25.00%.

@@            Coverage Diff            @@
##             master     #568   +/-   ##
=========================================
  Coverage     60.86%   60.86%           
- Complexity     1795     1796    +1     
=========================================
  Files           213      213           
  Lines         12375    12379    +4     
  Branches       1049     1049           
=========================================
+ Hits           7532     7535    +3     
- Misses         4435     4437    +2     
+ Partials        408      407    -1     
Impacted Files Coverage Δ
...he/uniffle/server/buffer/ShuffleBufferManager.java 83.09% <0.00%> (-0.89%) ⬇️
...rg/apache/uniffle/server/ShuffleServerMetrics.java 97.34% <100.00%> (+0.02%) ⬆️
...he/uniffle/server/storage/LocalStorageManager.java 87.00% <0.00%> (+1.12%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@xianjingfeng xianjingfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xianjingfeng xianjingfeng linked an issue Feb 9, 2023 that may be closed by this pull request
3 tasks
@xianjingfeng xianjingfeng merged commit 7729c16 into apache:master Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Add the shuffle server metric about used memory when reading
4 participants