Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1596] fix(netty): Use a ChannelFutureListener callback mechanism to release readMemory #1605

Merged
merged 8 commits into from
Mar 28, 2024

Conversation

rickyma
Copy link
Contributor

@rickyma rickyma commented Mar 26, 2024

What changes were proposed in this pull request?

  1. Add a ChannelFutureListener and use its callback mechanism to release readMemory only after the writeAndFlush method is truly completed.
  2. Change the descriptions of configurations rss.server.buffer.capacity.ratio and rss.server.read.buffer.capacity.ratio.

Why are the changes needed?

This is actually a bug, which was introduced by PR #879. The issue has been present since the very beginning when the Netty feature was first integrated.
Fix #1596.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I don't think we need new tests. Tested in our env.
The new log will be:

[2024-03-26 23:11:51.039] [epollEventLoopGroup-3-158] [INFO] ShuffleServerNettyHandler.operationComplete - Successfully executed getLocalShuffleData for appId[application_1703049085550_7359933_1711463990606], shuffleId[0], partitionId[1328], offset[0], length[14693742]. Took 1457 ms and retrieved 14693742 bytes of data
[2024-03-26 23:11:51.040] [epollEventLoopGroup-3-130] [INFO] ShuffleServerNettyHandler.operationComplete - Successfully executed getMemoryShuffleData for appId[application_1703049085550_7359933_1711463990606], shuffleId[0], partitionId[1262]. Took 1 ms and retrieved 0 bytes of data
[2024-03-26 23:11:51.068] [epollEventLoopGroup-3-177] [INFO] ShuffleServerNettyHandler.operationComplete - Successfully executed getLocalShuffleIndex for appId[application_1703049085550_7359933_1711463990606], shuffleId[0], partitionId[1366]. Took 918 ms and retrieved 1653600 bytes of data

@rickyma
Copy link
Contributor Author

rickyma commented Mar 26, 2024

After rethinking about the issue, I found out that it was a bug when enabing Netty. I've changed the description of the issue #1596. I think this PR will be enough for now. We can decrease the readCapacity to reduce the CPU load of the machine.
PTAL. @jerqi @zuston

@codecov-commenter
Copy link

codecov-commenter commented Mar 26, 2024

Codecov Report

Attention: Patch coverage is 2.53165% with 77 lines in your changes are missing coverage. Please review.

Project coverage is 54.85%. Comparing base (3a1b4d2) to head (cf50ded).

Files Patch % Lines
...niffle/server/netty/ShuffleServerNettyHandler.java 0.00% 72 Missing ⚠️
...mon/netty/protocol/GetLocalShuffleDataRequest.java 0.00% 1 Missing ⚠️
...on/netty/protocol/GetLocalShuffleIndexRequest.java 0.00% 1 Missing ⚠️
...on/netty/protocol/GetMemoryShuffleDataRequest.java 0.00% 1 Missing ⚠️
.../common/netty/protocol/SendShuffleDataRequest.java 0.00% 1 Missing ⚠️
...he/uniffle/server/buffer/ShuffleBufferManager.java 66.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #1605      +/-   ##
============================================
+ Coverage     54.02%   54.85%   +0.82%     
  Complexity     2863     2863              
============================================
  Files           438      418      -20     
  Lines         24858    22551    -2307     
  Branches       2114     2123       +9     
============================================
- Hits          13430    12370    -1060     
+ Misses        10587     9408    -1179     
+ Partials        841      773      -68     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rickyma
Copy link
Contributor Author

rickyma commented Mar 26, 2024

Now finally the metric shows perfectly.

Before this PR:
image

After this PR:
image

Copy link

github-actions bot commented Mar 26, 2024

Test Results

 2 340 files  ±0   2 340 suites  ±0   4h 31m 8s ⏱️ - 1m 0s
   908 tests ±0     905 ✅  - 2   1 💤 ±0  1 ❌ +1  1 🔥 +1 
10 541 runs  ±0  10 525 ✅  - 2  14 💤 ±0  1 ❌ +1  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit cf50ded. ± Comparison against base commit 3a1b4d2.

♻️ This comment has been updated with latest results.

new ReleaseMemoryAndRecordReadTimeListener(
start, readBufferSize, data.size(), requestInfo, req, client);
client.getChannel().writeAndFlush(response).addListener(listener);
return;
} catch (Exception e) {
status = StatusCode.INTERNAL_ERROR;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here haven't release the read memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Done.

new ReleaseMemoryAndRecordReadTimeListener(
start, assumedFileSize, data.size(), requestInfo, req, client);
client.getChannel().writeAndFlush(response).addListener(listener);
return;
} catch (FileNotFoundException indexFileNotFoundException) {
LOG.warn(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

new ReleaseMemoryAndRecordReadTimeListener(
start, length, sdr.getDataLength(), requestInfo, req, client);
client.getChannel().writeAndFlush(response).addListener(listener);
return;
} catch (Exception e) {
status = StatusCode.INTERNAL_ERROR;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@rickyma rickyma requested a review from zuston March 27, 2024 07:15
} catch (FileNotFoundException indexFileNotFoundException) {
if (shuffleIndexResult != null) {
shuffleIndexResult.release();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory requirement is not released. Why not using shuffleServer.getShuffleBufferManager().releaseReadMemory(readBufferSize);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I've missed it. Done.

@rickyma rickyma requested a review from zuston March 27, 2024 09:24
} catch (Exception e) {
shuffleServer.getShuffleBufferManager().releaseReadMemory(length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concern about this. Will it release twice? Because maybe completelistener will release memory when it throws exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, writeAndFlush is asynchronous, so no exceptions will be catched here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we throw an exception in the future listener?

Copy link
Contributor Author

@rickyma rickyma Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw new RssException("Can not handle request " + request.type());

You mean this code snippet? I think we will never reach here, so it's fine we do anything. It's like you use assert to make sure something will not happen. And if it happens, it must be a serious bug that needs to be fixed immediately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this exception will be ignored now. It's not a good way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I can just log an error message if you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERROR log message or assertion is acceptable. It's up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jerqi
Copy link
Contributor

jerqi commented Mar 27, 2024

Now we still use Grpc as the default. I think we change default config options when we use Netty as default.

@jerqi
Copy link
Contributor

jerqi commented Mar 27, 2024

Now finally the metric shows perfectly.

Before this PR: image

After this PR: image

Why do we use more memory after this pull request?

@rickyma
Copy link
Contributor Author

rickyma commented Mar 27, 2024

Why do we use more memory after this pull request?

Because the readMemory does not work well. You can see the detailed description in the related issue #1596.

We actually are using "more" memory. This is the normal behavior. Before this PR, the read_used_buffer_size metric is incorrect.

@rickyma
Copy link
Contributor Author

rickyma commented Mar 27, 2024

Now we still use Grpc as the default. I think we change default config options when we use Netty as default.

Do you want to change the default value in this PR?

@rickyma rickyma requested a review from jerqi March 27, 2024 11:09
@rickyma
Copy link
Contributor Author

rickyma commented Mar 27, 2024

Now we still use Grpc as the default. I think we change default config options when we use Netty as default.

I get your point. You want to revert the changes to the configurations' default values in this PR. Right?

@jerqi
Copy link
Contributor

jerqi commented Mar 27, 2024

Now we still use Grpc as the default. I think we change default config options when we use Netty as default.

I get your point. You want to revert the changes to the configurations' descriptions in this PR. Right?

Yes.

@rickyma
Copy link
Contributor Author

rickyma commented Mar 27, 2024

Now we still use Grpc as the default. I think we change default config options when we use Netty as default.

I get your point. You want to revert the changes to the configurations' default values in this PR. Right?

Yes.

Done. Default values are reverted. But I think descriptions still need to be updated. Because the implemention of the code is changed.

@@ -42,7 +42,7 @@ public class ShuffleServerConf extends RssBaseConf {
.doubleType()
.defaultValue(0.6)
.withDescription(
"JVM heap size * ratio for the maximum memory of buffer manager for shuffle server, this "
"JVM heap size or off-heap size(when enabling Netty) * ratio for the maximum memory of buffer manager for shuffle server, this "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you modify the document, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Updated.

@rickyma rickyma requested a review from qqqttt123 March 28, 2024 04:01
@zuston
Copy link
Member

zuston commented Mar 28, 2024

Now we still use Grpc as the default. I think we change default config options when we use Netty as default.

I get your point. You want to revert the changes to the configurations' default values in this PR. Right?

Yes.

Done. Default values are reverted. But I think descriptions still need to be updated. Because the implemention of the code is changed.

Please update the description, and then I will merge this.

@rickyma
Copy link
Contributor Author

rickyma commented Mar 28, 2024

Please update the description, and then I will merge this.

I think the descriptions of the configs have already been updated in this PR.

@zuston
Copy link
Member

zuston commented Mar 28, 2024

Please update the description, and then I will merge this.

I think the descriptions of the configs have already been updated in this PR.

Got it. Thanks for your contribution, merged.

@zuston zuston merged commit cbf4f6f into apache:master Mar 28, 2024
38 of 41 checks passed
jerqi pushed a commit that referenced this pull request Apr 12, 2024
…nnel is writable (#1641)

### What changes were proposed in this pull request?

1. Send failed responses only when the channel is writable.
2. Print debug logs when the data is successfully sent, reducing log output.
3. Reduce the duplicated error log.

### Why are the changes needed?

A follow-up PR for: #1605.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing UTs.
jerqi pushed a commit that referenced this pull request Apr 30, 2024
…nnel is writable (#1641)

### What changes were proposed in this pull request?

1. Send failed responses only when the channel is writable.
2. Print debug logs when the data is successfully sent, reducing log output.
3. Reduce the duplicated error log.

### Why are the changes needed?

A follow-up PR for: #1605.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing UTs.
@rickyma rickyma deleted the issue-1596 branch May 5, 2024 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Reading local shuffle data in high-pressure scenarios may lead to high system load
5 participants