Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] add a configuration to control shuffle data flush #445

Merged
merged 4 commits into from
Dec 28, 2022

Conversation

advancedxy
Copy link
Contributor

@advancedxy advancedxy commented Dec 22, 2022

What changes were proposed in this pull request?

Add a configuration to control which kind of shuffle buffers should be flushed

Why are the changes needed?

To reduce potential small I/Os to local persistent storages. This should close #419

Does this PR introduce any user-facing change?

End user: No
Shuffle Server: a new configuration

How was this patch tested?

add UT

if (pickedFlushSize > expectedFlushSize) {
LOG.info("Finish flush pick with {} bytes", pickedFlushSize);
break;
if (size > this.shuffleFlushThreshold) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If max size in pickedShuffle < shuffleFlushThreshold, the flushing mechanism is invalid? Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... This shuffleFlushThreshold is intended to be buffering small I/Os.

If it's a bad configurations, the configurations should be adjusted. Or do you have other ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need to do such change. If too much small IO happened, we'd better to expand the cluster to distribute pressure.

Besides, the small IO could be observed by the metrics of event_size_threshold_level4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need to do such change. If too much small IO happened, we'd better to expand the cluster to distribute pressure.

How to expand the cluster? more shuffle servers? There would be limited number of shuffle server and you cannot scale to many shuffle servers for one application.

This configurations is specially tailed for HDD only shuffle servers. I think it's necessary to reduce small I/Os to HDDs as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event_size_threshold_level4

do you mean event_size_threshold_level1?

Copy link
Member

@zuston zuston Dec 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Got your thought. But I consider an extreme case, if the all buffers size are 1M and memory reaches the high-watermark. After that, the later sending data requests are not the existing buffers' partition, whether the memory's data will be flushed?

In this case, will dead-lock happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Got your thought. But I consider an extreme case, if the all buffers size are 1M and memory reaches the high-watermark. After that, the later sending data requests are not the existing buffers' partition, whether the memory's data will be flushed?

In this case, will dead-lock happen?

If misconfigured, say shuffle memory's high watermark: 1G, then all buffer sizes are 1M or less than 1M, then yes, it could be possible all data stays in memory. But like you said, it's an extremely case, and in practice shuffle data must varies differently.

I don't think we need to consider such cases, there are other ways to deal such case: lower down the threshold size for lower high watermark. Do you have other concerns?

Copy link
Contributor Author

@advancedxy advancedxy Dec 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, It could be a bit more adaptive to select shuffleIds to flush... such as if it doesn't free any memory, then just ignore the threshold and re-select the shuffle ids... But I think it's a bit of overkill/over engineering in this stage to introduce such flexibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm.. If having this possibility, we should take consideration in our design instead of hoping not happening in users environment.

Copy link
Contributor Author

@advancedxy advancedxy Dec 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm.. If having this possibility, we should take consideration in our design instead of hoping not happening in users environment.

😂. The option is the safe bet, and the user/admin should take responsibility for what they are configuring. Take the
SERVER_BUFFER_CAPACITY and low/high watermark for analog:

  1. do we need to take considerations when user misconfigured buffer_capacity to some extremely small value, say 1M.
  2. do we need to take considerations when low/high are equal, which makes flush process one shuffle one time

Did you occurred anything similar in your prod envs?

@advancedxy advancedxy marked this pull request as ready for review December 23, 2022 03:40
@advancedxy
Copy link
Contributor Author

This is ready for review.

jerqi
jerqi previously approved these changes Dec 26, 2022
Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @advancedxy @zuston

LOG.info("Finish flush pick with {} bytes", pickedFlushSize);
break;
if (size > this.shuffleFlushThreshold) {
pickedFlushSize += size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will it happen if shuffle server only have small ios?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shuffle server will not reached to high watermark and thus shuffle data is in almost in memory, which is effectively an in-memory shuffle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the shuffle server reached the high watermark, but there only are small I/Os, what will it happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've discussed this situation with @zuston. In practice, the cases you described should not happen.

However since both of you raised similar concern, let me addressed it in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, please take another look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the behavior different from origin one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that you change the flush strategy instead of adding an extra configuration. Could we keep the origin behavior by the configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that you change the flush strategy instead of adding an extra configuration. Could we keep the origin behavior by the configuration?

No, the new behavior is controlled by an extra configuration. The origin behavior is not changed by default unless admin/op specifically set the flush threshold.

@codecov-commenter
Copy link

codecov-commenter commented Dec 26, 2022

Codecov Report

Merging #445 (6c1e04a) into master (3aaac3e) will increase coverage by 0.04%.
The diff coverage is 93.10%.

@@             Coverage Diff              @@
##             master     #445      +/-   ##
============================================
+ Coverage     58.65%   58.70%   +0.04%     
- Complexity     1647     1650       +3     
============================================
  Files           199      199              
  Lines         11201    11214      +13     
  Branches        996      998       +2     
============================================
+ Hits           6570     6583      +13     
  Misses         4237     4237              
  Partials        394      394              
Impacted Files Coverage Δ
...he/uniffle/server/buffer/ShuffleBufferManager.java 82.75% <86.66%> (+0.47%) ⬆️
...a/org/apache/uniffle/server/ShuffleServerConf.java 99.27% <100.00%> (+0.01%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

jerqi
jerqi previously approved these changes Dec 26, 2022
@jerqi
Copy link
Contributor

jerqi commented Dec 26, 2022

LGTM, let @zuston take another look.

@advancedxy
Copy link
Contributor Author

Gently ping @zuston

@advancedxy
Copy link
Contributor Author

Gently ping @zuston

Seems @zuston is busy.. I will wait this PR for tomorrow.

@zuston
Copy link
Member

zuston commented Dec 28, 2022

Gently ping @zuston

Seems @zuston is busy.. I will wait this PR for tomorrow.

Sorry, I'm suffering from covid-19 and don't have much time to review this.

@advancedxy
Copy link
Contributor Author

Gently ping @zuston

Seems @zuston is busy.. I will wait this PR for tomorrow.

Sorry, I'm suffering from covid-19 and don't have much time to review this.

Bless and take care of yourself.

@jerqi
Copy link
Contributor

jerqi commented Dec 28, 2022

Gently ping @zuston

Seems @zuston is busy.. I will wait this PR for tomorrow.

Sorry, I'm suffering from covid-19 and don't have much time to review this.

Bless

@advancedxy advancedxy merged commit 6ddf8a7 into apache:master Dec 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask] Introduce a new configuration to control the flush threshold for local storage
4 participants