Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce startup-silent-period mechanism to avoid partial assignments #247

Merged
merged 3 commits into from
Oct 10, 2022

Conversation

zuston
Copy link
Member

@zuston zuston commented Sep 28, 2022

What changes were proposed in this pull request?

Introduce startup-silent-period mechanism to avoid partial assignments

Why are the changes needed?

When changing some coordinator's conf and then restart, coordinator will accept client getAssignment request immediately, but it will serve for jobs request based on the partial registered shuffle-servers, which will make some jobs gotten not enough required shuffle-servers and then slow the running speed.

I think we should make coordinator wait for more than one shuffle-server heartbeat interval before serving for client. During out-of-service, requests from client will fallback to slave coordinator.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

UTs

@zuston zuston requested a review from jerqi September 28, 2022 05:49
@zuston
Copy link
Member Author

zuston commented Sep 28, 2022

PTAL @jerqi

@codecov-commenter
Copy link

codecov-commenter commented Sep 28, 2022

Codecov Report

Merging #247 (63af987) into master (c89f95c) will increase coverage by 0.02%.
The diff coverage is 66.66%.

@@             Coverage Diff              @@
##             master     #247      +/-   ##
============================================
+ Coverage     59.16%   59.18%   +0.02%     
- Complexity     1340     1343       +3     
============================================
  Files           163      163              
  Lines          8810     8837      +27     
  Branches        833      835       +2     
============================================
+ Hits           5212     5230      +18     
- Misses         3332     3339       +7     
- Partials        266      268       +2     
Impacted Files Coverage Δ
...he/uniffle/coordinator/CoordinatorGrpcService.java 2.31% <0.00%> (-0.03%) ⬇️
...ache/uniffle/coordinator/SimpleClusterManager.java 86.61% <53.33%> (-4.46%) ⬇️
...rg/apache/uniffle/coordinator/CoordinatorConf.java 97.26% <100.00%> (+0.20%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@zuston zuston changed the title [FEATUER] Introduce startup-silent-period mechanism to avoid partial assignments Introduce startup-silent-period mechanism to avoid partial assignments Sep 28, 2022
@jerqi
Copy link
Contributor

jerqi commented Sep 29, 2022

Is startup-silent-period a good name? What's name of similar mechanism of Yarn?

@zuston
Copy link
Member Author

zuston commented Sep 29, 2022

Is startup-silent-period a good name? What's name of similar mechanism of Yarn?

No such config in Yarn. Safe mode is for HDFS. I think it's not proper in Uniffle scenarios

Do u have any ideas?

@jerqi
Copy link
Contributor

jerqi commented Oct 9, 2022

Is startup-silent-period a good name? What's name of similar mechanism of Yarn?

No such config in Yarn. Safe mode is for HDFS. I think it's not proper in Uniffle scenarios

Do u have any ideas?

It's ok for Uniffle. I don't have another good ideas.

@zuston
Copy link
Member Author

zuston commented Oct 9, 2022

It's ok for Uniffle. I don't have another good ideas.

Got it. I will rename to safemode.

@jerqi
Copy link
Contributor

jerqi commented Oct 9, 2022

It's ok for Uniffle. I don't have another good ideas.

Got it. I will rename to safemode.

I means that startup-silent-period is ok.

@zuston
Copy link
Member Author

zuston commented Oct 9, 2022

OK. Do u have any other advice? @jerqi

.key("rss.coordinator.startup-silent-period.enabled")
.booleanType()
.defaultValue(false)
.withDescription("Enable the startup-silent-period to reject the assignment requests "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add more description to explain why we shouldn't use true as default value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@zuston
Copy link
Member Author

zuston commented Oct 10, 2022

Updated.

  1. Add the doc of these configs
  2. Explain why make the startup-slient-period stop default.

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @zuston , wait for CI

@zuston
Copy link
Member Author

zuston commented Oct 10, 2022

Error:  Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.998 s <<< FAILURE! - in org.apache.uniffle.test.CoordinatorGrpcTest
Error:  rpcMetricsTest  Time elapsed: 1.035 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <0.0> but was: <1.0>

CI failed. https://github.com/apache/incubator-uniffle/actions/runs/3216468616/jobs/5258387024 #244

Rerun it.

@jerqi
Copy link
Contributor

jerqi commented Oct 10, 2022

Unrelated flaky test failed, I will rerun the flaky test

@jerqi jerqi merged commit 5875eb3 into apache:master Oct 10, 2022
@zuston zuston deleted the rejection branch October 10, 2022 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants