Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the conf of shuffleNodesNumber from jobs to be as checking factor #208

Merged
merged 4 commits into from
Sep 20, 2022

Conversation

zuston
Copy link
Member

@zuston zuston commented Sep 9, 2022

What changes were proposed in this pull request?

Use the conf of shuffleNodesNumber from jobs to be as checking factor

Why are the changes needed?

In the PR #97 , it allow client to specify the shuffle server numbers, but in clusterLoaderChecker, it dont take this into considering. In this PR, it will use the conf of shuffleNodesNumber from jobs to be as checking factor only when the conf of rss.coordinator.access.loadChecker.serverNum.threshold is missed.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UTs.

@codecov-commenter
Copy link

codecov-commenter commented Sep 9, 2022

Codecov Report

Merging #208 (240fb1a) into master (9b83f66) will increase coverage by 0.14%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master     #208      +/-   ##
============================================
+ Coverage     59.02%   59.17%   +0.14%     
- Complexity     1326     1332       +6     
============================================
  Files           160      160              
  Lines          8699     8732      +33     
  Branches        815      819       +4     
============================================
+ Hits           5135     5167      +32     
+ Misses         3301     3300       -1     
- Partials        263      265       +2     
Impacted Files Coverage Δ
...java/org/apache/uniffle/common/util/Constants.java 0.00% <ø> (ø)
.../uniffle/coordinator/AccessClusterLoadChecker.java 96.96% <100.00%> (+0.81%) ⬆️
...rg/apache/uniffle/coordinator/CoordinatorConf.java 96.94% <100.00%> (ø)
...orage/handler/impl/LocalFileServerReadHandler.java 77.96% <0.00%> (-2.40%) ⬇️
.../java/org/apache/uniffle/common/util/RssUtils.java 69.32% <0.00%> (-0.80%) ⬇️
.../org/apache/spark/shuffle/writer/WriterBuffer.java 93.18% <0.00%> (-0.30%) ⬇️
.../storage/handler/impl/HdfsShuffleWriteHandler.java 85.07% <0.00%> (ø)
...rg/apache/uniffle/server/ShuffleServerMetrics.java 96.11% <0.00%> (+0.24%) ⬆️
...in/java/org/apache/uniffle/server/HealthCheck.java 61.36% <0.00%> (+1.83%) ⬆️
...org/apache/uniffle/server/LocalStorageChecker.java 70.19% <0.00%> (+2.45%) ⬆️
... and 3 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@zuston
Copy link
Member Author

zuston commented Sep 9, 2022

PTAL @jerqi

@@ -99,13 +103,17 @@ private boolean tryAccessCluster() {
long retryInterval = sparkConf.get(RssSparkConfig.RSS_CLIENT_ACCESS_RETRY_INTERVAL_MS);
int retryTimes = sparkConf.get(RssSparkConfig.RSS_CLIENT_ACCESS_RETRY_TIMES);

int assignmentShuffleNodesNum = sparkConf.get(RssSparkConfig.RSS_CLIENT_ASSIGNMENT_SHUFFLE_SERVER_NUMBER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we only modify spark3?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to modify in spark2

@jerqi
Copy link
Contributor

jerqi commented Sep 13, 2022

RSS_CLIENT_ASSIGNMENT_SHUFFLE_SERVER_NUMBER means that the max number should be allocated for client. It's not min number that should be allocated for client. I think it's not suitable to judge whether to reject the application.

@zuston
Copy link
Member Author

zuston commented Sep 13, 2022

The conf of rss.coordinator.shuffle.nodes.max in coordinator is a soft limitation. It means when available nodes number is lower than this value, it will not throw exception and reassign the number of all available nodes.

But in current implementation of AccessClusterLoadChecker, it will reject the app when available nodes number < rss.coordinator.shuffle.nodes.max.

For example

available shuffle nodes number = 20
job required shuffle nodes number = 10
rss.coordinator.shuffle.nodes.max = 30

The AccessClusterLoadChecker will reject this job. I think this is unreasonable.

@jerqi
Copy link
Contributor

jerqi commented Sep 13, 2022

The conf of rss.coordinator.shuffle.nodes.max in coordinator is a soft limitation. It means when available nodes number is lower than this value, it will not throw exception and reassign the number of all available nodes.

But in current implementation of AccessClusterLoadChecker, it will reject the app when available nodes number < rss.coordinator.shuffle.nodes.max.

For example

available shuffle nodes number = 20 job required shuffle nodes number = 10 rss.coordinator.shuffle.nodes.max = 30

The AccessClusterLoadChecker will reject this job. I think this is unreasonable.

I got it.

@jerqi
Copy link
Contributor

jerqi commented Sep 13, 2022

But the rss.coordinator.shuffle.nodes.max is just the default value. If you have other need, you can change this default value. rss.coordinator.access.loadChecker.serverNum.threshold isn't related the to rss.coordinator.shuffle.nodes.max actually.

@zuston
Copy link
Member Author

zuston commented Sep 13, 2022

You mean we should remove the check of rss.coordinator.shuffle.nodes.max in AccessClusterLoadChecker?

@jerqi
Copy link
Contributor

jerqi commented Sep 14, 2022

You mean we should remove the check of rss.coordinator.shuffle.nodes.max in AccessClusterLoadChecker?

Origin logic don't check rss.coordinator.shuffle.nodes.max. It's a config option rss.coordinator.access.loadChecker.serverNum.threshold. Its default value is rss.coordinator.shuffle.nodes.max. Default value don't need to be reasonable for every situation.

@zuston
Copy link
Member Author

zuston commented Sep 14, 2022

Got your thought.

But the conf of rss.coordinator.access.loadChecker.serverNum.threshold is a hard limitation, if value is too low, it will pass all. If too high, it will reject all.

For the equality of every job, I wont set it and hope the checker can decide whether to reject depending on the required shuffle servers number of jobs.

I thinks this is reasonable, it will still reserve the hard limitation and keep the flexibility

@jerqi
Copy link
Contributor

jerqi commented Sep 14, 2022

Got your thought.

But the conf of rss.coordinator.access.loadChecker.serverNum.threshold is a hard limitation, if value is too low, it will pass all. If too high, it will reject all.

For the equality of every job, I wont set it and hope the checker can decide whether to reject depending on the required shuffle servers number of jobs.

I thinks this is reasonable, it will still reserve the hard limitation and keep the flexibility

Should we give a option of coordinator to decide whether to use the required shuffle server of client for users?

@zuston
Copy link
Member Author

zuston commented Sep 14, 2022

Should we give a option of coordinator to decide whether to use the required shuffle server of client for users?

Not necessary. If the conf value of rss.client.assignment.shuffle.nodes.max is -1 or not set, it will use the conf from coordinator side.

@jerqi
Copy link
Contributor

jerqi commented Sep 14, 2022

Should we give a option of coordinator to decide whether to use the required shuffle server of client for users?

Not necessary. If the conf value of rss.client.assignment.shuffle.nodes.max is -1 or not set, it will use the conf from coordinator side.

If I want to use rss.coordinator.access.loadChecker.serverNum.threshold, it's weird that we must guarantee that the conf value of rss.client.assignment.shuffle.nodes.max is -1 or not set. They are not related config options.

@zuston
Copy link
Member Author

zuston commented Sep 14, 2022

Oh. Sorry for not explaining clearly.

  1. If the rss.coordinator.access.loadChecker.serverNum.threshold is set, this checker will only use this value as the checking condition.
  2. If the rss.coordinator.access.loadChecker.serverNum.threshold is missing, this checker will use the client's shuffle-servers number as checking condition. If the value of ss.client.assignment.shuffle.nodes.max is -1 or not set, then it will fallback to use the coodinator side's rss.coordinator.shuffle.nodes.max

@jerqi
Copy link
Contributor

jerqi commented Sep 14, 2022

Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok?

@zuston
Copy link
Member Author

zuston commented Sep 14, 2022

Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok?

Got it

@zuston
Copy link
Member Author

zuston commented Sep 19, 2022

Ping @jerqi

@jerqi
Copy link
Contributor

jerqi commented Sep 19, 2022

Should we add more documents for this feature?

@zuston
Copy link
Member Author

zuston commented Sep 19, 2022

Should we add more documents for this feature?

Added.

@zuston zuston requested a review from jerqi September 20, 2022 02:03
Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @zuston

@jerqi jerqi merged commit 9dc7f0d into apache:master Sep 20, 2022
@zuston
Copy link
Member Author

zuston commented Sep 20, 2022

Thanks for your patient review @jerqi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants