-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YARN-5871. [RESERVATION] Add support for reservation-based routing. #4632
Conversation
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
@goiri I have integrated the code of YARN-5871, tested it locally, and it runs normally. There are 3 proto compilation prompts in javac, which I personally think can be ignored. I am sorting out the design documents, and it is expected to be available in 1-2 days. Please help to review code, thank you very much! I don't plan to split into multiple prs, because a lot of code is related to proto, I will explain this part in the documentation. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
try { | ||
long mem = -1; | ||
JSONObject obj = new JSONObject(value.getCapability()); | ||
mem = obj.getJSONObject("clusterMetrics").getLong("availableMB"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directly:
long mem = obj.getJSONObject("clusterMetrics").getLong("availableMB");
for (ResourceRequest rr : rrList) { | ||
// Handle "node" requests | ||
try { | ||
targetId = resolver.getSubClusterForNode(rr.getResourceName()); | ||
nodeRequest = rr; | ||
} catch (YarnException e) { | ||
LOG.error("Cannot resolve node : {}", e.getLocalizedMessage()); | ||
LOG.error("Cannot resolve node.", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid the full exception?
Map<SubClusterId, SubClusterInfo> filteredSubClusters = prefilterSubClusters( | ||
appContext.getReservationID(), getActiveSubclusters()); | ||
|
||
FederationPolicyUtils.validateSubClusterAvailability( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we have a method that takes both arguments separate?
public SubClusterId getReservationHomeSubcluster(ReservationSubmissionRequest request) | ||
throws YarnException { | ||
if (request == null) { | ||
throw new FederationPolicyException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One line.
appSubmissionContext.getQueue().hashCode() % activeSubclusters.size()); | ||
|
||
List<SubClusterId> list = new ArrayList<>(activeSubclusters.keySet()); | ||
protected SubClusterId chooseSubCluster(String queue, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a javadoc in the parent?
+ rackRequest.getResourceName() + ", Any request: " + anyRequest | ||
.getResourceName()); | ||
|
||
LOG.info("Node request: {} , Rack request: {} , Any request: {}.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaces are weird.
// note: we cannot pre-compute the weights, as the set of activeSubcluster | ||
// changes dynamically (and this would unfairly spread the load to | ||
// sub-clusters adjacent to an inactive one), hence we need to count/scan | ||
// sub-clusters adja cent to an inactive one), hence we need to count/scan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
} | ||
if (entry.getKey() != null | ||
&& activeSubclusters.containsKey(entry.getKey().toId())) { | ||
if (entry.getKey() != null && preSelectSubClusters.containsKey(entry.getKey().toId())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract entry.getKey()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getValue() possibly too.
@@ -73,9 +56,8 @@ public SubClusterId getHomeSubcluster( | |||
int pickedIndex = FederationPolicyUtils.getWeightedRandom(weightList); | |||
if (pickedIndex == -1) { | |||
throw new FederationPolicyException( | |||
"No positive weight found on active subclusters"); | |||
"No positive weight found on active subClusters."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous capitalization looked more like it.
@slfan1989 this is a lot to review. |
@goiri Thank you very much for your help reviewing the code, I will split it. The overall implementation idea is as follows I uploaded a design document in YARN-5871, this is the V1 version (very simple), I will continue to add the content of the document, but it basically describes the work to be done by YARN-11235(#4656), I will upload the V2 version of the document as soon as possible. |
YARN-5871. [RESERVATION] Add support for reservation-based routing.
When completing the pr of YARN-11177, I found that YARN Federation has limited support for Reservation System.
After looking at Jira, I found that YARN-5871 implements some of the functions of Router Reservation System, I continue to follow this pr. However, the original code of YARN-5871 has changed too much for the Router Policy And it can no longer be directly merged into the current version of trunk. I refer to this part of the code to try to be compatible with the current version of the Router.
After reading the code of this PR, I will describe the design idea of this PR. This part of the work is in progress, first ensure that the PR can be compiled normally and passed the unit test.