-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26269][YARN]Yarnallocator should have same blacklist behaviour with yarn to maxmize use of cluster resource #23223
Conversation
ping @attilapiros @vanzin @jerryshao for kindly review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ngone51 I read your change but I would like to understand whether the intention of this change is right.
I mean if node blacklisting in Spark would be perfectly aligned to YARN then it would be just redundant to have it in Spark in the first place. Take for example disk failure. According to design of Spark blacklisting:
"Resources are blacklisted even if they are only flaky (not just when they are completely
unusable), eg. because of one bad disk out of many."
This could be only intended design for task level backlisting.
This is why I am a bit hesitant so it would be nice to support it with some concrete case where this really helps and needed.
Ping @squito, @tgravescs
@@ -612,11 +612,14 @@ private[yarn] class YarnAllocator( | |||
val message = "Container killed by YARN for exceeding physical memory limits. " + | |||
s"$diag Consider boosting ${EXECUTOR_MEMORY_OVERHEAD.key}." | |||
(true, message) | |||
case exit_status if NOT_APP_AND_SYSTEM_FAULT_EXIT_STATUS.contains(exit_status) => | |||
(true, "Container marked as failed: " + containerId + onHostStr + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Use string interpolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concrete case where this really helps and needed.
There is 1K nodes cluster and jobs have performance degradation because of a single node. It's rather hard to convince Cluster Ops to decommission node because of "performance degradation". Imagine 10 dev teams chase single ops team for valid reason (node has problems) or because code has a bug or data is skewed or spots on the sun.
Simple solution:
- rerun failed / delayed job and blacklist "problematic" node.
- Report about the problem if job works w/o anomalies
Results
- Ops are not spammed with a weird requests from devs
- devs are not blocked because of really bad node.
@@ -612,11 +612,14 @@ private[yarn] class YarnAllocator( | |||
val message = "Container killed by YARN for exceeding physical memory limits. " + | |||
s"$diag Consider boosting ${EXECUTOR_MEMORY_OVERHEAD.key}." | |||
(true, message) | |||
case exit_status if NOT_APP_AND_SYSTEM_FAULT_EXIT_STATUS.contains(exit_status) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to have it as a separate case but just a new if around handleResourceAllocationFailure
and as NOT_APP_AND_SYSTEM_FAULT_EXIT_STATUS is introduced it would make sense to separate it from the huge match on exitStatus. This way it would be easier to follow when it is really triggered (one should not check all the previous case branches then consider this condition with contains). That way values like ContainerExitStatus.SUCCESS from the set would be really used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I agree this should be cleaned up we already handle cases above that are in the NOT_APP_AND_SYSTEM_FAULT_EXIT_STATUS set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also after this gets rearranged, I'd leave a comment in here pointing to the code in hadoop you linked to on the jira.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. But I'm not sure about:
That way values like ContainerExitStatus.SUCCESS from the set would be really used.
this part. @attilapiros
val containerId = ContainerId.newContainerId(appAttemptId, containerNum) | ||
def createContainer( | ||
host: String, | ||
containerId: ContainerId = ContainerId.newContainerId(appAttemptId, containerNum), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just containerNumber as parameter with default value of containerNum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
ok to test |
the approach here makes sense. Are you seeing actual issues with this blacklisting when it shouldn't? I could see that possible there and if so we should move this to defect and make sure it goes into 2.4.1 |
Test build #99733 has finished for PR 23223 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like @tgravescs said, I'm curious if you have cases where this is going wrong currently, and if so probably should be a bug fix targeted at 2.4.1 as well.
val cs7 = ContainerStatus.newInstance(containers(7).getId, ContainerState.COMPLETE, | ||
"aborted", ContainerExitStatus.ABORTED) | ||
val cs8 = ContainerStatus.newInstance(containers(8).getId, ContainerState.COMPLETE, | ||
"disk_failed", ContainerExitStatus.DISKS_FAILED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a suggestion, you can avoid some repetition here
val nonBlacklistedStatuses = Seq(ContainerExitStatus.SUCCESSS, ..., ContainerExitStatus.DISKS_FAILED)
val containerStatuses = nonBlacklistedStatus.zipWithIndex.map { case (state, idx) =>
ContainerStatus.newInstance(containers(idx).getId, ContainerState.COMPLETE, "diagnostics", state)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion!
@@ -612,11 +612,14 @@ private[yarn] class YarnAllocator( | |||
val message = "Container killed by YARN for exceeding physical memory limits. " + | |||
s"$diag Consider boosting ${EXECUTOR_MEMORY_OVERHEAD.key}." | |||
(true, message) | |||
case exit_status if NOT_APP_AND_SYSTEM_FAULT_EXIT_STATUS.contains(exit_status) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also after this gets rearranged, I'd leave a comment in here pointing to the code in hadoop you linked to on the jira.
I'm not super familiar with exactly how the blacklisting works in yarn itself -- it looks like its only going to blacklist the node for the AM, not other nodes for general containers. I don't totally follow where the |
This change seems result in perfectly aligned to YARN for node blacklisting in Spark, but my original thought is that some exit status (e.g. KILLED_BY_RESOURCEMANAGER ), currently, should not lead to a node blacklisting. So, actually, perfectly aligned to YARN is not the real target of this change, and we can also make some custom strategy for Spark.
For spark task level backlisting, is it should be delegated to schedulerBlacklist in YarnAllocatorBlacklistTracker ? And it seems ContainerExitStatus.DISKS_FAILED in YARN is not same with Spark tasks' disk failure. |
Unfortunately, no. @tgravescs @squito |
@squito Yarn have blacklist for AM when config |
Test build #99757 has finished for PR 23223 at commit
|
if you aren't seeing actual issues with this I guess it would be interesting to test it further to see if it does. I can see spark blacklisting when it shouldn't for exit codes like you mention (KILLED_BY_RESOURCEMANAGER) . so I guess I would like to see someone test this further and determine if that happens. If it does we should change to bug and put into 2.4.1. |
@tgravescs Yeah, I have the same thought. I'd like to try it, but I can not guarantee that I can achieve it... Because I have never done this kind of test before. I'll try my best. |
Hi @tgravescs , I tried it, but found it's difficult to produce KILLED_BY_RESOURCEMANAGER exit status. I followed YARN-73 YARN-495, but things didn't go as I expected. |
ok thanks for trying. if I get a chance I can try later in the week, but that doesn't have to block this now if someone else has time to review before I get to it. We can always pull it back later. |
sorry for the delay on this, will try it out today |
// oop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/ap | ||
// ache/hadoop/yarn/util/Apps.java#L273 for details) | ||
if (NOT_APP_AND_SYSTEM_FAULT_EXIT_STATUS.contains(other_exit_status)) { | ||
(true, s"Container marked as failed: $containerId$onHostStr" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to return false here for exitCausedByApp since these don't seem to be issues with the App. From the comment in the yarn code: // Neither the app's fault nor the system's fault. This happens by design,
// so no need for skipping nodes
If we mark it as true then it counts against our task failure, if its false it doesn't seems like these shouldn't count against our failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
// ache/hadoop/yarn/util/Apps.java#L273 for details) | ||
if (NOT_APP_AND_SYSTEM_FAULT_EXIT_STATUS.contains(other_exit_status)) { | ||
(true, s"Container marked as failed: $containerId$onHostStr" + | ||
s". Exit status: ${completedContainer.getExitStatus}" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra space after :
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay.
// completed container from a bad node | ||
allocatorBlacklistTracker.handleResourceAllocationFailure(hostOpt) | ||
(true, s"Container from a bad node: $containerId$onHostStr" + | ||
s". Exit status: ${completedContainer.getExitStatus}" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra space after :
haven't been able to easily reproduce, will try again today |
Thank you for your efforts. @tgravescs |
Test build #100336 has finished for PR 23223 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I haven't had time to try further to reproduce but changes make sense
merged to master, going to merge into if its a clean merge |
this doesn't cleanly merge to 2.4, @Ngone51 would you be able to put up a pr again branch 2.4? |
sure, thanks a lot. @tgravescs |
Hi, @tgravescs . |
thanks for pointing that out, I forgot to change the type, will fix iut now. I consider this a defect as we would blacklist thing when we shouldn't. |
+1. Thanks! |
…r with yarn to maxmize use of cluster resource ## What changes were proposed in this pull request? As I mentioned in jira [SPARK-26269](https://issues.apache.org/jira/browse/SPARK-26269), in order to maxmize the use of cluster resource, this pr try to make `YarnAllocator` have the same blacklist behaviour with YARN. ## How was this patch tested? Added. Closes apache#23223 from Ngone51/dev-YarnAllocator-should-have-same-blacklist-behaviour-with-YARN. Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: Ngone51 <ngone_5451@163.com> Signed-off-by: Thomas Graves <tgraves@apache.org>
…r with yarn to maxmize use of cluster resource ## What changes were proposed in this pull request? As I mentioned in jira [SPARK-26269](https://issues.apache.org/jira/browse/SPARK-26269), in order to maxmize the use of cluster resource, this pr try to make `YarnAllocator` have the same blacklist behaviour with YARN. ## How was this patch tested? Added. Closes apache#23223 from Ngone51/dev-YarnAllocator-should-have-same-blacklist-behaviour-with-YARN. Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: Ngone51 <ngone_5451@163.com> Signed-off-by: Thomas Graves <tgraves@apache.org>
What changes were proposed in this pull request?
As I mentioned in jira SPARK-26269, in order to maxmize the use of cluster resource, this pr try to make
YarnAllocator
have the same blacklist behaviour with YARN.How was this patch tested?
Added.