[SPARK-29683][YARN]False report isAllNodeBlacklisted when RM is having issue #28606

cnZach · 2020-05-22T05:12:51Z

What changes were proposed in this pull request?

Improve the check logic on if all node managers are really being backlisted.

Why are the changes needed?

I observed when the AM is out of sync with ResourceManager, or RM is having issue report back with current number of available NMs, something like below happens:
...
20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "client.zyx.com/x.x.x.124"; destination host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover immediately.
...
20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with ResourceManager, hence resyncing.
...

then the spark job would suddenly run into AllNodeBlacklisted state:
...
20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted)
...

but actually there's no black listed nodes in currentBlacklistedYarnNodes, and I do not see any blacklisting message from:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119

We should only return isAllNodeBlacklisted =true when we see there are >0 numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

A minor change. No changes on tests.

cnZach · 2020-05-22T09:03:31Z

Can one of the admins verify this patch?

srowen · 2020-05-23T14:32:15Z

...anagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala

+      logWarn("There's no available nodes reported, please check Resource Manager.")
+      false
+    } else if (currentBlacklistedYarnNodes.size >= numClusterNodes) {
+      true


if (x) true else false is redundant; just return the value of the predicate as before.

srowen · 2020-05-23T14:32:35Z

...anagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala

-  def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= numClusterNodes
+  def isAllNodeBlacklisted: Boolean = {
+    if (numClusterNodes <= 0) {
+      logWarn("There's no available nodes reported, please check Resource Manager.")


"There are", or just drop the first word.

…sage

cnZach · 2020-05-24T02:09:40Z

Hi @srowen , thanks for reviewing this. I have addressed your review comment. Let me know if there's further changes needed.

srowen · 2020-05-24T14:10:31Z

@attilapiros what do you think? seems like a plausible logic change but I don't know this part well, maybe you do better.

srowen · 2020-05-24T14:10:37Z

Jenkins test this please

SparkQA · 2020-05-24T14:38:11Z

Test build #123061 has finished for PR 28606 at commit 508a746.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros

I mostly agree with this change but there is already a PR about this: #26343.

attilapiros · 2020-05-25T08:13:36Z

...anagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala

@@ -103,7 +103,14 @@ private[spark] class YarnAllocatorBlacklistTracker(
    refreshBlacklistedNodes()
  }

-  def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= numClusterNodes
+  def isAllNodeBlacklisted: Boolean = {
+    if (numClusterNodes <= 0) {


numClusterNodes == 0 would be better

srowen · 2020-05-25T13:50:54Z

@attilapiros is this minor change correct and complementary, or needs to be a part of the larger PR?

attilapiros · 2020-05-25T14:21:43Z

Yes, this is correct on its own.

cnZach · 2020-05-26T01:14:53Z

Thanks @attilapiros for pointing out #26343. I see the same issue is addressed in #26343, so if #26343 goes in, we can just close this minor PR.

srowen · 2020-06-01T14:45:24Z

Not sure what the status of that PR is, so I will merge this and attach to the JIRA as a partial fix.

…ng issue ### What changes were proposed in this pull request? Improve the check logic on if all node managers are really being backlisted. ### Why are the changes needed? I observed when the AM is out of sync with ResourceManager, or RM is having issue report back with current number of available NMs, something like below happens: ... 20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "client.zyx.com/x.x.x.124"; destination host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover immediately. ... 20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with ResourceManager, hence resyncing. ... then the spark job would suddenly run into AllNodeBlacklisted state: ... 20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted) ... but actually there's no black listed nodes in currentBlacklistedYarnNodes, and I do not see any blacklisting message from: https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119 We should only return isAllNodeBlacklisted =true when we see there are >0 numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A minor change. No changes on tests. Closes apache#28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue. Authored-by: Yuexin Zhang <zach.yx.zhang@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>

probot-autolabeler bot added the YARN label May 22, 2020

srowen reviewed May 23, 2020

View reviewed changes

report all nodes blacklisted only when current available nodes >0

6dcb366

cnZach force-pushed the false_AllNodeBlacklisted_when_RM_is_having_issue branch from 387b5e7 to 6dcb366 Compare May 24, 2020 00:53

address review comment: fix redundant if/else and wording in warn mes…

dfc2ade

…sage

should use log.Warning not log.Warn

508a746

attilapiros reviewed May 25, 2020

View reviewed changes

srowen changed the title ~~[MINOR][YARN]False report isAllNodeBlacklisted when RM is having issue~~ [SPARK-29683][YARN]False report isAllNodeBlacklisted when RM is having issue Jun 1, 2020

srowen closed this in e70df2c Jun 1, 2020

sjrand mentioned this pull request Jun 1, 2020

[SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is havi… palantir/spark#688

Merged

attilapiros mentioned this pull request May 13, 2021

[SPARK-29683][YARN] Job will fail due to executor failures all available nodes are blacklisted #26343

Closed

sungpeo mentioned this pull request Jan 3, 2022

[SPARK-29683][2.4][YARN] False report isAllNodeBlacklisted when RM is havi… #35089

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29683][YARN]False report isAllNodeBlacklisted when RM is having issue #28606

[SPARK-29683][YARN]False report isAllNodeBlacklisted when RM is having issue #28606

cnZach commented May 22, 2020

cnZach commented May 22, 2020

srowen May 23, 2020

srowen May 23, 2020

cnZach commented May 24, 2020

srowen commented May 24, 2020

srowen commented May 24, 2020

SparkQA commented May 24, 2020

attilapiros left a comment

attilapiros May 25, 2020

srowen commented May 25, 2020

attilapiros commented May 25, 2020

cnZach commented May 26, 2020

srowen commented Jun 1, 2020

[SPARK-29683][YARN]False report isAllNodeBlacklisted when RM is having issue #28606

[SPARK-29683][YARN]False report isAllNodeBlacklisted when RM is having issue #28606

Conversation

cnZach commented May 22, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cnZach commented May 22, 2020

srowen May 23, 2020

Choose a reason for hiding this comment

srowen May 23, 2020

Choose a reason for hiding this comment

cnZach commented May 24, 2020

srowen commented May 24, 2020

srowen commented May 24, 2020

SparkQA commented May 24, 2020

attilapiros left a comment

Choose a reason for hiding this comment

attilapiros May 25, 2020

Choose a reason for hiding this comment

srowen commented May 25, 2020

attilapiros commented May 25, 2020

cnZach commented May 26, 2020

srowen commented Jun 1, 2020