Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29683][YARN]False report isAllNodeBlacklisted when RM is having issue #28606

Conversation

cnZach
Copy link
Contributor

@cnZach cnZach commented May 22, 2020

What changes were proposed in this pull request?

Improve the check logic on if all node managers are really being backlisted.

Why are the changes needed?

I observed when the AM is out of sync with ResourceManager, or RM is having issue report back with current number of available NMs, something like below happens:
...
20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "client.zyx.com/x.x.x.124"; destination host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover immediately.
...
20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with ResourceManager, hence resyncing.
...

then the spark job would suddenly run into AllNodeBlacklisted state:
...
20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted)
...

but actually there's no black listed nodes in currentBlacklistedYarnNodes, and I do not see any blacklisting message from:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119

We should only return isAllNodeBlacklisted =true when we see there are >0 numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

A minor change. No changes on tests.

@cnZach
Copy link
Contributor Author

cnZach commented May 22, 2020

Can one of the admins verify this patch?

logWarn("There's no available nodes reported, please check Resource Manager.")
false
} else if (currentBlacklistedYarnNodes.size >= numClusterNodes) {
true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (x) true else false is redundant; just return the value of the predicate as before.

def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= numClusterNodes
def isAllNodeBlacklisted: Boolean = {
if (numClusterNodes <= 0) {
logWarn("There's no available nodes reported, please check Resource Manager.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"There are", or just drop the first word.

@cnZach cnZach force-pushed the false_AllNodeBlacklisted_when_RM_is_having_issue branch from 387b5e7 to 6dcb366 Compare May 24, 2020 00:53
@cnZach
Copy link
Contributor Author

cnZach commented May 24, 2020

Hi @srowen , thanks for reviewing this. I have addressed your review comment. Let me know if there's further changes needed.

@srowen
Copy link
Member

srowen commented May 24, 2020

@attilapiros what do you think? seems like a plausible logic change but I don't know this part well, maybe you do better.

@srowen
Copy link
Member

srowen commented May 24, 2020

Jenkins test this please

@SparkQA
Copy link

SparkQA commented May 24, 2020

Test build #123061 has finished for PR 28606 at commit 508a746.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree with this change but there is already a PR about this: #26343.

@@ -103,7 +103,14 @@ private[spark] class YarnAllocatorBlacklistTracker(
refreshBlacklistedNodes()
}

def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= numClusterNodes
def isAllNodeBlacklisted: Boolean = {
if (numClusterNodes <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numClusterNodes == 0 would be better

@srowen
Copy link
Member

srowen commented May 25, 2020

@attilapiros is this minor change correct and complementary, or needs to be a part of the larger PR?

@attilapiros
Copy link
Contributor

Yes, this is correct on its own.

@cnZach
Copy link
Contributor Author

cnZach commented May 26, 2020

Thanks @attilapiros for pointing out #26343. I see the same issue is addressed in #26343, so if #26343 goes in, we can just close this minor PR.

@srowen
Copy link
Member

srowen commented Jun 1, 2020

Not sure what the status of that PR is, so I will merge this and attach to the JIRA as a partial fix.

@srowen srowen changed the title [MINOR][YARN]False report isAllNodeBlacklisted when RM is having issue [SPARK-29683][YARN]False report isAllNodeBlacklisted when RM is having issue Jun 1, 2020
@srowen srowen closed this in e70df2c Jun 1, 2020
sjrand pushed a commit to palantir/spark that referenced this pull request Jun 1, 2020
…ng issue

### What changes were proposed in this pull request?

Improve the check logic on if all node managers are really being backlisted.

### Why are the changes needed?

I observed when the AM is out of sync with ResourceManager, or RM is having issue report back with current number of available NMs, something like below happens:
...
20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "client.zyx.com/x.x.x.124"; destination host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover immediately.
...
20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with ResourceManager, hence resyncing.
...

then the spark job would suddenly run into AllNodeBlacklisted state:
...
20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted)
...

but actually there's no black listed nodes in currentBlacklistedYarnNodes, and I do not see any blacklisting message from:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119

We should only return isAllNodeBlacklisted =true when we see there are >0  numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A minor change. No changes on tests.

Closes apache#28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue.

Authored-by: Yuexin Zhang <zach.yx.zhang@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
rshkv pushed a commit to palantir/spark that referenced this pull request Jun 1, 2020
…ng issue

### What changes were proposed in this pull request?

Improve the check logic on if all node managers are really being backlisted.

### Why are the changes needed?

I observed when the AM is out of sync with ResourceManager, or RM is having issue report back with current number of available NMs, something like below happens:
...
20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "client.zyx.com/x.x.x.124"; destination host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover immediately.
...
20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with ResourceManager, hence resyncing.
...

then the spark job would suddenly run into AllNodeBlacklisted state:
...
20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted)
...

but actually there's no black listed nodes in currentBlacklistedYarnNodes, and I do not see any blacklisting message from:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119

We should only return isAllNodeBlacklisted =true when we see there are >0  numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A minor change. No changes on tests.

Closes apache#28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue.

Authored-by: Yuexin Zhang <zach.yx.zhang@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
sungpeo pushed a commit to sungpeo/spark that referenced this pull request Jan 3, 2022
…ng issue

### What changes were proposed in this pull request?

Improve the check logic on if all node managers are really being backlisted.

### Why are the changes needed?

I observed when the AM is out of sync with ResourceManager, or RM is having issue report back with current number of available NMs, something like below happens:
...
20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "client.zyx.com/x.x.x.124"; destination host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover immediately.
...
20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with ResourceManager, hence resyncing.
...

then the spark job would suddenly run into AllNodeBlacklisted state:
...
20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted)
...

but actually there's no black listed nodes in currentBlacklistedYarnNodes, and I do not see any blacklisting message from:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119

We should only return isAllNodeBlacklisted =true when we see there are >0  numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A minor change. No changes on tests.

Closes apache#28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue.

Authored-by: Yuexin Zhang <zach.yx.zhang@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants