Skip to content

[SPARK-32881][CORE] Catch some race condition errors and log them more clearly #29992

Closed
holdenk wants to merge 1 commit intoapache:masterfrom
holdenk:SPARK-32881-error-messaging-on-decom-race-messages
Closed

[SPARK-32881][CORE] Catch some race condition errors and log them more clearly #29992
holdenk wants to merge 1 commit intoapache:masterfrom
holdenk:SPARK-32881-error-messaging-on-decom-race-messages

Conversation

@holdenk
Copy link
Contributor

@holdenk holdenk commented Oct 9, 2020

What changes were proposed in this pull request?

Decommissioning can run out of time resulting in some race condition, these race conditions result in confusing error messages but not negative impact.

Why are the changes needed?

The NPE & element missing errors in the log can create a missunderstanding.

Does this PR introduce any user-facing change?

Logs change.

How was this patch tested?

Existing tests pass.

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34203/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34203/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Test build #129599 has finished for PR 29992 at commit 1ca46a4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

One K8s IT fails. Let's retrigger.

- Run in client mode. *** FAILED ***

@dongjoon-hyun
Copy link
Member

Retest this please

logWarning(s"Asked to update map output ${mapId} for untracked map status.")
}
} catch {
case e: java.lang.NullPointerException =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove java.lang.?

}.toSeq
} catch {
// If the block manager has already exited, nothing to replicate.
case e: java.util.NoSuchElementException =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we import?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be great if we have a warning log here.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @holdenk ! It looks helpful. I left a few comments.

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34215/

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34215/

@SparkQA
Copy link

SparkQA commented Oct 10, 2020

Test build #129612 has finished for PR 29992 at commit 1ca46a4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @holdenk .
Merged to master.

zhengruifeng pushed a commit to zhengruifeng/spark that referenced this pull request Oct 12, 2020
…e clearly

### What changes were proposed in this pull request?

Decommissioning can run out of time resulting in some race condition, these race conditions result in confusing error messages but not negative impact.

### Why are the changes needed?

The NPE & element missing errors in the log can create a missunderstanding.

### Does this PR introduce _any_ user-facing change?
Logs change.

### How was this patch tested?
Existing tests pass.

Closes apache#29992 from holdenk/SPARK-32881-error-messaging-on-decom-race-messages.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
logWarning(s"Asked to update map output ${mapId} for untracked map status.")
}
} catch {
case e: java.lang.NullPointerException =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: can we avoid catching NullPointerException? It's a bit odd that we catch NullPointerException. We could just switch to if-else I guess.

@HyukjinKwon
Copy link
Member

How did we test this, @holdenk? I can make a quick followup for that.

@HyukjinKwon
Copy link
Member

gentle ping. This PR introduces the very first place that catches and suppresses NullPointerException in the codebase - the second place that catches NullPointerException (mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala) rethrows NPE.

@HyukjinKwon
Copy link
Member

gentile ping. Would you mind asking to address my comments? I thought it wouldn't be difficult to address.

@holdenk
Copy link
Contributor Author

holdenk commented Oct 21, 2020

Hey sorry I didn’t see your comments. To avoid this I don’t think we could do if else, we’d have to introduce locking I think. Since this only happens under a race condition where we don’t care about the data I think it’s ok. If you’d prefer we could also take that part out since the NPE will get eaten later. Happy to review a PR with whatever approach you want or work on a follow up if you want.

@HyukjinKwon
Copy link
Member

I can give a try. Do you have any stacktrace or symptom for this NPE issue? Seems SPARK-32881 did not throw NPE.

@dongjoon-hyun, should SPARK-32881 be resolved after this fix, or is it open mistakenly?

@dongjoon-hyun
Copy link
Member

Thanks, @HyukjinKwon . I resolved SPARK-32881 now.

@dongjoon-hyun
Copy link
Member

Actually, this PR fixes both java.lang.NullPointerException and java.util.NoSuchElementException.
I reported NPE to her personally in addition to SPARK-32881.

@HyukjinKwon
Copy link
Member

That's good. I just wonder if we can just:

  • blockManagerInfo.contains(...) instead of catching java.util.NoSuchElementException
  • Check if something is null, and just ignore instead of catching java.lang.NullPointerException

I don't believe catching java.lang.NullPointerException in a block is a good practice because it will catch all NPE that can potentially happen everywhere.

Would you mind if I ask the stacktrace? At least I want to understand why catching NPE is necessary even if I fail to make a fix.

@holdenk
Copy link
Contributor Author

holdenk commented Oct 21, 2020

So I think we get the null if the mapStatus is deleted during the update. The problem is checking if it's null doesn't guarantee it won't be deleted after the check.

holdenk added a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…e clearly

### What changes were proposed in this pull request?

Decommissioning can run out of time resulting in some race condition, these race conditions result in confusing error messages but not negative impact.

### Why are the changes needed?

The NPE & element missing errors in the log can create a missunderstanding.

### Does this PR introduce _any_ user-facing change?
Logs change.

### How was this patch tested?
Existing tests pass.

Closes apache#29992 from holdenk/SPARK-32881-error-messaging-on-decom-race-messages.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@HyukjinKwon
Copy link
Member

@holdenk, I am very sorry but can you guide me how you tested and/or share a traceback?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants