[SPARK-32881][CORE] Catch some race condition errors and log them more clearly #29992
[SPARK-32881][CORE] Catch some race condition errors and log them more clearly #29992holdenk wants to merge 1 commit intoapache:masterfrom
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #129599 has finished for PR 29992 at commit
|
|
One K8s IT fails. Let's retrigger. |
|
Retest this please |
| logWarning(s"Asked to update map output ${mapId} for untracked map status.") | ||
| } | ||
| } catch { | ||
| case e: java.lang.NullPointerException => |
There was a problem hiding this comment.
Shall we remove java.lang.?
| }.toSeq | ||
| } catch { | ||
| // If the block manager has already exited, nothing to replicate. | ||
| case e: java.util.NoSuchElementException => |
There was a problem hiding this comment.
Also, it would be great if we have a warning log here.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Thank you, @holdenk ! It looks helpful. I left a few comments.
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129612 has finished for PR 29992 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you, @holdenk .
Merged to master.
…e clearly ### What changes were proposed in this pull request? Decommissioning can run out of time resulting in some race condition, these race conditions result in confusing error messages but not negative impact. ### Why are the changes needed? The NPE & element missing errors in the log can create a missunderstanding. ### Does this PR introduce _any_ user-facing change? Logs change. ### How was this patch tested? Existing tests pass. Closes apache#29992 from holdenk/SPARK-32881-error-messaging-on-decom-race-messages. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
| logWarning(s"Asked to update map output ${mapId} for untracked map status.") | ||
| } | ||
| } catch { | ||
| case e: java.lang.NullPointerException => |
There was a problem hiding this comment.
Quick question: can we avoid catching NullPointerException? It's a bit odd that we catch NullPointerException. We could just switch to if-else I guess.
|
How did we test this, @holdenk? I can make a quick followup for that. |
|
gentle ping. This PR introduces the very first place that catches and suppresses |
|
gentile ping. Would you mind asking to address my comments? I thought it wouldn't be difficult to address. |
|
Hey sorry I didn’t see your comments. To avoid this I don’t think we could do if else, we’d have to introduce locking I think. Since this only happens under a race condition where we don’t care about the data I think it’s ok. If you’d prefer we could also take that part out since the NPE will get eaten later. Happy to review a PR with whatever approach you want or work on a follow up if you want. |
|
I can give a try. Do you have any stacktrace or symptom for this NPE issue? Seems SPARK-32881 did not throw NPE. @dongjoon-hyun, should SPARK-32881 be resolved after this fix, or is it open mistakenly? |
|
Thanks, @HyukjinKwon . I resolved SPARK-32881 now. |
|
Actually, this PR fixes both |
|
That's good. I just wonder if we can just:
I don't believe catching Would you mind if I ask the stacktrace? At least I want to understand why catching NPE is necessary even if I fail to make a fix. |
|
So I think we get the null if the mapStatus is deleted during the update. The problem is checking if it's null doesn't guarantee it won't be deleted after the check. |
…e clearly ### What changes were proposed in this pull request? Decommissioning can run out of time resulting in some race condition, these race conditions result in confusing error messages but not negative impact. ### Why are the changes needed? The NPE & element missing errors in the log can create a missunderstanding. ### Does this PR introduce _any_ user-facing change? Logs change. ### How was this patch tested? Existing tests pass. Closes apache#29992 from holdenk/SPARK-32881-error-messaging-on-decom-race-messages. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|
@holdenk, I am very sorry but can you guide me how you tested and/or share a traceback? |
What changes were proposed in this pull request?
Decommissioning can run out of time resulting in some race condition, these race conditions result in confusing error messages but not negative impact.
Why are the changes needed?
The NPE & element missing errors in the log can create a missunderstanding.
Does this PR introduce any user-facing change?
Logs change.
How was this patch tested?
Existing tests pass.