Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19147][CORE] Gracefully handle error in task after executor is stopped #25759

Closed
wants to merge 1 commit into from

Conversation

colinmjj
Copy link

@colinmjj colinmjj commented Sep 11, 2019

What changes were proposed in this pull request?

TransportClientFactory.createClient() is called by task and TransportClientFactory.close() is called by executor.
When stop the executor, close() will set workerGroup = null, NPE will occur in createClient which generate many exception in log.
For exception occurs after close(), treated it as an expected Exception
and transform it to InterruptedException which can be processed by Executor.

Why are the changes needed?

The change can reduce the exception stack trace in log file, and user won't be confused by these excepted exception.

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

New tests are added in TransportClientFactorySuite and ExecutorSuite

@srowen
Copy link
Member

srowen commented Sep 11, 2019

Wait, why would we be creating a new client after the executor is shut down?

@colinmjj
Copy link
Author

Here is the scene, when the executor is killed before shuffle process, but the task was created and ready to fetch blocks. Then, the NPE will occur if task try to create a new client.
Such NPE will confuse user to find why executor is killed.

// and transform it to InterruptedException which can be processed by Executor.
// See SPARK-19147
if (workerGroup == null) {
throw new InterruptedException(e.getMessage());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still going to generate an exception in the logs, no? should it just be a log warning?
This is I think too indirect. Why not throw IllegalStateException in createClient instead in this case and catch for it specifically?

@srowen
Copy link
Member

srowen commented Sep 12, 2019

Also please improve the title of this PR

@colinmjj colinmjj changed the title [SPARK-19147][CORE] netty throw NPE [SPARK-19147][CORE] Avoid task to throw NPE caused by TransportClientFactory.createClient after executor stop Sep 13, 2019
@colinmjj
Copy link
Author

@srowen thanks for the comments, I'll update the pr later.

@HyukjinKwon
Copy link
Member

@colinmjj, also please fill other items in PR description.

@SparkQA
Copy link

SparkQA commented Sep 16, 2019

Test build #4871 has finished for PR 25759 at commit b2510fb.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@colinmjj colinmjj changed the title [SPARK-19147][CORE] Avoid task to throw NPE caused by TransportClientFactory.createClient after executor stop [SPARK-19147][CORE] Gracefully handle error in task after executor is stopped Sep 17, 2019
@colinmjj
Copy link
Author

@srowen The patch is updated, for the exception from task after executor.stop, add exception process to deal with it.

try {
// NullPointerException occurred if factory.createClient() after factory.close()
factory.createClient(TestUtils.getLocalHost(), server1.getPort());
} catch (Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just catch NullPointerException and ignore it?
But, maybe createClient should throw a better exception to begin with?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test shows exception is occurred if TransportClientFactory.createClient() called after TransportClientFactory.close().
Agree to throw a better exception and the patch is updated. There should be an IOException now.

// The exception will be thrown from the task becauseof the unexpected status,
// see: SPARK-19147, here is to process the exception after executor.stop
// as the excepted exception.
case t: Throwable if !isLocal && env.isStopped =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @jerryshao or @squito perhaps
My question is, if the executor is shut down, can you even report metrics etc, or is it meaningful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might succeed, as this will race against the stopping the executor. But you're very likely to trigger more exceptions from execBackend.statusUpdate, so it probably doesn't make sense to try, especially if the whole point of this change is to cut down on scary error msgs during shutdown.

btw I think env.isStopped will need to be volatile for this to work reliably.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen @squito , thanks for the comments, I check the code again and make clearly how metrics & heartbeat work. You're right, report metrics is meaningless after executor.close(), because heartbeat won't work.
Update the pr and the exception will be processed in "case t: Throwable =>" part with log only.

@@ -246,6 +246,46 @@ class ExecutorSuite extends SparkFunSuite
heartbeatZeroAccumulatorUpdateTest(false)
}

test("SPARK-19147: Gracefully handle error in task after executor is stopped") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the change is really to avoid stack traces, not assert about the particular error. Would this pass even before this change? I'm just wondering if this is worth testing.

What may be worth testing is whether metrics are updated, which is the real possible behavior change here? or would they already be reported in case of an error?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the case after I make clearly how metrics works, thanks for review.

case _: NotSerializableException =>
// t is not serializable so just send the stacktrace
val ef = new ExceptionFailure(t, accUpdates, false).withAccums(accums)
if (env.isStopped) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking OK overall, to me. You might be able to avoid most of the diff due to indentation by only adding a single case:

case t: Throwable if env.isStopped =>
  logError(...)
case t: Throwable =>
  // unchanged

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #4876 has finished for PR 25759 at commit 8fa404c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@srowen srowen closed this in 076186e Sep 21, 2019
@srowen
Copy link
Member

srowen commented Sep 21, 2019

Merged to master

@dongjoon-hyun
Copy link
Member

Hi, All.
Can we have this in branch-2.4 since SPARK-19147 is a long standing issue?

@srowen
Copy link
Member

srowen commented Sep 21, 2019

Although it's 'just' a cosmetic issue with how errors are logged in a sort of corner case, it's also a minor change. I'm fine if you want to back-port to 2.4.

@dongjoon-hyun
Copy link
Member

Thanks, @srowen . I'll backport this to branch-2.4.

dongjoon-hyun pushed a commit that referenced this pull request Sep 21, 2019
… stopped

### What changes were proposed in this pull request?

TransportClientFactory.createClient() is called by task and TransportClientFactory.close() is called by executor.
When stop the executor, close() will set workerGroup = null, NPE will occur in createClient which generate many exception in log.
For exception occurs after close(), treated it as an expected Exception
and transform it to InterruptedException which can be processed by Executor.

### Why are the changes needed?

The change can reduce the exception stack trace in log file, and user won't be confused by these excepted exception.

### Does this PR introduce any user-facing change?

N/A

### How was this patch tested?

New tests are added in TransportClientFactorySuite and ExecutorSuite

Closes #25759 from colinmjj/spark-19147.

Authored-by: colinma <colinma@tencent.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
(cherry picked from commit 076186e)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants