[SPARK-32738][CORE] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29580

wzhfy · 2020-08-30T04:32:39Z

What changes were proposed in this pull request?

Processing for ThreadSafeRpcEndpoint is controlled by numActiveThreads in Inbox. Now if any fatal error happens during Inbox.process, numActiveThreads is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep numActiveThreads correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in Inbox.process.

Why are the changes needed?

numActiveThreads is not correct when fatal error happens and will cause the described problem.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add a new test.

wzhfy · 2020-08-30T04:38:07Z

cc @vanzin @cloud-fan cloud you please review this pr? thanks

SparkQA · 2020-08-30T04:41:32Z

Test build #128026 has finished for PR 29580 at commit 58ca21d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-30T05:03:32Z

Test build #128029 has finished for PR 29580 at commit eb8b0b3.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-30T07:05:01Z

Test build #128030 has finished for PR 29580 at commit 9a78c8b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2020-08-30T07:09:10Z

retest this please

SparkQA · 2020-08-30T09:41:56Z

Test build #128034 has finished for PR 29580 at commit 9a78c8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm

Thanks for fixing this, just had a minor question.

core/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala

mridulm · 2020-09-02T05:52:40Z

Then other threads can not process messages in that inbox, which causes the endpoint to hang

Other than what inbox has been stopped, this would not happen.
Are you referring to this ? Or any other cases ?

cloud-fan · 2020-09-02T08:52:10Z

cc @zsxwing @jiangxb1987

wzhfy · 2020-09-03T02:43:57Z

Then other threads can not process messages in that inbox, which causes the endpoint to hang

Other than what inbox has been stopped, this would not happen.
Are you referring to this ? Or any other cases ?

@mridulm In our case, messages for DriverEndpoint couldn't get processed after OOM happened in a dispatcher thread. Cluster's spark version is 2.3, but I think same problem would exist in 2.4, and for other endpoints in 3.0 (DriverEndpoint becomes an IsolatedRpcEndpoint instead of ThreadSafeRpcEndpoint in 3.x).

IIUC, an inbox is stopped only when it's unregistered.
When a dispatcher thread is processing messages in an inbox, if a fatal error (e.g. OOM) happens, it will just throw the error.
I don't find any place to stop the inbox when this case happens.
Please correct me if I'm wrong, thanks!

Ngone51

The fix looks reasonable to me.

Ngone51 · 2020-09-03T12:11:40Z

I guess in 2.4 and higher version, OOM can be caught by SparkUncaughtExceptionHandler and exit the driver process.

core/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala

SparkQA · 2020-09-04T15:58:42Z

Test build #128304 has finished for PR 29580 at commit dcf29c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2020-09-09T02:03:54Z

kindly ping @mridulm @cloud-fan @Ngone51

Ngone51 · 2020-09-09T09:40:45Z

LGTM.

core/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala

SparkQA · 2020-09-10T12:45:05Z

Test build #128505 has finished for PR 29580 at commit c279e94.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2020-09-10T18:38:33Z

retest this please

SparkQA · 2020-09-10T21:36:53Z

Test build #128535 has finished for PR 29580 at commit c279e94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2020-09-14T11:00:21Z

gentle ping @jiangxb1987 Is there any more comments?

tgravescs · 2020-09-14T14:31:38Z

I'm a little confused by the description here, can you please elaborate more. Is the problem if a Fatal error happens it just kills the threads and is swallowed? or is the problem dealing with the number of threads? Is it we think we have more threads then we really do so something hangs?

tgravescs · 2020-09-14T14:38:32Z

sorry took a closer look and it makes sense so ignore my question.

jiangxb1987 · 2020-09-14T23:45:59Z

LGTM

cloud-fan · 2020-09-15T06:46:16Z

thanks, merging to master!

cloud-fan · 2020-09-15T06:47:06Z

@wzhfy can you open backport PRs for 3.0 and 2.4? thanks!

wzhfy · 2020-09-15T12:55:33Z

@cloud-fan Sure, I'll submit backport PRs recently.
Thanks for reviewing! @mridulm @Ngone51 @cloud-fan @jiangxb1987 @tgravescs

…al error happens in `Inbox.process` ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes apache#29580 from wzhfy/deal_with_fatal_error. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…f fatal error happens in `Inbox.process` This is a backport for [pr#29580](#29580) to branch 3.0. ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes #29763 from wzhfy/deal_with_fatal_error_3.0. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…f fatal error happens in `Inbox.process` This is a backport for [pr#29580](#29580) to branch 2.4. ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes #29764 from wzhfy/deal_with_fatal_error_2.4. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…f fatal error happens in `Inbox.process` This is a backport for [pr#29580](apache#29580) to branch 3.0. ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes apache#29763 from wzhfy/deal_with_fatal_error_3.0. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

probot-autolabeler bot added the CORE label Aug 30, 2020

should reduce active threads when fatal exception

eb8b0b3

wzhfy force-pushed the deal_with_fatal_error branch from 58ca21d to eb8b0b3 Compare August 30, 2020 04:56

fix compile error

9a78c8b

mridulm reviewed Sep 2, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala Outdated Show resolved Hide resolved

wzhfy changed the title ~~[SPARK-32738][CORE] Should reduce the number of active threads if fatal exception happens in Inbox.process~~ [SPARK-32738][CORE] Should reduce the number of active threads if fatal error happens in Inbox.process Sep 3, 2020

Ngone51 reviewed Sep 3, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala Show resolved Hide resolved

fix comments

dcf29c6

jiangxb1987 reviewed Sep 9, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala Outdated Show resolved Hide resolved

use assert instead of if condition

c279e94

mridulm approved these changes Sep 14, 2020

View reviewed changes

cloud-fan closed this in 99384d1 Sep 15, 2020

wzhfy mentioned this pull request Sep 15, 2020

[SPARK-32738][CORE][3.0] Should reduce the number of active threads if fatal error happens in Inbox.process #29763

Closed

wzhfy mentioned this pull request Sep 15, 2020

[SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in Inbox.process #29764

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32738][CORE] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29580

[SPARK-32738][CORE] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29580

wzhfy commented Aug 30, 2020 •

edited

Loading

wzhfy commented Aug 30, 2020

SparkQA commented Aug 30, 2020

SparkQA commented Aug 30, 2020

SparkQA commented Aug 30, 2020

wzhfy commented Aug 30, 2020

SparkQA commented Aug 30, 2020

mridulm left a comment

mridulm commented Sep 2, 2020

cloud-fan commented Sep 2, 2020

wzhfy commented Sep 3, 2020 •

edited

Loading

Ngone51 left a comment •

edited

Loading

Ngone51 commented Sep 3, 2020

SparkQA commented Sep 4, 2020

wzhfy commented Sep 9, 2020

Ngone51 commented Sep 9, 2020

SparkQA commented Sep 10, 2020

jiangxb1987 commented Sep 10, 2020

SparkQA commented Sep 10, 2020

wzhfy commented Sep 14, 2020

tgravescs commented Sep 14, 2020

tgravescs commented Sep 14, 2020

jiangxb1987 commented Sep 14, 2020

cloud-fan commented Sep 15, 2020

cloud-fan commented Sep 15, 2020

wzhfy commented Sep 15, 2020

[SPARK-32738][CORE] Should reduce the number of active threads if fatal error happens in Inbox.process #29580

[SPARK-32738][CORE] Should reduce the number of active threads if fatal error happens in Inbox.process #29580

Conversation

wzhfy commented Aug 30, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

wzhfy commented Aug 30, 2020

SparkQA commented Aug 30, 2020

SparkQA commented Aug 30, 2020

SparkQA commented Aug 30, 2020

wzhfy commented Aug 30, 2020

SparkQA commented Aug 30, 2020

mridulm left a comment

Choose a reason for hiding this comment

mridulm commented Sep 2, 2020

cloud-fan commented Sep 2, 2020

wzhfy commented Sep 3, 2020 • edited Loading

Ngone51 left a comment • edited Loading

Choose a reason for hiding this comment

Ngone51 commented Sep 3, 2020

SparkQA commented Sep 4, 2020

wzhfy commented Sep 9, 2020

Ngone51 commented Sep 9, 2020

SparkQA commented Sep 10, 2020

jiangxb1987 commented Sep 10, 2020

SparkQA commented Sep 10, 2020

wzhfy commented Sep 14, 2020

tgravescs commented Sep 14, 2020

tgravescs commented Sep 14, 2020

jiangxb1987 commented Sep 14, 2020

cloud-fan commented Sep 15, 2020

cloud-fan commented Sep 15, 2020

wzhfy commented Sep 15, 2020

[SPARK-32738][CORE] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29580

[SPARK-32738][CORE] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29580

wzhfy commented Aug 30, 2020 •

edited

Loading

wzhfy commented Sep 3, 2020 •

edited

Loading

Ngone51 left a comment •

edited

Loading