[SPARK-32738][CORE][3.0] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29763

wzhfy · 2020-09-15T17:01:53Z

This is a backport for pr#29580 to branch 3.0.

What changes were proposed in this pull request?

Processing for ThreadSafeRpcEndpoint is controlled by numActiveThreads in Inbox. Now if any fatal error happens during Inbox.process, numActiveThreads is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep numActiveThreads correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in Inbox.process.

Why are the changes needed?

numActiveThreads is not correct when fatal error happens and will cause the described problem.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add a new test.

…al error happens in `Inbox.process` ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes apache#29580 from wzhfy/deal_with_fatal_error. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

SparkQA · 2020-09-15T19:47:12Z

Test build #128722 has finished for PR 29763 at commit 589061c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm

Merging, thanks @wzhfy !

…f fatal error happens in `Inbox.process` This is a backport for [pr#29580](#29580) to branch 3.0. ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes #29763 from wzhfy/deal_with_fatal_error_3.0. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

mridulm · 2020-09-17T17:27:53Z

Weird, not sure why this did not get closed after merge - I see it in branch-3.0
@HyukjinKwon Any thoughts ?

HyukjinKwon · 2020-09-18T01:01:05Z

Oh, it doesn't get closed if it targets other branches. should be manually closed.

mridulm · 2020-09-18T18:43:42Z

Thanks for clarifying @HyukjinKwon , I was not aware of that !
I learn something new every day :-)

…f fatal error happens in `Inbox.process` This is a backport for [pr#29580](apache#29580) to branch 3.0. ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes apache#29763 from wzhfy/deal_with_fatal_error_3.0. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

probot-autolabeler bot added the CORE label Sep 15, 2020

cloud-fan approved these changes Sep 16, 2020

View reviewed changes

mridulm approved these changes Sep 17, 2020

View reviewed changes

HyukjinKwon closed this Sep 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32738][CORE][3.0] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29763

[SPARK-32738][CORE][3.0] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29763

wzhfy commented Sep 15, 2020

SparkQA commented Sep 15, 2020

mridulm left a comment

mridulm commented Sep 17, 2020

HyukjinKwon commented Sep 18, 2020

mridulm commented Sep 18, 2020

[SPARK-32738][CORE][3.0] Should reduce the number of active threads if fatal error happens in Inbox.process #29763

[SPARK-32738][CORE][3.0] Should reduce the number of active threads if fatal error happens in Inbox.process #29763

Conversation

wzhfy commented Sep 15, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Sep 15, 2020

mridulm left a comment

Choose a reason for hiding this comment

mridulm commented Sep 17, 2020

HyukjinKwon commented Sep 18, 2020

mridulm commented Sep 18, 2020

[SPARK-32738][CORE][3.0] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29763

[SPARK-32738][CORE][3.0] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29763