[SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29764

wzhfy · 2020-09-15T17:40:51Z

This is a backport for pr#29580 to branch 2.4.

What changes were proposed in this pull request?

Processing for ThreadSafeRpcEndpoint is controlled by numActiveThreads in Inbox. Now if any fatal error happens during Inbox.process, numActiveThreads is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep numActiveThreads correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in Inbox.process.

Why are the changes needed?

numActiveThreads is not correct when fatal error happens and will cause the described problem.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add a new test.

…al error happens in `Inbox.process` ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes apache#29580 from wzhfy/deal_with_fatal_error. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

wzhfy · 2020-09-15T17:41:45Z

core/src/test/scala/org/apache/spark/rpc/netty/InboxSuite.scala

+
+    val endpointRef = mock(classOf[NettyRpcEndpointRef])
+    val dispatcher = mock(classOf[Dispatcher])
+    val inbox = new Inbox(endpointRef, endpoint)


Here in 2.4 we pass an endpointRef as parameter instead of a name in 3.x

SparkQA · 2020-09-15T17:48:41Z

Test build #128723 has finished for PR 29764 at commit af64be8.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2020-09-16T00:25:14Z

retest this please

SparkQA · 2020-09-16T01:59:55Z

Test build #128728 has finished for PR 29764 at commit af64be8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2020-09-17T03:02:32Z

retest this please

SparkQA · 2020-09-17T03:10:05Z

Test build #128791 has finished for PR 29764 at commit af64be8.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2020-09-17T11:08:48Z

retest this please

SparkQA · 2020-09-17T14:31:01Z

Test build #128822 has finished for PR 29764 at commit af64be8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2020-09-18T04:31:27Z

Finally all tests are passed...
cc @cloud-fan @mridulm This is a backport for branch 2.4.

mridulm

Thanks @wzhfy !

…f fatal error happens in `Inbox.process` This is a backport for [pr#29580](#29580) to branch 2.4. ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes #29764 from wzhfy/deal_with_fatal_error_2.4. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

mridulm · 2020-09-18T23:53:42Z

@wzhfy Can you close the PR please ? It has been merged to branch-2.4

wzhfy · 2020-09-19T01:29:21Z

@mridulm Closed. Thanks!

wzhfy added 2 commits September 16, 2020 00:55

fix compile

af64be8

probot-autolabeler bot added the CORE label Sep 15, 2020

wzhfy commented Sep 15, 2020

View reviewed changes

cloud-fan approved these changes Sep 18, 2020

View reviewed changes

mridulm approved these changes Sep 18, 2020

View reviewed changes

wzhfy closed this Sep 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29764

[SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29764

wzhfy commented Sep 15, 2020

wzhfy Sep 15, 2020

SparkQA commented Sep 15, 2020

wzhfy commented Sep 16, 2020

SparkQA commented Sep 16, 2020

wzhfy commented Sep 17, 2020

SparkQA commented Sep 17, 2020

wzhfy commented Sep 17, 2020

SparkQA commented Sep 17, 2020

wzhfy commented Sep 18, 2020 •

edited

Loading

mridulm left a comment

mridulm commented Sep 18, 2020

wzhfy commented Sep 19, 2020

[SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in Inbox.process #29764

[SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in Inbox.process #29764

Conversation

wzhfy commented Sep 15, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

wzhfy Sep 15, 2020

Choose a reason for hiding this comment

SparkQA commented Sep 15, 2020

wzhfy commented Sep 16, 2020

SparkQA commented Sep 16, 2020

wzhfy commented Sep 17, 2020

SparkQA commented Sep 17, 2020

wzhfy commented Sep 17, 2020

SparkQA commented Sep 17, 2020

wzhfy commented Sep 18, 2020 • edited Loading

mridulm left a comment

Choose a reason for hiding this comment

mridulm commented Sep 18, 2020

wzhfy commented Sep 19, 2020

[SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29764

[SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process` #29764

wzhfy commented Sep 18, 2020 •

edited

Loading