Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in Inbox.process #29764

Closed
wants to merge 2 commits into from

Conversation

wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Sep 15, 2020

This is a backport for pr#29580 to branch 2.4.

What changes were proposed in this pull request?

Processing for ThreadSafeRpcEndpoint is controlled by numActiveThreads in Inbox. Now if any fatal error happens during Inbox.process, numActiveThreads is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep numActiveThreads correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in Inbox.process.

Why are the changes needed?

numActiveThreads is not correct when fatal error happens and will cause the described problem.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add a new test.

…al error happens in `Inbox.process`

### What changes were proposed in this pull request?

Processing for `ThreadSafeRpcEndpoint` is controlled by  `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep  `numActiveThreads` correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`.

### Why are the changes needed?

`numActiveThreads` is not correct when fatal error happens and will cause the described problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a new test.

Closes apache#29580 from wzhfy/deal_with_fatal_error.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

val endpointRef = mock(classOf[NettyRpcEndpointRef])
val dispatcher = mock(classOf[Dispatcher])
val inbox = new Inbox(endpointRef, endpoint)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here in 2.4 we pass an endpointRef as parameter instead of a name in 3.x

@SparkQA
Copy link

SparkQA commented Sep 15, 2020

Test build #128723 has finished for PR 29764 at commit af64be8.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 16, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Sep 16, 2020

Test build #128728 has finished for PR 29764 at commit af64be8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 17, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Sep 17, 2020

Test build #128791 has finished for PR 29764 at commit af64be8.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 17, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Sep 17, 2020

Test build #128822 has finished for PR 29764 at commit af64be8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 18, 2020

Finally all tests are passed...
cc @cloud-fan @mridulm This is a backport for branch 2.4.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wzhfy !

asfgit pushed a commit that referenced this pull request Sep 18, 2020
…f fatal error happens in `Inbox.process`

This is a backport for [pr#29580](#29580) to branch 2.4.

### What changes were proposed in this pull request?

Processing for `ThreadSafeRpcEndpoint` is controlled by  `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep  `numActiveThreads` correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`.

### Why are the changes needed?

`numActiveThreads` is not correct when fatal error happens and will cause the described problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a new test.

Closes #29764 from wzhfy/deal_with_fatal_error_2.4.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
@mridulm
Copy link
Contributor

mridulm commented Sep 18, 2020

@wzhfy Can you close the PR please ? It has been merged to branch-2.4

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 19, 2020

@mridulm Closed. Thanks!

@wzhfy wzhfy closed this Sep 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants