Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32738][CORE][3.0] Should reduce the number of active threads if fatal error happens in Inbox.process #29763

Closed
wants to merge 1 commit into from

Conversation

wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Sep 15, 2020

This is a backport for pr#29580 to branch 3.0.

What changes were proposed in this pull request?

Processing for ThreadSafeRpcEndpoint is controlled by numActiveThreads in Inbox. Now if any fatal error happens during Inbox.process, numActiveThreads is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep numActiveThreads correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in Inbox.process.

Why are the changes needed?

numActiveThreads is not correct when fatal error happens and will cause the described problem.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add a new test.

…al error happens in `Inbox.process`

### What changes were proposed in this pull request?

Processing for `ThreadSafeRpcEndpoint` is controlled by  `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep  `numActiveThreads` correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`.

### Why are the changes needed?

`numActiveThreads` is not correct when fatal error happens and will cause the described problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a new test.

Closes apache#29580 from wzhfy/deal_with_fatal_error.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@SparkQA
Copy link

SparkQA commented Sep 15, 2020

Test build #128722 has finished for PR 29763 at commit 589061c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging, thanks @wzhfy !

asfgit pushed a commit that referenced this pull request Sep 17, 2020
…f fatal error happens in `Inbox.process`

This is a backport for [pr#29580](#29580) to branch 3.0.

### What changes were proposed in this pull request?

Processing for `ThreadSafeRpcEndpoint` is controlled by  `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep  `numActiveThreads` correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`.

### Why are the changes needed?

`numActiveThreads` is not correct when fatal error happens and will cause the described problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a new test.

Closes #29763 from wzhfy/deal_with_fatal_error_3.0.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
@mridulm
Copy link
Contributor

mridulm commented Sep 17, 2020

Weird, not sure why this did not get closed after merge - I see it in branch-3.0
@HyukjinKwon Any thoughts ?

@HyukjinKwon
Copy link
Member

Oh, it doesn't get closed if it targets other branches. should be manually closed.

@mridulm
Copy link
Contributor

mridulm commented Sep 18, 2020

Thanks for clarifying @HyukjinKwon , I was not aware of that !
I learn something new every day :-)

holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…f fatal error happens in `Inbox.process`

This is a backport for [pr#29580](apache#29580) to branch 3.0.

### What changes were proposed in this pull request?

Processing for `ThreadSafeRpcEndpoint` is controlled by  `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep  `numActiveThreads` correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`.

### Why are the changes needed?

`numActiveThreads` is not correct when fatal error happens and will cause the described problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a new test.

Closes apache#29763 from wzhfy/deal_with_fatal_error_3.0.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants