Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32738][CORE] Should reduce the number of active threads if fatal error happens in Inbox.process #29580

Closed
wants to merge 4 commits into from

Conversation

wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Aug 30, 2020

What changes were proposed in this pull request?

Processing for ThreadSafeRpcEndpoint is controlled by numActiveThreads in Inbox. Now if any fatal error happens during Inbox.process, numActiveThreads is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep numActiveThreads correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in Inbox.process.

Why are the changes needed?

numActiveThreads is not correct when fatal error happens and will cause the described problem.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add a new test.

@wzhfy
Copy link
Contributor Author

wzhfy commented Aug 30, 2020

cc @vanzin @cloud-fan cloud you please review this pr? thanks

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Test build #128026 has finished for PR 29580 at commit 58ca21d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Test build #128029 has finished for PR 29580 at commit eb8b0b3.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Test build #128030 has finished for PR 29580 at commit 9a78c8b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Aug 30, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Test build #128034 has finished for PR 29580 at commit 9a78c8b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this, just had a minor question.

core/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala Outdated Show resolved Hide resolved
@mridulm
Copy link
Contributor

mridulm commented Sep 2, 2020

Then other threads can not process messages in that inbox, which causes the endpoint to hang

Other than what inbox has been stopped, this would not happen.
Are you referring to this ? Or any other cases ?

@cloud-fan
Copy link
Contributor

cc @zsxwing @jiangxb1987

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 3, 2020

Then other threads can not process messages in that inbox, which causes the endpoint to hang

Other than what inbox has been stopped, this would not happen.
Are you referring to this ? Or any other cases ?

@mridulm In our case, messages for DriverEndpoint couldn't get processed after OOM happened in a dispatcher thread. Cluster's spark version is 2.3, but I think same problem would exist in 2.4, and for other endpoints in 3.0 (DriverEndpoint becomes an IsolatedRpcEndpoint instead of ThreadSafeRpcEndpoint in 3.x).

IIUC, an inbox is stopped only when it's unregistered.
When a dispatcher thread is processing messages in an inbox, if a fatal error (e.g. OOM) happens, it will just throw the error.
I don't find any place to stop the inbox when this case happens.
Please correct me if I'm wrong, thanks!

@wzhfy wzhfy changed the title [SPARK-32738][CORE] Should reduce the number of active threads if fatal exception happens in Inbox.process [SPARK-32738][CORE] Should reduce the number of active threads if fatal error happens in Inbox.process Sep 3, 2020
Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks reasonable to me.

@Ngone51
Copy link
Member

Ngone51 commented Sep 3, 2020

I guess in 2.4 and higher version, OOM can be caught by SparkUncaughtExceptionHandler and exit the driver process.

@SparkQA
Copy link

SparkQA commented Sep 4, 2020

Test build #128304 has finished for PR 29580 at commit dcf29c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 9, 2020

kindly ping @mridulm @cloud-fan @Ngone51

@Ngone51
Copy link
Member

Ngone51 commented Sep 9, 2020

LGTM.

@SparkQA
Copy link

SparkQA commented Sep 10, 2020

Test build #128505 has finished for PR 29580 at commit c279e94.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 10, 2020

Test build #128535 has finished for PR 29580 at commit c279e94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 14, 2020

gentle ping @jiangxb1987 Is there any more comments?

@tgravescs
Copy link
Contributor

I'm a little confused by the description here, can you please elaborate more. Is the problem if a Fatal error happens it just kills the threads and is swallowed? or is the problem dealing with the number of threads? Is it we think we have more threads then we really do so something hangs?

@tgravescs
Copy link
Contributor

sorry took a closer look and it makes sense so ignore my question.

@jiangxb1987
Copy link
Contributor

LGTM

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 99384d1 Sep 15, 2020
@cloud-fan
Copy link
Contributor

@wzhfy can you open backport PRs for 3.0 and 2.4? thanks!

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 15, 2020

@cloud-fan Sure, I'll submit backport PRs recently.
Thanks for reviewing! @mridulm @Ngone51 @cloud-fan @jiangxb1987 @tgravescs

wzhfy added a commit to wzhfy/spark that referenced this pull request Sep 15, 2020
…al error happens in `Inbox.process`

### What changes were proposed in this pull request?

Processing for `ThreadSafeRpcEndpoint` is controlled by  `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep  `numActiveThreads` correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`.

### Why are the changes needed?

`numActiveThreads` is not correct when fatal error happens and will cause the described problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a new test.

Closes apache#29580 from wzhfy/deal_with_fatal_error.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
wzhfy added a commit to wzhfy/spark that referenced this pull request Sep 15, 2020
…al error happens in `Inbox.process`

### What changes were proposed in this pull request?

Processing for `ThreadSafeRpcEndpoint` is controlled by  `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep  `numActiveThreads` correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`.

### Why are the changes needed?

`numActiveThreads` is not correct when fatal error happens and will cause the described problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a new test.

Closes apache#29580 from wzhfy/deal_with_fatal_error.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
asfgit pushed a commit that referenced this pull request Sep 17, 2020
…f fatal error happens in `Inbox.process`

This is a backport for [pr#29580](#29580) to branch 3.0.

### What changes were proposed in this pull request?

Processing for `ThreadSafeRpcEndpoint` is controlled by  `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep  `numActiveThreads` correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`.

### Why are the changes needed?

`numActiveThreads` is not correct when fatal error happens and will cause the described problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a new test.

Closes #29763 from wzhfy/deal_with_fatal_error_3.0.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
asfgit pushed a commit that referenced this pull request Sep 18, 2020
…f fatal error happens in `Inbox.process`

This is a backport for [pr#29580](#29580) to branch 2.4.

### What changes were proposed in this pull request?

Processing for `ThreadSafeRpcEndpoint` is controlled by  `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep  `numActiveThreads` correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`.

### Why are the changes needed?

`numActiveThreads` is not correct when fatal error happens and will cause the described problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a new test.

Closes #29764 from wzhfy/deal_with_fatal_error_2.4.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…f fatal error happens in `Inbox.process`

This is a backport for [pr#29580](apache#29580) to branch 3.0.

### What changes were proposed in this pull request?

Processing for `ThreadSafeRpcEndpoint` is controlled by  `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep  `numActiveThreads` correct.

This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.

To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`.

### Why are the changes needed?

`numActiveThreads` is not correct when fatal error happens and will cause the described problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a new test.

Closes apache#29763 from wzhfy/deal_with_fatal_error_3.0.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants