-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32738][CORE] Should reduce the number of active threads if fatal error happens in Inbox.process
#29580
Conversation
cc @vanzin @cloud-fan cloud you please review this pr? thanks |
Test build #128026 has finished for PR 29580 at commit
|
58ca21d
to
eb8b0b3
Compare
Test build #128029 has finished for PR 29580 at commit
|
Test build #128030 has finished for PR 29580 at commit
|
retest this please |
Test build #128034 has finished for PR 29580 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this, just had a minor question.
Other than what inbox has been stopped, this would not happen. |
@mridulm In our case, messages for IIUC, an inbox is stopped only when it's unregistered. |
Inbox.process
Inbox.process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks reasonable to me.
I guess in 2.4 and higher version, OOM can be caught by |
Test build #128304 has finished for PR 29580 at commit
|
kindly ping @mridulm @cloud-fan @Ngone51 |
LGTM. |
Test build #128505 has finished for PR 29580 at commit
|
retest this please |
Test build #128535 has finished for PR 29580 at commit
|
gentle ping @jiangxb1987 Is there any more comments? |
I'm a little confused by the description here, can you please elaborate more. Is the problem if a Fatal error happens it just kills the threads and is swallowed? or is the problem dealing with the number of threads? Is it we think we have more threads then we really do so something hangs? |
sorry took a closer look and it makes sense so ignore my question. |
LGTM |
thanks, merging to master! |
@wzhfy can you open backport PRs for 3.0 and 2.4? thanks! |
@cloud-fan Sure, I'll submit backport PRs recently. |
…al error happens in `Inbox.process` ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes apache#29580 from wzhfy/deal_with_fatal_error. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…al error happens in `Inbox.process` ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes apache#29580 from wzhfy/deal_with_fatal_error. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…f fatal error happens in `Inbox.process` This is a backport for [pr#29580](#29580) to branch 3.0. ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes #29763 from wzhfy/deal_with_fatal_error_3.0. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
…f fatal error happens in `Inbox.process` This is a backport for [pr#29580](#29580) to branch 2.4. ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes #29764 from wzhfy/deal_with_fatal_error_2.4. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
…f fatal error happens in `Inbox.process` This is a backport for [pr#29580](apache#29580) to branch 3.0. ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes apache#29763 from wzhfy/deal_with_fatal_error_3.0. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
What changes were proposed in this pull request?
Processing for
ThreadSafeRpcEndpoint
is controlled bynumActiveThreads
inInbox
. Now if any fatal error happens duringInbox.process
,numActiveThreads
is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keepnumActiveThreads
correct.This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints.
To fix this, we should reduce the number of active threads if fatal error happens in
Inbox.process
.Why are the changes needed?
numActiveThreads
is not correct when fatal error happens and will cause the described problem.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Add a new test.