-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34273][CORE] Do not reregister BlockManager when SparkContext is stopped #31373
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #134594 has finished for PR 31373 at commit
|
Retest this please |
@@ -128,14 +128,16 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) | |||
|
|||
// Messages received from executors | |||
case heartbeat @ Heartbeat(executorId, accumUpdates, blockManagerId, executorUpdates) => | |||
var reregisterBlockManager = !sc.isStopped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a race condition that happens during SparkContext shutdown. So it's also possible that the SparkContext is stopped right after we sent the HeartbeatResponse
. In that case, IIUC, the issue will still exist.
Does the current behavior cause any real issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank. you for review, @Ngone51 . That's true. In the production environment, I hit those intermediate status. And, this will help us simplify the situation.
- The case you mentioned,
Send HeartbeatResponse and sc.stop invoked
, is a normal situation. The users don't complain about this. - The case in this PR,
sc.stop invoked and Spark works inefficiently by sending HeartbeatResponse(true)
is a problem. The users complain about this.
For the following, yes. We are After the apps sc.stop
takes a longer time than we expect.
Does the current behavior cause any real issue?
Kubernetes integration test starting |
Refer to this link for build results (access rights to CI server needed): |
Test build #134611 has finished for PR 31373 at commit
|
Refer to this link for build results (access rights to CI server needed): |
Also, cc @mridulm and @tgravescs |
@dongjoon-hyun can you update description with what user sees from this? I think its a bunch of extra log messages and shutdown takes longer? |
Thank you for review, @tgravescs . Yes, it dose. I'll update it. |
Updated! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall changes seem fine to me. Spark has bunch of cases where shutdown can be slow and spews a bunch of log messages, so getting rid of those is great even if its not perfect.
Thank you so much, @tgravescs ! |
…is stopped ### What changes were proposed in this pull request? This PR aims to prevent `HeartbeatReceiver` asks `Executor` to re-register blocker manager when the SparkContext is already stopped. ### Why are the changes needed? Currently, `HeartbeatReceiver` blindly asks re-registration for the new heartbeat message. However, when SparkContext is stopped, we don't need to re-register new block manager. Re-registration causes unnecessary executors' logs and and a delay on job termination. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #31373 from dongjoon-hyun/SPARK-34273. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit bc41c5a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…is stopped ### What changes were proposed in this pull request? This PR aims to prevent `HeartbeatReceiver` asks `Executor` to re-register blocker manager when the SparkContext is already stopped. ### Why are the changes needed? Currently, `HeartbeatReceiver` blindly asks re-registration for the new heartbeat message. However, when SparkContext is stopped, we don't need to re-register new block manager. Re-registration causes unnecessary executors' logs and and a delay on job termination. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #31373 from dongjoon-hyun/SPARK-34273. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit bc41c5a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…is stopped This PR aims to prevent `HeartbeatReceiver` asks `Executor` to re-register blocker manager when the SparkContext is already stopped. Currently, `HeartbeatReceiver` blindly asks re-registration for the new heartbeat message. However, when SparkContext is stopped, we don't need to re-register new block manager. Re-registration causes unnecessary executors' logs and and a delay on job termination. No. Pass the CIs with the newly added test case. Closes #31373 from dongjoon-hyun/SPARK-34273. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit bc41c5a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Merged to master/3.1/3.0/2.4. |
…is stopped ### What changes were proposed in this pull request? This PR aims to prevent `HeartbeatReceiver` asks `Executor` to re-register blocker manager when the SparkContext is already stopped. ### Why are the changes needed? Currently, `HeartbeatReceiver` blindly asks re-registration for the new heartbeat message. However, when SparkContext is stopped, we don't need to re-register new block manager. Re-registration causes unnecessary executors' logs and and a delay on job termination. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test case. Closes apache#31373 from dongjoon-hyun/SPARK-34273. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR aims to prevent
HeartbeatReceiver
asksExecutor
to re-register blocker manager when the SparkContext is already stopped.Why are the changes needed?
Currently,
HeartbeatReceiver
blindly asks re-registration for the new heartbeat message.However, when SparkContext is stopped, we don't need to re-register new block manager.
Re-registration causes unnecessary executors' logs and and a delay on job termination.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Pass the CIs with the newly added test case.