Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34273][CORE] Do not reregister BlockManager when SparkContext is stopped #31373

Closed
wants to merge 1 commit into from
Closed

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jan 28, 2021

What changes were proposed in this pull request?

This PR aims to prevent HeartbeatReceiver asks Executor to re-register blocker manager when the SparkContext is already stopped.

Why are the changes needed?

Currently, HeartbeatReceiver blindly asks re-registration for the new heartbeat message.
However, when SparkContext is stopped, we don't need to re-register new block manager.
Re-registration causes unnecessary executors' logs and and a delay on job termination.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs with the newly added test case.

@github-actions github-actions bot added the CORE label Jan 28, 2021
@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@dongjoon-hyun dongjoon-hyun marked this pull request as draft January 28, 2021 05:41
@SparkQA

This comment has been minimized.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review January 28, 2021 06:33
@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39184/

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39184/

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Test build #134594 has finished for PR 31373 at commit 510a505.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@@ -128,14 +128,16 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock)

// Messages received from executors
case heartbeat @ Heartbeat(executorId, accumUpdates, blockManagerId, executorUpdates) =>
var reregisterBlockManager = !sc.isStopped
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a race condition that happens during SparkContext shutdown. So it's also possible that the SparkContext is stopped right after we sent the HeartbeatResponse. In that case, IIUC, the issue will still exist.

Does the current behavior cause any real issue?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jan 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank. you for review, @Ngone51 . That's true. In the production environment, I hit those intermediate status. And, this will help us simplify the situation.

  1. The case you mentioned, Send HeartbeatResponse and sc.stop invoked, is a normal situation. The users don't complain about this.
  2. The case in this PR, sc.stop invoked and Spark works inefficiently by sending HeartbeatResponse(true) is a problem. The users complain about this.

For the following, yes. We are After the apps sc.stop takes a longer time than we expect.

Does the current behavior cause any real issue?

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39199/

@AmplabJenkins
Copy link

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39199/

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Test build #134611 has finished for PR 31373 at commit 510a505.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134611/

@dongjoon-hyun
Copy link
Member Author

Also, cc @mridulm and @tgravescs

@tgravescs
Copy link
Contributor

@dongjoon-hyun can you update description with what user sees from this? I think its a bunch of extra log messages and shutdown takes longer?

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @tgravescs . Yes, it dose. I'll update it.

@dongjoon-hyun
Copy link
Member Author

Updated!

Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall changes seem fine to me. Spark has bunch of cases where shutdown can be slow and spews a bunch of log messages, so getting rid of those is great even if its not perfect.

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @tgravescs !

dongjoon-hyun added a commit that referenced this pull request Jan 28, 2021
…is stopped

### What changes were proposed in this pull request?

This PR aims to prevent `HeartbeatReceiver` asks `Executor` to re-register blocker manager when the SparkContext is already stopped.

### Why are the changes needed?

Currently, `HeartbeatReceiver` blindly asks re-registration for the new heartbeat message.
However, when SparkContext is stopped, we don't need to re-register new block manager.
Re-registration causes unnecessary executors' logs and and a delay on job termination.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes #31373 from dongjoon-hyun/SPARK-34273.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit bc41c5a)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun added a commit that referenced this pull request Jan 28, 2021
…is stopped

### What changes were proposed in this pull request?

This PR aims to prevent `HeartbeatReceiver` asks `Executor` to re-register blocker manager when the SparkContext is already stopped.

### Why are the changes needed?

Currently, `HeartbeatReceiver` blindly asks re-registration for the new heartbeat message.
However, when SparkContext is stopped, we don't need to re-register new block manager.
Re-registration causes unnecessary executors' logs and and a delay on job termination.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes #31373 from dongjoon-hyun/SPARK-34273.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit bc41c5a)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun added a commit that referenced this pull request Jan 28, 2021
…is stopped

This PR aims to prevent `HeartbeatReceiver` asks `Executor` to re-register blocker manager when the SparkContext is already stopped.

Currently, `HeartbeatReceiver` blindly asks re-registration for the new heartbeat message.
However, when SparkContext is stopped, we don't need to re-register new block manager.
Re-registration causes unnecessary executors' logs and and a delay on job termination.

No.

Pass the CIs with the newly added test case.

Closes #31373 from dongjoon-hyun/SPARK-34273.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit bc41c5a)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member Author

Merged to master/3.1/3.0/2.4.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-34273 branch January 28, 2021 21:16
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…is stopped

### What changes were proposed in this pull request?

This PR aims to prevent `HeartbeatReceiver` asks `Executor` to re-register blocker manager when the SparkContext is already stopped.

### Why are the changes needed?

Currently, `HeartbeatReceiver` blindly asks re-registration for the new heartbeat message.
However, when SparkContext is stopped, we don't need to re-register new block manager.
Re-registration causes unnecessary executors' logs and and a delay on job termination.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes apache#31373 from dongjoon-hyun/SPARK-34273.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants