Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4592] Avoid duplicate worker registrations in standalone mode #3447

Closed
wants to merge 5 commits into from

Conversation

andrewor14
Copy link
Contributor

Summary. On failover, the Master may receive duplicate registrations from the same worker, causing the worker to exit. This is caused by this commit 4afe9a4, which adds logic for the worker to re-register with the master in case of failures. However, the following race condition may occur:

(1) Master A fails and Worker attempts to reconnect to all masters
(2) Master B takes over and notifies Worker
(3) Worker responds by registering with Master B
(4) Meanwhile, Worker's previous reconnection attempt reaches Master B, causing the same Worker to register with Master B twice

Fix. Instead of attempting to register with all known masters, the worker should re-register with only the one that it has been communicating with. This is safe because the fact that a failover has occurred means the old master must have died. Then, when the worker is finally notified of a new master, it gives up on the old one in favor of the new one.

Caveat. Even this fix is subject to more obscure race conditions. For instance, if Master B fails and Master A recovers immediately, then Master A may still observe duplicate worker registrations. However, this and other potential race conditions summarized in SPARK-4592, are much, much less likely than the one described above, which is deterministically reproducible.

Andrew Or added 2 commits November 24, 2014 23:40
The gist is that we only reconnect to the master we've been
communicating with instead of making a registration request
to all known masters. More details in the code comments.
If a worker cannot initially reach a master, then it will attempt
a retry. In this case, the active master actor must be null. This
commit removes an assert that falsely assumes the contrary.
@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23828 has started for PR 3447 at commit 1fce6a9.

  • This patch merges cleanly.

The Master may not necessarily be dead, as it may have recovered.
@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23830 has started for PR 3447 at commit 83b321c.

  • This patch merges cleanly.

If this is an initial retry, meaning the active master is not
set yet, then do try to contact all masters. Otherwise, we can
assume that retry means there is a master failure.
@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23835 has started for PR 3447 at commit 79286dc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23828 has finished for PR 3447 at commit 1fce6a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23828/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23830 has finished for PR 3447 at commit 83b321c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23830/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23835 has finished for PR 3447 at commit 79286dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23835/
Test PASSed.

Instead of possible sending registration requests to the master
in two separate threads (the actor thread and the timer thread),
we rely on the actor's single-threaded-ness to provide for
thread-safety.
@andrewor14
Copy link
Contributor Author

@JoshRosen I have addressed all of your high-level comments. Please have a look.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23847 has started for PR 3447 at commit 0d9716c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23847 has finished for PR 3447 at commit 0d9716c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23847/
Test PASSed.

@JoshRosen
Copy link
Contributor

LGTM; thanks!

@andrewor14
Copy link
Contributor Author

Awesome I'm merging this into master and 1.2 thanks.

mengxr pushed a commit to mengxr/spark that referenced this pull request Nov 26, 2014
**Summary.** On failover, the Master may receive duplicate registrations from the same worker, causing the worker to exit. This is caused by this commit apache@4afe9a4, which adds logic for the worker to re-register with the master in case of failures. However, the following race condition may occur:

(1) Master A fails and Worker attempts to reconnect to all masters
(2) Master B takes over and notifies Worker
(3) Worker responds by registering with Master B
(4) Meanwhile, Worker's previous reconnection attempt reaches Master B, causing the same Worker to register with Master B twice

**Fix.** Instead of attempting to register with all known masters, the worker should re-register with only the one that it has been communicating with. This is safe because the fact that a failover has occurred means the old master must have died. Then, when the worker is finally notified of a new master, it gives up on the old one in favor of the new one.

**Caveat.** Even this fix is subject to more obscure race conditions. For instance, if Master B fails and Master A recovers immediately, then Master A may still observe duplicate worker registrations. However, this and other potential race conditions summarized in [SPARK-4592](https://issues.apache.org/jira/browse/SPARK-4592), are much, much less likely than the one described above, which is deterministically reproducible.

Author: Andrew Or <andrew@databricks.com>

Closes apache#3447 from andrewor14/standalone-failover and squashes the following commits:

0d9716c [Andrew Or] Move re-registration logic to actor for thread-safety
79286dc [Andrew Or] Preserve old behavior for initial retries
83b321c [Andrew Or] Tweak wording
1fce6a9 [Andrew Or] Active master actor could be null in the beginning
b6f269e [Andrew Or] Avoid duplicate worker registrations

(cherry picked from commit 1b2ab1c)
Signed-off-by: Andrew Or <andrew@databricks.com>
@asfgit asfgit closed this in 1b2ab1c Nov 26, 2014
@andrewor14 andrewor14 deleted the standalone-failover branch March 3, 2015 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants