[SPARK-4592] Avoid duplicate worker registrations in standalone mode #3447

andrewor14 · 2014-11-25T08:09:11Z

Summary. On failover, the Master may receive duplicate registrations from the same worker, causing the worker to exit. This is caused by this commit 4afe9a4, which adds logic for the worker to re-register with the master in case of failures. However, the following race condition may occur:

(1) Master A fails and Worker attempts to reconnect to all masters
(2) Master B takes over and notifies Worker
(3) Worker responds by registering with Master B
(4) Meanwhile, Worker's previous reconnection attempt reaches Master B, causing the same Worker to register with Master B twice

Fix. Instead of attempting to register with all known masters, the worker should re-register with only the one that it has been communicating with. This is safe because the fact that a failover has occurred means the old master must have died. Then, when the worker is finally notified of a new master, it gives up on the old one in favor of the new one.

Caveat. Even this fix is subject to more obscure race conditions. For instance, if Master B fails and Master A recovers immediately, then Master A may still observe duplicate worker registrations. However, this and other potential race conditions summarized in SPARK-4592, are much, much less likely than the one described above, which is deterministically reproducible.

The gist is that we only reconnect to the master we've been communicating with instead of making a registration request to all known masters. More details in the code comments.

If a worker cannot initially reach a master, then it will attempt a retry. In this case, the active master actor must be null. This commit removes an assert that falsely assumes the contrary.

SparkQA · 2014-11-25T08:15:01Z

Test build #23828 has started for PR 3447 at commit 1fce6a9.

This patch merges cleanly.

The Master may not necessarily be dead, as it may have recovered.

SparkQA · 2014-11-25T08:27:54Z

Test build #23830 has started for PR 3447 at commit 83b321c.

This patch merges cleanly.

If this is an initial retry, meaning the active master is not set yet, then do try to contact all masters. Otherwise, we can assume that retry means there is a master failure.

SparkQA · 2014-11-25T09:20:17Z

Test build #23835 has started for PR 3447 at commit 79286dc.

This patch merges cleanly.

SparkQA · 2014-11-25T09:37:24Z

Test build #23828 has finished for PR 3447 at commit 1fce6a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-25T09:37:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23828/
Test PASSed.

SparkQA · 2014-11-25T09:50:37Z

Test build #23830 has finished for PR 3447 at commit 83b321c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-25T09:50:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23830/
Test PASSed.

SparkQA · 2014-11-25T10:46:00Z

Test build #23835 has finished for PR 3447 at commit 79286dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-25T10:46:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23835/
Test PASSed.

Instead of possible sending registration requests to the master in two separate threads (the actor thread and the timer thread), we rely on the actor's single-threaded-ness to provide for thread-safety.

andrewor14 · 2014-11-25T21:06:41Z

@JoshRosen I have addressed all of your high-level comments. Please have a look.

SparkQA · 2014-11-25T21:10:05Z

Test build #23847 has started for PR 3447 at commit 0d9716c.

This patch merges cleanly.

SparkQA · 2014-11-25T22:42:06Z

Test build #23847 has finished for PR 3447 at commit 0d9716c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-25T22:42:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23847/
Test PASSed.

JoshRosen · 2014-11-25T23:29:29Z

LGTM; thanks!

andrewor14 · 2014-11-25T23:42:32Z

Awesome I'm merging this into master and 1.2 thanks.

**Summary.** On failover, the Master may receive duplicate registrations from the same worker, causing the worker to exit. This is caused by this commit apache@4afe9a4, which adds logic for the worker to re-register with the master in case of failures. However, the following race condition may occur: (1) Master A fails and Worker attempts to reconnect to all masters (2) Master B takes over and notifies Worker (3) Worker responds by registering with Master B (4) Meanwhile, Worker's previous reconnection attempt reaches Master B, causing the same Worker to register with Master B twice **Fix.** Instead of attempting to register with all known masters, the worker should re-register with only the one that it has been communicating with. This is safe because the fact that a failover has occurred means the old master must have died. Then, when the worker is finally notified of a new master, it gives up on the old one in favor of the new one. **Caveat.** Even this fix is subject to more obscure race conditions. For instance, if Master B fails and Master A recovers immediately, then Master A may still observe duplicate worker registrations. However, this and other potential race conditions summarized in [SPARK-4592](https://issues.apache.org/jira/browse/SPARK-4592), are much, much less likely than the one described above, which is deterministically reproducible. Author: Andrew Or <andrew@databricks.com> Closes apache#3447 from andrewor14/standalone-failover and squashes the following commits: 0d9716c [Andrew Or] Move re-registration logic to actor for thread-safety 79286dc [Andrew Or] Preserve old behavior for initial retries 83b321c [Andrew Or] Tweak wording 1fce6a9 [Andrew Or] Active master actor could be null in the beginning b6f269e [Andrew Or] Avoid duplicate worker registrations (cherry picked from commit 1b2ab1c) Signed-off-by: Andrew Or <andrew@databricks.com>

Andrew Or added 2 commits November 24, 2014 23:40

Avoid duplicate worker registrations

b6f269e

The gist is that we only reconnect to the master we've been communicating with instead of making a registration request to all known masters. More details in the code comments.

Active master actor could be null in the beginning

1fce6a9

If a worker cannot initially reach a master, then it will attempt a retry. In this case, the active master actor must be null. This commit removes an assert that falsely assumes the contrary.

JoshRosen mentioned this pull request Nov 25, 2014

[SPARK-3736] Workers reconnect when disassociated from the master. #2828

Closed

Tweak wording

83b321c

The Master may not necessarily be dead, as it may have recovered.

Preserve old behavior for initial retries

79286dc

If this is an initial retry, meaning the active master is not set yet, then do try to contact all masters. Otherwise, we can assume that retry means there is a master failure.

Move re-registration logic to actor for thread-safety

0d9716c

Instead of possible sending registration requests to the master in two separate threads (the actor thread and the timer thread), we rely on the actor's single-threaded-ness to provide for thread-safety.

asfgit closed this in 1b2ab1c Nov 26, 2014

andrewor14 deleted the standalone-failover branch March 3, 2015 01:06

Ngone51 mentioned this pull request May 9, 2019

[SPARK-23191][CORE] Warn rather than terminate when duplicate worker register happens #24569

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4592] Avoid duplicate worker registrations in standalone mode #3447

[SPARK-4592] Avoid duplicate worker registrations in standalone mode #3447

andrewor14 commented Nov 25, 2014

SparkQA commented Nov 25, 2014

SparkQA commented Nov 25, 2014

SparkQA commented Nov 25, 2014

SparkQA commented Nov 25, 2014

AmplabJenkins commented Nov 25, 2014

SparkQA commented Nov 25, 2014

AmplabJenkins commented Nov 25, 2014

SparkQA commented Nov 25, 2014

AmplabJenkins commented Nov 25, 2014

andrewor14 commented Nov 25, 2014

SparkQA commented Nov 25, 2014

SparkQA commented Nov 25, 2014

AmplabJenkins commented Nov 25, 2014

JoshRosen commented Nov 25, 2014

andrewor14 commented Nov 25, 2014

[SPARK-4592] Avoid duplicate worker registrations in standalone mode #3447

[SPARK-4592] Avoid duplicate worker registrations in standalone mode #3447

Conversation

andrewor14 commented Nov 25, 2014

SparkQA commented Nov 25, 2014

SparkQA commented Nov 25, 2014

SparkQA commented Nov 25, 2014

SparkQA commented Nov 25, 2014

AmplabJenkins commented Nov 25, 2014

SparkQA commented Nov 25, 2014

AmplabJenkins commented Nov 25, 2014

SparkQA commented Nov 25, 2014

AmplabJenkins commented Nov 25, 2014

andrewor14 commented Nov 25, 2014

SparkQA commented Nov 25, 2014

SparkQA commented Nov 25, 2014

AmplabJenkins commented Nov 25, 2014

JoshRosen commented Nov 25, 2014

andrewor14 commented Nov 25, 2014