New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4592] Avoid duplicate worker registrations in standalone mode #3447
Conversation
The gist is that we only reconnect to the master we've been communicating with instead of making a registration request to all known masters. More details in the code comments.
If a worker cannot initially reach a master, then it will attempt a retry. In this case, the active master actor must be null. This commit removes an assert that falsely assumes the contrary.
Test build #23828 has started for PR 3447 at commit
|
The Master may not necessarily be dead, as it may have recovered.
Test build #23830 has started for PR 3447 at commit
|
If this is an initial retry, meaning the active master is not set yet, then do try to contact all masters. Otherwise, we can assume that retry means there is a master failure.
Test build #23835 has started for PR 3447 at commit
|
Test build #23828 has finished for PR 3447 at commit
|
Test PASSed. |
Test build #23830 has finished for PR 3447 at commit
|
Test PASSed. |
Test build #23835 has finished for PR 3447 at commit
|
Test PASSed. |
Instead of possible sending registration requests to the master in two separate threads (the actor thread and the timer thread), we rely on the actor's single-threaded-ness to provide for thread-safety.
@JoshRosen I have addressed all of your high-level comments. Please have a look. |
Test build #23847 has started for PR 3447 at commit
|
Test build #23847 has finished for PR 3447 at commit
|
Test PASSed. |
LGTM; thanks! |
Awesome I'm merging this into master and 1.2 thanks. |
**Summary.** On failover, the Master may receive duplicate registrations from the same worker, causing the worker to exit. This is caused by this commit apache@4afe9a4, which adds logic for the worker to re-register with the master in case of failures. However, the following race condition may occur: (1) Master A fails and Worker attempts to reconnect to all masters (2) Master B takes over and notifies Worker (3) Worker responds by registering with Master B (4) Meanwhile, Worker's previous reconnection attempt reaches Master B, causing the same Worker to register with Master B twice **Fix.** Instead of attempting to register with all known masters, the worker should re-register with only the one that it has been communicating with. This is safe because the fact that a failover has occurred means the old master must have died. Then, when the worker is finally notified of a new master, it gives up on the old one in favor of the new one. **Caveat.** Even this fix is subject to more obscure race conditions. For instance, if Master B fails and Master A recovers immediately, then Master A may still observe duplicate worker registrations. However, this and other potential race conditions summarized in [SPARK-4592](https://issues.apache.org/jira/browse/SPARK-4592), are much, much less likely than the one described above, which is deterministically reproducible. Author: Andrew Or <andrew@databricks.com> Closes apache#3447 from andrewor14/standalone-failover and squashes the following commits: 0d9716c [Andrew Or] Move re-registration logic to actor for thread-safety 79286dc [Andrew Or] Preserve old behavior for initial retries 83b321c [Andrew Or] Tweak wording 1fce6a9 [Andrew Or] Active master actor could be null in the beginning b6f269e [Andrew Or] Avoid duplicate worker registrations (cherry picked from commit 1b2ab1c) Signed-off-by: Andrew Or <andrew@databricks.com>
Summary. On failover, the Master may receive duplicate registrations from the same worker, causing the worker to exit. This is caused by this commit 4afe9a4, which adds logic for the worker to re-register with the master in case of failures. However, the following race condition may occur:
(1) Master A fails and Worker attempts to reconnect to all masters
(2) Master B takes over and notifies Worker
(3) Worker responds by registering with Master B
(4) Meanwhile, Worker's previous reconnection attempt reaches Master B, causing the same Worker to register with Master B twice
Fix. Instead of attempting to register with all known masters, the worker should re-register with only the one that it has been communicating with. This is safe because the fact that a failover has occurred means the old master must have died. Then, when the worker is finally notified of a new master, it gives up on the old one in favor of the new one.
Caveat. Even this fix is subject to more obscure race conditions. For instance, if Master B fails and Master A recovers immediately, then Master A may still observe duplicate worker registrations. However, this and other potential race conditions summarized in SPARK-4592, are much, much less likely than the one described above, which is deterministically reproducible.