ARTEMIS-4143 improve mitigation against split-brain with shared-storage#4344
Merged
Conversation
d4abf83 to
4b4ee1c
Compare
tabish121
reviewed
Jan 26, 2023
a6834b5 to
cf3bd1f
Compare
Contributor
|
@jbertram since I have seen the failures associated with this PR, I am making it a draft. Please make it ready to review after fixed? |
f33157b to
1303667
Compare
Configurations employing shared-storage with NFS are susceptible to
split-brain in certain scenarios. For example:
1) Primary loses network connection to NFS.
2) Backup activates.
3) Primary reconnects to NFS.
4) Split-brain.
In reality this situation is pretty unlikely due to the timing involved,
but the possibility still exists. Currently the file lock held by the
primary broker on the NFS share is essentially worthless in this
situation. This commit adds logic by which the timestamp of the lock
file is updated during activation and then routinely checked during
runtime to ensure consistency. This effectively mitigates split-brain in
this situation (and likely others). Here's how it works now.
1) Primary loses network connection to NFS.
2) Backup activates.
3) Primary reconnects to NFS.
4) Primary detects that the lock file's timestamp has been updated and
shuts itself down.
When the primary shuts down in step #4 the Topology on the backup can be
damaged. Protections were added for this via ARTEMIS-2868 but only for
the replicated use-case. This commit applies the protection for
removeMember() so that the Topology remains intact.
There are no tests for these changes as I cannot determine how to
properly simulate this use-case. However, there have never been robust,
automated tests for these kinds of NFS use-cases so this is not a
departure from the norm.
1303667 to
5bf7a14
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Configurations employing shared-storage with NFS are susceptible to split-brain in certain scenarios. For example:
In reality this situation is pretty unlikely due to the timing involved, but the possibility still exists. Currently the file lock held by the primary broker on the NFS share is essentially worthless in this situation. This commit adds logic by which the timestamp of the lock file is updated during activation and then routinely checked during runtime to ensure consistency. This effectively mitigates split-brain in this situation (and likely others). Here's how it works now:
shuts itself down.
When the primary shuts down in step #4 the Topology on the backup can be damaged. Protections were added for this via ARTEMIS-2868 but only for the replicated use-case. This commit applies the protection for removeMember() so that the Topology remains intact.
There are no tests for these changes as I cannot determine how to properly simulate this use-case. However, there have never been robust, automated tests for these kinds of NFS use-cases so this is not a departure from the norm.