Skip to content

ARTEMIS-4143 improve mitigation against split-brain with shared-storage#4344

Merged
clebertsuconic merged 1 commit into
apache:mainfrom
jbertram:ARTEMIS-4143
Feb 3, 2023
Merged

ARTEMIS-4143 improve mitigation against split-brain with shared-storage#4344
clebertsuconic merged 1 commit into
apache:mainfrom
jbertram:ARTEMIS-4143

Conversation

@jbertram
Copy link
Copy Markdown
Contributor

@jbertram jbertram commented Jan 26, 2023

Configurations employing shared-storage with NFS are susceptible to split-brain in certain scenarios. For example:

  1. Primary loses network connection to NFS.
  2. Backup activates.
  3. Primary reconnects to NFS.
  4. Split-brain.

In reality this situation is pretty unlikely due to the timing involved, but the possibility still exists. Currently the file lock held by the primary broker on the NFS share is essentially worthless in this situation. This commit adds logic by which the timestamp of the lock file is updated during activation and then routinely checked during runtime to ensure consistency. This effectively mitigates split-brain in this situation (and likely others). Here's how it works now:

  1. Primary loses network connection to NFS.
  2. Backup activates.
  3. Primary reconnects to NFS.
  4. Primary detects that the lock file's timestamp has been updated and
    shuts itself down.

When the primary shuts down in step #4 the Topology on the backup can be damaged. Protections were added for this via ARTEMIS-2868 but only for the replicated use-case. This commit applies the protection for removeMember() so that the Topology remains intact.

There are no tests for these changes as I cannot determine how to properly simulate this use-case. However, there have never been robust, automated tests for these kinds of NFS use-cases so this is not a departure from the norm.

@jbertram jbertram force-pushed the ARTEMIS-4143 branch 4 times, most recently from a6834b5 to cf3bd1f Compare January 27, 2023 15:41
@clebertsuconic clebertsuconic marked this pull request as draft January 30, 2023 15:25
@clebertsuconic
Copy link
Copy Markdown
Contributor

@jbertram since I have seen the failures associated with this PR, I am making it a draft. Please make it ready to review after fixed?

@jbertram jbertram force-pushed the ARTEMIS-4143 branch 2 times, most recently from f33157b to 1303667 Compare February 2, 2023 22:56
Configurations employing shared-storage with NFS are susceptible to
split-brain in certain scenarios. For example:

  1) Primary loses network connection to NFS.
  2) Backup activates.
  3) Primary reconnects to NFS.
  4) Split-brain.

In reality this situation is pretty unlikely due to the timing involved,
but the possibility still exists. Currently the file lock held by the
primary broker on the NFS share is essentially worthless in this
situation. This commit adds logic by which the timestamp of the lock
file is updated during activation and then routinely checked during
runtime to ensure consistency. This effectively mitigates split-brain in
this situation (and likely others). Here's how it works now.

  1) Primary loses network connection to NFS.
  2) Backup activates.
  3) Primary reconnects to NFS.
  4) Primary detects that the lock file's timestamp has been updated and
     shuts itself down.

When the primary shuts down in step #4 the Topology on the backup can be
damaged. Protections were added for this via ARTEMIS-2868 but only for
the replicated use-case. This commit applies the protection for
removeMember() so that the Topology remains intact.

There are no tests for these changes as I cannot determine how to
properly simulate this use-case. However, there have never been robust,
automated tests for these kinds of NFS use-cases so this is not a
departure from the norm.
@jbertram jbertram marked this pull request as ready for review February 3, 2023 15:36
@clebertsuconic clebertsuconic merged commit 8f30347 into apache:main Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants