Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SBR configuration #6898

Closed
PeterHageus opened this issue Aug 23, 2023 · 3 comments
Closed

SBR configuration #6898

PeterHageus opened this issue Aug 23, 2023 · 3 comments

Comments

@PeterHageus
Copy link

Hi. Don't know if this is a Akka.Hosting issue or Akka.Cluster, but we have a problem with the default configuration:

During cluster churn (high cpu load on servers) our seed nodes are sometimes downed. This leads to them forming a new cluster, but a minority part. Everything restarted/started after this connects to this minority part, while the majority can remain stable for at least 24h (until IIS recycles), leading to a long lived partition.

Would setting KeepMajority.Role to the seed node role only take the seed nodes into account when resolving partition? Would that be the correct way to configure the cluster?

@Arkatufus
Copy link
Contributor

Arkatufus commented Aug 28, 2023

This is an Akka.Cluster behaviour and the short answer is "it depends".

  • How big is your cluster
  • What is your cluster size VS. seed nodes ratio
  • Are you willing to take the risk that a big part of the cluster will be downed if that part was split brained from the smaller chunk of the cluster that has all the seed nodes.

When you set KeepMajority.Role, what will happen when a split brain occured is that only the cluster members that has that role is being considered when SBR tries to resolve the split. This would mean that you will need at least 5 seed nodes for this to work properly in production.

But lets take some examples:

Cluster settings:

  • 5 seed nodes with role "seed"
  • 100 nodes with no roles
  • KeepMajority.Role is set to "seed"

Scenario 1, the happy path:
The cluster split into these parts:

  • Part 1: 3 "seed" nodes and 80 non-role nodes
  • Part 2: 2 "seed" nodes and 20 non-role nodes
    SBR Resolution: SBR will down part 2

Scenario 2, the not-so-happy path:
The cluster split into these parts:

  • Part 1: 3 "seed" nodes and 10 non-role nodes
  • Part 2: 2 "seed" nodes and 90 non-role nodes
    SBR Resolution: SBR will down part 2, even when it has the "majority" of general node count. This is because SBR only considers the number of the nodes that has majority inside the declared role.

Scenario 3, the ugly path:

  • All of the "seed" roles are down, leaving the 100 non-role to be stranded.
  • 1 "seed" role node are restarted and self-join itself to form a cluster
    SBR Resolution: Nothing. You will end up with a permanent split brain and they could not corellate to each other because the non-role cluster does not know about the newly created cluster.

@Arkatufus
Copy link
Contributor

The only way to fix this problem is to remove the arbiter inside the cluster, be it a Lighthouse instance or a fixed count of seed nodes. To do this, you will need Akka.Management.Cluster.Bootstrap in combination with Akka.Discovery which uses an arbiter outside of the cluster and is available for Kubernetes, Azure, and AWS.

Note that Akka.Discovery.Config is not the answer. It is still using the cluster itself as the arbiter, which defeats the purpose.

@Arkatufus Arkatufus transferred this issue from akkadotnet/Akka.Hosting Aug 28, 2023
@PeterHageus
Copy link
Author

OK, thanks for your input! Guess our only strategy atm is higher tolerance for heartbeats, to avoid unnecessary disconnects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants