SBR configuration #6898

PeterHageus · 2023-08-23T07:02:00Z

Hi. Don't know if this is a Akka.Hosting issue or Akka.Cluster, but we have a problem with the default configuration:

During cluster churn (high cpu load on servers) our seed nodes are sometimes downed. This leads to them forming a new cluster, but a minority part. Everything restarted/started after this connects to this minority part, while the majority can remain stable for at least 24h (until IIS recycles), leading to a long lived partition.

Would setting KeepMajority.Role to the seed node role only take the seed nodes into account when resolving partition? Would that be the correct way to configure the cluster?

Arkatufus · 2023-08-28T21:30:26Z

This is an Akka.Cluster behaviour and the short answer is "it depends".

How big is your cluster
What is your cluster size VS. seed nodes ratio
Are you willing to take the risk that a big part of the cluster will be downed if that part was split brained from the smaller chunk of the cluster that has all the seed nodes.

When you set KeepMajority.Role, what will happen when a split brain occured is that only the cluster members that has that role is being considered when SBR tries to resolve the split. This would mean that you will need at least 5 seed nodes for this to work properly in production.

But lets take some examples:

Cluster settings:

5 seed nodes with role "seed"
100 nodes with no roles
KeepMajority.Role is set to "seed"

Scenario 1, the happy path:
The cluster split into these parts:

Part 1: 3 "seed" nodes and 80 non-role nodes
Part 2: 2 "seed" nodes and 20 non-role nodes
SBR Resolution: SBR will down part 2

Scenario 2, the not-so-happy path:
The cluster split into these parts:

Part 1: 3 "seed" nodes and 10 non-role nodes
Part 2: 2 "seed" nodes and 90 non-role nodes
SBR Resolution: SBR will down part 2, even when it has the "majority" of general node count. This is because SBR only considers the number of the nodes that has majority inside the declared role.

Scenario 3, the ugly path:

All of the "seed" roles are down, leaving the 100 non-role to be stranded.
1 "seed" role node are restarted and self-join itself to form a cluster
SBR Resolution: Nothing. You will end up with a permanent split brain and they could not corellate to each other because the non-role cluster does not know about the newly created cluster.

Arkatufus · 2023-08-28T21:36:46Z

The only way to fix this problem is to remove the arbiter inside the cluster, be it a Lighthouse instance or a fixed count of seed nodes. To do this, you will need Akka.Management.Cluster.Bootstrap in combination with Akka.Discovery which uses an arbiter outside of the cluster and is available for Kubernetes, Azure, and AWS.

Note that Akka.Discovery.Config is not the answer. It is still using the cluster itself as the arbiter, which defeats the purpose.

PeterHageus · 2023-09-06T14:51:48Z

OK, thanks for your input! Guess our only strategy atm is higher tolerance for heartbeats, to avoid unnecessary disconnects.

Arkatufus transferred this issue from akkadotnet/Akka.Hosting Aug 28, 2023

PeterHageus closed this as completed Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SBR configuration #6898

SBR configuration #6898

PeterHageus commented Aug 23, 2023

Arkatufus commented Aug 28, 2023 •

edited

Loading

Arkatufus commented Aug 28, 2023

PeterHageus commented Sep 6, 2023

SBR configuration #6898

SBR configuration #6898

Comments

PeterHageus commented Aug 23, 2023

Arkatufus commented Aug 28, 2023 • edited Loading

Arkatufus commented Aug 28, 2023

PeterHageus commented Sep 6, 2023

Arkatufus commented Aug 28, 2023 •

edited

Loading