Network Partitions in Akka Cluster #12

arunkpatra · 2020-06-19T06:26:56Z

Network partitions observed during manual scale-out of Akka cluster nodes

The thingverse-backend application, is what that runs a node in the backend Akka cluster. Thingverse uses CQRS and hence some nodes do writes exclusively and some do only read operations. The individual nodes join the Akka cluster using a discovery mechanism which is configurable (e.g. Kubernetes, Consul, Akka-dns etc.). There are specific conditions which need to be satisfied for the Akka cluster to be fully ready to accept traffic, e.g. a minimum number of read and write nodes. For a thingverse Akka cluster that is fully up and ready, we must handle scenarios where nodes fail, e.g:

Any number of nodes in the cluster might become unreachable . e.g. the network link might fail altogether, while the node itself is probably running just fine.
The node crashes due to application programming error or data errors.

Essense is, in situations where nodes are un-reachable and they did not 'leave' gracefully, it is impossible for Akka to do much except evicting those nodes from the cluster and run with a reduced size cluster. These kind of situations will lead to the violation of the configured conditions for the cluster being declared valid.

Network partition or a split-brain situation happens when a group of nodes can't communicate with the current master in the Akka cluster altogether. They will tend to form an isolated cluster of their own comprising of nodes that they can still talk to. We then end up with a horrible situation where we got ourselves a 'network partition' or a 'split brain'. See Lightbend documentation for an in-depth discussion on the subject.

Thingverse ships with an experimental split-brain resolver. This resolver is not yet production ready and extensive testing involving different sized clusters need to be done.

To Reproduce
Steps to reproduce the behavior:

Allow the backend cluster to run normally with 4 read-model nodes and 4 write-model nodes.
Now try to scale-out the Kubernetes cluster by increasing replica-count of the thingverse-backend deployment.
Alternatively, start killing nodes randomly not giving K8s enough time to reschedule new pods.
You would see random network partitions and akka cluster nodes will start self terminating.

Expected behavior
Network partitions should not occur. Suitable notification mechanisms should exist when a totally un-avoidable network partition does occur.

The text was updated successfully, but these errors were encountered:

arunkpatra added bug Something isn't working research Requires specialized research labels Jun 19, 2020

arunkpatra mentioned this issue Jun 19, 2020

Implement Circuit-breaker assistance for recovery of a failing Akka Cluster #13

Open

arunkpatra added this to the Thingverse - v1.0.0.M3 milestone Jun 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network Partitions in Akka Cluster #12

Network Partitions in Akka Cluster #12

arunkpatra commented Jun 19, 2020 •

edited

Loading

Network Partitions in Akka Cluster #12

Network Partitions in Akka Cluster #12

Comments

arunkpatra commented Jun 19, 2020 • edited Loading

arunkpatra commented Jun 19, 2020 •

edited

Loading