Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network Partitions in Akka Cluster #12

Open
arunkpatra opened this issue Jun 19, 2020 · 0 comments
Open

Network Partitions in Akka Cluster #12

arunkpatra opened this issue Jun 19, 2020 · 0 comments
Labels
bug Something isn't working research Requires specialized research

Comments

@arunkpatra
Copy link
Owner

arunkpatra commented Jun 19, 2020

Network partitions observed during manual scale-out of Akka cluster nodes

The thingverse-backend application, is what that runs a node in the backend Akka cluster. Thingverse uses CQRS and hence some nodes do writes exclusively and some do only read operations. The individual nodes join the Akka cluster using a discovery mechanism which is configurable (e.g. Kubernetes, Consul, Akka-dns etc.). There are specific conditions which need to be satisfied for the Akka cluster to be fully ready to accept traffic, e.g. a minimum number of read and write nodes. For a thingverse Akka cluster that is fully up and ready, we must handle scenarios where nodes fail, e.g:

  1. Any number of nodes in the cluster might become unreachable . e.g. the network link might fail altogether, while the node itself is probably running just fine.
  2. The node crashes due to application programming error or data errors.

Essense is, in situations where nodes are un-reachable and they did not 'leave' gracefully, it is impossible for Akka to do much except evicting those nodes from the cluster and run with a reduced size cluster. These kind of situations will lead to the violation of the configured conditions for the cluster being declared valid.

Network partition or a split-brain situation happens when a group of nodes can't communicate with the current master in the Akka cluster altogether. They will tend to form an isolated cluster of their own comprising of nodes that they can still talk to. We then end up with a horrible situation where we got ourselves a 'network partition' or a 'split brain'. See Lightbend documentation for an in-depth discussion on the subject.

Thingverse ships with an experimental split-brain resolver. This resolver is not yet production ready and extensive testing involving different sized clusters need to be done.

To Reproduce
Steps to reproduce the behavior:

  1. Allow the backend cluster to run normally with 4 read-model nodes and 4 write-model nodes.
  2. Now try to scale-out the Kubernetes cluster by increasing replica-count of the thingverse-backend deployment.
  3. Alternatively, start killing nodes randomly not giving K8s enough time to reschedule new pods.
  4. You would see random network partitions and akka cluster nodes will start self terminating.

Expected behavior
Network partitions should not occur. Suitable notification mechanisms should exist when a totally un-avoidable network partition does occur.

@arunkpatra arunkpatra added bug Something isn't working research Requires specialized research labels Jun 19, 2020
@arunkpatra arunkpatra added this to the Thingverse - v1.0.0.M3 milestone Jun 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working research Requires specialized research
Projects
None yet
Development

No branches or pull requests

1 participant