You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Network partitions observed during manual scale-out of Akka cluster nodes
The thingverse-backend application, is what that runs a node in the backend Akka cluster. Thingverse uses CQRS and hence some nodes do writes exclusively and some do only read operations. The individual nodes join the Akka cluster using a discovery mechanism which is configurable (e.g. Kubernetes, Consul, Akka-dns etc.). There are specific conditions which need to be satisfied for the Akka cluster to be fully ready to accept traffic, e.g. a minimum number of read and write nodes. For a thingverse Akka cluster that is fully up and ready, we must handle scenarios where nodes fail, e.g:
Any number of nodes in the cluster might become unreachable . e.g. the network link might fail altogether, while the node itself is probably running just fine.
The node crashes due to application programming error or data errors.
Essense is, in situations where nodes are un-reachable and they did not 'leave' gracefully, it is impossible for Akka to do much except evicting those nodes from the cluster and run with a reduced size cluster. These kind of situations will lead to the violation of the configured conditions for the cluster being declared valid.
Network partition or a split-brain situation happens when a group of nodes can't communicate with the current master in the Akka cluster altogether. They will tend to form an isolated cluster of their own comprising of nodes that they can still talk to. We then end up with a horrible situation where we got ourselves a 'network partition' or a 'split brain'. See Lightbend documentation for an in-depth discussion on the subject.
Thingverse ships with an experimental split-brain resolver. This resolver is not yet production ready and extensive testing involving different sized clusters need to be done.
To Reproduce
Steps to reproduce the behavior:
Allow the backend cluster to run normally with 4 read-model nodes and 4 write-model nodes.
Now try to scale-out the Kubernetes cluster by increasing replica-count of the thingverse-backend deployment.
Alternatively, start killing nodes randomly not giving K8s enough time to reschedule new pods.
You would see random network partitions and akka cluster nodes will start self terminating.
Expected behavior
Network partitions should not occur. Suitable notification mechanisms should exist when a totally un-avoidable network partition does occur.
The text was updated successfully, but these errors were encountered:
Network partitions observed during manual scale-out of Akka cluster nodes
The
thingverse-backend
application, is what that runs a node in the backend Akka cluster. Thingverse uses CQRS and hence some nodes do writes exclusively and some do only read operations. The individual nodes join the Akka cluster using a discovery mechanism which is configurable (e.g. Kubernetes, Consul, Akka-dns etc.). There are specific conditions which need to be satisfied for the Akka cluster to be fully ready to accept traffic, e.g. a minimum number of read and write nodes. For a thingverse Akka cluster that is fully up and ready, we must handle scenarios where nodes fail, e.g:Essense is, in situations where nodes are un-reachable and they did not 'leave' gracefully, it is impossible for Akka to do much except evicting those nodes from the cluster and run with a reduced size cluster. These kind of situations will lead to the violation of the configured conditions for the cluster being declared valid.
Network partition or a split-brain situation happens when a group of nodes can't communicate with the current master in the Akka cluster altogether. They will tend to form an isolated cluster of their own comprising of nodes that they can still talk to. We then end up with a horrible situation where we got ourselves a 'network partition' or a 'split brain'. See Lightbend documentation for an in-depth discussion on the subject.
Thingverse ships with an experimental split-brain resolver. This resolver is not yet production ready and extensive testing involving different sized clusters need to be done.
To Reproduce
Steps to reproduce the behavior:
thingverse-backend
deployment.Expected behavior
Network partitions should not occur. Suitable notification mechanisms should exist when a totally un-avoidable network partition does occur.
The text was updated successfully, but these errors were encountered: