Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
[Jepsen] Transient network partitions can disable the cluster indefinitely #886
Splitting this out from #821, since it's a distinct issue from slow recovery: in 188.8.131.52 through 184.108.40.206-b2, network partitions can cause master nodes to get "stuck". If a stuck master is the leader, no node in the cluster can make progress, disabling the cluster. Healthy master nodes will not take over as leaders automatically, though they can be induced to take over through further network partitions. Given long enough, all master nodes can become stuck, and the cluster will not service another request until those nodes are rebooted.
For instance, in this test, nodes n1 and n2 both got stuck trying to open the catalog table; n3 was healthy during the first recovery period, but declined to take over as leader. The second wave of partitions forced a leader election which made n3 the leader, and operations resumed.