Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jepsen] Transient network partitions can disable the cluster indefinitely #886

Closed
aphyr opened this Issue Feb 22, 2019 · 3 comments

Comments

3 participants
@aphyr
Copy link

commented Feb 22, 2019

Splitting this out from #821, since it's a distinct issue from slow recovery: in 1.1.9.0 through 1.1.13.0-b2, network partitions can cause master nodes to get "stuck". If a stuck master is the leader, no node in the cluster can make progress, disabling the cluster. Healthy master nodes will not take over as leaders automatically, though they can be induced to take over through further network partitions. Given long enough, all master nodes can become stuck, and the cluster will not service another request until those nodes are rebooted.

For instance, in this test, nodes n1 and n2 both got stuck trying to open the catalog table; n3 was healthy during the first recovery period, but declined to take over as leader. The second wave of partitions forced a leader election which made n3 the leader, and operations resumed.

latency-raw 9

@aphyr

This comment has been minimized.

Copy link
Author

commented Feb 22, 2019

@amitanandaiyer has a patch in 1ae0df3 which he thinks should fix this; I'll confirm once a build is ready! :-)

@kmuthukk kmuthukk added the bug label Feb 24, 2019

@kmuthukk kmuthukk added this to To do in Jepsen Testing via automation Feb 24, 2019

@kmuthukk

This comment has been minimized.

Copy link
Collaborator

commented Feb 24, 2019

Thx @aphyr - will get a build which addresses the unavailability issues reported in #821, #886 as as well the occasional memory issue due to use of libbacktrace (#862)

@aphyr

This comment has been minimized.

Copy link
Author

commented Feb 28, 2019

It's looking like this is fixed in 1.1.15-b16. :)

@aphyr aphyr closed this Feb 28, 2019

Jepsen Testing automation moved this from To do to Done Feb 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.