Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jepsen] Transient network partitions can disable the cluster indefinitely #886

Closed
aphyr opened this issue Feb 22, 2019 · 3 comments
Closed

[Jepsen] Transient network partitions can disable the cluster indefinitely #886

aphyr opened this issue Feb 22, 2019 · 3 comments

Comments

@aphyr
Copy link

@aphyr aphyr commented Feb 22, 2019

Splitting this out from #821, since it's a distinct issue from slow recovery: in 1.1.9.0 through 1.1.13.0-b2, network partitions can cause master nodes to get "stuck". If a stuck master is the leader, no node in the cluster can make progress, disabling the cluster. Healthy master nodes will not take over as leaders automatically, though they can be induced to take over through further network partitions. Given long enough, all master nodes can become stuck, and the cluster will not service another request until those nodes are rebooted.

For instance, in this test, nodes n1 and n2 both got stuck trying to open the catalog table; n3 was healthy during the first recovery period, but declined to take over as leader. The second wave of partitions forced a leader election which made n3 the leader, and operations resumed.

latency-raw 9

@aphyr
Copy link
Author

@aphyr aphyr commented Feb 22, 2019

@amitanandaiyer has a patch in 1ae0df3 which he thinks should fix this; I'll confirm once a build is ready! :-)

@kmuthukk kmuthukk added this to To do in Jepsen Testing via automation Feb 24, 2019
@kmuthukk
Copy link
Collaborator

@kmuthukk kmuthukk commented Feb 24, 2019

Thx @aphyr - will get a build which addresses the unavailability issues reported in #821, #886 as as well the occasional memory issue due to use of libbacktrace (#862)

@aphyr
Copy link
Author

@aphyr aphyr commented Feb 28, 2019

It's looking like this is fixed in 1.1.15-b16. :)

@aphyr aphyr closed this Feb 28, 2019
Jepsen Testing automation moved this from To do to Done Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants