Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial network partitioning leads to cluster unavailability. #43183

Closed
101alex opened this issue Jun 13, 2019 · 4 comments

Comments

Projects
None yet
3 participants
@101alex
Copy link

commented Jun 13, 2019

Elasticsearch version: 6.8.0

Plugins installed: []

JVM version (java -version): 1.8.0_212

OS version (uname -a if on a Unix-like system): OS[Linux/4.4.0-145-generic/amd64]

Description of the problem including expected versus actual behavior:

I have a cluster of 5 nodes (node1, node2, node3, node4, node5). Their ids are in the following order (this is critical for master election, which I believe is the core of this failure): node2 < node3 < the rest of nodes.

When the cluster first starts, as expected, node2 is elected as a master, however, after a partial network partitioning occurs, isolating node2 from nodes 3,4, and 5, (while node 3 is able to communicate with all other nodes) the cluster is stuck in an unavailability status. In other words, no new master is elected and node2 steps down from being a master.

I believe the expected behavior is to have node3 become the new master as it is connected to a majority of nodes and it has the smallest id (after node2). This does not happen, however. From examining the logs it seems that node3 keeps trying to join node2 as it thinks it should be the master while it does not accept the join requests from nodes 3, 4, and 5.

Steps to reproduce:

In my setup I have 1 index with the following settings:
* Number of shards: 1
* Number of replicas: 2
* write.wait_for_active_shards: 3

I have 5 nodes, and then once a partial network partitioning occurs (as described above), the system stays unavailable (I had waited for more than 1 hour and it stayed in that status) and it does not accept index operations.

***Log files are provided for the 5 nodes.

node1.log
node2.log
node3.log
node4.log
node5.log

@elasticmachine

This comment has been minimized.

Copy link

commented Jun 13, 2019

@DaveCTurner

This comment has been minimized.

Copy link
Contributor

commented Jun 13, 2019

Can you describe more precisely what "partial network partition" means in this context? Which nodes can communicate with each other? Is the partition symmetric or asymmetric?

I think this is fixed in 7.x by #32006, and we're unlikely to take any action to address this in the 6.x series. Can you reproduce it in a more recent version?

@101alex

This comment has been minimized.

Copy link
Author

commented Jun 14, 2019

Can you describe more precisely what "partial network partition" means in this context? Which nodes can communicate with each other? Is the partition symmetric or asymmetric?

A partial network partitioning is a network partitioning where two groups of nodes cannot communicate with each other, while there is a third group of nodes that can communicate with both.

I just realized that I accidentally wrote node3 instead of node1 in some places in the original post, sorry.

In my case the three groups are:
g1: node2.
g2: nodes 1, 4, and 5.
g3: node3

g1 members cannot communicate with g2 members (neither can g2 members communicate with g1 members) while g3 members can communicate with all other nodes (and all other nodes can communicate with g3). Hope this clarifies what I mean by partial network partitioning.

I think this is fixed in 7.x by #32006, and we're unlikely to take any action to address this in the 6.x series. Can you reproduce it in a more recent version?

Will try to reproduce this in 7.x soon and will let you know how it goes.

Thanks.

@DaveCTurner

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2019

g1 members cannot communicate with g2 members (neither can g2 members communicate with g1 members) while g3 members can communicate with all other nodes (and all other nodes can communicate with g3).

Ok, this is a situation that master elections in 6.x and earlier are known not to tolerate, but in which 7.x and later clusters will still be able to elect a master. It also doesn't seem very likely to occur in the wild. It's always possible to find a partition that results in unavailability (e.g. disconnect all pairs of nodes) so we must make choices about the kinds of partition that are reasonable to tolerate. Typically we choose to concentrate on the ones that occur in practice in a properly-configured network, which don't look like this.

Closing this as there's no further action required here, but please do report back on your experiments with 7.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.