New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimum_master_nodes does not prevent split-brain if splits are intersecting #2488

Closed
saj opened this Issue Dec 17, 2012 · 103 comments

Comments

Projects
None yet
@saj

saj commented Dec 17, 2012

G'day,

I'm using ElasticSearch 0.19.11 with the unicast Zen discovery protocol.

With this setup, I can easily split a 3-node cluster into two 'hemispheres' (continuing with the brain metaphor) with one node acting as a participant in both hemispheres. I believe this to be a significant problem, because now minimum_master_nodes is incapable of preventing certain split-brain scenarios.

Here's what my 3-node test cluster looked like before I broke it:

Here's what the cluster looked like after simulating a communications failure between nodes (2) and (3):

Here's what seems to have happened immediately after the split:

  1. Node (2) and (3) lose contact with one another. (zen-disco-node_failed ... reason failed to ping)
  2. Node (2), still master of the left hemisphere, notes the disappearance of node (3) and broadcasts an advisory message to all of its followers. Node (1) takes note of the advisory.
  3. Node (3) has now lost contact with its old master and decides to hold an election. It declares itself winner of the election. On declaring itself, it assumes master role of the right hemisphere, then broadcasts an advisory message to all of its followers. Node (1) takes note of this advisory, too.

At this point, I can't say I know what to expect to find on node (1). If I query both masters for a list of nodes, I see node (1) in both clusters.

Let's look at minimum_master_nodes as it applies to this test cluster. Assume I had set minimum_master_nodes to 2. Had node (3) been completely isolated from nodes (1) and (2), I would not have run into this problem. The left hemisphere would have enough nodes to satisfy the constraint; the right hemisphere would not. This would continue to work for larger clusters (with an appropriately larger value for minimum_master_nodes).

The problem with minimum_master_nodes is that it does not work when the split brains are intersecting, as in my example above. Even on a larger cluster of, say, 7 nodes with minimum_master_nodes set to 4, all that needs to happen is for the 'right' two nodes to lose contact with one another (a master election has to take place) for the cluster to split.

Is there anything that can be done to detect the intersecting split on node (1)?

Would #1057 help?

Am I missing something obvious? :)

@moscht

This comment has been minimized.

Show comment
Hide comment
@moscht

moscht Dec 18, 2012

We also had at some point a similar issue, where minimum_master_nodes did not prevent the cluster from having two different views of the nodes at the same time.

As our indices were created automatically, some of the indices were created twice, once in every half of the cluster with the two masters broadcasting different states, and after a full cluster restart some shards were unable to be allocated, as the state has been mixed up. This was on 0.17. so I am not sure, if data would still be lost, as the state is now saved with the shards. But the other question is what happens when an index exists twice in the cluster (as it has been created on every master).

I think we should have a method to recover from such a situation. As I don't know how the zen discovery works exactly, I can not say how to solve it, but IMHO a node should only be in one cluster, in your second image node 1 should either be with 2, preventing 3 from becoming master, or with node 3, preventing 2 from staying master.

moscht commented Dec 18, 2012

We also had at some point a similar issue, where minimum_master_nodes did not prevent the cluster from having two different views of the nodes at the same time.

As our indices were created automatically, some of the indices were created twice, once in every half of the cluster with the two masters broadcasting different states, and after a full cluster restart some shards were unable to be allocated, as the state has been mixed up. This was on 0.17. so I am not sure, if data would still be lost, as the state is now saved with the shards. But the other question is what happens when an index exists twice in the cluster (as it has been created on every master).

I think we should have a method to recover from such a situation. As I don't know how the zen discovery works exactly, I can not say how to solve it, but IMHO a node should only be in one cluster, in your second image node 1 should either be with 2, preventing 3 from becoming master, or with node 3, preventing 2 from staying master.

@tallpsmith

This comment has been minimized.

Show comment
Hide comment
@tallpsmith

tallpsmith Dec 18, 2012

see Issue #2117 as well, I'm not sure if the Unicast discovery is making it worse for you, but I think we captured the underlying problem over on that issue, but would like your thoughts too.

tallpsmith commented Dec 18, 2012

see Issue #2117 as well, I'm not sure if the Unicast discovery is making it worse for you, but I think we captured the underlying problem over on that issue, but would like your thoughts too.

@saj

This comment has been minimized.

Show comment
Hide comment
@saj

saj Dec 20, 2012

From #2117:

The split brain occurs if the nodeId(UUID) of the disconnected node is such that the disconnected node picks itself as the next logical master while pinging the other nodes(NodeFaultDetection).

Ditto.

The split brain only occurs on the second time that the node is disconnected/isolated.

I see a split on the first partial isolation. To me, these bug reports look like two different problems.

saj commented Dec 20, 2012

From #2117:

The split brain occurs if the nodeId(UUID) of the disconnected node is such that the disconnected node picks itself as the next logical master while pinging the other nodes(NodeFaultDetection).

Ditto.

The split brain only occurs on the second time that the node is disconnected/isolated.

I see a split on the first partial isolation. To me, these bug reports look like two different problems.

@trollybaz

This comment has been minimized.

Show comment
Hide comment
@trollybaz

trollybaz Apr 3, 2013

I believe I ran into this issue yesterday in a 3 node cluster- a node elects itself master when the current master is disconnected from it. The remaining partipant node toggles between having the other nodes as its master before settling on one. Is this what you saw @saj?

trollybaz commented Apr 3, 2013

I believe I ran into this issue yesterday in a 3 node cluster- a node elects itself master when the current master is disconnected from it. The remaining partipant node toggles between having the other nodes as its master before settling on one. Is this what you saw @saj?

@saj

This comment has been minimized.

Show comment
Hide comment
@saj

saj Apr 3, 2013

Yes, @trollybaz.

I ended up working around the problem (in testing) by using elasticsearch-zookeeper in place of Zen discovery. We already had reliable Zookeeper infrastructure up for other applications, so this approach made a whole lot of sense to me. I was unable to reproduce the problem with the Zookeeper discovery module.

saj commented Apr 3, 2013

Yes, @trollybaz.

I ended up working around the problem (in testing) by using elasticsearch-zookeeper in place of Zen discovery. We already had reliable Zookeeper infrastructure up for other applications, so this approach made a whole lot of sense to me. I was unable to reproduce the problem with the Zookeeper discovery module.

@tallpsmith

This comment has been minimized.

Show comment
Hide comment
@tallpsmith

tallpsmith Apr 4, 2013

I'm pretty sure we're suffering from this in certain situations, and I don't think that it's limited to unicast discovery.

We've had some bad networking, some Virtual Machine stalls (result of SAN issues, or VMWare doing weird stuff), or even heavy GC activity can cause enough pauses for aspects of the split brain to occur.

We were originally running pre-0.19.5 which contained an important fix for an edge case I thought we were suffering from, but since moving to 0.19.10 we've had at least one split brain (VMware->SAN related) that caused 1 of the 3 ES nodes to lose touch with the master, and declare itself master, while still then maintaing links back to other nodes.

I'm going to be tweaking our ES logging config to output DEBUG level discovery to a separate file so that I can properly trace these cases, but there have just been too many of these not to consider ES not handling these adversarial environment cases.

I believe #2117 is still an issue and is an interesting edge case, but I think this issue here best represents the majority of the issues people are having. My gut/intuition seems to indicate that the probability of this issue occurring does drop with a larger cluster, so the 3-node, minimum_master_node=2 is the most prevalent case.

It seems like when the 'split brain' new master connects to it's known child nodes, any node that already has an upstream connection to an existing master probably should be flagging it as a problem, and telling the newly connected master node "hey, I don't think you fully understand the cluster situation".

tallpsmith commented Apr 4, 2013

I'm pretty sure we're suffering from this in certain situations, and I don't think that it's limited to unicast discovery.

We've had some bad networking, some Virtual Machine stalls (result of SAN issues, or VMWare doing weird stuff), or even heavy GC activity can cause enough pauses for aspects of the split brain to occur.

We were originally running pre-0.19.5 which contained an important fix for an edge case I thought we were suffering from, but since moving to 0.19.10 we've had at least one split brain (VMware->SAN related) that caused 1 of the 3 ES nodes to lose touch with the master, and declare itself master, while still then maintaing links back to other nodes.

I'm going to be tweaking our ES logging config to output DEBUG level discovery to a separate file so that I can properly trace these cases, but there have just been too many of these not to consider ES not handling these adversarial environment cases.

I believe #2117 is still an issue and is an interesting edge case, but I think this issue here best represents the majority of the issues people are having. My gut/intuition seems to indicate that the probability of this issue occurring does drop with a larger cluster, so the 3-node, minimum_master_node=2 is the most prevalent case.

It seems like when the 'split brain' new master connects to it's known child nodes, any node that already has an upstream connection to an existing master probably should be flagging it as a problem, and telling the newly connected master node "hey, I don't think you fully understand the cluster situation".

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic Apr 5, 2013

Contributor

I believe there are two issues at hand. One being the possible culprits for a node being disconnected from the cluster: network issues, large GC, discover bug, etc... The other issue, and the more important one IMHO, is the failure in the master election process to detect that a node belongs to two separate clusters (with different masters). Clusters should embrace node failures for whatever reason, but master election needs to be rock solid. Tough problem in systems without an authoritative process such as ZooKeeper.

To add more data to the issue: I have seen the issue on two different 0.20RC1 clusters. One having eight nodes, the other with four.

Contributor

brusic commented Apr 5, 2013

I believe there are two issues at hand. One being the possible culprits for a node being disconnected from the cluster: network issues, large GC, discover bug, etc... The other issue, and the more important one IMHO, is the failure in the master election process to detect that a node belongs to two separate clusters (with different masters). Clusters should embrace node failures for whatever reason, but master election needs to be rock solid. Tough problem in systems without an authoritative process such as ZooKeeper.

To add more data to the issue: I have seen the issue on two different 0.20RC1 clusters. One having eight nodes, the other with four.

@tallpsmith

This comment has been minimized.

Show comment
Hide comment
@tallpsmith

tallpsmith Apr 5, 2013

I'm not sure the former is really something ES should be actively dealing with, the latter I agree, and is the main point here, in how ES detects and recovers from cases where 2 masters have been elected.

There was supposed to have been some code in, I think, 0.19.5 that 'recovers' from this state by choosing the side that has the most recent ClusterStatus object (see Issue #2042) , but it doesn't appear in practice to be working as expected, because we get these child nodes accepting connections from multiple masters.

I think gathering the discovery-level DEBUG logging from the multiple nodes and presenting it here is the only way to get further traction on this case.

It's possible going through the steps in Issue #2117 may uncover edge cases related to this one (even though the source conditions are different); at least it might be a reproducible case to explore.

@s1monw nudge - have you had a chance to look into #2117 at all... ? :)

tallpsmith commented Apr 5, 2013

I'm not sure the former is really something ES should be actively dealing with, the latter I agree, and is the main point here, in how ES detects and recovers from cases where 2 masters have been elected.

There was supposed to have been some code in, I think, 0.19.5 that 'recovers' from this state by choosing the side that has the most recent ClusterStatus object (see Issue #2042) , but it doesn't appear in practice to be working as expected, because we get these child nodes accepting connections from multiple masters.

I think gathering the discovery-level DEBUG logging from the multiple nodes and presenting it here is the only way to get further traction on this case.

It's possible going through the steps in Issue #2117 may uncover edge cases related to this one (even though the source conditions are different); at least it might be a reproducible case to explore.

@s1monw nudge - have you had a chance to look into #2117 at all... ? :)

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic Apr 5, 2013

Contributor

Paul, I agree that the former is not something to focus on. Should have stated that. :) The beauty of many of the new big data systems is that they embrace failure. Nodes will come and go, either due to errors or just simple maintenance. #2117 might have a different source condition, but the recovery process after the fact should be identical.

I have enabled DEBUG logging at the discovery level and I can pinpoint when a node has left/joined a cluster, but I still have no insights on the election process.

Contributor

brusic commented Apr 5, 2013

Paul, I agree that the former is not something to focus on. Should have stated that. :) The beauty of many of the new big data systems is that they embrace failure. Nodes will come and go, either due to errors or just simple maintenance. #2117 might have a different source condition, but the recovery process after the fact should be identical.

I have enabled DEBUG logging at the discovery level and I can pinpoint when a node has left/joined a cluster, but I still have no insights on the election process.

@tallpsmith

This comment has been minimized.

Show comment
Hide comment
@tallpsmith

tallpsmith May 24, 2013

suffered from this the other day when an accidental provisioning error had a 4GB ES Heap instance running on a 4GB O/S memory, which was always going to end up in trouble. The node swapped, process hung, and the intersection issue described here happened.

Yes, the provisioning error could have been avoided, yes, probably use of mlockall may have prevented the destined-to-die-a-horrible-swap-death, but there's other scenarios that could cause a hung process (bad I/O causing stalls for example) where the way ES handles the cluster state is poor, and leads to this problem.

we hope very much someone is looking hard into ways to make ES a bit more resilient when facing these situations to improve data integrity... (goes on bended knees while pleading)

tallpsmith commented May 24, 2013

suffered from this the other day when an accidental provisioning error had a 4GB ES Heap instance running on a 4GB O/S memory, which was always going to end up in trouble. The node swapped, process hung, and the intersection issue described here happened.

Yes, the provisioning error could have been avoided, yes, probably use of mlockall may have prevented the destined-to-die-a-horrible-swap-death, but there's other scenarios that could cause a hung process (bad I/O causing stalls for example) where the way ES handles the cluster state is poor, and leads to this problem.

we hope very much someone is looking hard into ways to make ES a bit more resilient when facing these situations to improve data integrity... (goes on bended knees while pleading)

@otisg

This comment has been minimized.

Show comment
Hide comment
@otisg

otisg May 24, 2013

Btw. why not adopt ZK, which I believe would make this situation impossible(?)? I don't love the extra process/management that the use of ZK would imply..... though maybe it could be embedded, like in SolrCloud, to work around that?

otisg commented May 24, 2013

Btw. why not adopt ZK, which I believe would make this situation impossible(?)? I don't love the extra process/management that the use of ZK would imply..... though maybe it could be embedded, like in SolrCloud, to work around that?

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic May 24, 2013

Contributor

From my understanding, the single embedded Zookeeper model is not ideal for production and that a full Zookeeper cluster is preferred. Never tried myself, so I cannot personally comment.

Contributor

brusic commented May 24, 2013

From my understanding, the single embedded Zookeeper model is not ideal for production and that a full Zookeeper cluster is preferred. Never tried myself, so I cannot personally comment.

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw May 24, 2013

Contributor

FYI - there is a zookeeper plugin for ES

Contributor

s1monw commented May 24, 2013

FYI - there is a zookeeper plugin for ES

@otisg

This comment has been minimized.

Show comment
Hide comment
@otisg

otisg May 24, 2013

Oh, I didn't mean to imply a single embedded ZK. I meant N of them in different ES processes. Right Simon, there is the plugin, but I suspect people are afraid of using it because it's not clear if it's 100% maintained, if it works with the latest ES and such. So my Q is really about adopting something like that and supporting it officially. Is that a possibility?

otisg commented May 24, 2013

Oh, I didn't mean to imply a single embedded ZK. I meant N of them in different ES processes. Right Simon, there is the plugin, but I suspect people are afraid of using it because it's not clear if it's 100% maintained, if it works with the latest ES and such. So my Q is really about adopting something like that and supporting it officially. Is that a possibility?

@mpalmer

This comment has been minimized.

Show comment
Hide comment
@mpalmer

mpalmer May 24, 2013

@otisg: The problem with the ZK plugin is that with clients being part of the cluster, they need to know about ZK in order to be able to discover the servers in the cluster. Some client libraries (such as the one used by the application that started this bug report -- I'm a colleague of Saj's) doesn't support ZK discovery. In order for ZK to be a useful alternative in general, there either needs to be universal support of ZK in client libraries, or a backwards-compatible way for non-ZK-aware client libraries to discover the servers (perhaps a ZK-to-Zen translator or something... I don't know, I've got bugger-all knowledge of how ES actually works under the hood).

mpalmer commented May 24, 2013

@otisg: The problem with the ZK plugin is that with clients being part of the cluster, they need to know about ZK in order to be able to discover the servers in the cluster. Some client libraries (such as the one used by the application that started this bug report -- I'm a colleague of Saj's) doesn't support ZK discovery. In order for ZK to be a useful alternative in general, there either needs to be universal support of ZK in client libraries, or a backwards-compatible way for non-ZK-aware client libraries to discover the servers (perhaps a ZK-to-Zen translator or something... I don't know, I've got bugger-all knowledge of how ES actually works under the hood).

@aochsner

This comment has been minimized.

Show comment
Hide comment
@aochsner

aochsner Jun 10, 2013

Contributor

We've gotten into this situation twice now in our QA environment. 3 nodes. minimum_master_nodes = 2. Log flies at https://gist.github.com/aochsner/5749640 (sorry they are big and repetitive).

We are on 0.9.0 and using multicast

As a bit of a walkthrough. sthapqa02 was the master and all it noticed was that sthapqa01 went bye bye and never rejoined. According to sthapqa02, the cluster was sthapqa02 (itself) and sthapqa03.

sthapqa01 is what appeared to have problems. It couldn't reach sthapqa02 and decided to create a cluster between itself and sthapqa03.

sthapqa03 went along w/ sthapqa01 to create a cluster and didn't notify sthapqa02.

So 01 and 03 are in a cluster and 02 thinks it's in a cluster w/ 03.

Contributor

aochsner commented Jun 10, 2013

We've gotten into this situation twice now in our QA environment. 3 nodes. minimum_master_nodes = 2. Log flies at https://gist.github.com/aochsner/5749640 (sorry they are big and repetitive).

We are on 0.9.0 and using multicast

As a bit of a walkthrough. sthapqa02 was the master and all it noticed was that sthapqa01 went bye bye and never rejoined. According to sthapqa02, the cluster was sthapqa02 (itself) and sthapqa03.

sthapqa01 is what appeared to have problems. It couldn't reach sthapqa02 and decided to create a cluster between itself and sthapqa03.

sthapqa03 went along w/ sthapqa01 to create a cluster and didn't notify sthapqa02.

So 01 and 03 are in a cluster and 02 thinks it's in a cluster w/ 03.

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Aug 13, 2013

Member

just an update that this behaves much better in 0.90.3 with dedicated master nodes deployment, but we are working on a better implementation down the road (with potential constraints on requiring fixed dedicated master nodes by the nature of some consensus algo impls, we will see how it goes...).

Member

kimchy commented Aug 13, 2013

just an update that this behaves much better in 0.90.3 with dedicated master nodes deployment, but we are working on a better implementation down the road (with potential constraints on requiring fixed dedicated master nodes by the nature of some consensus algo impls, we will see how it goes...).

@tallpsmith

This comment has been minimized.

Show comment
Hide comment
@tallpsmith

tallpsmith Aug 14, 2013

@kimchy that sounds promising, I would love to to understand more of the changes in that 0.90.x series that is in this area to understand what movements are going on ? Is there a commit hash you could point to that you can remember that I could peek at ?

By dedicated master node, do you mean nodes that just perform the master role, and not data role? (so additional nodes on top of existing data nodes). This would sort of mimic how adding Zookeeper as a Master Election co-ordinator works?

tallpsmith commented Aug 14, 2013

@kimchy that sounds promising, I would love to to understand more of the changes in that 0.90.x series that is in this area to understand what movements are going on ? Is there a commit hash you could point to that you can remember that I could peek at ?

By dedicated master node, do you mean nodes that just perform the master role, and not data role? (so additional nodes on top of existing data nodes). This would sort of mimic how adding Zookeeper as a Master Election co-ordinator works?

@phungleson

This comment has been minimized.

Show comment
Hide comment
@phungleson

phungleson Aug 14, 2013

Contributor

@kimchy Does 0.90.2 has the same features or they are only available in 0.90.3?

Contributor

phungleson commented Aug 14, 2013

@kimchy Does 0.90.2 has the same features or they are only available in 0.90.3?

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic Aug 14, 2013

Contributor

Shay, thanks for the update.

For us, the problem has gone away with the adoption of 0.90.2. The actual underlying problem might not have been fixed, but the improved memory usage with elasticsearch 0.90/Lucene 4 has eliminated large GCs, which probably were the root cause of our disconnections. No disconnections means no need to elect another master.

Contributor

brusic commented Aug 14, 2013

Shay, thanks for the update.

For us, the problem has gone away with the adoption of 0.90.2. The actual underlying problem might not have been fixed, but the improved memory usage with elasticsearch 0.90/Lucene 4 has eliminated large GCs, which probably were the root cause of our disconnections. No disconnections means no need to elect another master.

@btiernay

This comment has been minimized.

Show comment
Hide comment
@btiernay

btiernay Sep 20, 2013

This situation happened to us recently running 0.90.1 with minimum_master_nodes set to N/2 + 1, with N = 15. I'm not sure what the root cause was, but this shows that such a scenario is probable in larger clusters as well.

btiernay commented Sep 20, 2013

This situation happened to us recently running 0.90.1 with minimum_master_nodes set to N/2 + 1, with N = 15. I'm not sure what the root cause was, but this shows that such a scenario is probable in larger clusters as well.

@trevorreeves

This comment has been minimized.

Show comment
Hide comment
@trevorreeves

trevorreeves Oct 18, 2013

We have been frequently experiencing this 'mix brain' issue in several of our clusters - up to 3 or 4 times a week. We have always had dedicated master eligible nodes (i.e. master=true, data=false), correctly configured minimum_master_nodes and have recently moved to 0.90.3, and seen no improvement in the situation.

As a side note, the initial cause of the disruption to our cluster is 'something' to do with the network links between the nodes I imagine - one of the master eligible nodes occasionally loses connectivity with the master node briefly - "transport disconnected (with verified connect)" is all we get in the logs. We haven't figured out this issue yet (something is killing the tcp connection?), but this explains the frequency with which we are affected by this bug as it seems its a double hit due to the inability for the cluster to recover itself correctly when this disconnect occurs.

@kimchy Is there any latest status on the 'better implementation down the road' and when it might be delivered?

Sounds like zookeeper is our reluctant interim solution.

trevorreeves commented Oct 18, 2013

We have been frequently experiencing this 'mix brain' issue in several of our clusters - up to 3 or 4 times a week. We have always had dedicated master eligible nodes (i.e. master=true, data=false), correctly configured minimum_master_nodes and have recently moved to 0.90.3, and seen no improvement in the situation.

As a side note, the initial cause of the disruption to our cluster is 'something' to do with the network links between the nodes I imagine - one of the master eligible nodes occasionally loses connectivity with the master node briefly - "transport disconnected (with verified connect)" is all we get in the logs. We haven't figured out this issue yet (something is killing the tcp connection?), but this explains the frequency with which we are affected by this bug as it seems its a double hit due to the inability for the cluster to recover itself correctly when this disconnect occurs.

@kimchy Is there any latest status on the 'better implementation down the road' and when it might be delivered?

Sounds like zookeeper is our reluctant interim solution.

@tallpsmith

This comment has been minimized.

Show comment
Hide comment
@tallpsmith

tallpsmith Oct 22, 2013

just as I was beginning plans to go to a set of dedicated master-only nodes I ready @trevorreeves post where he's still hitting the same problem. Doh!

Our situation appears to be IOWait related, in that a master node (also a data-node) hits an issue that causes extensive IOWait (a _scroll based search can trigger this, we already cap the # streams and Mb/second recovery rate through settings), the JVM becomes unresponsive. The other nodes that are doing the Master Fault Detection are configured with 3 x 30 second ping timeouts, all of which fail, and then they give up on the master.

I'm not really sure what is stalling the master node JVM, particularly when I'm positive it's not GC related, it's definitely linked to heavy IOWait. We have one node in one installation with a 'tenuous' connection to a NetApp storage backing the volume used by the ES local disk image, and that seems to be the underlying root of our issues, but it is the way the ES cluster is failing to recover from this situation and not properly reestabling a consensus on the cluster that causes issues (I don't mind any weirdness during times of whacky IO patterns that form the split brain so much as I dislike the way ES is failing to keep track of who thinks who's who in the cluster).

At this point, it does seem like the Zookeeper based discovery/cluster management plugin is the most reliable way, though I'm not looking forward to setting up that up to be honest.

tallpsmith commented Oct 22, 2013

just as I was beginning plans to go to a set of dedicated master-only nodes I ready @trevorreeves post where he's still hitting the same problem. Doh!

Our situation appears to be IOWait related, in that a master node (also a data-node) hits an issue that causes extensive IOWait (a _scroll based search can trigger this, we already cap the # streams and Mb/second recovery rate through settings), the JVM becomes unresponsive. The other nodes that are doing the Master Fault Detection are configured with 3 x 30 second ping timeouts, all of which fail, and then they give up on the master.

I'm not really sure what is stalling the master node JVM, particularly when I'm positive it's not GC related, it's definitely linked to heavy IOWait. We have one node in one installation with a 'tenuous' connection to a NetApp storage backing the volume used by the ES local disk image, and that seems to be the underlying root of our issues, but it is the way the ES cluster is failing to recover from this situation and not properly reestabling a consensus on the cluster that causes issues (I don't mind any weirdness during times of whacky IO patterns that form the split brain so much as I dislike the way ES is failing to keep track of who thinks who's who in the cluster).

At this point, it does seem like the Zookeeper based discovery/cluster management plugin is the most reliable way, though I'm not looking forward to setting up that up to be honest.

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Nov 21, 2013

Contributor

We haven't hit this but this report is worrying - is this being worked on? This is the kind of thing that'd make us switch to Zookeeper.

Contributor

nik9000 commented Nov 21, 2013

We haven't hit this but this report is worrying - is this being worked on? This is the kind of thing that'd make us switch to Zookeeper.

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic Nov 21, 2013

Contributor

Just wanted to point out to Nik a comment in the other related issue: #2117 (comment)

"Unfortunately, this situation can in-fact occur with zen discovery at this point. We are working on a fix for this issue which might take a bit until we have something that can bring a solid solution for this."

I wonder what has happened since then and if their findings correspond to my scenario.

For my clusters, split-brains always occur when a node becomes isolated and then elects themselves as master. More visibility (logging) of the election process would be helpful. Re-discovery would be helpful as well since I rarely see the cluster self heal despite being in erroneous situations (nodes belongs to two clusters_. I am on version 0.90.2, so I am not sure if I am perhaps missing a critical update although I do scan the issues and commits.

Contributor

brusic commented Nov 21, 2013

Just wanted to point out to Nik a comment in the other related issue: #2117 (comment)

"Unfortunately, this situation can in-fact occur with zen discovery at this point. We are working on a fix for this issue which might take a bit until we have something that can bring a solid solution for this."

I wonder what has happened since then and if their findings correspond to my scenario.

For my clusters, split-brains always occur when a node becomes isolated and then elects themselves as master. More visibility (logging) of the election process would be helpful. Re-discovery would be helpful as well since I rarely see the cluster self heal despite being in erroneous situations (nodes belongs to two clusters_. I am on version 0.90.2, so I am not sure if I am perhaps missing a critical update although I do scan the issues and commits.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Dec 11, 2013

Could you do me a huge favor and not patch this until, like, May or so? I need to finish some other things before the next installation of Jepsen. ;-)

aphyr commented Dec 11, 2013

Could you do me a huge favor and not patch this until, like, May or so? I need to finish some other things before the next installation of Jepsen. ;-)

@bitsofinfo

This comment has been minimized.

Show comment
Hide comment
@bitsofinfo

bitsofinfo Jan 2, 2014

Contributor

Is there any update on this or timeline for when it will be fixed?

Contributor

bitsofinfo commented Jan 2, 2014

Is there any update on this or timeline for when it will be fixed?

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jan 15, 2014

Ran into this very problem on a 4 node cluster.

Node 1 and Node 2 got disconnected and elected themselves as masters,
Node 3 and 4 remained followers for both Node 1 and Node 2.

We do not have the option of running ZK.

Does anyone know the election process is governed (I know it runs off the Praxos Consensus algorithm) but in layman's term does each follower vote exactly once or do they case multiple votes?

ghost commented Jan 15, 2014

Ran into this very problem on a 4 node cluster.

Node 1 and Node 2 got disconnected and elected themselves as masters,
Node 3 and 4 remained followers for both Node 1 and Node 2.

We do not have the option of running ZK.

Does anyone know the election process is governed (I know it runs off the Praxos Consensus algorithm) but in layman's term does each follower vote exactly once or do they case multiple votes?

@amitelad7

This comment has been minimized.

Show comment
Hide comment
@amitelad7

amitelad7 Feb 8, 2014

We just ran into this problem on a 41 data node and 5 master node cluster running 0.90.9
@kimchy is your recommendation to use zookeeper and not zen?

amitelad7 commented Feb 8, 2014

We just ran into this problem on a 41 data node and 5 master node cluster running 0.90.9
@kimchy is your recommendation to use zookeeper and not zen?

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Feb 17, 2014

@amitelad7
You have a few options running at Zen, you can increases the fd timeouts/retries/intervals if your network/node is unresponsive. The other option is to explicitly define master nodes, but in the case of yours where you have 5 masters it may get tricky.

ghost commented Feb 17, 2014

@amitelad7
You have a few options running at Zen, you can increases the fd timeouts/retries/intervals if your network/node is unresponsive. The other option is to explicitly define master nodes, but in the case of yours where you have 5 masters it may get tricky.

@mycrEEpy

This comment has been minimized.

Show comment
Hide comment
@mycrEEpy

mycrEEpy Mar 13, 2014

We experienced this problem in our test environment because of tcp connections (heartbeat?) getting dropped by a firewall after some time leading to the "transport disconnected (with verified connect)" error which results in a split brain as described in this issue.

I configured the "net.ipv4.tcp_keepalive_time" variable in the /etc/sysctl.conf to a lower value (e.g. 600 equals 10 minutes) which fixed the problem for us. No disconnects, no new master election, no split brain.

But giving my +1 for this issue to get fixed asap as it could still occur.

mycrEEpy commented Mar 13, 2014

We experienced this problem in our test environment because of tcp connections (heartbeat?) getting dropped by a firewall after some time leading to the "transport disconnected (with verified connect)" error which results in a split brain as described in this issue.

I configured the "net.ipv4.tcp_keepalive_time" variable in the /etc/sysctl.conf to a lower value (e.g. 600 equals 10 minutes) which fixed the problem for us. No disconnects, no new master election, no split brain.

But giving my +1 for this issue to get fixed asap as it could still occur.

@nikicat

This comment has been minimized.

Show comment
Hide comment
@nikicat

nikicat commented Mar 13, 2014

👍

@AeroNotix

This comment has been minimized.

Show comment
Hide comment
@AeroNotix

AeroNotix Mar 24, 2014

Out of interest are you all running ES on EC2?

AeroNotix commented Mar 24, 2014

Out of interest are you all running ES on EC2?

@amitelad7

This comment has been minimized.

Show comment
Hide comment
@amitelad7

amitelad7 Mar 24, 2014

we're running on a private cloud of our own

amitelad7 commented Mar 24, 2014

we're running on a private cloud of our own

@AeroNotix

This comment has been minimized.

Show comment
Hide comment
@AeroNotix

AeroNotix Mar 24, 2014

@amitelad7 oh man >< Even worse.

41 nodes? Crazy. Did you try lowering the TCP keepalive setting like @mycrEEpy mentioned?

AeroNotix commented Mar 24, 2014

@amitelad7 oh man >< Even worse.

41 nodes? Crazy. Did you try lowering the TCP keepalive setting like @mycrEEpy mentioned?

@amitelad7

This comment has been minimized.

Show comment
Hide comment
@amitelad7

amitelad7 Mar 24, 2014

it's actually been quite stable over the past few weeks so we havent worked on further optimizations :)

amitelad7 commented Mar 24, 2014

it's actually been quite stable over the past few weeks so we havent worked on further optimizations :)

@AeroNotix

This comment has been minimized.

Show comment
Hide comment
@AeroNotix

AeroNotix Mar 24, 2014

@amitelad7 what does "quite stable" mean? :)

AeroNotix commented Mar 24, 2014

@amitelad7 what does "quite stable" mean? :)

@brusic

This comment has been minimized.

Show comment
Hide comment
@brusic

brusic Mar 24, 2014

Contributor

We are also running on a private cloud.

Part of our problem was incorrect Elasticsearch documentation. The docs listed the default ping timeout as 2s, so in an effort to improve the cluster, we raised the value to 5s. In reality the default is 30s, so I was actually lowering the value. The documentation is now fixed. We are now more resilient to network failures.

Contributor

brusic commented Mar 24, 2014

We are also running on a private cloud.

Part of our problem was incorrect Elasticsearch documentation. The docs listed the default ping timeout as 2s, so in an effort to improve the cluster, we raised the value to 5s. In reality the default is 30s, so I was actually lowering the value. The documentation is now fixed. We are now more resilient to network failures.

@AeroNotix

This comment has been minimized.

Show comment
Hide comment
@AeroNotix

AeroNotix Mar 24, 2014

@aphyr did you do any analysis using Jepsen on Elasticsearch?

AeroNotix commented Mar 24, 2014

@aphyr did you do any analysis using Jepsen on Elasticsearch?

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Mar 26, 2014

Still pending. Been a bit overwhelmed.

aphyr commented Mar 26, 2014

Still pending. Been a bit overwhelmed.

@pilvitaneli

This comment has been minimized.

Show comment
Hide comment
@pilvitaneli

pilvitaneli Aug 19, 2014

@kimchy I've been running jepsen tests on improve_zen branch quite regularly and four nemeses (isolate-self-primaries-nemesis, nemesis/partition-random-halves, nemesis/partition-halves, nemesis/partitioner nemesis/bridge) seem to transiently fail, i.e. 2-4 runs out of ten seem to result in some lost documents. One nemesis (nemesis/partition-random-node) hasn't failed in a few thousand successive runs, so would consider that scenario to be fixed.

pilvitaneli commented Aug 19, 2014

@kimchy I've been running jepsen tests on improve_zen branch quite regularly and four nemeses (isolate-self-primaries-nemesis, nemesis/partition-random-halves, nemesis/partition-halves, nemesis/partitioner nemesis/bridge) seem to transiently fail, i.e. 2-4 runs out of ten seem to result in some lost documents. One nemesis (nemesis/partition-random-node) hasn't failed in a few thousand successive runs, so would consider that scenario to be fixed.

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Aug 19, 2014

Member

@pilvitaneli thanks for the effort!, we have been running it as well on our end. In short, we verified that the split brain doesn't seem to happen anymore, but still need some work around the replication logic in addition to it to strengthen certain failure cases. We identified the case, but our hope is to push improve zen with the fix to split brain as soon as possible, and then progress on the replication aspect.

We are in the final stages of tests on improve_zen, now mainly doing 100s nodes tests type verifications, since improve zen now does a round of gossip on master failure, and we had to work on optimizing the resource usage in that case.

I hope that in the next few days we will publish a new resiliency status page (writing it now, took some time, summer... :) ). The resiliency status page would shed light and be a good place to aggregate all the effort going into any aspect of resiliency in ES (there is much work done except for this issue), on the work that has already been done, work that is in progress, and things that we know about still left to be done.

Member

kimchy commented Aug 19, 2014

@pilvitaneli thanks for the effort!, we have been running it as well on our end. In short, we verified that the split brain doesn't seem to happen anymore, but still need some work around the replication logic in addition to it to strengthen certain failure cases. We identified the case, but our hope is to push improve zen with the fix to split brain as soon as possible, and then progress on the replication aspect.

We are in the final stages of tests on improve_zen, now mainly doing 100s nodes tests type verifications, since improve zen now does a round of gossip on master failure, and we had to work on optimizing the resource usage in that case.

I hope that in the next few days we will publish a new resiliency status page (writing it now, took some time, summer... :) ). The resiliency status page would shed light and be a good place to aggregate all the effort going into any aspect of resiliency in ES (there is much work done except for this issue), on the work that has already been done, work that is in progress, and things that we know about still left to be done.

bleskes added a commit to bleskes/elasticsearch that referenced this issue Sep 1, 2014

[Discovery] accumulated improvements to ZenDiscovery
Merging the accumulated work from the feautre/improve_zen branch. Here are the highlights of the changes:

__Testing infra__
- Networking:
    - all symmetric partitioning
    - dropping packets
    - hard disconnects
    - Jepsen Tests
- Single node service disruptions:
    - Long GC / Halt
    - Slow cluster state updates
- Discovery settings
    - Easy to setup unicast with partial host list

__Zen Discovery__
- Pinging after master loss (no local elects)
- Fixes the split brain issue: #2488
- Batching join requests
- More resilient joining process (wait on a publish from master)

Closes #7493

bleskes added a commit to bleskes/elasticsearch that referenced this issue Sep 1, 2014

[Discovery] accumulated improvements to ZenDiscovery
Merging the accumulated work from the feature/improve_zen branch. Here are the highlights of the changes:

__Testing infra__
- Networking:
    - all symmetric partitioning
    - dropping packets
    - hard disconnects
    - Jepsen Tests
- Single node service disruptions:
    - Long GC / Halt
    - Slow cluster state updates
- Discovery settings
    - Easy to setup unicast with partial host list

__Zen Discovery__
- Pinging after master loss (no local elects)
- Fixes the split brain issue: #2488
- Batching join requests
- More resilient joining process (wait on a publish from master)

Closes #7493

@bleskes bleskes added v2.0.0 labels Sep 1, 2014

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Sep 1, 2014

Member

I'm closing this issue, as it is solved, as specified, by the changes made in #7493. Of course, there is more work to be done and the effort continues.

Thx for all the input and discussion.

Member

bleskes commented Sep 1, 2014

I'm closing this issue, as it is solved, as specified, by the changes made in #7493. Of course, there is more work to be done and the effort continues.

Thx for all the input and discussion.

@bleskes bleskes closed this Sep 1, 2014

@AeroNotix

This comment has been minimized.

Show comment
Hide comment
@AeroNotix

AeroNotix Sep 1, 2014

@bleskes so it's 100% fixed?

AeroNotix commented Sep 1, 2014

@bleskes so it's 100% fixed?

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Sep 1, 2014

Member

this issue (partial network splits causing split brain) is fixed now, yes.

Member

bleskes commented Sep 1, 2014

this issue (partial network splits causing split brain) is fixed now, yes.

@AeroNotix

This comment has been minimized.

Show comment
Hide comment
@AeroNotix

AeroNotix Sep 1, 2014

Interesting, will have to confirm that myself with Jepsen tests.

AeroNotix commented Sep 1, 2014

Interesting, will have to confirm that myself with Jepsen tests.

@shikhar

This comment has been minimized.

Show comment
Hide comment
@shikhar

shikhar Sep 1, 2014

Contributor

still need some work around the replication logic in addition to it to strengthen certain failure cases

is there an issue(s) open for this?

Contributor

shikhar commented Sep 1, 2014

still need some work around the replication logic in addition to it to strengthen certain failure cases

is there an issue(s) open for this?

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Sep 1, 2014

Member

@AeroNotix sure, let me know what you run into. Do note though that Jepsen tests more then what stated in this issue. For example, how the document replication model.

@shikhar there more then one thing to do. I think the best way to follow the work is through the resiliency label.

Member

bleskes commented Sep 1, 2014

@AeroNotix sure, let me know what you run into. Do note though that Jepsen tests more then what stated in this issue. For example, how the document replication model.

@shikhar there more then one thing to do. I think the best way to follow the work is through the resiliency label.

@mschirrmeister

This comment has been minimized.

Show comment
Hide comment
@mschirrmeister

mschirrmeister Sep 1, 2014

Is there an eta when 1.4 is released, or will it even go into the next 1.3.x update?

mschirrmeister commented Sep 1, 2014

Is there an eta when 1.4 is released, or will it even go into the next 1.3.x update?

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Sep 1, 2014

Member

@shikhar things take a bit longer than expected, but expect issue(s) for the rest of the known work to be open in the next few days, as well as the status page I talked about (just came back from vacation personally :) ).

Member

kimchy commented Sep 1, 2014

@shikhar things take a bit longer than expected, but expect issue(s) for the rest of the known work to be open in the next few days, as well as the status page I talked about (just came back from vacation personally :) ).

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Sep 1, 2014

Member

@mschirrmeister this feature is not planned to be back ported to 1.3, its too big of a change. No concrete ETA for 1.4, hopefully we will have a release (possibly first in Beta form) in the next couple of weeks.

Member

kimchy commented Sep 1, 2014

@mschirrmeister this feature is not planned to be back ported to 1.3, its too big of a change. No concrete ETA for 1.4, hopefully we will have a release (possibly first in Beta form) in the next couple of weeks.

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Sep 3, 2014

Member

@shikhar FYI - I opened a ticket for the issue we discussed above: #7572

Member

bleskes commented Sep 3, 2014

@shikhar FYI - I opened a ticket for the issue we discussed above: #7572

@shikhar

This comment has been minimized.

Show comment
Hide comment
@shikhar

shikhar Sep 3, 2014

Contributor

thanks @bleskes!

Contributor

shikhar commented Sep 3, 2014

thanks @bleskes!

@kelaban

This comment has been minimized.

Show comment
Hide comment
@kelaban

kelaban Oct 15, 2014

@kimchy, has the behavior of minimum number of master nodes changed in the new implementation, stated here? As I currently understand the setting, if there is less than N nodes available no cluster will exist (reads or writes)

kelaban commented Oct 15, 2014

@kimchy, has the behavior of minimum number of master nodes changed in the new implementation, stated here? As I currently understand the setting, if there is less than N nodes available no cluster will exist (reads or writes)

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Oct 15, 2014

Member

@kelaban the "previous" behavior is the same as the current one, when N nodes are not available, then that side of the cluster becomes blocked. In the new implementation (1.4), there is an option to decide if reads will still be allowed on that cluster or not.

Member

kimchy commented Oct 15, 2014

@kelaban the "previous" behavior is the same as the current one, when N nodes are not available, then that side of the cluster becomes blocked. In the new implementation (1.4), there is an option to decide if reads will still be allowed on that cluster or not.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Nov 4, 2014

@kimchy Where can we find this option? Should it already be present in the 1.4 branch?
We came across the split-brain issue and we want to know whether this is fixed in 1.4.

ghost commented Nov 4, 2014

@kimchy Where can we find this option? Should it already be present in the 1.4 branch?
We came across the split-brain issue and we want to know whether this is fixed in 1.4.

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 4, 2015

this issue (partial network splits causing split brain) is fixed now, yes.

I'm not sure why this issue was closed--people keep citing it and saying the problem is solved, but the Jepsen test from earlier in this thread still fails. Partial network partitions (and, for that matter, clean network partitions, and single-node partitions, and single-node pauses) continue to result in split-brain and lost data, for both compare-and-set and document-creation tests. I don't think the changes from #7493 were sufficient to solve the problem, though they may have improved the odds of successfully retaining data.

For instance, here's a test in which we induce randomized 120-second long intersecting partitions, for 600 seconds, with 10 seconds of complete connectivity in between each failure. This pattern resulted in 22/897 acknowledged documents being lost due to concurrent, conflicting primary nodes. You can reproduce this in Jepsen 7d0a718 by going to the elasticsearch directory and running lein test :only elasticsearch.core-test/create-bridge--may take a couple runs to actually trigger the race though.

aphyr commented Apr 4, 2015

this issue (partial network splits causing split brain) is fixed now, yes.

I'm not sure why this issue was closed--people keep citing it and saying the problem is solved, but the Jepsen test from earlier in this thread still fails. Partial network partitions (and, for that matter, clean network partitions, and single-node partitions, and single-node pauses) continue to result in split-brain and lost data, for both compare-and-set and document-creation tests. I don't think the changes from #7493 were sufficient to solve the problem, though they may have improved the odds of successfully retaining data.

For instance, here's a test in which we induce randomized 120-second long intersecting partitions, for 600 seconds, with 10 seconds of complete connectivity in between each failure. This pattern resulted in 22/897 acknowledged documents being lost due to concurrent, conflicting primary nodes. You can reproduce this in Jepsen 7d0a718 by going to the elasticsearch directory and running lein test :only elasticsearch.core-test/create-bridge--may take a couple runs to actually trigger the race though.

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Apr 4, 2015

Member

I'm not sure why this issue was closed

This issue, as it is stated, relates to have two master nodes elected during partial network split, despite of min_master_nodes. This issue should be solved now. The thinking is that we will open issues for different scenarios as they are discovered. An example is #7572 as well as your recent tickets (#10407 & #10426). Once we figure out the root cause of those failure (and the one mentioned in your previous comment) and if it turns out to be similar to this issue, it will of course be re-opened.

Member

bleskes commented Apr 4, 2015

I'm not sure why this issue was closed

This issue, as it is stated, relates to have two master nodes elected during partial network split, despite of min_master_nodes. This issue should be solved now. The thinking is that we will open issues for different scenarios as they are discovered. An example is #7572 as well as your recent tickets (#10407 & #10426). Once we figure out the root cause of those failure (and the one mentioned in your previous comment) and if it turns out to be similar to this issue, it will of course be re-opened.

@speedplane

This comment has been minimized.

Show comment
Hide comment
@speedplane

speedplane Feb 26, 2016

Contributor

Not directly on topic to this issue, but why is it so difficult to avoid/prevent this split brain issue? If there are two master nodes on a network (ie, a split brain configuration), why can't there be some protocol for the two masters to figure out which one should become a slave?

I imagine some mechanism would need to detect that the system is in a split-brain state, and then a heuristic would be applied to choose the real master (e.g., oldest running server, most number of docs, random choice, etc.). This probably takes work to do, but it does not seem too difficult.

Contributor

speedplane commented Feb 26, 2016

Not directly on topic to this issue, but why is it so difficult to avoid/prevent this split brain issue? If there are two master nodes on a network (ie, a split brain configuration), why can't there be some protocol for the two masters to figure out which one should become a slave?

I imagine some mechanism would need to detect that the system is in a split-brain state, and then a heuristic would be applied to choose the real master (e.g., oldest running server, most number of docs, random choice, etc.). This probably takes work to do, but it does not seem too difficult.

@hamiltop

This comment has been minimized.

Show comment
Hide comment
@hamiltop

hamiltop Feb 26, 2016

Michael: Split brain occurs precisely because the two masters can't
communicate. If they could they would resolve it.

On Thu, Feb 25, 2016 at 6:54 PM Michael Sander notifications@github.com
wrote:

Not directly on topic to this issue, but why is it so difficult to
avoid/prevent this split brain issue? If there are two master nodes on a
network (ie, a split brain configuration), why can't there be some protocol
for the two masters to figure out which one should become a slave?

I imagine some mechanism would need to detect that the system is in a
split-brain state, and then a heuristic would be applied to choose the real
master (e.g., oldest running server, most number of docs, random choice,
etc.). This probably takes work to do, but it does not seem too difficult.


Reply to this email directly or view it on GitHub
#2488 (comment)
.

hamiltop commented Feb 26, 2016

Michael: Split brain occurs precisely because the two masters can't
communicate. If they could they would resolve it.

On Thu, Feb 25, 2016 at 6:54 PM Michael Sander notifications@github.com
wrote:

Not directly on topic to this issue, but why is it so difficult to
avoid/prevent this split brain issue? If there are two master nodes on a
network (ie, a split brain configuration), why can't there be some protocol
for the two masters to figure out which one should become a slave?

I imagine some mechanism would need to detect that the system is in a
split-brain state, and then a heuristic would be applied to choose the real
master (e.g., oldest running server, most number of docs, random choice,
etc.). This probably takes work to do, but it does not seem too difficult.


Reply to this email directly or view it on GitHub
#2488 (comment)
.

@speedplane

This comment has been minimized.

Show comment
Hide comment
@speedplane

speedplane Feb 26, 2016

Contributor

Got it. Earlier this week two nodes in my cluster appeared to be fighting for who was the master of the cluster. They were both on the same network and I believe were in communication with each other, but they went back and forth over which was the master. I shut down one of the nodes, gave it five minutes, restarted that node, and everything was fine. I thought that this was a split brain issue, but I guess it may be something else.

Contributor

speedplane commented Feb 26, 2016

Got it. Earlier this week two nodes in my cluster appeared to be fighting for who was the master of the cluster. They were both on the same network and I believe were in communication with each other, but they went back and forth over which was the master. I shut down one of the nodes, gave it five minutes, restarted that node, and everything was fine. I thought that this was a split brain issue, but I guess it may be something else.

@jasontedor

This comment has been minimized.

Show comment
Hide comment
@jasontedor

jasontedor Feb 26, 2016

Member

Earlier this week two nodes in my cluster appeared to be fighting for who was the master of the cluster.

@speedplane Do you have exactly two master-eligible nodes? Do you have minimum master nodes set to two (if you're going to run with exactly two master-eligible nodes you should, although this means that your cluster becomes semi-unavailable if one of the masters faults; ideally if you have multiple master-eligible nodes you'll have at least three and have minimum master nodes set to a quorum of them)?

I thought that this was a split brain issue, but I guess it may be something else.

Split brain is when two nodes in a cluster are simultaneously acting as masters for that cluster.

Member

jasontedor commented Feb 26, 2016

Earlier this week two nodes in my cluster appeared to be fighting for who was the master of the cluster.

@speedplane Do you have exactly two master-eligible nodes? Do you have minimum master nodes set to two (if you're going to run with exactly two master-eligible nodes you should, although this means that your cluster becomes semi-unavailable if one of the masters faults; ideally if you have multiple master-eligible nodes you'll have at least three and have minimum master nodes set to a quorum of them)?

I thought that this was a split brain issue, but I guess it may be something else.

Split brain is when two nodes in a cluster are simultaneously acting as masters for that cluster.

@speedplane

This comment has been minimized.

Show comment
Hide comment
@speedplane

speedplane Feb 26, 2016

Contributor

@jasontedor Yes, I had exactly two nodes, and minimum master nodes was set to one. I did this intentionally for the exact reason you described. It appeared that the two nodes were simultaneously acting as a master, but they were both in communication with each other, so shouldn't they be able to resolve it, as @hamiltop suggests?

Contributor

speedplane commented Feb 26, 2016

@jasontedor Yes, I had exactly two nodes, and minimum master nodes was set to one. I did this intentionally for the exact reason you described. It appeared that the two nodes were simultaneously acting as a master, but they were both in communication with each other, so shouldn't they be able to resolve it, as @hamiltop suggests?

@jasontedor

This comment has been minimized.

Show comment
Hide comment
@jasontedor

jasontedor Feb 26, 2016

Member

Yes, I had exactly two nodes, and minimum master nodes was set to one.

@speedplane This is bad because it does subject you to split brain.

I did this intentionally for the exact reason you described.

That's not what I recommend. Either drop to one (and lose high-availability), or increase to three (and set minimum master nodes to two).

It appeared that the two nodes were simultaneously acting as a master, but they were both in communication with each other, so shouldn't they be able to resolve it, as @hamiltop suggests?

What evidence do you have that they were simultaneously acting as master? How do you know that they were in communication with each other? What version of Elasticsearch?

Member

jasontedor commented Feb 26, 2016

Yes, I had exactly two nodes, and minimum master nodes was set to one.

@speedplane This is bad because it does subject you to split brain.

I did this intentionally for the exact reason you described.

That's not what I recommend. Either drop to one (and lose high-availability), or increase to three (and set minimum master nodes to two).

It appeared that the two nodes were simultaneously acting as a master, but they were both in communication with each other, so shouldn't they be able to resolve it, as @hamiltop suggests?

What evidence do you have that they were simultaneously acting as master? How do you know that they were in communication with each other? What version of Elasticsearch?

@speedplane

This comment has been minimized.

Show comment
Hide comment
@speedplane

speedplane Feb 26, 2016

Contributor

What evidence do you have that they were simultaneously acting as master?

In the Big Desk plugin, the little star next to node name kept on bouncing back and forth between my two nodes (see screenshot).

bigdesk plugin

How do you know that they were in communication with each other?

I don't think I explicitly tested whether one could contact the other, but I was able to ssh into both, they were on the same network, and there did not appear to be any network issues.

What version of Elasticsearch?

1.7.3

Contributor

speedplane commented Feb 26, 2016

What evidence do you have that they were simultaneously acting as master?

In the Big Desk plugin, the little star next to node name kept on bouncing back and forth between my two nodes (see screenshot).

bigdesk plugin

How do you know that they were in communication with each other?

I don't think I explicitly tested whether one could contact the other, but I was able to ssh into both, they were on the same network, and there did not appear to be any network issues.

What version of Elasticsearch?

1.7.3

@jasontedor

This comment has been minimized.

Show comment
Hide comment
@jasontedor

jasontedor Feb 26, 2016

Member

In the Big Desk plugin, the little star next to node name kept on bouncing back and forth between my two nodes (see screenshot).

@speedplane I'm not familiar with the Big Desk plugin, sorry. Let's just assume that it's correct and as you say. Have you checked the logs or any other monitoring for repeated long-running garbage collections pauses on both of these nodes?

I don't think I explicitly tested whether one could contact the other, but I was able to ssh into both, they were on the same network, and there did not appear to be any network issues.

Networks are fickle things but I do suspect something else here.

1.7.3

Thanks.

Member

jasontedor commented Feb 26, 2016

In the Big Desk plugin, the little star next to node name kept on bouncing back and forth between my two nodes (see screenshot).

@speedplane I'm not familiar with the Big Desk plugin, sorry. Let's just assume that it's correct and as you say. Have you checked the logs or any other monitoring for repeated long-running garbage collections pauses on both of these nodes?

I don't think I explicitly tested whether one could contact the other, but I was able to ssh into both, they were on the same network, and there did not appear to be any network issues.

Networks are fickle things but I do suspect something else here.

1.7.3

Thanks.

@XANi

This comment has been minimized.

Show comment
Hide comment
@XANi

XANi Feb 26, 2016

@speedplane "2 node situation" is inherently hard to deal with because there is no one metric iy could be decided which one should be shot down.

"Most written to" or "last written to" doesnt really mean much and in most cases alerting that something is wrong is preferable to "just throw away whatever other node had".

That is why a lot of distributed software recommends at least 3 nodes, because with 3 there is always majority, so you can set it up to only allow requests if at least n/2+1 nodes are up

XANi commented Feb 26, 2016

@speedplane "2 node situation" is inherently hard to deal with because there is no one metric iy could be decided which one should be shot down.

"Most written to" or "last written to" doesnt really mean much and in most cases alerting that something is wrong is preferable to "just throw away whatever other node had".

That is why a lot of distributed software recommends at least 3 nodes, because with 3 there is always majority, so you can set it up to only allow requests if at least n/2+1 nodes are up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment