[Indexing] A network partition can cause in flight documents to be lost #7572

bleskes · 2014-09-03T18:01:33Z

This ticket is meant to capture an issue which was discovered as part of the work done in #7493 , which contains a failing reproduction test with @awaitFix.

If a network partition separates a node from the master, there is some window of time before the node detects it. The length of the window is dependent on the type of the partition. This window is extremely small if a socket is broken. More adversarial partitions, for example, silently dropping requests without breaking the socket can take longer (up to 3x30s using current defaults).

If the node hosts a primary shard at the moment of partition, and ends up being isolated from the cluster (which could have resulted in Split Brain before), some documents that are being indexed into the primary may be lost if they fail to reach one of the allocated replicas (due to the partition) and that replica is later promoted to primary by the master.

shikhar · 2014-10-21T06:54:44Z

I am curious to learn what your current thinking on fixing the issue is. I believe so long as we are ensuring the write is acknowledged by WriteConsistencyLevel.QUORUM or WriteConsistencyLevel.ALL, the problem should not theoretically happen. This seems to be what TransportShardReplicationOperationAction is aiming at, but may be buggy?

As an aside, can you point me at the primary-selection logic used by Elasticsearch?

bleskes · 2014-10-21T08:15:04Z

@shikhar the write consistency check works at the moment based of the cluster state of the node that hosts the primary. That means that it can take some time (again, when the network is just dropping requests, socket disconnects are quick) before the master detects a node does not respond to pings and removes it from the cluster states (or that a node detects it's not connected to a master). The first step is improving transparency w.r.t replica shards indexing errors (see #7994). That will help expose when a document was not successfully indexed to all replicas. After that we plan to continue with improving primary shard promotion. Current code is here: https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java#L271

shikhar · 2014-10-21T19:26:25Z

Ah I see, my thinking was that the WCL check be verified both before and after the write has been sent. The after is what really matters. So it seems you are suggesting that the responsibility of verifying how many replicas a write was acknowledged by, will be borne by the requestor? I think the terminology around "write consistency level" check may have to be re-considered then!

From the primary selection logic I can't spot anywhere where it's trying to pick the most "recent" replica of the candidates. Does ES currently exercise any such preference?

bleskes · 2014-10-21T19:35:03Z

So it seems you are suggesting that the responsibility of verifying how many replicas a write was acknowledged by, will be borne by the requestor?

The PR I mentioned is just a first step to bring more transparency into the process, by no means the goal.

From the primary selection logic I can't spot anywhere where it's trying to pick the most "recent" replica of the candidates. Does ES currently exercise any such preference?

"recent" is very tricky when you index concurrently different documents of different sizes on different nodes. Depending on how things run, there is no notion of a clear "recent" shard as each replica may be behind on different documents, all in flight. I currently have some thoughts on how to approach this better but it's early stages. One of the options is take make a intermediate step which will indeed involve some heuristic around "recency".

shikhar · 2014-10-21T20:16:10Z

"recent" is very tricky when you index concurrently different documents of different sizes on different nodes. Depending on how things run, there is no notion of a clear "recent" shard as each replica may be behind on different documents, all in flight. I currently have some thoughts on how to approach this better but it's early stages. One of the options is take make a intermediate step which will indeed involve some heuristic around "recency".

Agreed that it's tricky.

It seems to me that what's required is a shard-specific monotonic counter, and since all writes go through the primary this can be safely implemented. Is this blocking on the "sequence ID" stuff I think I saw some talk of? Is there a ticket for that?

bleskes · 2014-10-21T20:27:14Z

It seems to me that what's required is a shard-specific monotonic counter, and since all writes go through the primary this can be safely implemented. Is this blocking on the "sequence ID" stuff I think I saw some talk of?

You read our minds :)

shikhar · 2014-10-21T20:28:30Z

recommendation from @aphyr for this problem: viewstamped replication

aphyr · 2014-10-21T20:39:54Z

Or Paxos, or ZAB, or Raft, or ...

evantahler · 2014-10-24T06:50:14Z

Chiming with a related note that I mentioned on the mailing list (@shikhar linked me here) re: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/M17mgdZnikk/Vk5lVIRjIFAJ. This is failure mode that can happen without a network partition... just crashing nodes (which you can easily get with some long GC pauses)

I think the monotonic counters are a good solution to this, but only if they count something that indicates not only state (The next document inserted to the shard should be document 1000), but also size (which implies that I have 999 documents in my copy of the shard). This way, if you end up in a position where a partially-replicated shard is promoted to master (because it has the only copy of the shard remaining in the cluster), you can now offer the user some interesting cluster configuration options:

serve the data I have, but accept no writes/updates (until a full shard returns to the cluster)
temporarily close the index / 500 error (until a full shard returns to the cluster)
promote what I have to master (and re-replicate my copy to other nodes when they re-join the cluster)

Without knowing that a shard is in this "partial-data" state, you couldn't make the choice. I would personally choose #1 most of the time, but I can see use cases for all three options. I would argue that #3 is what is happening presently. While this would add overhead to each write/update (you would need to count the number of documents in the shard EACH write), I think that allowing ES to run in this "more safe" mode is a good option. Hopefully the suggestion isn't too crazy, as this would only add a check on the local copy of the data, and we probably only need to do it on the master shard.

aphyr · 2014-10-24T20:47:32Z

promote what I have to master (and re-replicate my copy to other nodes when they re-join the cluster)

There's some great literature that addresses this problem.

bleskes · 2014-10-24T21:27:40Z

@evantahler

This way, if you end up in a position where a partially-replicated shard is promoted to master (because it has the only copy of the shard remaining in the cluster)

This should never happen. ES prefers to go to red state and block indexing to promoting half copies to primaries. If it did it is a major bug and I would request you open another issue about it (this one is about something else).

shikhar · 2015-05-05T17:57:30Z

linking #10708

JeanFrancoisContour · 2015-07-13T09:45:59Z

Since this issue is related to in-flight documents. Do you think there is a risk to loose existing document during primary shard relocation (cluster rebalancing after adding a new node for instance )?

bleskes · 2015-07-16T09:34:06Z

@JeanFrancoisContour this issue relates to documents that are wrongfully acked. I.e., ES acknowledge them but they didn't really reach all the replicas. They are lost when the primary is removed in favour of one of the other replica due to a network partition that isolates the primary. It should effect primary relocation. If you have issues there do please report by opening a different ticket.

JeanFrancoisContour · 2015-07-16T17:05:54Z

Ok thanks, so if we can afford to send data twice (same _id), in real time for the first event and a few hour later (bulk) for the second try, we are pretty confident in ES overall ?

bleskes · 2016-04-07T07:55:47Z

For the record, the majority of the work to fix this can be found at #14252

#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.

bleskes added bug labels Sep 3, 2014

bleskes self-assigned this Sep 3, 2014

bleskes mentioned this issue Sep 3, 2014

minimum_master_nodes does not prevent split-brain if splits are intersecting #2488

Closed

dakrone mentioned this issue Sep 3, 2014

Jepsen transient failures under network partition conditions #7549

Closed

bleskes mentioned this issue Apr 3, 2015

A network partition isolating a primary can cause the loss of inserted documents #10407

Closed

aphyr mentioned this issue Apr 4, 2015

A VM pause (due to GC, high IO load, etc) can cause the loss of inserted documents #10426

Closed

kimchy mentioned this issue May 2, 2015

Node crashes can cause data loss #10933

Closed

clintongormley mentioned this issue May 8, 2015

Add monitoring for inconsistent doc count between primary and replica shards. #11046

Closed

bleskes mentioned this issue Jul 17, 2015

Shutting down a node stops the transport layer despite of ongoing indexing operations #12314

Closed

brwe mentioned this issue Aug 6, 2015

ES shard fail to recovery due do number of docs differ #12661

Closed

clintongormley added the :Cluster label Jan 26, 2016

haizaar mentioned this issue Feb 19, 2016

Documentation is misleading regarding write consistency #16728

Closed

jasontedor mentioned this issue Apr 6, 2016

Enable acked indexing #17038

Merged

jasontedor closed this as completed in #17038 Apr 6, 2016

bleskes added a commit that referenced this issue Apr 7, 2016

Update resliency page

557a3d1

#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.

bleskes mentioned this issue Apr 7, 2016

Update resliency page #17586

Merged

bleskes added a commit that referenced this issue Apr 7, 2016

Update resiliency page (#17586)

8eee28e

#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.

aphyr mentioned this issue Aug 17, 2016

Network partitions can cause divergence, dirty reads, and lost updates. #20031

Closed

clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Indexing] A network partition can cause in flight documents to be lost #7572

[Indexing] A network partition can cause in flight documents to be lost #7572

bleskes commented Sep 3, 2014

shikhar commented Oct 21, 2014

bleskes commented Oct 21, 2014

shikhar commented Oct 21, 2014

bleskes commented Oct 21, 2014

shikhar commented Oct 21, 2014

bleskes commented Oct 21, 2014

shikhar commented Oct 21, 2014

aphyr commented Oct 21, 2014

evantahler commented Oct 24, 2014

aphyr commented Oct 24, 2014

bleskes commented Oct 24, 2014

shikhar commented May 5, 2015

JeanFrancoisContour commented Jul 13, 2015

bleskes commented Jul 16, 2015

JeanFrancoisContour commented Jul 16, 2015

bleskes commented Apr 7, 2016

[Indexing] A network partition can cause in flight documents to be lost #7572

[Indexing] A network partition can cause in flight documents to be lost #7572

Comments

bleskes commented Sep 3, 2014

shikhar commented Oct 21, 2014

bleskes commented Oct 21, 2014

shikhar commented Oct 21, 2014

bleskes commented Oct 21, 2014

shikhar commented Oct 21, 2014

bleskes commented Oct 21, 2014

shikhar commented Oct 21, 2014

aphyr commented Oct 21, 2014

evantahler commented Oct 24, 2014

aphyr commented Oct 24, 2014

bleskes commented Oct 24, 2014

shikhar commented May 5, 2015

JeanFrancoisContour commented Jul 13, 2015

bleskes commented Jul 16, 2015

JeanFrancoisContour commented Jul 16, 2015

bleskes commented Apr 7, 2016