[Indexing] A network partition can cause in flight documents to be lost #7572

Closed
bleskes opened this Issue Sep 3, 2014 · 16 comments

Projects

None yet

6 participants

@bleskes
Member
bleskes commented Sep 3, 2014

This ticket is meant to capture an issue which was discovered as part of the work done in #7493 , which contains a failing reproduction test with @awaitFix.

If a network partition separates a node from the master, there is some window of time before the node detects it. The length of the window is dependent on the type of the partition. This window is extremely small if a socket is broken. More adversarial partitions, for example, silently dropping requests without breaking the socket can take longer (up to 3x30s using current defaults).

If the node hosts a primary shard at the moment of partition, and ends up being isolated from the cluster (which could have resulted in Split Brain before), some documents that are being indexed into the primary may be lost if they fail to reach one of the allocated replicas (due to the partition) and that replica is later promoted to primary by the master.

@bleskes bleskes self-assigned this Sep 3, 2014
@shikhar
Contributor
shikhar commented Oct 21, 2014

I am curious to learn what your current thinking on fixing the issue is. I believe so long as we are ensuring the write is acknowledged by WriteConsistencyLevel.QUORUM or WriteConsistencyLevel.ALL, the problem should not theoretically happen. This seems to be what TransportShardReplicationOperationAction is aiming at, but may be buggy?

As an aside, can you point me at the primary-selection logic used by Elasticsearch?

@bleskes
Member
bleskes commented Oct 21, 2014

@shikhar the write consistency check works at the moment based of the cluster state of the node that hosts the primary. That means that it can take some time (again, when the network is just dropping requests, socket disconnects are quick) before the master detects a node does not respond to pings and removes it from the cluster states (or that a node detects it's not connected to a master). The first step is improving transparency w.r.t replica shards indexing errors (see #7994). That will help expose when a document was not successfully indexed to all replicas. After that we plan to continue with improving primary shard promotion. Current code is here: https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java#L271

@shikhar
Contributor
shikhar commented Oct 21, 2014

Ah I see, my thinking was that the WCL check be verified both before and after the write has been sent. The after is what really matters. So it seems you are suggesting that the responsibility of verifying how many replicas a write was acknowledged by, will be borne by the requestor? I think the terminology around "write consistency level" check may have to be re-considered then!

From the primary selection logic I can't spot anywhere where it's trying to pick the most "recent" replica of the candidates. Does ES currently exercise any such preference?

@bleskes
Member
bleskes commented Oct 21, 2014

So it seems you are suggesting that the responsibility of verifying how many replicas a write was acknowledged by, will be borne by the requestor?

The PR I mentioned is just a first step to bring more transparency into the process, by no means the goal.

From the primary selection logic I can't spot anywhere where it's trying to pick the most "recent" replica of the candidates. Does ES currently exercise any such preference?

"recent" is very tricky when you index concurrently different documents of different sizes on different nodes. Depending on how things run, there is no notion of a clear "recent" shard as each replica may be behind on different documents, all in flight. I currently have some thoughts on how to approach this better but it's early stages. One of the options is take make a intermediate step which will indeed involve some heuristic around "recency".

@shikhar
Contributor
shikhar commented Oct 21, 2014

"recent" is very tricky when you index concurrently different documents of different sizes on different nodes. Depending on how things run, there is no notion of a clear "recent" shard as each replica may be behind on different documents, all in flight. I currently have some thoughts on how to approach this better but it's early stages. One of the options is take make a intermediate step which will indeed involve some heuristic around "recency".

Agreed that it's tricky.

It seems to me that what's required is a shard-specific monotonic counter, and since all writes go through the primary this can be safely implemented. Is this blocking on the "sequence ID" stuff I think I saw some talk of? Is there a ticket for that?

@bleskes
Member
bleskes commented Oct 21, 2014

It seems to me that what's required is a shard-specific monotonic counter, and since all writes go through the primary this can be safely implemented. Is this blocking on the "sequence ID" stuff I think I saw some talk of?

You read our minds :)

@shikhar
Contributor
shikhar commented Oct 21, 2014

recommendation from @aphyr for this problem: viewstamped replication

@aphyr
aphyr commented Oct 21, 2014

Or Paxos, or ZAB, or Raft, or ...

@evantahler

Chiming with a related note that I mentioned on the mailing list (@shikhar linked me here) re: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/M17mgdZnikk/Vk5lVIRjIFAJ. This is failure mode that can happen without a network partition... just crashing nodes (which you can easily get with some long GC pauses)

I think the monotonic counters are a good solution to this, but only if they count something that indicates not only state (The next document inserted to the shard should be document 1000), but also size (which implies that I have 999 documents in my copy of the shard). This way, if you end up in a position where a partially-replicated shard is promoted to master (because it has the only copy of the shard remaining in the cluster), you can now offer the user some interesting cluster configuration options:

  1. serve the data I have, but accept no writes/updates (until a full shard returns to the cluster)
  2. temporarily close the index / 500 error (until a full shard returns to the cluster)
  3. promote what I have to master (and re-replicate my copy to other nodes when they re-join the cluster)

Without knowing that a shard is in this "partial-data" state, you couldn't make the choice. I would personally choose #1 most of the time, but I can see use cases for all three options. I would argue that #3 is what is happening presently. While this would add overhead to each write/update (you would need to count the number of documents in the shard EACH write), I think that allowing ES to run in this "more safe" mode is a good option. Hopefully the suggestion isn't too crazy, as this would only add a check on the local copy of the data, and we probably only need to do it on the master shard.

@aphyr
aphyr commented Oct 24, 2014
  1. promote what I have to master (and re-replicate my copy to other nodes when they re-join the cluster)

There's some great literature that addresses this problem.

@bleskes
Member
bleskes commented Oct 24, 2014

@evantahler

This way, if you end up in a position where a partially-replicated shard is promoted to master (because it has the only copy of the shard remaining in the cluster)

This should never happen. ES prefers to go to red state and block indexing to promoting half copies to primaries. If it did it is a major bug and I would request you open another issue about it (this one is about something else).

@shikhar
Contributor
shikhar commented May 5, 2015

linking #10708

@JeanFrancoisContour

Since this issue is related to in-flight documents. Do you think there is a risk to loose existing document during primary shard relocation (cluster rebalancing after adding a new node for instance )?

@bleskes
Member
bleskes commented Jul 16, 2015

@JeanFrancoisContour this issue relates to documents that are wrongfully acked. I.e., ES acknowledge them but they didn't really reach all the replicas. They are lost when the primary is removed in favour of one of the other replica due to a network partition that isolates the primary. It should effect primary relocation. If you have issues there do please report by opening a different ticket.

@JeanFrancoisContour

Ok thanks, so if we can afford to send data twice (same _id), in real time for the first event and a few hour later (bulk) for the second try, we are pretty confident in ES overall ?

@jasontedor jasontedor closed this in #17038 Apr 6, 2016
@bleskes
Member
bleskes commented Apr 7, 2016

For the record, the majority of the work to fix this can be found at #14252

@bleskes bleskes added a commit that referenced this issue Apr 7, 2016
@bleskes bleskes Update resliency page
#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.
557a3d1
@bleskes bleskes added a commit that referenced this issue Apr 7, 2016
@bleskes bleskes Update resiliency page (#17586)
#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.
8eee28e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment