New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Indexing] A network partition can cause in flight documents to be lost #7572

Closed
bleskes opened this Issue Sep 3, 2014 · 16 comments

Comments

Projects
None yet
6 participants
@bleskes
Member

bleskes commented Sep 3, 2014

This ticket is meant to capture an issue which was discovered as part of the work done in #7493 , which contains a failing reproduction test with @awaitFix.

If a network partition separates a node from the master, there is some window of time before the node detects it. The length of the window is dependent on the type of the partition. This window is extremely small if a socket is broken. More adversarial partitions, for example, silently dropping requests without breaking the socket can take longer (up to 3x30s using current defaults).

If the node hosts a primary shard at the moment of partition, and ends up being isolated from the cluster (which could have resulted in Split Brain before), some documents that are being indexed into the primary may be lost if they fail to reach one of the allocated replicas (due to the partition) and that replica is later promoted to primary by the master.

@shikhar

This comment has been minimized.

Show comment
Hide comment
@shikhar

shikhar Oct 21, 2014

Contributor

I am curious to learn what your current thinking on fixing the issue is. I believe so long as we are ensuring the write is acknowledged by WriteConsistencyLevel.QUORUM or WriteConsistencyLevel.ALL, the problem should not theoretically happen. This seems to be what TransportShardReplicationOperationAction is aiming at, but may be buggy?

As an aside, can you point me at the primary-selection logic used by Elasticsearch?

Contributor

shikhar commented Oct 21, 2014

I am curious to learn what your current thinking on fixing the issue is. I believe so long as we are ensuring the write is acknowledged by WriteConsistencyLevel.QUORUM or WriteConsistencyLevel.ALL, the problem should not theoretically happen. This seems to be what TransportShardReplicationOperationAction is aiming at, but may be buggy?

As an aside, can you point me at the primary-selection logic used by Elasticsearch?

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Oct 21, 2014

Member

@shikhar the write consistency check works at the moment based of the cluster state of the node that hosts the primary. That means that it can take some time (again, when the network is just dropping requests, socket disconnects are quick) before the master detects a node does not respond to pings and removes it from the cluster states (or that a node detects it's not connected to a master). The first step is improving transparency w.r.t replica shards indexing errors (see #7994). That will help expose when a document was not successfully indexed to all replicas. After that we plan to continue with improving primary shard promotion. Current code is here: https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java#L271

Member

bleskes commented Oct 21, 2014

@shikhar the write consistency check works at the moment based of the cluster state of the node that hosts the primary. That means that it can take some time (again, when the network is just dropping requests, socket disconnects are quick) before the master detects a node does not respond to pings and removes it from the cluster states (or that a node detects it's not connected to a master). The first step is improving transparency w.r.t replica shards indexing errors (see #7994). That will help expose when a document was not successfully indexed to all replicas. After that we plan to continue with improving primary shard promotion. Current code is here: https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java#L271

@shikhar

This comment has been minimized.

Show comment
Hide comment
@shikhar

shikhar Oct 21, 2014

Contributor

Ah I see, my thinking was that the WCL check be verified both before and after the write has been sent. The after is what really matters. So it seems you are suggesting that the responsibility of verifying how many replicas a write was acknowledged by, will be borne by the requestor? I think the terminology around "write consistency level" check may have to be re-considered then!

From the primary selection logic I can't spot anywhere where it's trying to pick the most "recent" replica of the candidates. Does ES currently exercise any such preference?

Contributor

shikhar commented Oct 21, 2014

Ah I see, my thinking was that the WCL check be verified both before and after the write has been sent. The after is what really matters. So it seems you are suggesting that the responsibility of verifying how many replicas a write was acknowledged by, will be borne by the requestor? I think the terminology around "write consistency level" check may have to be re-considered then!

From the primary selection logic I can't spot anywhere where it's trying to pick the most "recent" replica of the candidates. Does ES currently exercise any such preference?

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Oct 21, 2014

Member

So it seems you are suggesting that the responsibility of verifying how many replicas a write was acknowledged by, will be borne by the requestor?

The PR I mentioned is just a first step to bring more transparency into the process, by no means the goal.

From the primary selection logic I can't spot anywhere where it's trying to pick the most "recent" replica of the candidates. Does ES currently exercise any such preference?

"recent" is very tricky when you index concurrently different documents of different sizes on different nodes. Depending on how things run, there is no notion of a clear "recent" shard as each replica may be behind on different documents, all in flight. I currently have some thoughts on how to approach this better but it's early stages. One of the options is take make a intermediate step which will indeed involve some heuristic around "recency".

Member

bleskes commented Oct 21, 2014

So it seems you are suggesting that the responsibility of verifying how many replicas a write was acknowledged by, will be borne by the requestor?

The PR I mentioned is just a first step to bring more transparency into the process, by no means the goal.

From the primary selection logic I can't spot anywhere where it's trying to pick the most "recent" replica of the candidates. Does ES currently exercise any such preference?

"recent" is very tricky when you index concurrently different documents of different sizes on different nodes. Depending on how things run, there is no notion of a clear "recent" shard as each replica may be behind on different documents, all in flight. I currently have some thoughts on how to approach this better but it's early stages. One of the options is take make a intermediate step which will indeed involve some heuristic around "recency".

@shikhar

This comment has been minimized.

Show comment
Hide comment
@shikhar

shikhar Oct 21, 2014

Contributor

"recent" is very tricky when you index concurrently different documents of different sizes on different nodes. Depending on how things run, there is no notion of a clear "recent" shard as each replica may be behind on different documents, all in flight. I currently have some thoughts on how to approach this better but it's early stages. One of the options is take make a intermediate step which will indeed involve some heuristic around "recency".

Agreed that it's tricky.

It seems to me that what's required is a shard-specific monotonic counter, and since all writes go through the primary this can be safely implemented. Is this blocking on the "sequence ID" stuff I think I saw some talk of? Is there a ticket for that?

Contributor

shikhar commented Oct 21, 2014

"recent" is very tricky when you index concurrently different documents of different sizes on different nodes. Depending on how things run, there is no notion of a clear "recent" shard as each replica may be behind on different documents, all in flight. I currently have some thoughts on how to approach this better but it's early stages. One of the options is take make a intermediate step which will indeed involve some heuristic around "recency".

Agreed that it's tricky.

It seems to me that what's required is a shard-specific monotonic counter, and since all writes go through the primary this can be safely implemented. Is this blocking on the "sequence ID" stuff I think I saw some talk of? Is there a ticket for that?

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Oct 21, 2014

Member

It seems to me that what's required is a shard-specific monotonic counter, and since all writes go through the primary this can be safely implemented. Is this blocking on the "sequence ID" stuff I think I saw some talk of?

You read our minds :)

Member

bleskes commented Oct 21, 2014

It seems to me that what's required is a shard-specific monotonic counter, and since all writes go through the primary this can be safely implemented. Is this blocking on the "sequence ID" stuff I think I saw some talk of?

You read our minds :)

@shikhar

This comment has been minimized.

Show comment
Hide comment
@shikhar

shikhar Oct 21, 2014

Contributor

recommendation from @aphyr for this problem: viewstamped replication

Contributor

shikhar commented Oct 21, 2014

recommendation from @aphyr for this problem: viewstamped replication

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Oct 21, 2014

Or Paxos, or ZAB, or Raft, or ...

aphyr commented Oct 21, 2014

Or Paxos, or ZAB, or Raft, or ...

@evantahler

This comment has been minimized.

Show comment
Hide comment
@evantahler

evantahler Oct 24, 2014

Chiming with a related note that I mentioned on the mailing list (@shikhar linked me here) re: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/M17mgdZnikk/Vk5lVIRjIFAJ. This is failure mode that can happen without a network partition... just crashing nodes (which you can easily get with some long GC pauses)

I think the monotonic counters are a good solution to this, but only if they count something that indicates not only state (The next document inserted to the shard should be document 1000), but also size (which implies that I have 999 documents in my copy of the shard). This way, if you end up in a position where a partially-replicated shard is promoted to master (because it has the only copy of the shard remaining in the cluster), you can now offer the user some interesting cluster configuration options:

  1. serve the data I have, but accept no writes/updates (until a full shard returns to the cluster)
  2. temporarily close the index / 500 error (until a full shard returns to the cluster)
  3. promote what I have to master (and re-replicate my copy to other nodes when they re-join the cluster)

Without knowing that a shard is in this "partial-data" state, you couldn't make the choice. I would personally choose #1 most of the time, but I can see use cases for all three options. I would argue that #3 is what is happening presently. While this would add overhead to each write/update (you would need to count the number of documents in the shard EACH write), I think that allowing ES to run in this "more safe" mode is a good option. Hopefully the suggestion isn't too crazy, as this would only add a check on the local copy of the data, and we probably only need to do it on the master shard.

evantahler commented Oct 24, 2014

Chiming with a related note that I mentioned on the mailing list (@shikhar linked me here) re: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/M17mgdZnikk/Vk5lVIRjIFAJ. This is failure mode that can happen without a network partition... just crashing nodes (which you can easily get with some long GC pauses)

I think the monotonic counters are a good solution to this, but only if they count something that indicates not only state (The next document inserted to the shard should be document 1000), but also size (which implies that I have 999 documents in my copy of the shard). This way, if you end up in a position where a partially-replicated shard is promoted to master (because it has the only copy of the shard remaining in the cluster), you can now offer the user some interesting cluster configuration options:

  1. serve the data I have, but accept no writes/updates (until a full shard returns to the cluster)
  2. temporarily close the index / 500 error (until a full shard returns to the cluster)
  3. promote what I have to master (and re-replicate my copy to other nodes when they re-join the cluster)

Without knowing that a shard is in this "partial-data" state, you couldn't make the choice. I would personally choose #1 most of the time, but I can see use cases for all three options. I would argue that #3 is what is happening presently. While this would add overhead to each write/update (you would need to count the number of documents in the shard EACH write), I think that allowing ES to run in this "more safe" mode is a good option. Hopefully the suggestion isn't too crazy, as this would only add a check on the local copy of the data, and we probably only need to do it on the master shard.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Oct 24, 2014

  1. promote what I have to master (and re-replicate my copy to other nodes when they re-join the cluster)

There's some great literature that addresses this problem.

aphyr commented Oct 24, 2014

  1. promote what I have to master (and re-replicate my copy to other nodes when they re-join the cluster)

There's some great literature that addresses this problem.

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Oct 24, 2014

Member

@evantahler

This way, if you end up in a position where a partially-replicated shard is promoted to master (because it has the only copy of the shard remaining in the cluster)

This should never happen. ES prefers to go to red state and block indexing to promoting half copies to primaries. If it did it is a major bug and I would request you open another issue about it (this one is about something else).

Member

bleskes commented Oct 24, 2014

@evantahler

This way, if you end up in a position where a partially-replicated shard is promoted to master (because it has the only copy of the shard remaining in the cluster)

This should never happen. ES prefers to go to red state and block indexing to promoting half copies to primaries. If it did it is a major bug and I would request you open another issue about it (this one is about something else).

@shikhar

This comment has been minimized.

Show comment
Hide comment
@shikhar

shikhar May 5, 2015

Contributor

linking #10708

Contributor

shikhar commented May 5, 2015

linking #10708

@JeanFrancoisContour

This comment has been minimized.

Show comment
Hide comment
@JeanFrancoisContour

JeanFrancoisContour Jul 13, 2015

Since this issue is related to in-flight documents. Do you think there is a risk to loose existing document during primary shard relocation (cluster rebalancing after adding a new node for instance )?

JeanFrancoisContour commented Jul 13, 2015

Since this issue is related to in-flight documents. Do you think there is a risk to loose existing document during primary shard relocation (cluster rebalancing after adding a new node for instance )?

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Jul 16, 2015

Member

@JeanFrancoisContour this issue relates to documents that are wrongfully acked. I.e., ES acknowledge them but they didn't really reach all the replicas. They are lost when the primary is removed in favour of one of the other replica due to a network partition that isolates the primary. It should effect primary relocation. If you have issues there do please report by opening a different ticket.

Member

bleskes commented Jul 16, 2015

@JeanFrancoisContour this issue relates to documents that are wrongfully acked. I.e., ES acknowledge them but they didn't really reach all the replicas. They are lost when the primary is removed in favour of one of the other replica due to a network partition that isolates the primary. It should effect primary relocation. If you have issues there do please report by opening a different ticket.

@JeanFrancoisContour

This comment has been minimized.

Show comment
Hide comment
@JeanFrancoisContour

JeanFrancoisContour Jul 16, 2015

Ok thanks, so if we can afford to send data twice (same _id), in real time for the first event and a few hour later (bulk) for the second try, we are pretty confident in ES overall ?

JeanFrancoisContour commented Jul 16, 2015

Ok thanks, so if we can afford to send data twice (same _id), in real time for the first event and a few hour later (bulk) for the second try, we are pretty confident in ES overall ?

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Apr 7, 2016

Member

For the record, the majority of the work to fix this can be found at #14252

Member

bleskes commented Apr 7, 2016

For the record, the majority of the work to fix this can be found at #14252

bleskes added a commit that referenced this issue Apr 7, 2016

Update resliency page
#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.

bleskes added a commit that referenced this issue Apr 7, 2016

Update resiliency page (#17586)
#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment