New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network partitions can cause divergence, dirty reads, and lost updates. #20031

Open
aphyr opened this Issue Aug 17, 2016 · 10 comments

Comments

Projects
None yet
8 participants
@aphyr

aphyr commented Aug 17, 2016

Hello again! I hope everyone's having a nice week. :-)

Since #7572 (A network partition can cause in flight documents to be lost) is now closed, and the resiliency page reports that "Loss of documents during network partition" is now addressed in 5.0.0, I've updated the Jepsen Elasticsearch tests for 5.0.0-alpha5. In these tests, Elasticsearch appears to allow:

  • Dirty reads
  • Divergence between replicas
  • The loss of successfully inserted documents

This work was funded by Crate.io, which uses Elasticsearch internally and is affected by the same issues in their fork.

Here's the test code, and the supporting library which uses the Java client (also at 5.0.0-alpha5). I've lowered timeouts to help ES recover faster from the network shenanigans in our tests.

Several clients concurrently index documents, each with a unique ID that is never retried. Writes succeed if the index request returns RestStatus.CREATED.

Meanwhile, several clients concurrently attempt to get recently inserted documents by ID. Reads are considered successful if response.isExists() is true.

During this time, we perform a sequence of simple minority/majority network partitions: 20 seconds on, 10 seconds off.

At the end of the test, we heal the network, allow 10 seconds for quiescence (I've set ES timeouts/polling intervals much lower than normal to allow faster convergence), and have every client perform an index refresh. Refreshes are retried until every shard reports successful. Somewhat confusingly, getSuccessfulShards + getFailedShards is rarely equal to getTotalShards, and .getShardFailures appears to always be empty; perhaps there's an indeterminate, unreported shard state between success and failure? In any case, I check that getTotalShards is equal to getSuccessfulShards.

Once every client has performed a refresh, each client performs a read of all documents in the index. We're looking for four-ish cases:

  • Divergence: do nodes agree on the documents in the index?
  • Dirty reads: Are we able to read records which were not successfully inserted, as determined by their absence in a final read?
  • Some lost updates: Are successful writes missing on some nodes but not others?
  • Lost updates: Are successful writes missing on every node?

All of these cases appear to occur. Here's a case where some acknowledged writes were lost by every node (757 lost, 3374 present):

{:dirty-read
 {:dirty
  #{1111 1116 1121 1123 1124 1125 1126 1128 1129 1131 1137 1141 1143
    ...
    2311 2312 2313 2314 2315 2317 2321 2322 2324 2325 2326 2327 2329
    2333},
  :nodes-agree? true,
  :valid? false,
  :lost-count 757,
  :some-lost-count 757,
  :lost
  #{1111 1116 1117 1121 1123 1124 1125 1126 1128 1129 1131 1133 1136
    ...
    2313 2314 2315 2317 2318 2320 2321 2322 2323 2324 2325 2326 2327
    2329 2333 2335},
  :dirty-count 573,
  :read-count 3101,
  :on-some-count 3374,
  :not-on-all #{},
  :some-lost
  #{1111 1116 1117 1121 1123 1124 1125 1126 1128 1129 1131 1133 1136
    ...
    2313 2314 2315 2317 2318 2320 2321 2322 2323 2324 2325 2326 2327
    2329 2333 2335},
  :not-on-all-count 0,
  :on-all-count 3374,
  :unchecked-count 846},
 :perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :valid? false}

And here's a case where nodes disagree on the contents of the index

{:dirty-read
 {:dirty #{},
  :nodes-agree? false,
  :valid? false,
  :lost-count 0,
  :some-lost-count 109,
  :lost #{},
  :dirty-count 0,
  :read-count 2014,
  :on-some-count 5738,
  :not-on-all
  #{2835 2937 2945 2950 2967 2992 2999 3004 3007 3011 3014 3019 3027
    ...
    5260 5263 5270 5272 5287 5295 5298 5300 5305 5327 5337 5343 5355
    5378 5396 5397 5402 5404},
  :some-lost
  #{4556 4557 4559 4560 4573 4575 4577 4583 4589 4595 4608 4613 4619
    4621 4629 4637 4639 4642 4651 4653 4655 4660 4664 4671 4680 4681
    4687 4702 4708 4721 4722 4723 4725 4728 4741 4744 4746 4759 4763
    4768 4769 4784 4795 4819 4832 4838 4849 4850 4858 4872 4878 4884
    4894 4899 4904 4916 4920 4921 4923 4928 4930 4935 4938 4942 4949
    4952 4958 4961 4971 4985 5022 5046 5056 5067 5082 5088 5104 5123
    5125 5126 5138 5147 5159 5172 5187 5188 5200 5231 5241 5252 5257
    5260 5263 5270 5272 5287 5295 5298 5300 5305 5327 5337 5343 5355
    5378 5396 5397 5402 5404},
  :not-on-all-count 317,
  :on-all-count 5421,
  :unchecked-count 3724},
 :perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :valid? false}

not-on-all denotes documents which were present on some, but not all, nodes. some-lost is the subset of those documents which were successfully inserted and should have been present.

I've also demonstrated lost updates due to dirty read semantics plus _version divergence, but I haven't ported that test to ES 5.0.0 yet.

Cheers!

Elasticsearch version: 5.0.0-alpha-5

Plugins installed: []

JVM version: Oracle JDK 1.8.0_91

OS version: Debian Jessie

Description of the problem including expected versus actual behavior: Elasticsearch should probably not forget about inserted documents which were acknowledged

Steps to reproduce:

  1. Clone Jepsen at 63c34b16679f1aad396c1b3e212f92646dcff1be
  2. cd elasticsearch
  3. lein test
  4. Wait a few hours. I haven't had time to narrow down the conditions required to trigger this bug.
@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Aug 17, 2016

Member

thanks @aphyr - we'll dive into it

Member

clintongormley commented Aug 17, 2016

thanks @aphyr - we'll dive into it

@jeffknupp

This comment has been minimized.

Show comment
Hide comment
@jeffknupp

jeffknupp Aug 30, 2016

Will this block the 5.0 release (it seems as though it should, since resiliency is one of the features being heavily touted in the new version)?

jeffknupp commented Aug 30, 2016

Will this block the 5.0 release (it seems as though it should, since resiliency is one of the features being heavily touted in the new version)?

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Sep 19, 2016

Member

@aphyr thanks again for the clear description. Responding to the separate issues you mention:

  1. Dirty reads - as discussed before this is possible with Elasticsearch. We are working on adding documentation to make it clear.
  2. Replica/Primary divergence - the nature of primary backup replication allows replicas to be out of sync with each other while operations are in flight (not committed). A Primary failure at the moment will mean that the new primary is potentially out of sync with the other replicas but unlike the old primary it has no knowledge of which non-acked documents were still in flight and are missing from some of the replicas. Currently this is only fixed with the next replica relocation or recovery - which takes too long to correct it self. Dirty reads exposes this as well. We are currently working on document level primary/replica syncing method (see #10708) which will be used immediately upon primary promotion.
  3. Loss of acknowledge writes - this one was interesting :) @ywelsch and I tracked it down and this is what happens:

With 5.0 we added the notion of "in sync copies" to keep track of valid copies. The valid copies are used to track which copies are allowed to become primaries. That set of what we call allocation ids is tracked in the cluster state. In the test a change to this set that was committed by the master was lost during a partition, causing a stale shard copy to be promoted to primary when the partition healed. This caused data loss. This issue is addressed in #20384 and will be part of the imminent 5.0 beta1 release. With it in place we run Jepsen for almost two days straight with no data loss. However, as mentioned in the ticket, there is still a very small chance of this happening. We have added an entry to our resiliency page about this. We are working to fix that too, although that will take quite a bit longer and will not make it for the 5.0 release.

@aphyr I would love it if you can verify our findings. I will be happy to supply a snapshot build of beta1.

Member

bleskes commented Sep 19, 2016

@aphyr thanks again for the clear description. Responding to the separate issues you mention:

  1. Dirty reads - as discussed before this is possible with Elasticsearch. We are working on adding documentation to make it clear.
  2. Replica/Primary divergence - the nature of primary backup replication allows replicas to be out of sync with each other while operations are in flight (not committed). A Primary failure at the moment will mean that the new primary is potentially out of sync with the other replicas but unlike the old primary it has no knowledge of which non-acked documents were still in flight and are missing from some of the replicas. Currently this is only fixed with the next replica relocation or recovery - which takes too long to correct it self. Dirty reads exposes this as well. We are currently working on document level primary/replica syncing method (see #10708) which will be used immediately upon primary promotion.
  3. Loss of acknowledge writes - this one was interesting :) @ywelsch and I tracked it down and this is what happens:

With 5.0 we added the notion of "in sync copies" to keep track of valid copies. The valid copies are used to track which copies are allowed to become primaries. That set of what we call allocation ids is tracked in the cluster state. In the test a change to this set that was committed by the master was lost during a partition, causing a stale shard copy to be promoted to primary when the partition healed. This caused data loss. This issue is addressed in #20384 and will be part of the imminent 5.0 beta1 release. With it in place we run Jepsen for almost two days straight with no data loss. However, as mentioned in the ticket, there is still a very small chance of this happening. We have added an entry to our resiliency page about this. We are working to fix that too, although that will take quite a bit longer and will not make it for the 5.0 release.

@aphyr I would love it if you can verify our findings. I will be happy to supply a snapshot build of beta1.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Sep 22, 2016

Hi @bleskes. I'd be delighted to work on verification. I've got an ongoing contract that's keeping me pretty busy right now, but if you'd like to send me an email (aphyr@jepsen.io) I'll reach out when I've got availability for more testing.

aphyr commented Sep 22, 2016

Hi @bleskes. I'd be delighted to work on verification. I've got an ongoing contract that's keeping me pretty busy right now, but if you'd like to send me an email (aphyr@jepsen.io) I'll reach out when I've got availability for more testing.

@elasticmachine

This comment has been minimized.

Show comment
Hide comment
@elasticmachine

elasticmachine commented Mar 27, 2018

@saiergong

This comment has been minimized.

Show comment
Hide comment
@saiergong

saiergong May 18, 2018

i wonder when primary shard send operation to its replicas, and one of the replicas acknowleged this request to primary but the other replica did not conform, and primary not acknowlege to client. then a read request send to the acknowleged replica,Can it read the latest data that has not been acknowlege to the client?@bleskes

saiergong commented May 18, 2018

i wonder when primary shard send operation to its replicas, and one of the replicas acknowleged this request to primary but the other replica did not conform, and primary not acknowlege to client. then a read request send to the acknowleged replica,Can it read the latest data that has not been acknowlege to the client?@bleskes

@jasontedor

This comment has been minimized.

Show comment
Hide comment
@jasontedor
Member

jasontedor commented May 18, 2018

@saiergong Yes.

@saiergong

This comment has been minimized.

Show comment
Hide comment
@saiergong

saiergong May 18, 2018

many thanks @jasontedor , but do we think its ok? or we have plan to resolve this?

saiergong commented May 18, 2018

many thanks @jasontedor , but do we think its ok? or we have plan to resolve this?

@jasontedor

This comment has been minimized.

Show comment
Hide comment
@jasontedor

jasontedor May 18, 2018

Member

@saiergong Indeed it is not ideal however we consider it to be a lower priority than the other problems that we are currently trying to solve.

Member

jasontedor commented May 18, 2018

@saiergong Indeed it is not ideal however we consider it to be a lower priority than the other problems that we are currently trying to solve.

@saiergong

This comment has been minimized.

Show comment
Hide comment
@saiergong

saiergong commented May 18, 2018

i know , thanks @jasontedor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment