Skip to content

Network partitions can cause divergence, dirty reads, and lost updates. #20031

@aphyr

Description

@aphyr

Hello again! I hope everyone's having a nice week. :-)

Since #7572 (A network partition can cause in flight documents to be lost) is now closed, and the resiliency page reports that "Loss of documents during network partition" is now addressed in 5.0.0, I've updated the Jepsen Elasticsearch tests for 5.0.0-alpha5. In these tests, Elasticsearch appears to allow:

  • Dirty reads
  • Divergence between replicas
  • The loss of successfully inserted documents

This work was funded by Crate.io, which uses Elasticsearch internally and is affected by the same issues in their fork.

Here's the test code, and the supporting library which uses the Java client (also at 5.0.0-alpha5). I've lowered timeouts to help ES recover faster from the network shenanigans in our tests.

Several clients concurrently index documents, each with a unique ID that is never retried. Writes succeed if the index request returns RestStatus.CREATED.

Meanwhile, several clients concurrently attempt to get recently inserted documents by ID. Reads are considered successful if response.isExists() is true.

During this time, we perform a sequence of simple minority/majority network partitions: 20 seconds on, 10 seconds off.

At the end of the test, we heal the network, allow 10 seconds for quiescence (I've set ES timeouts/polling intervals much lower than normal to allow faster convergence), and have every client perform an index refresh. Refreshes are retried until every shard reports successful. Somewhat confusingly, getSuccessfulShards + getFailedShards is rarely equal to getTotalShards, and .getShardFailures appears to always be empty; perhaps there's an indeterminate, unreported shard state between success and failure? In any case, I check that getTotalShards is equal to getSuccessfulShards.

Once every client has performed a refresh, each client performs a read of all documents in the index. We're looking for four-ish cases:

  • Divergence: do nodes agree on the documents in the index?
  • Dirty reads: Are we able to read records which were not successfully inserted, as determined by their absence in a final read?
  • Some lost updates: Are successful writes missing on some nodes but not others?
  • Lost updates: Are successful writes missing on every node?

All of these cases appear to occur. Here's a case where some acknowledged writes were lost by every node (757 lost, 3374 present):

{:dirty-read
 {:dirty
  #{1111 1116 1121 1123 1124 1125 1126 1128 1129 1131 1137 1141 1143
    ...
    2311 2312 2313 2314 2315 2317 2321 2322 2324 2325 2326 2327 2329
    2333},
  :nodes-agree? true,
  :valid? false,
  :lost-count 757,
  :some-lost-count 757,
  :lost
  #{1111 1116 1117 1121 1123 1124 1125 1126 1128 1129 1131 1133 1136
    ...
    2313 2314 2315 2317 2318 2320 2321 2322 2323 2324 2325 2326 2327
    2329 2333 2335},
  :dirty-count 573,
  :read-count 3101,
  :on-some-count 3374,
  :not-on-all #{},
  :some-lost
  #{1111 1116 1117 1121 1123 1124 1125 1126 1128 1129 1131 1133 1136
    ...
    2313 2314 2315 2317 2318 2320 2321 2322 2323 2324 2325 2326 2327
    2329 2333 2335},
  :not-on-all-count 0,
  :on-all-count 3374,
  :unchecked-count 846},
 :perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :valid? false}

And here's a case where nodes disagree on the contents of the index

{:dirty-read
 {:dirty #{},
  :nodes-agree? false,
  :valid? false,
  :lost-count 0,
  :some-lost-count 109,
  :lost #{},
  :dirty-count 0,
  :read-count 2014,
  :on-some-count 5738,
  :not-on-all
  #{2835 2937 2945 2950 2967 2992 2999 3004 3007 3011 3014 3019 3027
    ...
    5260 5263 5270 5272 5287 5295 5298 5300 5305 5327 5337 5343 5355
    5378 5396 5397 5402 5404},
  :some-lost
  #{4556 4557 4559 4560 4573 4575 4577 4583 4589 4595 4608 4613 4619
    4621 4629 4637 4639 4642 4651 4653 4655 4660 4664 4671 4680 4681
    4687 4702 4708 4721 4722 4723 4725 4728 4741 4744 4746 4759 4763
    4768 4769 4784 4795 4819 4832 4838 4849 4850 4858 4872 4878 4884
    4894 4899 4904 4916 4920 4921 4923 4928 4930 4935 4938 4942 4949
    4952 4958 4961 4971 4985 5022 5046 5056 5067 5082 5088 5104 5123
    5125 5126 5138 5147 5159 5172 5187 5188 5200 5231 5241 5252 5257
    5260 5263 5270 5272 5287 5295 5298 5300 5305 5327 5337 5343 5355
    5378 5396 5397 5402 5404},
  :not-on-all-count 317,
  :on-all-count 5421,
  :unchecked-count 3724},
 :perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :valid? false}

not-on-all denotes documents which were present on some, but not all, nodes. some-lost is the subset of those documents which were successfully inserted and should have been present.

I've also demonstrated lost updates due to dirty read semantics plus _version divergence, but I haven't ported that test to ES 5.0.0 yet.

Cheers!

Elasticsearch version: 5.0.0-alpha-5

Plugins installed: []

JVM version: Oracle JDK 1.8.0_91

OS version: Debian Jessie

Description of the problem including expected versus actual behavior: Elasticsearch should probably not forget about inserted documents which were acknowledged

Steps to reproduce:

  1. Clone Jepsen at 63c34b16679f1aad396c1b3e212f92646dcff1be
  2. cd elasticsearch
  3. lein test
  4. Wait a few hours. I haven't had time to narrow down the conditions required to trigger this bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed/DistributedA catch all label for anything in the Distributed Area. Please avoid if you can.Metaresiliency

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions