A network partition isolating a primary can cause the loss of inserted documents #10407

Closed
aphyr opened this Issue Apr 3, 2015 · 8 comments

Projects

None yet

4 participants

@aphyr
aphyr commented Apr 3, 2015

This is tough for me to reliably reproduce, but I've seen it a half-dozen times and think it needs an issue. Elasticsearch can still lose inserted documents when a single primary node is isolated by a simple network partition. Nothing fancy, no overlapping partitions required. Sometimes it's only a few documents, from immediately following the partition:

{:valid? false,
 :lost "#{640 643 645 649}",
 :recovered
 "#{140..142 144 146 148..152 155 157 159..162 164..168 171 174..175 177 179..185 187 190..192 194 197 199..200 202 205..208 214 224 234 238 240..247 249..254 256..264 266..282 285 296 317 342 363 388 431}",
 :ok
 "#{0..137 140..142 144 146 148..152 155 157 159..162 164..168 171 174..175 177 179..185 187 190..192 194 197 199..200 202 205..208 210..212 214 224 234 238 240..247 249..254 256..264 266..639 641..642 644 646..648 650..654}",
 :recovered-frac 96/655,
 :unexpected-frac 0,
 :unexpected "#{}",
 :lost-frac 4/655,
 :ok-frac 598/655}

Other times, the cluster seems to get really confused, and loses over half of the acknowledged records, distributed throughout the history:

{:valid? false,
 :lost
 "#{53..55 126..129 131..142 144..167 169..189 191..212 214..235 237..260 262..281 283..305 307..327 329..352 354..367 374 376}",
 :recovered
 "#{46..47 49 52 60 428..430 432..435 437..440 442..445 447..450 452..455 457..473}",
 :ok
 "#{0..47 49 52 60 428..430 432..435 437..440 442..445 447..450 452..455 457..473}",
 :recovered-frac 15/158,
 :unexpected-frac 0,
 :unexpected "#{}",
 :lost-frac 118/237,
 :ok-frac 91/474}

The logs aren't particularly informative here, but I have noticed some interesting messages which seem correlated with data loss: for instance:

[2015-04-02][WARN ][index.shard] [n4] [jepsen-index][0] suspect illegal state: trying to move shard from primary mode to replica mode

and

[2015-04-02 17:19:54,370][DEBUG][action.index             ] [n2] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

which I think is a hardcoded message.

Again, these are tough to reproduce--only 1 in 10 runs or so--so I'll keep working on finding a failure schedule that reliably triggers the fault.

@aphyr
aphyr commented Apr 3, 2015

Potentially related: the worst data-loss incidents seem related to long-lasting split-brain, where after a 120-second partition isolating a single primary is resolved, the original node and a newly elected node both believe they're the primary--even after exchanging traffic.

@aphyr
aphyr commented Apr 3, 2015

Here's a particularly rough case, losing 25% of documents with only primary-isolating partitions:

{:valid? false,
 :lost
 "#{50..53 125..128 130..149 151..170 172..194 196..217 219..239 241..263 265..286 288..309 311..331 333..355 357..378 380..392 394}",
 :recovered
 "#{400 448..449 451 453..454 456..459 461..464 466..469 471..474 476..484 486..489 491..494 496..498 516 619 624 629 634 639 644 647..649 651..654 656..659 661..665 667..669 671..675 677..680 682..684 686..689 691 877..879 882..884 887..889 892..894 897..899 902..904 907..909 912..914 917..919}",
 :ok
 "#{0..49 400 448..449 451 453..454 456..459 461..464 466..469 471..474 476..484 486..489 491..494 496..499 501..537 539..561 563..583 585..606 608..617 619 624 629 634 639 644 647..649 651..654 656..659 661..665 667..669 671..675 677..680 682..684 686..689 691..694 696..713 715..737 739..758 760..783 785..804 806..829 831..848 850 877..879 882..884 887..889 892..894 897..899 902..904 907..909 912..914 917..919 922..924 927..938 940..942 944..956 958..960 962..975 977..979 981..993 995..998 1000..1012 1014..1016 1018..1030 1032..1035 1037..1039}",
 :recovered-frac 7/65,
 :unexpected-frac 0,
 :unexpected "#{}",
 :lost-frac 261/1040,
 :ok-frac 53/104}

This test is 600 seconds long, where the network is stable for 10 seconds, then all nodes considering themselves the primary are isolated for 120 seconds, then the network heals and the cycle repeats. We give the cluster 20 seconds of guaranteed healthy network, then wait for the health endpoint to return green on all nodes, then wait a while just in case, and do a final read.

            :generator (gen/phases
                         (->> (range)
                              (map (fn [x] {:type  :invoke
                                            :f     :add
                                            :value x}))
                              gen/seq
                              (gen/stagger 1/10)
                              (gen/delay 1)
                              (gen/nemesis
                                (gen/seq (cycle
                                           [(gen/sleep 10)
                                            {:type :info :f :start}
                                            (gen/sleep 120)
                                            {:type :info :f :stop}])))
                              (gen/time-limit 600))
                         (gen/nemesis
                           (gen/phases
                             (gen/once {:type :info :f :stop})
                             (gen/sleep 20)))
                         (gen/clients
                           (gen/once {:type :invoke :f :read})))}))

In this test, we isolate [n4], then [n5], then [n4], and then something interesting happens: both n5 and n4 consider themselves primaries, even after 120 seconds of n4 being isolated. On the next cycle round, n2 and n4 both consider themselves primaries!

In an attempt to speed up fault detection (this network is virtual with negligible latency and no contention except during simulated partitions), I've lowered the ping timeouts to the order of 2-3 seconds--the cluster seems to detect failures within ~10 seconds normally, but sometimes gets stuck!

discovery.zen.ping.timeout: 3s

# After joining, send a join request with this timeout? Docs are kinda unclear
# here.
discovery.zen.join_timeout: 5s

# Fault detection timeouts. Our network is super reliable and uncongested
# except when we explicitly simulate failure, so I'm gonna lower these to speed
# up cluster convergence.
discovery.zen.fd.ping_interval: 1s
discovery.zen.fd.ping_timeout: 1s
discovery.zen.fd.ping_retries: 2

# Publish cluster state updates more frequently
discovery.zen.publish_timeout: 5s

I've uploaded the Jepsen history, set analysis, and elasticsearch logs (timestamps are perfectly coordinated--they're LXC containers) to https://aphyr.com/media/es-create-loss.tar.gz, if you'd like to take a look. :)

@bleskes
Member
bleskes commented Apr 3, 2015

Thx @aphyr

Elasticsearch can still lose inserted documents when a single primary node is isolated by a simple network partition

This might be #7572 , which is also documented here: http://www.elastic.co/guide/en/elasticsearch/resiliency/current/#_loss_of_documents_during_network_partition_status_ongoing

Other times, the cluster seems to get really confused

This one feels like something else, but it will require some research. Can you share the exact commit of Jepsen you used? That will help understand what exactly happened. I'm also curious to see what effect @dakrone 's suggestion to use scan scroll will have.

@aphyr
aphyr commented Apr 3, 2015

Hi @bleskes! This is from Jepsen 4fdf509d82620ac59b4daf4e6a21f9495c73e68a--just cd elasticsearch; lein test. I've been breaking up the test suites in hopes of actually having Jepsen As A Library instead of a big monolithic hairball, haha. Slow going so far. ;-)

I've also switched to using scan scroll, but it doesn't seem to have any effect on observed data loss--I can reproduce it with both a size: n query or a scrolling one.

@dakrone
Member
dakrone commented Apr 7, 2015

Hi @aphyr , I've been looking through the https://aphyr.com/media/es-create-loss.tar.gz logs that you provided, however, it looks like the history.edn timestamps are nanoseconds relative to the time that the test was started.

Could you add or modify the timestamps to be absolute instead of relative so we can correlate the logs from the ES instances to the Jepsen history?

@aphyr
aphyr commented Apr 8, 2015

Sorry, I've got a huge backlog at the moment and have to get slides prepped, but you can just flip https://github.com/aphyr/jepsen/blob/master/jepsen/src/jepsen/core.clj#L114 to use (System/currentTimeMillis) if you want unix timestamps. Be advised that Jepsen has facilities for lying to the database about what time it is--those aren't currently in play on the Elasticsearch tests, but I wouldn't rely on the log timestamps going forward. ;-)

@bleskes
Member
bleskes commented Apr 28, 2015

A quick update on this one - we have put some time into reproducing this and it looks more and more like #7572 . I will post an update once this becomes definite (or if anything else comes up).

@clintongormley
Member

Closed as duplicate of #7572

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment