A VM pause (due to GC, high IO load, etc) can cause the loss of inserted documents #10426

Open
aphyr opened this Issue Apr 4, 2015 · 6 comments

Projects

None yet

4 participants

@aphyr
aphyr commented Apr 4, 2015

Following up on #7572 and #10407, I've found that Elasticsearch will lose inserted documents even in the event of a node hiccup due to garbage collection, swapping, disk failure, IO panic, virtual machine pauses, VM migration, etc. https://gist.github.com/aphyr/b8c98e6149bc66a2d839 shows a log where we pause an elasticsearch primary via SIGSTOP and SIGCONT. Even though no operations can take place against the suspended node during this time, and a new primary for the cluster comes to power, it looks like the old primary is still capable of acking inserts which are not replicated to the new primary--somewhere right before or right after the pause. The result is the loss of ~10% of acknowledged inserts.

You can replicate these results with Jepsen (commit e331ff3578), by running lein test :only elasticsearch.core-test/create-pause in the elasticsearch directory.

Looking through the Elasticsearch cluster state code (which I am by no means qualified to understand or evaluate), I get the... really vague, probably incorrect impression that Elasticsearch might make a couple assumptions:

  1. Primaries are considered authoritative "now", without a logical clock that identifies what "now" means.
  2. Operations like "insert a document" don't... seem... to carry a logical clock with them allowing replicas to decide whether or not the operation supercedes their state, which means that messages delayed in flight can show up and cause interesting things to happen.

Are these at all correct? Have you considered looking in to an epoch/term/generation scheme? If primaries are elected uniquely for a certain epoch, you can tag each operation with that epoch and use it to reject invalid requests from the logical past--invariants around advancing the epoch, in turn, can enforce the logical monotonicity of operations. It might make it easier to tamp down race conditions like this.

@bleskes
Member
bleskes commented Apr 4, 2015

Thx @aphyr . In general I can give a quick answer to, while we research the rest:

Have you considered looking in to an epoch/term/generation scheme?

This is indeed the current plan.

@dakrone dakrone referenced this issue in jepsen-io/jepsen Apr 7, 2015
Merged

Various cleanups for Elasticsearch test #51

@bleskes
Member
bleskes commented Apr 10, 2015

We have made some effort to reproduce this failure. In general, we see GC as just another disruption that can happen, the same way we view network issues and file corruptions. If anyone is interested in the work we do there, the org.elasticsearch.test.disruption package and DiscoveryWithServiceDisruptionsTests are a good place to look.

In the Jepsen runs that failed for us, Jepsen created an index and have paused the master node's JVM where the primary of one of the index shards was allocated to that master node. At the time the JVM was paused, no other replica of this shard was fully initialized after initial creation. Because the master JVM was pause, other nodes elected another master but that cluster had no copies left for that specific shard. This left the cluster at a red state. When the node is unpaused it rejoins the cluster. The shard is not re-allocated because we require a qurom of copies to assign a primary (in order to make sure we do not reuse a dated copy). As such the cluster stays red and all the data previously indexed into this shard is not available for searches.

When we changed Jepsen to wait for all replicas to be assigned before starting the nemsis, the failure doesn't happen anymore. This change, and some other improvements are part this PR to Jepsen.

That said, because of the similar nature between GC and an unresponsive network, there is still small window to loose documents which is captured by #7572 and documented on the resiliency status page

@aphyr can you confirm that changes in the PR offers the same behavior for you?

@aphyr
aphyr commented Apr 15, 2015

Thanks for this, @bleskes! I have been super busy with a few other issues but this is the last one I have to clear before talks go! I'll take a look tomorrow morning. :)

@bleskes
Member
bleskes commented Apr 21, 2015

@aphyr re our previous discussion of:

Have you considered looking in to an epoch/term/generation scheme?
This is indeed the current plan.

If you're curious - I've open a (high level) issue describing our current thinking - see #10708 .

@aphyr
aphyr commented Apr 28, 2015

I've merged your PR, and can confirm that ES still drops documents when a primary process is paused.

{:valid? false,
 :lost "#{1761}",
 :recovered
 "#{0 2..3 8 30 51 73 97 119 141 165 187 211 233 257 279 302 324 348 371 394 436 457 482 504 527 550 572 597 619 642 664 688 711 734 758 781 804 827 850 894 911 934 957 979 1003 1025 1049 1071 1092 1117 1138 1163 1185 1208 1230 1253 1277 1299 1342 1344 1350 1372 1415 1439 1462 1485 1508 1553 1576 1599 1623 1645 1667 1690 1714 1736 1779 1803 1825 1848 1871 1893 1917 1939 1964 1985 2010 2031 2054 2077 2100 2123 2146 2169 2192}",
 :ok "#{0..1344 1346..1392 1394..1530 1532..1760 1762..2203}",
 :recovered-frac 24/551,
 :unexpected-frac 0,
 :unexpected "#{}",
 :lost-frac 1/2204,
 :ok-frac 550/551}
@dakrone
Member
dakrone commented Apr 28, 2015

@aphyr thanks for running it! I think the PR helps remove the index not being in a green state before starting the test as a cause of document loss (not the only cause). I will keep running the test with additional logging to try and reproduce the failure you see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment