Wait on shard failures #14252

jasontedor · 2015-10-22T14:32:28Z

Currently when executing an action (e.g., bulk, delete, or indexing operations) on all shards, if an exception occurs while executing the action on a replica shard we send a shard failure message to the master. However, we do not wait for the master to acknowledge this message and do not handle failures in sending this message to the master. This is problematic because it means that we will acknowledge the action and this can result in losing writes. For example, in a situation where a primary is isolated from the master and its replicas, the following sequence of events can occur:

we write to the local primary
we fail to write to the replicas
we fail in notifying the master to fail the replicas
the primary acknowledges the write to the client
the master notices the primary is gone and promotes one of the replicas to be primary

In this case, the replica will not have the write that was acknowledged to the client and this amounts to data loss.

Instead, if we waited on the master to acknowledge the shard failures we would never have acknowledged the write to the client in this case.

Create listener mechanism for executing callbacks when exceptions occur sending a shard failure message to the master Add listener mechanism for failures to send shard failed #14295
Add unit tests that show we wait until failure or success (do not have to handle the failures yet) Add timeout mechanism for sending shard failures #14707
Add general support for cluster state batch updates Split cluster state update tasks into roles #14899
Apply cluster state batch updates to shard failures Use general cluster state batching mechanism for shard failures #15016
Handle when the node we thought was the master is no longer the master (e.g., master might have stepped down) -> find the actual master (e.g., wait for a new master to be elected) and retry the failed shard notice Wait for new master when failing shard #15748
Fail shard failure requests from illegal sources Illegal shard failure requests #16275
Master tells us we are no longer the primary -> fail the local shard, retry request on new primary Fail demoted primary shards and retry request #16415
Handle failed shard has already been removed from the routing table -> okay Shard failure requests for non-existent shards #16089
Handle master side of shard failures (do not respond to the node until the new cluster state is published, otherwise report failure or allow the node to timeout) Master should wait on cluster state publication when failing a shard #15468

bleskes · 2015-10-23T14:32:05Z

+1

s1monw · 2015-10-23T20:03:47Z

sounds good to me too

Wait for new master when failing shard Relates #14252

makeyang · 2016-04-05T05:50:09Z

will this one resovled issue:7572?

bleskes · 2016-04-05T06:54:40Z

@makeyang yes. you are correct. We are waiting with closing that issue until #17038 is in.

#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.

jasontedor added >enhancement :Cluster Meta v5.0.0-alpha1 labels Oct 22, 2015

jasontedor added the resiliency label Oct 26, 2015

jasontedor mentioned this issue Oct 27, 2015

Add listener mechanism for failures to send shard failed #14295

Merged

jasontedor mentioned this issue Nov 12, 2015

Add timeout mechanism for sending shard failures #14707

Merged

jasontedor mentioned this issue Dec 1, 2015

Use general cluster state batching mechanism for shard failures #15016

Merged

jasontedor mentioned this issue Dec 16, 2015

Master should wait on cluster state publication when failing a shard #15468

Merged

This was referenced Jan 3, 2016

Refactor master node change predicate for reuse #15735

Merged

Make cluster state external to o.e.c.a.s.ShardStateAction #15736

Merged

Wait for new master when failing shard #15748

Merged

jasontedor added a commit that referenced this issue Jan 17, 2016

Merge pull request #15748 from jasontedor/shard-failure-no-master-retry

69b21fe

Wait for new master when failing shard Relates #14252

jasontedor mentioned this issue Jan 19, 2016

Shard failure requests for non-existent shards #16089

Closed

jasontedor mentioned this issue Feb 3, 2016

Fail demoted primary shards and retry request #16415

Closed

bleskes added the release highlight label Feb 4, 2016

jasontedor closed this as completed in 346ff04 Feb 10, 2016

bleskes mentioned this issue Apr 7, 2016

[Indexing] A network partition can cause in flight documents to be lost #7572

Closed

bleskes added a commit that referenced this issue Apr 7, 2016

Update resliency page

557a3d1

#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.

bleskes mentioned this issue Apr 7, 2016

Update resliency page #17586

Merged

bleskes added a commit that referenced this issue Apr 7, 2016

Update resiliency page (#17586)

8eee28e

#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.

clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait on shard failures #14252

Wait on shard failures #14252

jasontedor commented Oct 22, 2015 •

edited by acesir

Loading

bleskes commented Oct 23, 2015

s1monw commented Oct 23, 2015

makeyang commented Apr 5, 2016

bleskes commented Apr 5, 2016

Wait on shard failures #14252

Wait on shard failures #14252

Comments

jasontedor commented Oct 22, 2015 • edited by acesir Loading

bleskes commented Oct 23, 2015

s1monw commented Oct 23, 2015

makeyang commented Apr 5, 2016

bleskes commented Apr 5, 2016

jasontedor commented Oct 22, 2015 •

edited by acesir

Loading