Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait on shard failures #14252

Closed
9 tasks done
jasontedor opened this issue Oct 22, 2015 · 4 comments
Closed
9 tasks done

Wait on shard failures #14252

jasontedor opened this issue Oct 22, 2015 · 4 comments
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement Meta release highlight resiliency v5.0.0-alpha1

Comments

@jasontedor
Copy link
Member

jasontedor commented Oct 22, 2015

Currently when executing an action (e.g., bulk, delete, or indexing operations) on all shards, if an exception occurs while executing the action on a replica shard we send a shard failure message to the master. However, we do not wait for the master to acknowledge this message and do not handle failures in sending this message to the master. This is problematic because it means that we will acknowledge the action and this can result in losing writes. For example, in a situation where a primary is isolated from the master and its replicas, the following sequence of events can occur:

  1. we write to the local primary
  2. we fail to write to the replicas
  3. we fail in notifying the master to fail the replicas
  4. the primary acknowledges the write to the client
  5. the master notices the primary is gone and promotes one of the replicas to be primary

In this case, the replica will not have the write that was acknowledged to the client and this amounts to data loss.

Instead, if we waited on the master to acknowledge the shard failures we would never have acknowledged the write to the client in this case.

@bleskes
Copy link
Contributor

bleskes commented Oct 23, 2015

+1

@s1monw
Copy link
Contributor

s1monw commented Oct 23, 2015

sounds good to me too

@makeyang
Copy link
Contributor

makeyang commented Apr 5, 2016

will this one resovled issue:7572?

@bleskes
Copy link
Contributor

bleskes commented Apr 5, 2016

@makeyang yes. you are correct. We are waiting with closing that issue until #17038 is in.

bleskes added a commit that referenced this issue Apr 7, 2016
#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.
bleskes added a commit that referenced this issue Apr 7, 2016
#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement Meta release highlight resiliency v5.0.0-alpha1
Projects
None yet
Development

No branches or pull requests

5 participants