Do not allow stale replicas to automatically be promoted to primary #14671

Closed
jasontedor opened this Issue Nov 11, 2015 · 4 comments

Projects

None yet

4 participants

@jasontedor
Contributor

Consider a primary shard P hosted on node p and its replica shard Q hosted on node q. If p is isolated from the cluster (e.g., through node failure, a flapping NIC, or an excessively long garbage collection pause), indexing operations can continue on q after Q is promoted to primary; these indexing operations will be acknowledged to the requesting clients. If q is subsequently isolated before p rejoins and before a new replica is assigned to another node in the cluster, the subsequent rejoining of p can currently lead to P being promoted to primary again. The indexing operations acknowledged by q will be lost.

A mechanism needs to be built to prevent the automatic promotion of a stale shard in such a scenario and instead only promote a non-stale shard to primary (if a non-stale shard is availabie). The only scenario in which a stale shard should be promoted to primary is through manual intervention by a system operator (e.g., in cases when q suffers a total hardware failure).

Relates #10933

@ywelsch ywelsch was assigned by jasontedor Nov 11, 2015
@bleskes
Member
bleskes commented Nov 11, 2015

Thanks @jasontedor . can we also update the resiliency page?

@jasontedor
Contributor

@bleskes Added to the Resiliency page in #14681.

@bleskes
Member
bleskes commented Nov 11, 2015

Thanks Jason!

On 11 nov. 2015 4:35 PM +0100, Jason Tedornotifications@github.com, wrote:

@bleskes(https://github.com/bleskes)Added to the Resiliency page in#14681(#14681).


Reply to this email directly orview it on GitHub(#14671 (comment)).

@clintongormley
Member

Closed by #15281

@bleskes bleskes added a commit that referenced this issue Apr 7, 2016
@bleskes bleskes Update resliency page
#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.
557a3d1
@bleskes bleskes added a commit that referenced this issue Apr 7, 2016
@bleskes bleskes Update resiliency page (#17586)
#14252 , #7572 , #15900, #12573, #14671, #15281 and #9126 have all been closed/merged and will be part of 5.0.0.
8eee28e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment