New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries #19287

Merged
merged 1 commit into from Jul 19, 2016

Conversation

Projects
None yet
2 participants
@ywelsch
Contributor

ywelsch commented Jul 6, 2016

Primary relocation violates two invariants that ensure proper interaction between document replication and peer recoveries, ultimately leading to documents not being properly replicated (see #19248 for more details).

Closes #19248

@bleskes

View changes

Show outdated Hide outdated ...c/main/java/org/elasticsearch/indices/recovery/RecoveriesCollection.java
@bleskes

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/indices/recovery/RecoverySource.java
@bleskes

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/indices/recovery/RecoverySource.java
@bleskes

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/indices/recovery/RecoverySource.java
assert remove : "Handler was not registered [" + handler + "]";
if (remove) {
shard.recoveryStats().decCurrentAsSource();
}
if (shardRecoveryHandlers.isEmpty()) {
if (shardRecoveryContext.recoveryHandlers.isEmpty()) {
ongoingRecoveries.remove(shard);

This comment has been minimized.

@bleskes

bleskes Jul 18, 2016

Member

can we assert that the delay recovery exception is null?

@bleskes

bleskes Jul 18, 2016

Member

can we assert that the delay recovery exception is null?

@bleskes

View changes

Show outdated Hide outdated .../main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
@bleskes

View changes

Show outdated Hide outdated .../main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
@bleskes

View changes

Show outdated Hide outdated .../main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
@bleskes

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
@bleskes

View changes

Show outdated Hide outdated .../main/java/org/elasticsearch/indices/recovery/RecoveryTargetHandler.java
public class RecoveryWaitForClusterStateRequest extends TransportRequest {
private long recoveryId;

This comment has been minimized.

@bleskes

bleskes Jul 18, 2016

Member

time for a base class for all these recovery requests? we can remove a lot of boilerplate in RecoveryTargetService by have the recovery resolving shared. As a different PR , of course.

@bleskes

bleskes Jul 18, 2016

Member

time for a base class for all these recovery requests? we can remove a lot of boilerplate in RecoveryTargetService by have the recovery resolving shared. As a different PR , of course.

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Jul 18, 2016

Member

thx @ywelsch . I left some comments. Can we also maybe add some tests to RecoverySourceHandlerTests to test the new relocation behavior (waiting for a cluster state version and fail on move to relocated )?

Member

bleskes commented Jul 18, 2016

thx @ywelsch . I left some comments. Can we also maybe add some tests to RecoverySourceHandlerTests to test the new relocation behavior (waiting for a cluster state version and fail on move to relocated )?

@ywelsch

This comment has been minimized.

Show comment
Hide comment
@ywelsch

ywelsch Jul 18, 2016

Contributor

@bleskes I've pushed a new change addressing comments. I've also added the unit tests that you asked for but I am not convinced that they are too useful.

Contributor

ywelsch commented Jul 18, 2016

@bleskes I've pushed a new change addressing comments. I've also added the unit tests that you asked for but I am not convinced that they are too useful.

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Jul 18, 2016

Member

@ywelsch I'm wondering if changing the title to Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries will better describe the situation? or even take over the issue's title? (we use PR to drive change lists)

Member

bleskes commented Jul 18, 2016

@ywelsch I'm wondering if changing the title to Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries will better describe the situation? or even take over the issue's title? (we use PR to drive change lists)

@ywelsch ywelsch changed the title from Fix data loss when indexing during primary relocation with ongoing replica recoveries to Fix replica-primary inconsistencies when indexing during primary relocation with ongoing replica recoveries Jul 18, 2016

@bleskes

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/indices/recovery/RecoverySource.java
@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Jul 18, 2016

Member

LGTM. Thanks @ywelsch

Member

bleskes commented Jul 18, 2016

LGTM. Thanks @ywelsch

Fix replica-primary inconsistencies when indexing during primary relo…
…cation with ongoing replica recoveries

Primary relocation violates two invariants that ensure proper interaction between document replication and peer recoveries,
ultimately leading to documents not being properly replicated.

Invariant 1: Document writes must be replicated based on the routing table of a cluster state that includes all shards which
have ongoing or finished recoveries. This is ensured by the fact that do not start a recovery that is not reflected by the
cluster state available on the primary node and we always sample a fresh cluster state before starting to replicate write
operations.

Invariant 2: Every operation that is not part of the snapshot taken for phase 2, must be succesfully indexed on the target
replica (pending shard level errors which will cause the target shard to be failed). To ensure this, we start replicating to
the target shard as soon as the recovery start and open it's engine before we take the snapshot. All operations that are
indexed after the snapshot was taken are guaranteed to arrive to the shard when it's ready to index them. Note that this also
means that the replication doesn't fail a shard if it's not yet ready to recieve operations - it's a normal part of a
recovering shard.

With primary relocations, the two invariants can be possibly violated. Let's consider a primary
relocating while there is another replica shard recovering from the primary shard.

Invariant 1 can be violated if the target of the primary relocation is so lagging on cluster state processing that it doesn't
even know about the new initializing replica. This is very rare in practice as replica recoveries take time to copy all the
index files but it is a theoretical gap that surfaces in testing scenarios.

Invariant 2 can be violated even if the target primary knows about the initializing replica. This can happen if the target
primary replicates an operation to the intializing shard and that operation arrives to the initializing shard before it opens
it's engine but arrives to the primary source after it has taken the snapshot of the translog. Those operations will be
currently missed on the new initializing replica.

The fix to reestablish invariant 1 is to ensure that the primary relocation target has a cluster state with all replica
recoveries that were successfully started on primary relocation source. The fix to reestablish invariant 2 is to check after
opening engine on the replica if the primary has been relocated in the meanwhile and fail the recovery.

@ywelsch ywelsch merged commit c4fe8e7 into elastic:master Jul 19, 2016

1 check passed

CLA Commit author is a member of Elasticsearch
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment