Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a refresh listener to a recovering shard should be a noop #26055

Merged
merged 6 commits into from Aug 4, 2017

Conversation

Projects
None yet
3 participants
@bleskes
Copy link
Member

commented Aug 4, 2017

When refresh=wait_for is set on an indexing request, we register a listener on the shards that are call during the next refresh. During the recover translog phase, when the engine is open, we have a window of time when indexing operations succeed and they can add their listeners. Those listeners will only be called when the recovery finishes as we do not refresh during recoveries (unless the indexing buffer is full). Next to being a bad user experience, it can also cause deadlocks with an ongoing peer recovery that may wait for those operations to mark the replica in sync (details below).

To fix this, this PR changes refresh listeners to be a noop when the shard is not yet serving reads (implicitly covering the recovery period). It doesn't matter anyway.

There is a still a small problem I want to think how to solve - an indexing operation that came in with wait_for_refresh after the finalize recovery and before markAsDone is called will not be immediately visible when moving to POST_RECOVERY. I'm going to give it some more thought (I hope a simple refresh will do) but I think we can start reviewing on the main issue.

Deadlock with recovery:

When finalizing a peer recovery we mark the peer as "in sync". To do so we wait until the peer's local checkpoint is at least as high as the global checkpoint. If an operation with refresh=wait_for is added as a listener on that peer during recovery, it is not completed from the perspective of the primary. The primary than may wait for it to complete before advancing the local checkpoint for that peer. Since that peer is not considered in sync, the global checkpoint on the primary can be higher, causing a deadlock. Operation waits for recovery to finish and a refresh to happen. Recovery waits on the operation.

@jasontedor
Copy link
Member

left a comment

LGTM.

Can you also revert f154e53 and e1ef3d5?

@@ -848,7 +848,7 @@ public long getWritingBytes() {

public RefreshStats refreshStats() {
// Null refreshListeners means this shard doesn't support them so there can't be any.

This comment has been minimized.

Copy link
@jasontedor

jasontedor Aug 4, 2017

Member

Drop the comment too?

This comment has been minimized.

Copy link
@bleskes

bleskes Aug 4, 2017

Author Member

yep

.settings(settings)
.primaryTerm(0, 1).build();
IndexShard primary = newShard(new ShardId(metaData.getIndex(), 0), true, "n1", metaData, null);
recoveryShardFromStore(primary);

This comment has been minimized.

Copy link
@jasontedor

jasontedor Aug 4, 2017

Member

While we're here, can you fix the name of this method to be recoverShardFromStore?

This comment has been minimized.

Copy link
@bleskes

bleskes Aug 4, 2017

Author Member

fixed.

bleskes added some commits Aug 4, 2017

@bleskes

This comment has been minimized.

Copy link
Member Author

commented Aug 4, 2017

@jasontedor thanks. Pushed another commit with a solution for the visibility issue. Can you take another look?

@jasontedor
Copy link
Member

left a comment

Visibility change looks good, so still LGTM.

bleskes added some commits Aug 4, 2017

doh

@bleskes bleskes force-pushed the bleskes:wait_for_refresh_recovery branch to fc6e558 Aug 4, 2017

@bleskes bleskes merged commit e11cbed into elastic:master Aug 4, 2017

2 checks passed

CLA Commit author is a member of Elasticsearch
Details
elasticsearch-ci Build finished.
Details

@bleskes bleskes deleted the bleskes:wait_for_refresh_recovery branch Aug 4, 2017

bleskes added a commit that referenced this pull request Aug 4, 2017

Adding a refresh listener to a recovering shard should be a noop (#26055
)

When `refresh=wait_for` is set on an indexing request, we register a listener on the shards that are call during the next refresh. During the recover translog phase, when the engine is open, we have a window of time when indexing operations succeed and they can add their listeners. Those listeners will only be called when the recovery finishes as we do not refresh during recoveries (unless the indexing buffer is full). Next to being a bad user experience, it can also cause deadlocks with an ongoing peer recovery that may wait for those operations to mark the replica in sync (details below).

To fix this, this PR changes refresh listeners to be a noop when the shard is not yet serving reads (implicitly covering the recovery period). It doesn't matter anyway. 

Deadlock with recovery:

When finalizing a peer recovery we mark the peer as "in sync". To do so we wait until the peer's local checkpoint is at least as high as the global checkpoint. If an operation with `refresh=wait_for` is added as a listener on that peer during recovery, it is not completed from the perspective of the primary. The primary than may wait for it to complete before advancing the local checkpoint for that peer. Since that peer is not considered in sync, the global checkpoint on the primary can be higher, causing a deadlock. Operation waits for recovery to finish and a refresh to happen. Recovery waits on the operation.

bleskes added a commit that referenced this pull request Aug 4, 2017

Adding a refresh listener to a recovering shard should be a noop (#26055
)

When `refresh=wait_for` is set on an indexing request, we register a listener on the shards that are call during the next refresh. During the recover translog phase, when the engine is open, we have a window of time when indexing operations succeed and they can add their listeners. Those listeners will only be called when the recovery finishes as we do not refresh during recoveries (unless the indexing buffer is full). Next to being a bad user experience, it can also cause deadlocks with an ongoing peer recovery that may wait for those operations to mark the replica in sync (details below).

To fix this, this PR changes refresh listeners to be a noop when the shard is not yet serving reads (implicitly covering the recovery period). It doesn't matter anyway. 

Deadlock with recovery:

When finalizing a peer recovery we mark the peer as "in sync". To do so we wait until the peer's local checkpoint is at least as high as the global checkpoint. If an operation with `refresh=wait_for` is added as a listener on that peer during recovery, it is not completed from the perspective of the primary. The primary than may wait for it to complete before advancing the local checkpoint for that peer. Since that peer is not considered in sync, the global checkpoint on the primary can be higher, causing a deadlock. Operation waits for recovery to finish and a refresh to happen. Recovery waits on the operation.

@colings86 colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.