Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check again on-going snapshots/restores of indices before closing #43873

Merged

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Jul 2, 2019

Today we prevent any index that is actively snapshotted or restored to be closed. This verification is done during the execution of the first phase of index closing (ie before blocking the indices).

We should also do this verification again in the last phase of index closing (ie after the shard sanity checks and right before actually changing the index state and the routing table) because a snapshot/restore could sneak in while the shards are verified-before-close.

@tlrx tlrx added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.3.0 labels Jul 2, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@tlrx
Copy link
Member Author

tlrx commented Jul 3, 2019

@elasticmachine update branch

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke about this with @tlrx just now and had one concern:

I wondered if it wouldn't be better to prevent a new snapshot (i.e. just bail because closing is in progress for an index that should be snapshotted) from being started for the affected indices between the two phases of closing.

It seems to me that that would:

  1. Be simpler in terms of the state transitions (no redundant failing to close step and subsequent cleanup step on concurrent snapshots)
  2. For users running back-to-back snapshots of the whole cluster, a situation could arise where closing an index becomes effectively impossible because a concurrent snapshot start could keep interrupting the close operation over and over (if running the snapshot in a loop ... which some users do apparently).

@ywelsch
Copy link
Contributor

ywelsch commented Jul 3, 2019

I wondered if it wouldn't be better to prevent a new snapshot (i.e. just bail because closing is in progress for an index that should be snapshotted) from being started for the affected indices between the two phases of closing.

The main problem I see with this is that if an index is not successfully closed (but the closed block still left behind), no new snapshots can be done, as we can't distinguish between "is between the two phases of closing" and "closing failed and left closed block behind". This is trappy as it will put the cluster at risk of data loss, and requires manual intervention to fix. I would rather prefer the original solution.

@original-brownbear
Copy link
Member

@ywelsch makes sense to me with the way things are, @tlrx said the same :) Maybe one question just for my own education below (whenever you have a second):

as we can't distinguish between "is between the two phases of closing" and "closing failed and left closed block behind".

Isn't the latter a bug (the fact that we just leave the block behind when closing failed) to be resolved by removing the close block when closing failed? (or do we not want to do that to prevent further indexing to the index?)

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :) just a suggestion on the ex. messages that seem to break our style in some spots.

@ywelsch
Copy link
Contributor

ywelsch commented Jul 3, 2019

as we can't distinguish between "is between the two phases of closing" and "closing failed and left closed block behind".

Isn't the latter a bug (the fact that we just leave the block behind when closing failed) to be resolved by removing the close block when closing failed? (or do we not want to do that to prevent further indexing to the index?)

There is currently no automated internal retry (we might do that at some point), but the user intent was clearly to close the index, so leaving the block behind should not be surprising (in case the user tries to index into the index, he will get a proper exception message saying the index might be in the process of being closed and that he can remove the block by reopening the index, or that he can issue the close command again). By keeping the block in place, a follow-up close is also more likely to succeed as it can just reuse the block while stepping through the phases. If you would auto-remove the block, two concurrent close operations might be problematic as they would interfere with each other. The close operation is currently idempotent, also in failure cases.

@jpountz jpountz added v7.4.0 and removed v7.3.0 labels Jul 3, 2019
@ywelsch ywelsch added the v7.3.0 label Jul 5, 2019
@ywelsch
Copy link
Contributor

ywelsch commented Jul 5, 2019

I've added the 7.3.0 label here as it's a bugfix.

@tlrx
Copy link
Member Author

tlrx commented Jul 8, 2019

@elasticmachine update branch

@tlrx tlrx merged commit a5d9939 into elastic:master Jul 8, 2019
@tlrx tlrx deleted the check-again-restored-snapshot-indices-while-closing branch July 8, 2019 15:06
@tlrx
Copy link
Member Author

tlrx commented Jul 8, 2019

Thanks @original-brownbear !

tlrx added a commit that referenced this pull request Jul 8, 2019
…3873)

Today we prevent any index that is actively snapshotted or restored to be closed. 
This verification is done during the execution of the first phase of index closing 
(ie before blocking the indices).

We should also do this verification again in the last phase of index closing 
(ie after the shard sanity checks and right before actually changing the index 
state and the routing table) because a snapshot/restore could sneak in while
 the shards are verified-before-close.
tlrx added a commit that referenced this pull request Jul 8, 2019
…3873)

Today we prevent any index that is actively snapshotted or restored to be closed. 
This verification is done during the execution of the first phase of index closing 
(ie before blocking the indices).

We should also do this verification again in the last phase of index closing 
(ie after the shard sanity checks and right before actually changing the index 
state and the routing table) because a snapshot/restore could sneak in while
 the shards are verified-before-close.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v7.3.0 v7.4.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants