Check again on-going snapshots/restores of indices before closing #43873

tlrx · 2019-07-02T12:51:15Z

Today we prevent any index that is actively snapshotted or restored to be closed. This verification is done during the execution of the first phase of index closing (ie before blocking the indices).

We should also do this verification again in the last phase of index closing (ie after the shard sanity checks and right before actually changing the index state and the routing table) because a snapshot/restore could sneak in while the shards are verified-before-close.

elasticmachine · 2019-07-02T12:51:17Z

Pinging @elastic/es-distributed

…e-closing

tlrx · 2019-07-03T07:39:21Z

@elasticmachine update branch

…e-closing

original-brownbear

Spoke about this with @tlrx just now and had one concern:

I wondered if it wouldn't be better to prevent a new snapshot (i.e. just bail because closing is in progress for an index that should be snapshotted) from being started for the affected indices between the two phases of closing.

It seems to me that that would:

Be simpler in terms of the state transitions (no redundant failing to close step and subsequent cleanup step on concurrent snapshots)
For users running back-to-back snapshots of the whole cluster, a situation could arise where closing an index becomes effectively impossible because a concurrent snapshot start could keep interrupting the close operation over and over (if running the snapshot in a loop ... which some users do apparently).

ywelsch · 2019-07-03T14:01:33Z

I wondered if it wouldn't be better to prevent a new snapshot (i.e. just bail because closing is in progress for an index that should be snapshotted) from being started for the affected indices between the two phases of closing.

The main problem I see with this is that if an index is not successfully closed (but the closed block still left behind), no new snapshots can be done, as we can't distinguish between "is between the two phases of closing" and "closing failed and left closed block behind". This is trappy as it will put the cluster at risk of data loss, and requires manual intervention to fix. I would rather prefer the original solution.

original-brownbear · 2019-07-03T14:30:34Z

@ywelsch makes sense to me with the way things are, @tlrx said the same :) Maybe one question just for my own education below (whenever you have a second):

as we can't distinguish between "is between the two phases of closing" and "closing failed and left closed block behind".

Isn't the latter a bug (the fact that we just leave the block behind when closing failed) to be resolved by removing the close block when closing failed? (or do we not want to do that to prevent further indexing to the index?)

original-brownbear

LGTM :) just a suggestion on the ex. messages that seem to break our style in some spots.

server/src/main/java/org/elasticsearch/snapshots/RestoreService.java

server/src/main/java/org/elasticsearch/cluster/metadata/MetaDataIndexStateService.java

ywelsch · 2019-07-03T14:56:08Z

as we can't distinguish between "is between the two phases of closing" and "closing failed and left closed block behind".

Isn't the latter a bug (the fact that we just leave the block behind when closing failed) to be resolved by removing the close block when closing failed? (or do we not want to do that to prevent further indexing to the index?)

There is currently no automated internal retry (we might do that at some point), but the user intent was clearly to close the index, so leaving the block behind should not be surprising (in case the user tries to index into the index, he will get a proper exception message saying the index might be in the process of being closed and that he can remove the block by reopening the index, or that he can issue the close command again). By keeping the block in place, a follow-up close is also more likely to succeed as it can just reuse the block while stepping through the phases. If you would auto-remove the block, two concurrent close operations might be problematic as they would interfere with each other. The close operation is currently idempotent, also in failure cases.

ywelsch · 2019-07-05T12:34:52Z

I've added the 7.3.0 label here as it's a bugfix.

tlrx · 2019-07-08T12:32:50Z

@elasticmachine update branch

…e-closing

tlrx · 2019-07-08T15:06:27Z

Thanks @original-brownbear !

…3873) Today we prevent any index that is actively snapshotted or restored to be closed. This verification is done during the execution of the first phase of index closing (ie before blocking the indices). We should also do this verification again in the last phase of index closing (ie after the shard sanity checks and right before actually changing the index state and the routing table) because a snapshot/restore could sneak in while the shards are verified-before-close.

elastic/elasticsearch#43873

PRevent snapshotted or restored indices to be closed

3053bb2

tlrx added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.3.0 labels Jul 2, 2019

Merge branch 'master' into check-again-restored-snapshot-indices-whil…

9a43530

…e-closing

elasticmachine and others added 2 commits July 3, 2019 00:39

Merge branch 'master' into check-again-restored-snapshot-indices-whil…

cfaf0fd

…e-closing

Re-resolve indices in MetaDataDeleteIndexService.java

07eb108

tlrx requested review from ywelsch and original-brownbear July 3, 2019 12:25

original-brownbear reviewed Jul 3, 2019

View reviewed changes

original-brownbear approved these changes Jul 3, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/snapshots/RestoreService.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/cluster/metadata/MetaDataIndexStateService.java Show resolved Hide resolved

jpountz added v7.4.0 and removed v7.3.0 labels Jul 3, 2019

ywelsch added the v7.3.0 label Jul 5, 2019

feedback

c4d18ec

Merge branch 'master' into check-again-restored-snapshot-indices-whil…

a298c9d

…e-closing

tlrx merged commit a5d9939 into elastic:master Jul 8, 2019

tlrx deleted the check-again-restored-snapshot-indices-while-closing branch July 8, 2019 15:06

mkleen added a commit to crate/crate that referenced this pull request Jun 22, 2020

Check again on-going snapshots/restores of indices before closing

b3b336d

elastic/elasticsearch#43873

mkleen added a commit to crate/crate that referenced this pull request Jun 23, 2020

Check again on-going snapshots/restores of indices before closing

e8ae1c7

elastic/elasticsearch#43873

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check again on-going snapshots/restores of indices before closing #43873

Check again on-going snapshots/restores of indices before closing #43873

tlrx commented Jul 2, 2019

elasticmachine commented Jul 2, 2019

tlrx commented Jul 3, 2019

original-brownbear left a comment

ywelsch commented Jul 3, 2019

original-brownbear commented Jul 3, 2019

original-brownbear left a comment

ywelsch commented Jul 3, 2019

ywelsch commented Jul 5, 2019

tlrx commented Jul 8, 2019

tlrx commented Jul 8, 2019

Check again on-going snapshots/restores of indices before closing #43873

Check again on-going snapshots/restores of indices before closing #43873

Conversation

tlrx commented Jul 2, 2019

elasticmachine commented Jul 2, 2019

tlrx commented Jul 3, 2019

original-brownbear left a comment

Choose a reason for hiding this comment

ywelsch commented Jul 3, 2019

original-brownbear commented Jul 3, 2019

original-brownbear left a comment

Choose a reason for hiding this comment

ywelsch commented Jul 3, 2019

ywelsch commented Jul 5, 2019

tlrx commented Jul 8, 2019

tlrx commented Jul 8, 2019