-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check again on-going snapshots/restores of indices before closing #43873
Check again on-going snapshots/restores of indices before closing #43873
Conversation
Pinging @elastic/es-distributed |
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke about this with @tlrx just now and had one concern:
I wondered if it wouldn't be better to prevent a new snapshot (i.e. just bail because closing is in progress for an index that should be snapshotted) from being started for the affected indices between the two phases of closing.
It seems to me that that would:
- Be simpler in terms of the state transitions (no redundant failing to close step and subsequent cleanup step on concurrent snapshots)
- For users running back-to-back snapshots of the whole cluster, a situation could arise where closing an index becomes effectively impossible because a concurrent snapshot start could keep interrupting the close operation over and over (if running the snapshot in a loop ... which some users do apparently).
The main problem I see with this is that if an index is not successfully closed (but the closed block still left behind), no new snapshots can be done, as we can't distinguish between "is between the two phases of closing" and "closing failed and left closed block behind". This is trappy as it will put the cluster at risk of data loss, and requires manual intervention to fix. I would rather prefer the original solution. |
@ywelsch makes sense to me with the way things are, @tlrx said the same :) Maybe one question just for my own education below (whenever you have a second):
Isn't the latter a bug (the fact that we just leave the block behind when closing failed) to be resolved by removing the close block when closing failed? (or do we not want to do that to prevent further indexing to the index?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :) just a suggestion on the ex. messages that seem to break our style in some spots.
server/src/main/java/org/elasticsearch/snapshots/RestoreService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/metadata/MetaDataIndexStateService.java
Show resolved
Hide resolved
There is currently no automated internal retry (we might do that at some point), but the user intent was clearly to close the index, so leaving the block behind should not be surprising (in case the user tries to index into the index, he will get a proper exception message saying the index might be in the process of being closed and that he can remove the block by reopening the index, or that he can issue the close command again). By keeping the block in place, a follow-up close is also more likely to succeed as it can just reuse the block while stepping through the phases. If you would auto-remove the block, two concurrent close operations might be problematic as they would interfere with each other. The close operation is currently idempotent, also in failure cases. |
I've added the 7.3.0 label here as it's a bugfix. |
@elasticmachine update branch |
Thanks @original-brownbear ! |
…3873) Today we prevent any index that is actively snapshotted or restored to be closed. This verification is done during the execution of the first phase of index closing (ie before blocking the indices). We should also do this verification again in the last phase of index closing (ie after the shard sanity checks and right before actually changing the index state and the routing table) because a snapshot/restore could sneak in while the shards are verified-before-close.
…3873) Today we prevent any index that is actively snapshotted or restored to be closed. This verification is done during the execution of the first phase of index closing (ie before blocking the indices). We should also do this verification again in the last phase of index closing (ie after the shard sanity checks and right before actually changing the index state and the routing table) because a snapshot/restore could sneak in while the shards are verified-before-close.
Today we prevent any index that is actively snapshotted or restored to be closed. This verification is done during the execution of the first phase of index closing (ie before blocking the indices).
We should also do this verification again in the last phase of index closing (ie after the shard sanity checks and right before actually changing the index state and the routing table) because a snapshot/restore could sneak in while the shards are verified-before-close.