Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle retention of failed and partial snapshots in SLM #46988

Closed
gwbrown opened this issue Sep 23, 2019 · 7 comments
Closed

Handle retention of failed and partial snapshots in SLM #46988

gwbrown opened this issue Sep 23, 2019 · 7 comments
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement

Comments

@gwbrown
Copy link
Contributor

gwbrown commented Sep 23, 2019

Problem

Right now, SLM treats PARTIAL and FAILED snapshots the same, and both are kept around forever. This is unlikely to be the behavior users will expect from SLM, so SLM should handle retention of partial and failed snapshots as well.

Proposed solution

FAILED snapshots will be kept until the configured expire_after period has passed, if present, and then be deleted. If there is no configured expire_after in the retention policy, then they will be deleted if there is at least one more recent successful snapshot from this policy (as they may otherwise be useful for troubleshooting purposes). Failed snapshots are not counted towards either min_count or max_count. (This has been implemented in #47617)

PARTIAL snapshots are more likely to be useful, so need to be handled a bit differently. For this case, there are two potential routes: One that is simple, and one that attempts to be intuitive.

Simple

Partial snapshots are retained unless there is at least one more recent successful snapshot from the same policy, at which point they are deleted after the expire_after period has passed, if present. If expire_after is not present and there is a more recent successful snapshot, they are deleted in the next retention run. In this case, partial snapshots are not counted toward either min_count or max_count, which count successful snapshots only. (This has been implemented in #47833)

Complex

  • If min_count is the only condition: No snapshots for this policy are ever deleted, so partial snapshots have no special handling.
  • If expire_after is the only condition: At least one successful snapshot will be kept, regardless of expire_after. Partial snapshots are deleted after the expire_after period has passed, regardless of whether or not there is a more recent successful snapshot.
  • If max_count is the only condition: At least one successful snapshot will be kept. Partial snapshots are deleted, oldest first, to keep successful_snaps + partial_snaps equal to or less than max_count.
  • If min_count and expire_after are configured: At least min_count successful snapshots will be retained. Partial snapshots are deleted after the expire_after period has passed, regardless of whether or not there is a more recent successful snapshot.
  • If min_count and max_count are configured: At least min_count successful snapshots will be retained. All other snapshots, whether successful or partial, will be deleted, oldest first, to keep successful_snaps + partial_snaps equal to or less than max_count.
  • If expire_after and max_count are configured: At least one successful snapshot will be kept, regardless of expire_after. Partial snapshots will be deleted, oldest first, to keep successful_snaps + partial_snaps equal to or less than max_count, as well as after the expire_after period, regardless of whether there is a more recent successful snapshot.
  • If all three conditions are configured: At least min_count successful snapshots will be retained. All other snapshots, whether successful or partial, will be deleted, oldest first, to keep successful_snaps + partial_snaps equal to or less than max_count, as well as after the expire_after period has passed, regardless of whether there is a more recent successful snapshot.

Relates to #43663

@gwbrown gwbrown added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Sep 23, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@dakrone
Copy link
Member

dakrone commented Sep 23, 2019

Thanks for opening this and describing the problem Gordon. My opinion is that this should be as easy as possible for a user to reason about, and for that reason, I am +1 on the failure handling and +1 for the "simple" solution to PARTIAL handling.

@original-brownbear
Copy link
Member

I wonder if we really want to handle PARTIAL snapshots differently from SUCCESS.
Think about this:

  • You run a large cluster (lots of shards/nodes -> non-trivial chance of running into a PARTIAL snapshot) and take snapshots every N minutes
  • In most cases, your PARTIAL snapshots are mostly just missing a single shard/index and are otherwise fine

-> wouldn't you want your retention rules to be the same for SUCCESS and PARTIAL here?
It seems to me that for large clusters where PARTIAL simply is a reality (e.g. a primary will fail-over during snapshot every so often and fail its snapshot action) that care about the actual point in time snapshots and not just about having the latest snapshot the two statuses are pretty much the same?

@dakrone
Copy link
Member

dakrone commented Sep 24, 2019

I think in that scenario a user always wants to have a valid SUCCESS snapshot, since that's the only one guaranteed to contain all of their data. We never want to expunge an old SUCCESS snapshot even if they have 10 other PARTIAL snapshots that are newer.

@original-brownbear
Copy link
Member

original-brownbear commented Sep 24, 2019

@dakrone True it's a fair point that users want snapshots that guarantee all data. I think I worded the above badly :) What I tried to get across is, that there could be valid scenarios where PARTIAL snapshots are highly likely and dropping a partial snapshot instead of an old fully SUCCESSFUL one might be very painful. Maybe just add some some option to simply not count partial snapshots towards the retention count limit (and then just expunge them as you expire successful ones ... i.e. whenever you expunge a successful one you also expunge all PARTIAL ones older than it)?

One thing to note:
In the end, you probably don't care much about PARTIAL snapshots in use cases where you just keep appending data to your cluster overall because snapshots will have a ton of overlap and you won't save much data by deleting PARTIAL snapshots that lie between SUCCESSFUL ones.
On the other hand, in use cases where PARTIAL snapshots matter because data in the cluster is updated/rotated in some form you'll see storage costs accruing from those PARTIAL snapshots.

=> we should probably deal with #42540 to up the value of PARTIAL snapshots :)

Either way, I now realise that technically users can use a long expire time to get around this anyway. Just figured I'd raise the question of whether we want to help ease the pain of frequent PARTIAL snapshots somehow :)

@nachogiljaldo
Copy link

Something that might be useful (an we do in ESS) is preserving the latest snapshot disregard its state. For example, if the latest is FAILED you don't want to delete it since you want to see that it failed and why it failed. So this might be another actor to consider.

@gwbrown
Copy link
Contributor Author

gwbrown commented Oct 11, 2019

We've implemented the FAILED handling in #47617 and, hearing no objections, the simple version of PARTIAL handling in #47833, so I'm going to close this issue.

To clarify matters related to the comments above, the most recent snapshot overall is kept, as well as the most recent SUCCESSful snapshot - but if expire_after is not set, PARTIAL or FAILED snapshots will be deleted as soon as there is a more recent success.

@gwbrown gwbrown closed this as completed Oct 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement
Projects
None yet
Development

No branches or pull requests

5 participants