Handle retention of failed and partial snapshots in SLM #46988

gwbrown · 2019-09-23T22:56:17Z

Problem

Right now, SLM treats PARTIAL and FAILED snapshots the same, and both are kept around forever. This is unlikely to be the behavior users will expect from SLM, so SLM should handle retention of partial and failed snapshots as well.

Proposed solution

FAILED snapshots will be kept until the configured expire_after period has passed, if present, and then be deleted. If there is no configured expire_after in the retention policy, then they will be deleted if there is at least one more recent successful snapshot from this policy (as they may otherwise be useful for troubleshooting purposes). Failed snapshots are not counted towards either min_count or max_count. (This has been implemented in #47617)

PARTIAL snapshots are more likely to be useful, so need to be handled a bit differently. For this case, there are two potential routes: One that is simple, and one that attempts to be intuitive.

Simple

Partial snapshots are retained unless there is at least one more recent successful snapshot from the same policy, at which point they are deleted after the expire_after period has passed, if present. If expire_after is not present and there is a more recent successful snapshot, they are deleted in the next retention run. In this case, partial snapshots are not counted toward either min_count or max_count, which count successful snapshots only. (This has been implemented in #47833)

Complex

If min_count is the only condition: No snapshots for this policy are ever deleted, so partial snapshots have no special handling.
If expire_after is the only condition: At least one successful snapshot will be kept, regardless of expire_after. Partial snapshots are deleted after the expire_after period has passed, regardless of whether or not there is a more recent successful snapshot.
If max_count is the only condition: At least one successful snapshot will be kept. Partial snapshots are deleted, oldest first, to keep successful_snaps + partial_snaps equal to or less than max_count.
If min_count and expire_after are configured: At least min_count successful snapshots will be retained. Partial snapshots are deleted after the expire_after period has passed, regardless of whether or not there is a more recent successful snapshot.
If min_count and max_count are configured: At least min_count successful snapshots will be retained. All other snapshots, whether successful or partial, will be deleted, oldest first, to keep successful_snaps + partial_snaps equal to or less than max_count.
If expire_after and max_count are configured: At least one successful snapshot will be kept, regardless of expire_after. Partial snapshots will be deleted, oldest first, to keep successful_snaps + partial_snaps equal to or less than max_count, as well as after the expire_after period, regardless of whether there is a more recent successful snapshot.
If all three conditions are configured: At least min_count successful snapshots will be retained. All other snapshots, whether successful or partial, will be deleted, oldest first, to keep successful_snaps + partial_snaps equal to or less than max_count, as well as after the expire_after period has passed, regardless of whether there is a more recent successful snapshot.

Relates to #43663

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-09-23T22:56:19Z

Pinging @elastic/es-core-features

dakrone · 2019-09-23T23:01:49Z

Thanks for opening this and describing the problem Gordon. My opinion is that this should be as easy as possible for a user to reason about, and for that reason, I am +1 on the failure handling and +1 for the "simple" solution to PARTIAL handling.

original-brownbear · 2019-09-24T08:42:47Z

I wonder if we really want to handle PARTIAL snapshots differently from SUCCESS.
Think about this:

You run a large cluster (lots of shards/nodes -> non-trivial chance of running into a PARTIAL snapshot) and take snapshots every N minutes
In most cases, your PARTIAL snapshots are mostly just missing a single shard/index and are otherwise fine

-> wouldn't you want your retention rules to be the same for SUCCESS and PARTIAL here?
It seems to me that for large clusters where PARTIAL simply is a reality (e.g. a primary will fail-over during snapshot every so often and fail its snapshot action) that care about the actual point in time snapshots and not just about having the latest snapshot the two statuses are pretty much the same?

dakrone · 2019-09-24T17:27:09Z

I think in that scenario a user always wants to have a valid SUCCESS snapshot, since that's the only one guaranteed to contain all of their data. We never want to expunge an old SUCCESS snapshot even if they have 10 other PARTIAL snapshots that are newer.

original-brownbear · 2019-09-24T18:45:39Z

@dakrone True it's a fair point that users want snapshots that guarantee all data. I think I worded the above badly :) What I tried to get across is, that there could be valid scenarios where PARTIAL snapshots are highly likely and dropping a partial snapshot instead of an old fully SUCCESSFUL one might be very painful. Maybe just add some some option to simply not count partial snapshots towards the retention count limit (and then just expunge them as you expire successful ones ... i.e. whenever you expunge a successful one you also expunge all PARTIAL ones older than it)?

One thing to note:
In the end, you probably don't care much about PARTIAL snapshots in use cases where you just keep appending data to your cluster overall because snapshots will have a ton of overlap and you won't save much data by deleting PARTIAL snapshots that lie between SUCCESSFUL ones.
On the other hand, in use cases where PARTIAL snapshots matter because data in the cluster is updated/rotated in some form you'll see storage costs accruing from those PARTIAL snapshots.

=> we should probably deal with #42540 to up the value of PARTIAL snapshots :)

Either way, I now realise that technically users can use a long expire time to get around this anyway. Just figured I'd raise the question of whether we want to help ease the pain of frequent PARTIAL snapshots somehow :)

nachogiljaldo · 2019-09-27T13:45:06Z

Something that might be useful (an we do in ESS) is preserving the latest snapshot disregard its state. For example, if the latest is FAILED you don't want to delete it since you want to see that it failed and why it failed. So this might be another actor to consider.

gwbrown · 2019-10-11T20:40:57Z

We've implemented the FAILED handling in #47617 and, hearing no objections, the simple version of PARTIAL handling in #47833, so I'm going to close this issue.

To clarify matters related to the comments above, the most recent snapshot overall is kept, as well as the most recent SUCCESSful snapshot - but if expire_after is not set, PARTIAL or FAILED snapshots will be deleted as soon as there is a more recent success.

gwbrown added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Sep 23, 2019

dakrone mentioned this issue Sep 23, 2019

Retention for Snapshot Lifecycle Management #43663

Closed

26 tasks

gwbrown mentioned this issue Sep 24, 2019

Wait for snapshot completion in SLM snapshot invocation #47051

Merged

gwbrown self-assigned this Sep 30, 2019

This was referenced Oct 4, 2019

Manage retention of failed snapshots in SLM #47617

Merged

Manage retention of partial snapshots in SLM (Simple version) #47833

Merged

gwbrown closed this as completed Oct 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle retention of failed and partial snapshots in SLM #46988

Handle retention of failed and partial snapshots in SLM #46988

gwbrown commented Sep 23, 2019 •

edited

elasticmachine commented Sep 23, 2019

dakrone commented Sep 23, 2019

original-brownbear commented Sep 24, 2019

dakrone commented Sep 24, 2019

original-brownbear commented Sep 24, 2019 •

edited

nachogiljaldo commented Sep 27, 2019

gwbrown commented Oct 11, 2019

Handle retention of failed and partial snapshots in SLM #46988

Handle retention of failed and partial snapshots in SLM #46988

Comments

gwbrown commented Sep 23, 2019 • edited

Problem

Proposed solution

Simple

Complex

elasticmachine commented Sep 23, 2019

dakrone commented Sep 23, 2019

original-brownbear commented Sep 24, 2019

dakrone commented Sep 24, 2019

original-brownbear commented Sep 24, 2019 • edited

nachogiljaldo commented Sep 27, 2019

gwbrown commented Oct 11, 2019

gwbrown commented Sep 23, 2019 •

edited

original-brownbear commented Sep 24, 2019 •

edited