New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle retention of failed and partial snapshots in SLM #46988
Comments
Pinging @elastic/es-core-features |
Thanks for opening this and describing the problem Gordon. My opinion is that this should be as easy as possible for a user to reason about, and for that reason, I am +1 on the failure handling and +1 for the "simple" solution to PARTIAL handling. |
I wonder if we really want to handle
-> wouldn't you want your retention rules to be the same for |
I think in that scenario a user always wants to have a valid |
@dakrone True it's a fair point that users want snapshots that guarantee all data. I think I worded the above badly :) What I tried to get across is, that there could be valid scenarios where One thing to note: => we should probably deal with #42540 to up the value of Either way, I now realise that technically users can use a long expire time to get around this anyway. Just figured I'd raise the question of whether we want to help ease the pain of frequent |
Something that might be useful (an we do in ESS) is preserving the latest snapshot disregard its state. For example, if the latest is FAILED you don't want to delete it since you want to see that it failed and why it failed. So this might be another actor to consider. |
We've implemented the To clarify matters related to the comments above, the most recent snapshot overall is kept, as well as the most recent |
Problem
Right now, SLM treats
PARTIAL
andFAILED
snapshots the same, and both are kept around forever. This is unlikely to be the behavior users will expect from SLM, so SLM should handle retention of partial and failed snapshots as well.Proposed solution
FAILED
snapshots will be kept until the configuredexpire_after
period has passed, if present, and then be deleted. If there is no configuredexpire_after
in the retention policy, then they will be deleted if there is at least one more recent successful snapshot from this policy (as they may otherwise be useful for troubleshooting purposes). Failed snapshots are not counted towards eithermin_count
ormax_count
. (This has been implemented in #47617)PARTIAL
snapshots are more likely to be useful, so need to be handled a bit differently. For this case, there are two potential routes: One that is simple, and one that attempts to be intuitive.Simple
Partial snapshots are retained unless there is at least one more recent successful snapshot from the same policy, at which point they are deleted after the
expire_after
period has passed, if present. Ifexpire_after
is not present and there is a more recent successful snapshot, they are deleted in the next retention run. In this case, partial snapshots are not counted toward eithermin_count
ormax_count
, which count successful snapshots only. (This has been implemented in #47833)Complex
min_count
is the only condition: No snapshots for this policy are ever deleted, so partial snapshots have no special handling.expire_after
is the only condition: At least one successful snapshot will be kept, regardless ofexpire_after
. Partial snapshots are deleted after theexpire_after
period has passed, regardless of whether or not there is a more recent successful snapshot.max_count
is the only condition: At least one successful snapshot will be kept. Partial snapshots are deleted, oldest first, to keepsuccessful_snaps + partial_snaps
equal to or less thanmax_count
.min_count
andexpire_after
are configured: At leastmin_count
successful snapshots will be retained. Partial snapshots are deleted after theexpire_after
period has passed, regardless of whether or not there is a more recent successful snapshot.min_count
andmax_count
are configured: At leastmin_count
successful snapshots will be retained. All other snapshots, whether successful or partial, will be deleted, oldest first, to keepsuccessful_snaps + partial_snaps
equal to or less thanmax_count
.expire_after
andmax_count
are configured: At least one successful snapshot will be kept, regardless ofexpire_after
. Partial snapshots will be deleted, oldest first, to keepsuccessful_snaps + partial_snaps
equal to or less thanmax_count
, as well as after theexpire_after
period, regardless of whether there is a more recent successful snapshot.min_count
successful snapshots will be retained. All other snapshots, whether successful or partial, will be deleted, oldest first, to keepsuccessful_snaps + partial_snaps
equal to or less thanmax_count
, as well as after theexpire_after
period has passed, regardless of whether there is a more recent successful snapshot.Relates to #43663
The text was updated successfully, but these errors were encountered: