New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate that snapshot repository exists for ILM policies during GenerateSnapshotNameStep #77657
Validate that snapshot repository exists for ILM policies during GenerateSnapshotNameStep #77657
Conversation
On subsequent runs, though, do not re-update the snapshot name -- once it's set it's forever
Pinging @elastic/es-data-management (Team:Data Management) |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach does look good to me.
I'm also curious in @dakrone's opinion.
// validate that the snapshot repository exists -- because policies are refreshed on later retries, and because | ||
// this fails prior to the snapshot repository being recorded in the ilm metadata, the policy can just be corrected | ||
// and everything will pass on the subsequent retry | ||
if (clusterState.metadata().custom(RepositoriesMetadata.TYPE, RepositoriesMetadata.EMPTY).repository(snapshotRepository) == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally to this check here, could a similar check also be performed as part of the put ilm policy api?
This way there is direct feedback upon creating the policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to that. @dakrone and I talked about adding that as a breaking change (so 8.0.0 only) at the rest layer. The scenario we're imagining is some sort of configuration as code that's going to put some templates and a policy and a snapshot repo as part of setting up ES, but where the policy put happens to be before the snapshot repo put -- currently that works, after our change it wouldn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this does LGTM, as it fixes the problem from an outside perspective. It's still possible to get wedged if you have an index in a step after the name generation that is using the wrong snapshot repo, but that should be an edge case once we have this validation.
As Martijn mentioned, +1 to adding the validation for existence to 8.0+ in a separate PR
#78468 is up as a draft on validating this at policy creation/update time, too. |
…rateSnapshotNameStep (#77657)
Fixes #72957
See #72957 for details, but at present we get quite far into ILM execution before we confirm that the snapshot repository actually exists for the
searchable_snapshot
action, and then by the time we find out it doesn't, the only way out is to make it exist (which is unfortunate in the case of typos!).A big part of the reason that ends up being such a pain is that we cache the
snapshot_repository
on the ILM custom metadata very early (atGenerateSnapshotNameStep
) and then never change it, at the same time we also have asserts here that prevent that same step from running again.My approach to solve this is two-fold:
I've added validation to the
GenerateSnapshotNameStep
that checks to make sure that the snapshot repository exists before the step is allowed to execute. In the event that the repository doesn't exist, you can either fix the policy to reference a repository that does exist (the crucial typo case), or you can create the repository.I've made it possible to execute
GenerateSnapshotNameStep
more than once. It still will not reset the generated snapshot name, but the rest of the logic executes regardless (and includes resetting the cachedsnapshot_repository
). This allows for a manual escape hatch for clusters that are already in this error state and then upgrade to a version where this fix exists, namely they can_ilm/move
togenerate-snapshot-name
in order to escape the error loop they'd otherwise find themselves in atcleanup-snapshot
.Note: @dakrone and I talked through a few approaches on this one, and... this PR is a totally different way that we didn't discuss at all. 😄