Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate that snapshot repository exists for ILM policies during GenerateSnapshotNameStep #77657

Merged

Conversation

joegallo
Copy link
Contributor

Fixes #72957

See #72957 for details, but at present we get quite far into ILM execution before we confirm that the snapshot repository actually exists for the searchable_snapshot action, and then by the time we find out it doesn't, the only way out is to make it exist (which is unfortunate in the case of typos!).

A big part of the reason that ends up being such a pain is that we cache the snapshot_repository on the ILM custom metadata very early (at GenerateSnapshotNameStep) and then never change it, at the same time we also have asserts here that prevent that same step from running again.

My approach to solve this is two-fold:

  1. I've added validation to the GenerateSnapshotNameStep that checks to make sure that the snapshot repository exists before the step is allowed to execute. In the event that the repository doesn't exist, you can either fix the policy to reference a repository that does exist (the crucial typo case), or you can create the repository.

  2. I've made it possible to execute GenerateSnapshotNameStep more than once. It still will not reset the generated snapshot name, but the rest of the logic executes regardless (and includes resetting the cached snapshot_repository). This allows for a manual escape hatch for clusters that are already in this error state and then upgrade to a version where this fix exists, namely they can _ilm/move to generate-snapshot-name in order to escape the error loop they'd otherwise find themselves in at cleanup-snapshot.

Note: @dakrone and I talked through a few approaches on this one, and... this PR is a totally different way that we didn't discuss at all. 😄

@joegallo joegallo added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 v7.16.0 labels Sep 13, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Sep 13, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@joegallo

This comment has been minimized.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach does look good to me.
I'm also curious in @dakrone's opinion.

// validate that the snapshot repository exists -- because policies are refreshed on later retries, and because
// this fails prior to the snapshot repository being recorded in the ilm metadata, the policy can just be corrected
// and everything will pass on the subsequent retry
if (clusterState.metadata().custom(RepositoriesMetadata.TYPE, RepositoriesMetadata.EMPTY).repository(snapshotRepository) == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally to this check here, could a similar check also be performed as part of the put ilm policy api?
This way there is direct feedback upon creating the policy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to that. @dakrone and I talked about adding that as a breaking change (so 8.0.0 only) at the rest layer. The scenario we're imagining is some sort of configuration as code that's going to put some templates and a policy and a snapshot repo as part of setting up ES, but where the policy put happens to be before the snapshot repo put -- currently that works, after our change it wouldn't.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this does LGTM, as it fixes the problem from an outside perspective. It's still possible to get wedged if you have an index in a step after the name generation that is using the wrong snapshot repo, but that should be an edge case once we have this validation.

As Martijn mentioned, +1 to adding the validation for existence to 8.0+ in a separate PR

@joegallo
Copy link
Contributor Author

#78468 is up as a draft on validating this at policy creation/update time, too.

@joegallo joegallo merged commit f68aef6 into elastic:master Sep 30, 2021
@joegallo joegallo deleted the ilm-validate-snapshot-repository-exists branch September 30, 2021 21:02
joegallo added a commit that referenced this pull request Sep 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team v7.16.0 v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ILM searchable snapshot repo change does not affect existing indices
5 participants