Validate that snapshot repository exists for ILM policies during GenerateSnapshotNameStep #77657

joegallo · 2021-09-13T19:12:25Z

See #72957 for details, but at present we get quite far into ILM execution before we confirm that the snapshot repository actually exists for the searchable_snapshot action, and then by the time we find out it doesn't, the only way out is to make it exist (which is unfortunate in the case of typos!).

A big part of the reason that ends up being such a pain is that we cache the snapshot_repository on the ILM custom metadata very early (at GenerateSnapshotNameStep) and then never change it, at the same time we also have asserts here that prevent that same step from running again.

My approach to solve this is two-fold:

I've added validation to the GenerateSnapshotNameStep that checks to make sure that the snapshot repository exists before the step is allowed to execute. In the event that the repository doesn't exist, you can either fix the policy to reference a repository that does exist (the crucial typo case), or you can create the repository.
I've made it possible to execute GenerateSnapshotNameStep more than once. It still will not reset the generated snapshot name, but the rest of the logic executes regardless (and includes resetting the cached snapshot_repository). This allows for a manual escape hatch for clusters that are already in this error state and then upgrade to a version where this fix exists, namely they can _ilm/move to generate-snapshot-name in order to escape the error loop they'd otherwise find themselves in at cleanup-snapshot.

Note: @dakrone and I talked through a few approaches on this one, and... this PR is a totally different way that we didn't discuss at all. 😄

On subsequent runs, though, do not re-update the snapshot name -- once it's set it's forever

elasticmachine · 2021-09-13T19:12:28Z

Pinging @elastic/es-data-management (Team:Data Management)

martijnvg

This approach does look good to me.
I'm also curious in @dakrone's opinion.

martijnvg · 2021-09-16T13:56:16Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/GenerateSnapshotNameStep.java

+        // validate that the snapshot repository exists -- because policies are refreshed on later retries, and because
+        // this fails prior to the snapshot repository being recorded in the ilm metadata, the policy can just be corrected
+        // and everything will pass on the subsequent retry
+        if (clusterState.metadata().custom(RepositoriesMetadata.TYPE, RepositoriesMetadata.EMPTY).repository(snapshotRepository) == null) {


Additionally to this check here, could a similar check also be performed as part of the put ilm policy api?
This way there is direct feedback upon creating the policy

+1 to that. @dakrone and I talked about adding that as a breaking change (so 8.0.0 only) at the rest layer. The scenario we're imagining is some sort of configuration as code that's going to put some templates and a policy and a snapshot repo as part of setting up ES, but where the policy put happens to be before the snapshot repo put -- currently that works, after our change it wouldn't.

dakrone

So this does LGTM, as it fixes the problem from an outside perspective. It's still possible to get wedged if you have an index in a step after the name generation that is using the wrong snapshot repo, but that should be an edge case once we have this validation.

As Martijn mentioned, +1 to adding the validation for existence to 8.0+ in a separate PR

joegallo · 2021-09-29T15:37:16Z

#78468 is up as a draft on validating this at policy creation/update time, too.

…rateSnapshotNameStep (#77657)

joegallo added 8 commits September 13, 2021 14:45

Organize imports

e80829d

Reorganize this code a little bit

8290a5d

Remove the assert that we haven't been through here

97769d0

On subsequent runs, though, do not re-update the snapshot name -- once it's set it's forever

Validate that the snapshot repository exists

31c97bf

Add a test for the new validation

bc9e613

Add a test for the snapshot repository being reset

16cb0d6

Check that the snapshot index name is as expected

27275be

Add a test for the case of re-running a successful run

a671fe8

joegallo added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 v7.16.0 labels Sep 13, 2021

joegallo requested a review from martijnvg September 13, 2021 19:12

elasticmachine added the Team:Data Management Meta label for data/management team label Sep 13, 2021

This comment has been minimized.

Sign in to view

joegallo mentioned this pull request Sep 13, 2021

Unwrap this Optional for the log message #77661

Merged

Merge branch 'master' into ilm-validate-snapshot-repository-exists

a1f7ef4

joegallo requested a review from dakrone September 16, 2021 13:22

martijnvg approved these changes Sep 16, 2021

View reviewed changes

dakrone approved these changes Sep 20, 2021

View reviewed changes

joegallo mentioned this pull request Sep 29, 2021

Validate that snapshot repository exists for ILM policies at creation/update time #78468

Merged

elasticmachine added 2 commits September 30, 2021 01:38

Merge branch 'master' into ilm-validate-snapshot-repository-exists

004a03c

Merge branch 'master' into ilm-validate-snapshot-repository-exists

cdecfc5

joegallo merged commit f68aef6 into elastic:master Sep 30, 2021

joegallo deleted the ilm-validate-snapshot-repository-exists branch September 30, 2021 21:02

joegallo added the backport pending label Sep 30, 2021

joegallo added a commit that referenced this pull request Sep 30, 2021

Validate that snapshot repository exists for ILM policies during Gene…

35cb0ea

…rateSnapshotNameStep (#77657)

joegallo removed the backport pending label Sep 30, 2021

jakelandis added v8.0.0-beta1 and removed v8.0.0 labels Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate that snapshot repository exists for ILM policies during GenerateSnapshotNameStep #77657

Validate that snapshot repository exists for ILM policies during GenerateSnapshotNameStep #77657

joegallo commented Sep 13, 2021

elasticmachine commented Sep 13, 2021

This comment has been minimized.

martijnvg left a comment

martijnvg Sep 16, 2021

joegallo Sep 16, 2021

dakrone left a comment

joegallo commented Sep 29, 2021

Validate that snapshot repository exists for ILM policies during GenerateSnapshotNameStep #77657

Validate that snapshot repository exists for ILM policies during GenerateSnapshotNameStep #77657

Conversation

joegallo commented Sep 13, 2021

elasticmachine commented Sep 13, 2021

This comment has been minimized.

martijnvg left a comment

Choose a reason for hiding this comment

martijnvg Sep 16, 2021

Choose a reason for hiding this comment

joegallo Sep 16, 2021

Choose a reason for hiding this comment

dakrone left a comment

Choose a reason for hiding this comment

joegallo commented Sep 29, 2021