Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry ILM steps when transient or recoverable errors are encountered #48183

Closed
31 tasks done
andreidan opened this issue Oct 17, 2019 · 5 comments · Fixed by #70716
Closed
31 tasks done

Retry ILM steps when transient or recoverable errors are encountered #48183

andreidan opened this issue Oct 17, 2019 · 5 comments · Fixed by #70716
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management Meta Team:Data Management Meta label for data/management team

Comments

@andreidan
Copy link
Contributor

andreidan commented Oct 17, 2019

This is a meta-issue to track and discuss the ILM steps that should be retryable and under which circumstances. This relates to the efforts on making the rollover action retryable (#44135 ) and the more general strategy ILM will employ in order to make actions more resilient and self-healing ( #42824 ).

Below are all the steps we use, grouped by actions (as we'll likely not treat steps differently depending in which actions they occur they are listed only once under the first action, ordered alphabetically, they're used in). The marker Terminal/Error steps are not listed.

Steps

  • InitializePolicyContextStep - Initializes the LifecycleExecutionState for an index. This should be the first Step called on an index

AllocateAction

  • AllocationRoutedStep - Checks whether all shards have been correctly routed in response to an update to the allocation rules for an index
  • UpdateSettingsStep (also used in ForceMergeAction, ReadOnlyAction, RolloverAction, SetPriorityAction and ShrinkAction)

DeleteAction

  • WaitForNoFollowersStep - A step that waits until the index it's used on is no longer a leader index
  • DeleteStep - deletes a single index

ForceMergeAction

  • CloseIndexStep
  • OpenIndexStep
  • ForceMergeStep - Invokes a force merge on a single index
  • SegmentCountStep - evaluates whether force_merge was successful by checking the segment count
  • WaitForIndexColorStep

FreezeAction

  • FreezeStep - freezes an index

RolloverAction

  • CheckNotDataStreamWriteIndexStep - This step checks if the managed index is part of a data stream, in which case it will check it's not the write index
  • WaitForRolloverReadyStep - Waits for at least one rollover condition to be satisfied, using the Rollover API's dry_run option
  • RolloverStep - Unconditionally rolls over an index using the Rollover API
  • WaitForActiveShardsStep - Waits for the shards of the newly created index to become active
  • UpdateRolloverLifecycleDateStep - Copies the lifecycle reference date to a new index created by rolling over an alias.

ShrinkAction

  • BranchingStep - This step changes its getNextStepKey() depending on the outcome of a defined predicate. It performs no changes to the cluster state
  • WaitForNoFollowersStep - A step that waits until the index it's used on is no longer a leader index
  • SetSingleNodeAllocateStep - Allocates all shards in a single index to one node
  • CheckShrinkReadyStep - used prior to running a shrink step in order to ensure that the index being shrunk has a copy of each shard allocated on one particular node (the node used by the require parameter) and that the shards are not relocating
  • ShrinkStep - Shrinks an index, using a prefix prepended to the original index name for the name of the shrunken index
  • ShrunkShardsAllocatedStep - Checks whether all shards in a shrunken index have been successfully allocated
  • CopyExecutionStateStep - Copies the execution state data from one index to another
  • ShrinkSetAliasStep - Following shrinking an index and deleting the original index, this step creates an alias with the same name as the original index which points to the new shrunken index
  • ShrunkenIndexCheckStep - Verifies that an index was created through a shrink operation, rather than created some other way

UnfollowAction

  • WaitForIndexingCompleteStep
  • WaitForFollowShardTasksStep
  • PauseFollowerIndexStep
  • CloseFollowerIndexStep
  • UnfollowFollowerIndexStep
  • OpenFollowerIndexStep

Scope

Any action/step that can be made to be re-tried after a failure.

Duration

~ 2 months

@andreidan andreidan added Meta :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Oct 17, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@danhermann danhermann changed the title Retry ILM steps when transitive or recoverable from erros are encountered Retry ILM steps when transitive or recoverable from errors are encountered Oct 17, 2019
@dakrone dakrone changed the title Retry ILM steps when transitive or recoverable from errors are encountered Retry ILM steps when transitive or recoverable errors are encountered Oct 18, 2019
@andreidan andreidan changed the title Retry ILM steps when transitive or recoverable errors are encountered Retry ILM steps when transient or recoverable errors are encountered Dec 5, 2019
@andreidan andreidan self-assigned this Dec 5, 2019
@andreidan
Copy link
Contributor Author

Removed team-discuss as we decided to make the Rollover action retryable by executing the rollover using one cluster state update (as opposed to 3 updates that are chained now, namely create rolled over index, update aliases, attach rollover information to the source index)

@gwbrown
Copy link
Contributor

gwbrown commented Dec 5, 2019

I'll be interested to see the outcome of that, a while ago I experimented with condensing rollover into a single cluster state update for a different problem before going with a different solution. As I recall, the only issue I ran into was that (at least the way I did it, which may not be the best way) ended up having to duplicate some of the logic from the transport actions for creating an index, etc. Unfortunately I don't seem to have the branch around anymore.

dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 6, 2020
This commits makes the "init" ILM step retryable. It also adds a test
where an index is created with a non-parsable index name and then fails.

Related to elastic#48183
dakrone added a commit that referenced this issue Jan 8, 2020
This commits makes the "init" ILM step retryable. It also adds a test
where an index is created with a non-parsable index name and then fails.

Related to #48183
dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 8, 2020
This commits makes the "init" ILM step retryable. It also adds a test
where an index is created with a non-parsable index name and then fails.

Related to elastic#48183
dakrone added a commit that referenced this issue Jan 8, 2020
This commits makes the "init" ILM step retryable. It also adds a test
where an index is created with a non-parsable index name and then fails.

Related to #48183
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
This commits makes the "init" ILM step retryable. It also adds a test
where an index is created with a non-parsable index name and then fails.

Related to elastic#48183
@andreidan andreidan removed their assignment Mar 4, 2020
@rjernst rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020
@bczifra
Copy link
Member

bczifra commented May 11, 2020

Since #43246 makes force merging "best effort", are there any remaining plans to make errors in the force merging step retryable? They're still listed under the "ForceMergeAction" section in the body of this issue.

@dakrone
Copy link
Member

dakrone commented May 11, 2020

@bczifra the goal is to eventually may every step retryable, including the force merge step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management Meta Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants