Retry ILM steps when transient or recoverable errors are encountered #48183

andreidan · 2019-10-17T09:18:39Z

This is a meta-issue to track and discuss the ILM steps that should be retryable and under which circumstances. This relates to the efforts on making the rollover action retryable (#44135 ) and the more general strategy ILM will employ in order to make actions more resilient and self-healing ( #42824 ).

Below are all the steps we use, grouped by actions (as we'll likely not treat steps differently depending in which actions they occur they are listed only once under the first action, ordered alphabetically, they're used in). The marker Terminal/Error steps are not listed.

Steps

InitializePolicyContextStep - Initializes the LifecycleExecutionState for an index. This should be the first Step called on an index

AllocateAction

AllocationRoutedStep - Checks whether all shards have been correctly routed in response to an update to the allocation rules for an index
UpdateSettingsStep (also used in ForceMergeAction, ReadOnlyAction, RolloverAction, SetPriorityAction and ShrinkAction)

DeleteAction

WaitForNoFollowersStep - A step that waits until the index it's used on is no longer a leader index
DeleteStep - deletes a single index

ForceMergeAction

CloseIndexStep
OpenIndexStep
ForceMergeStep - Invokes a force merge on a single index
SegmentCountStep - evaluates whether force_merge was successful by checking the segment count
WaitForIndexColorStep

FreezeAction

FreezeStep - freezes an index

RolloverAction

CheckNotDataStreamWriteIndexStep - This step checks if the managed index is part of a data stream, in which case it will check it's not the write index
WaitForRolloverReadyStep - Waits for at least one rollover condition to be satisfied, using the Rollover API's dry_run option
RolloverStep - Unconditionally rolls over an index using the Rollover API
WaitForActiveShardsStep - Waits for the shards of the newly created index to become active
UpdateRolloverLifecycleDateStep - Copies the lifecycle reference date to a new index created by rolling over an alias.

ShrinkAction

BranchingStep - This step changes its getNextStepKey() depending on the outcome of a defined predicate. It performs no changes to the cluster state
WaitForNoFollowersStep - A step that waits until the index it's used on is no longer a leader index
SetSingleNodeAllocateStep - Allocates all shards in a single index to one node
CheckShrinkReadyStep - used prior to running a shrink step in order to ensure that the index being shrunk has a copy of each shard allocated on one particular node (the node used by the require parameter) and that the shards are not relocating
ShrinkStep - Shrinks an index, using a prefix prepended to the original index name for the name of the shrunken index
ShrunkShardsAllocatedStep - Checks whether all shards in a shrunken index have been successfully allocated
CopyExecutionStateStep - Copies the execution state data from one index to another
ShrinkSetAliasStep - Following shrinking an index and deleting the original index, this step creates an alias with the same name as the original index which points to the new shrunken index
ShrunkenIndexCheckStep - Verifies that an index was created through a shrink operation, rather than created some other way

UnfollowAction

Scope

Any action/step that can be made to be re-tried after a failure.

Duration

~ 2 months

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-10-17T09:18:41Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

andreidan · 2019-12-05T16:58:35Z

Removed team-discuss as we decided to make the Rollover action retryable by executing the rollover using one cluster state update (as opposed to 3 updates that are chained now, namely create rolled over index, update aliases, attach rollover information to the source index)

gwbrown · 2019-12-05T18:07:18Z

I'll be interested to see the outcome of that, a while ago I experimented with condensing rollover into a single cluster state update for a different problem before going with a different solution. As I recall, the only issue I ran into was that (at least the way I did it, which may not be the best way) ended up having to duplicate some of the logic from the transport actions for creating an index, etc. Unfortunately I don't seem to have the branch around anymore.

This commits makes the "init" ILM step retryable. It also adds a test where an index is created with a non-parsable index name and then fails. Related to elastic#48183

This commits makes the "init" ILM step retryable. It also adds a test where an index is created with a non-parsable index name and then fails. Related to #48183

This commits makes the "init" ILM step retryable. It also adds a test where an index is created with a non-parsable index name and then fails. Related to elastic#48183

This commits makes the "init" ILM step retryable. It also adds a test where an index is created with a non-parsable index name and then fails. Related to #48183

This commits makes the "init" ILM step retryable. It also adds a test where an index is created with a non-parsable index name and then fails. Related to elastic#48183

bczifra · 2020-05-11T12:04:47Z

Since #43246 makes force merging "best effort", are there any remaining plans to make errors in the force merging step retryable? They're still listed under the "ForceMergeAction" section in the body of this issue.

dakrone · 2020-05-11T15:08:58Z

@bczifra the goal is to eventually may every step retryable, including the force merge step

andreidan added Meta :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Oct 17, 2019

andreidan added the team-discuss label Oct 17, 2019

danhermann changed the title ~~Retry ILM steps when transitive or recoverable from erros are encountered~~ Retry ILM steps when transitive or recoverable from errors are encountered Oct 17, 2019

dakrone changed the title ~~Retry ILM steps when transitive or recoverable from errors are encountered~~ Retry ILM steps when transitive or recoverable errors are encountered Oct 18, 2019

andreidan mentioned this issue Nov 25, 2019

Rollover sometimes does not attach Rollover Info to index metadata #49413

Closed

andreidan changed the title ~~Retry ILM steps when transitive or recoverable errors are encountered~~ Retry ILM steps when transient or recoverable errors are encountered Dec 5, 2019

andreidan self-assigned this Dec 5, 2019

andreidan removed the team-discuss label Dec 5, 2019

andreidan mentioned this issue Dec 30, 2019

ILM retryable async action steps #50522

Merged

jasontedor mentioned this issue Jan 6, 2020

Revisit ILM retry strategy for additional conditions #42824

Closed

dakrone mentioned this issue Jan 6, 2020

Make InitializePolicyContextStep retryable #50685

Merged

andreidan mentioned this issue Jan 7, 2020

Make the UpdateRolloverLifecycleDateStep retryable #50702

Merged

dakrone added a commit that referenced this issue Jan 8, 2020

Make InitializePolicyContextStep retryable (#50685)

919b297

This commits makes the "init" ILM step retryable. It also adds a test where an index is created with a non-parsable index name and then fails. Related to #48183

gwbrown mentioned this issue Jan 9, 2020

index rollover running with "NORMAL" priority #50778

Closed

This was referenced Jan 16, 2020

ILM wait for active shards on rolled index in a separate step #50718

Merged

ILM: Make UpdateSettingsStep retryable #51235

Merged

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

andreidan removed their assignment Mar 4, 2020

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket elastic/elasticsearch-net#4525

Closed

38 tasks

rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020

This was referenced Dec 15, 2020

ILM: make the rest of the forcemerge action steps retryable #66352

Merged

ILM: make the unfollow action and CCR related steps retryable #66356

Merged

andreidan mentioned this issue Mar 8, 2021

ILM: Make all the shrink action steps retryable #70107

Merged

andreidan mentioned this issue Mar 23, 2021

Mark Step#isRetryable abstract #70716

Merged

andreidan closed this as completed in #70716 Mar 23, 2021

stevejgordon mentioned this issue Apr 21, 2021

7.13.0 Meta Ticket elastic/elasticsearch-net#5584

Closed

62 tasks

This was referenced Apr 27, 2021

ILM: _retry using wildcard should skip indices that are not managed by ILM #44188

Closed

ILM _retry skips non-managed indices #51339

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry ILM steps when transient or recoverable errors are encountered #48183

Retry ILM steps when transient or recoverable errors are encountered #48183

andreidan commented Oct 17, 2019 •

edited

Loading

elasticmachine commented Oct 17, 2019

andreidan commented Dec 5, 2019

gwbrown commented Dec 5, 2019

bczifra commented May 11, 2020

dakrone commented May 11, 2020

Retry ILM steps when transient or recoverable errors are encountered #48183

Retry ILM steps when transient or recoverable errors are encountered #48183

Comments

andreidan commented Oct 17, 2019 • edited Loading

Steps

AllocateAction

DeleteAction

ForceMergeAction

FreezeAction

RolloverAction

ShrinkAction

UnfollowAction

Scope

Duration

elasticmachine commented Oct 17, 2019

andreidan commented Dec 5, 2019

gwbrown commented Dec 5, 2019

bczifra commented May 11, 2020

dakrone commented May 11, 2020

andreidan commented Oct 17, 2019 •

edited

Loading