Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ILM Actions can fail due to in-progress snapshots of indices #37541

Closed
talevy opened this issue Jan 16, 2019 · 4 comments · Fixed by #37624
Closed

ILM Actions can fail due to in-progress snapshots of indices #37541

talevy opened this issue Jan 16, 2019 · 4 comments · Fixed by #37624
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management

Comments

@talevy
Copy link
Contributor

talevy commented Jan 16, 2019

situation

Actions like: Delete, Freeze, Shrink all execute ES actions like
DeleteIndex and CloseIndex that throw IllegalArgumentExceptions in the cases
that snapshots of their respective indices are in progress.

What can ES do about it?

A SnapshotInProgressException can be thrown instead of an IAE so
that ILM can be signaled more explicitly that this is the case so that
it can retry actions appropriately. Actions should not move to an error
step.

@talevy talevy added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Jan 16, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@dakrone
Copy link
Member

dakrone commented Jan 16, 2019

We had a discussion about this. For now we have a plan to see if we can address this:

  • Change actions that can't happen while a snapshot is in progress to throw a SnapshotInProgressException instead of an IllegalArgumentException
  • From the steps that perform unsafe-while-snapshotting behavior, catch this new exception
  • If the exception is caught, register a ClusterStateObserver listener that attempts to re-run the step when the snapshot for the index is no longer in progress

talevy added a commit to talevy/elasticsearch that referenced this issue Jan 16, 2019
delete and close index actions threw IllegalArgumentExceptions when
attempting to run against an index that has a snapshot in progress.

This change introduces a dedicated SnapshotInProgressException for
these scenarios. This is done to explicitely signal to clients that
this is the reason the action failed, and it is a retryable error.

relates to elastic#37541.
@talevy
Copy link
Contributor Author

talevy commented Jan 16, 2019

  • Change actions that can't happen while a snapshot is in progress to throw a SnapshotInProgressException instead of an IllegalArgumentException

I opened a PR for this here: #37550

@talevy
Copy link
Contributor Author

talevy commented Jan 16, 2019

I've also opened #37552 to codify the problem

talevy added a commit that referenced this issue Jan 17, 2019
delete and close index actions threw IllegalArgumentExceptions
when attempting to run against an index that has a snapshot
in progress.

This change introduces a dedicated SnapshotInProgressException
for these scenarios. This is done to explicitly signal to clients that
this is the reason the action failed, and it is a retryable error.

relates to #37541.
talevy added a commit to talevy/elasticsearch that referenced this issue Jan 22, 2019
…#37550)

delete and close index actions threw IllegalArgumentExceptions
when attempting to run against an index that has a snapshot
in progress.

This change introduces a dedicated SnapshotInProgressException
for these scenarios. This is done to explicitly signal to clients that
this is the reason the action failed, and it is a retryable error.

relates to elastic#37541.
talevy added a commit that referenced this issue Jan 23, 2019
…#37723)

delete and close index actions threw IllegalArgumentExceptions
when attempting to run against an index that has a snapshot
in progress.

This change introduces a dedicated SnapshotInProgressException
for these scenarios. This is done to explicitly signal to clients that
this is the reason the action failed, and it is a retryable error.

relates to #37541.
dakrone added a commit that referenced this issue Jan 23, 2019
Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed.

This change adds an abstract step (`AsyncRetryDuringSnapshotActionStep`) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When a `SnapshotInProgressException` is received by the listener wrapper, a `ClusterStateObserver` listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring.

This also adds integration tests for these scenarios (thanks to @talevy in #37552).

Resolves #37541
dakrone added a commit that referenced this issue Jan 23, 2019
Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed.

This change adds an abstract step (`AsyncRetryDuringSnapshotActionStep`) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When a `SnapshotInProgressException` is received by the listener wrapper, a `ClusterStateObserver` listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring.

This also adds integration tests for these scenarios (thanks to @talevy in #37552).

Resolves #37541
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants