Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ILM] Delete step deletes data stream with only one index #105772

Merged
merged 6 commits into from
Mar 4, 2024

Conversation

andreidan
Copy link
Contributor

We seem to have a couple of checks to make sure we delete the data stream when the last index reaches the delete step however, these checks seem a bit contradictory.

Namely, the first check makes use if Index equality (UUID included) and the second just checks the index name.
So if a data stream with just one index (the write index) is restored from snapshot (different UUID) we would've failed the first index equality check and go through the second check dataStream.getWriteIndex().getName().equals(indexName) and fail the delete step (in a non-retryable way :( ) because we don't want to delete the write index of a data stream (but we really do if the data stream has only one index)

This PR makes 2 changes:

  1. use the index name equality everywhere in the step (we already looked up the index abstraction and the parent data stream, so we know for sure the managed index is part of the data stream)
  2. do not throw exception when we got here via a write index that is NOT the last index in the data stream but report the exception so we keep retrying this step (i.e. this enables our users to simply execute a manual rollover and the index is deleted by ILM eventually on retry)

We seem to have a couple of checks to make sure we delete the data stream
when the last index reaches the delete step however, these checks seem a
bit contradictory.

Namely, the first check makes use if `Index` equality (UUID included) and
the second just checks the index name.
So if a data stream with just one index (the write index) is restored from
snapshot (different UUID) we would've failed the first index equality check and
go through the second check `dataStream.getWriteIndex().getName().equals(indexName)` and
fail the delete step (in a non-retryable way :( ) because we don't want to delete the
write index of a data stream (but we really do if the data stream has only one index)

This PR makes 2 changes:
1. use the index name equality everywhere in the step (we already looked up the index
abstraction and the parent data stream, so we know for sure the managed index is part of
the data stream)
2. do not throw exception when we got here via a write index that is NOT the last index
in the data stream but report the exception so we keep retrying this step (i.e. this enables
our users to simply execute a manual rollover and the index is deleted by ILM eventually on retry)
@andreidan andreidan added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.13.1 v8.12.3 labels Feb 23, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team v8.14.0 labels Feb 23, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @andreidan, I've created a changelog YAML for you.

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left one comment about a simple comment, and another about enhancing the test, thanks for finding this Andrei!

@@ -41,7 +41,8 @@ public void performDuringNoSnapshot(IndexMetadata indexMetadata, ClusterState cu

if (dataStream != null) {
assert dataStream.getWriteIndex() != null : dataStream.getName() + " has no write index";
if (dataStream.getIndices().size() == 1 && dataStream.getIndices().get(0).equals(indexMetadata.getIndex())) {

if (dataStream.getIndices().size() == 1 && dataStream.getWriteIndex().getName().equals(indexName)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment about why we use name equality here so it doesn't get accidentally changed back? (I know we have tests, but it's still easy to stop someone from wasting work)

}

@Override
public void onFailure(Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that now that this isn't throwing directly, we need to have a latch or other mechanism to ensure that the onFailure handler was actually invoked. Otherwise if we were to introduce a bug where neither onResponse nor onFailure were called, then we wouldn't hit any asserts and the test would pass (when it shouldn't).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execution in this particular case is not async as we're not getting to the callback as part of a client interaction - this is all sync as it's part of the step validation.
I've stubbed the client to fail in case it's being called in this test so that changes to the ILM step structure will yield this test to fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I (later) understood what you meant - opened #105914 to make sure we fail the test if the listener is not called at all

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan andreidan added auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport-and-merge Automatically create backport pull requests and merge when ready and removed v8.12.3 labels Mar 4, 2024
@elasticsearchmachine elasticsearchmachine merged commit c97160a into elastic:main Mar 4, 2024
14 checks passed
@andreidan andreidan deleted the fix-delete-step branch March 4, 2024 10:56
andreidan added a commit to andreidan/elasticsearch that referenced this pull request Mar 4, 2024
…5772)

We seem to have a couple of checks to make sure we delete the data
stream when the last index reaches the delete step however, these checks
seem a bit contradictory.

Namely, the first check makes use if `Index` equality (UUID included)
and the second just checks the index name. So if a data stream with just
one index (the write index) is restored from snapshot (different UUID)
we would've failed the first index equality check and go through the
second check `dataStream.getWriteIndex().getName().equals(indexName)`
and fail the delete step (in a non-retryable way :( ) because we don't
want to delete the write index of a data stream (but we really do if the
data stream has only one index)

This PR makes 2 changes: 1. use the index name equality everywhere in
the step (we already looked up the index abstraction and the parent data
stream, so we know for sure the managed index is part of the data
stream) 2. do not throw exception when we got here via a write index
that is NOT the last index in the data stream but report the exception
so we keep retrying this step (i.e. this enables our users to simply
execute a manual rollover and the index is deleted by ILM eventually on
retry)
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.13

elasticsearchmachine pushed a commit that referenced this pull request Mar 4, 2024
…105897)

We seem to have a couple of checks to make sure we delete the data
stream when the last index reaches the delete step however, these checks
seem a bit contradictory.

Namely, the first check makes use if `Index` equality (UUID included)
and the second just checks the index name. So if a data stream with just
one index (the write index) is restored from snapshot (different UUID)
we would've failed the first index equality check and go through the
second check `dataStream.getWriteIndex().getName().equals(indexName)`
and fail the delete step (in a non-retryable way :( ) because we don't
want to delete the write index of a data stream (but we really do if the
data stream has only one index)

This PR makes 2 changes: 1. use the index name equality everywhere in
the step (we already looked up the index abstraction and the parent data
stream, so we know for sure the managed index is part of the data
stream) 2. do not throw exception when we got here via a write index
that is NOT the last index in the data stream but report the exception
so we keep retrying this step (i.e. this enables our users to simply
execute a manual rollover and the index is deleted by ILM eventually on
retry)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team v8.13.1 v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants