[ILM] Delete step deletes data stream with only one index #105772

andreidan · 2024-02-23T12:27:50Z

We seem to have a couple of checks to make sure we delete the data stream when the last index reaches the delete step however, these checks seem a bit contradictory.

Namely, the first check makes use if Index equality (UUID included) and the second just checks the index name.
So if a data stream with just one index (the write index) is restored from snapshot (different UUID) we would've failed the first index equality check and go through the second check dataStream.getWriteIndex().getName().equals(indexName) and fail the delete step (in a non-retryable way :( ) because we don't want to delete the write index of a data stream (but we really do if the data stream has only one index)

This PR makes 2 changes:

use the index name equality everywhere in the step (we already looked up the index abstraction and the parent data stream, so we know for sure the managed index is part of the data stream)
do not throw exception when we got here via a write index that is NOT the last index in the data stream but report the exception so we keep retrying this step (i.e. this enables our users to simply execute a manual rollover and the index is deleted by ILM eventually on retry)

We seem to have a couple of checks to make sure we delete the data stream when the last index reaches the delete step however, these checks seem a bit contradictory. Namely, the first check makes use if `Index` equality (UUID included) and the second just checks the index name. So if a data stream with just one index (the write index) is restored from snapshot (different UUID) we would've failed the first index equality check and go through the second check `dataStream.getWriteIndex().getName().equals(indexName)` and fail the delete step (in a non-retryable way :( ) because we don't want to delete the write index of a data stream (but we really do if the data stream has only one index) This PR makes 2 changes: 1. use the index name equality everywhere in the step (we already looked up the index abstraction and the parent data stream, so we know for sure the managed index is part of the data stream) 2. do not throw exception when we got here via a write index that is NOT the last index in the data stream but report the exception so we keep retrying this step (i.e. this enables our users to simply execute a manual rollover and the index is deleted by ILM eventually on retry)

elasticsearchmachine · 2024-02-23T12:28:13Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2024-02-23T12:28:14Z

Hi @andreidan, I've created a changelog YAML for you.

andreidan · 2024-02-23T14:35:03Z

@elasticmachine update branch

dakrone

LGTM, I left one comment about a simple comment, and another about enhancing the test, thanks for finding this Andrei!

dakrone · 2024-03-02T11:19:27Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/DeleteStep.java

@@ -41,7 +41,8 @@ public void performDuringNoSnapshot(IndexMetadata indexMetadata, ClusterState cu

        if (dataStream != null) {
            assert dataStream.getWriteIndex() != null : dataStream.getName() + " has no write index";
-            if (dataStream.getIndices().size() == 1 && dataStream.getIndices().get(0).equals(indexMetadata.getIndex())) {
+
+            if (dataStream.getIndices().size() == 1 && dataStream.getWriteIndex().getName().equals(indexName)) {


Can you add a comment about why we use name equality here so it doesn't get accidentally changed back? (I know we have tests, but it's still easy to stop someone from wasting work)

dakrone · 2024-03-02T11:22:58Z

x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/ilm/DeleteStepTests.java

+            }
+
+            @Override
+            public void onFailure(Exception e) {


I think that now that this isn't throwing directly, we need to have a latch or other mechanism to ensure that the onFailure handler was actually invoked. Otherwise if we were to introduce a bug where neither onResponse nor onFailure were called, then we wouldn't hit any asserts and the test would pass (when it shouldn't).

The execution in this particular case is not async as we're not getting to the callback as part of a client interaction - this is all sync as it's part of the step validation.
I've stubbed the client to fail in case it's being called in this test so that changes to the ILM step structure will yield this test to fail.

Ah I (later) understood what you meant - opened #105914 to make sure we fail the test if the listener is not called at all

andreidan · 2024-03-04T09:13:07Z

@elasticmachine update branch

…at all

…5772) We seem to have a couple of checks to make sure we delete the data stream when the last index reaches the delete step however, these checks seem a bit contradictory. Namely, the first check makes use if `Index` equality (UUID included) and the second just checks the index name. So if a data stream with just one index (the write index) is restored from snapshot (different UUID) we would've failed the first index equality check and go through the second check `dataStream.getWriteIndex().getName().equals(indexName)` and fail the delete step (in a non-retryable way :( ) because we don't want to delete the write index of a data stream (but we really do if the data stream has only one index) This PR makes 2 changes: 1. use the index name equality everywhere in the step (we already looked up the index abstraction and the parent data stream, so we know for sure the managed index is part of the data stream) 2. do not throw exception when we got here via a write index that is NOT the last index in the data stream but report the exception so we keep retrying this step (i.e. this enables our users to simply execute a manual rollover and the index is deleted by ILM eventually on retry)

elasticsearchmachine · 2024-03-04T10:57:04Z

💚 Backport successful

Status	Branch	Result
✅	8.13

…105897) We seem to have a couple of checks to make sure we delete the data stream when the last index reaches the delete step however, these checks seem a bit contradictory. Namely, the first check makes use if `Index` equality (UUID included) and the second just checks the index name. So if a data stream with just one index (the write index) is restored from snapshot (different UUID) we would've failed the first index equality check and go through the second check `dataStream.getWriteIndex().getName().equals(indexName)` and fail the delete step (in a non-retryable way :( ) because we don't want to delete the write index of a data stream (but we really do if the data stream has only one index) This PR makes 2 changes: 1. use the index name equality everywhere in the step (we already looked up the index abstraction and the parent data stream, so we know for sure the managed index is part of the data stream) 2. do not throw exception when we got here via a write index that is NOT the last index in the data stream but report the exception so we keep retrying this step (i.e. this enables our users to simply execute a manual rollover and the index is deleted by ILM eventually on retry)

andreidan added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.13.1 v8.12.3 labels Feb 23, 2024

andreidan requested a review from dakrone February 23, 2024 12:27

elasticsearchmachine added Team:Data Management Meta label for data/management team v8.14.0 labels Feb 23, 2024

Update docs/changelog/105772.yaml

c966e86

Merge branch 'main' into fix-delete-step

066bc95

dakrone approved these changes Mar 2, 2024

View reviewed changes

elasticmachine and others added 3 commits March 4, 2024 19:43

Merge branch 'main' into fix-delete-step

99bf9bb

comment

f95bad9

Stub the client in test to throw exception as it shouldn't be called …

b27055d

…at all

andreidan added auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport-and-merge Automatically create backport pull requests and merge when ready and removed v8.12.3 labels Mar 4, 2024

elasticsearchmachine merged commit c97160a into elastic:main Mar 4, 2024
14 checks passed

andreidan deleted the fix-delete-step branch March 4, 2024 10:56

andreidan mentioned this pull request Mar 4, 2024

[8.13] [ILM] Delete step deletes data stream with only one index (#105772) #105897

Merged

nielsbauman mentioned this pull request Mar 5, 2024

[ILM][Tests] Make sure we test the listener is called #105914

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ILM] Delete step deletes data stream with only one index #105772

[ILM] Delete step deletes data stream with only one index #105772

andreidan commented Feb 23, 2024

elasticsearchmachine commented Feb 23, 2024

elasticsearchmachine commented Feb 23, 2024

andreidan commented Feb 23, 2024

dakrone left a comment

dakrone Mar 2, 2024

dakrone Mar 2, 2024

andreidan Mar 4, 2024

andreidan Mar 4, 2024

andreidan commented Mar 4, 2024

elasticsearchmachine commented Mar 4, 2024

[ILM] Delete step deletes data stream with only one index #105772

[ILM] Delete step deletes data stream with only one index #105772

Conversation

andreidan commented Feb 23, 2024

elasticsearchmachine commented Feb 23, 2024

elasticsearchmachine commented Feb 23, 2024

andreidan commented Feb 23, 2024

dakrone left a comment

Choose a reason for hiding this comment

dakrone Mar 2, 2024

Choose a reason for hiding this comment

dakrone Mar 2, 2024

Choose a reason for hiding this comment

andreidan Mar 4, 2024

Choose a reason for hiding this comment

andreidan Mar 4, 2024

Choose a reason for hiding this comment

andreidan commented Mar 4, 2024

elasticsearchmachine commented Mar 4, 2024

💚 Backport successful