ILM integration test with full policy #33402

talevy · 2018-09-04T23:34:25Z

this adds an integration test that runs through a policy
with all the actions defined.
adds a test specific to a policy having just a rollover action
bumps the node count to 4

NOTE: test fails, and I think it is due to timing of async actions being executed in parallel

- this adds an integration test that runs through a policy with all the actions defined. - adds a test specific to a policy having just a rollover action - bumps the node count to 4

elasticmachine · 2018-09-04T23:34:26Z

Pinging @elastic/es-core-infra

colings86

I left a few minor comments but LGTM once those are fixed.

colings86 · 2018-09-05T09:50:21Z

...n/ilm/src/test/java/org/elasticsearch/xpack/indexlifecycle/TimeSeriesLifecycleActionsIT.java

+            new RolloverAction(null, null, 1L))));
+        phases.put("warm", new Phase("warm", TimeValue.ZERO, warmActions));
+        phases.put("cold", new Phase("cold", TimeValue.ZERO, singletonMap(AllocateAction.NAME,
+            new AllocateAction(1, singletonMap("_name", "node-3"), null, null))));


The number of replicas needs to be set to 0 here otherwise the shrunken index can never progress past the cold phase meaning it will never be deleted. Setting this to 0 locally causes the test to pass for me

oh, good catch! this is why I wanted another pair of eyes!

The error scenario is misleading due to the attempt to execute it again and the transaction was stuck halfway

hmm. this does not pass the test locally for me. I will continue investigating. I am also seeing exceptions with rollover

rollover failed stacktrace

[2018-09-05T11:20:18,002][ERROR][o.e.c.s.MasterService ] [node-1] exception thrown by listener notifying of failure from [ILM] org.elasticsearch.ElasticsearchException: policy [nzxPV] for index [hwdnjldqgi-000001] failed trying to move from step [{"phase":"hot","action":"rollover","name":"attempt_rollover"}] to step [{"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"}]. at org.elasticsearch.xpack.indexlifecycle.MoveToNextStepUpdateTask.onFailure(MoveToNextStepUpdateTask.java:78) ~[?:?] at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.onFailure(MasterService.java:453) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.cluster.service.MasterService$TaskOutputs.notifyFailedTasks(MasterService.java:386) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:199) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] at java.lang.Thread.run(Thread.java:844) [?:?] Suppressed: java.lang.NullPointerException at org.elasticsearch.xpack.indexlifecycle.MoveToNextStepUpdateTask.execute(MoveToNextStepUpdateTask.java:54) ~[?:?] at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] at java.lang.Thread.run(Thread.java:844) [?:?] Caused by: java.lang.NullPointerException at org.elasticsearch.xpack.indexlifecycle.MoveToNextStepUpdateTask.execute(MoveToNextStepUpdateTask.java:54) ~[?:?] at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT] ... 9 more =========================================

That stack trace, whilst something we should fix, I don't think will be the cause of any failure you are still seeing. That seems to be caused because we are trying to access the index metadata for an index that has been deleted in MoveToNextStepUpdateTask as we are not checking that the index metadata is not null before we try to use it. However, that should not be causing an issue, only an ugly and annoying NPE in the logs so I think there will be another stack trace from the test itself showing any remaining error that will cause the test to fail

I raised #33455 to fix the NPEs

If I pull down this PR and rebase the branch on the latest from the feature branch I cannot reproduce the error you get above. However I can intermittently reproduce a failure with the following stack trace:

stacktrace

[2018-09-06T09:34:36,024][ERROR][o.e.ExceptionsHelper ] [node-3] fatal error at org.elasticsearch.ExceptionsHelper.lambda$maybeDieOnAnotherThread$2(ExceptionsHelper.java:264) at java.base/java.util.Optional.ifPresent(Optional.java:172) at org.elasticsearch.ExceptionsHelper.maybeDieOnAnotherThread(ExceptionsHelper.java:254) at org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:201) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:844) [2018-09-06T09:34:36,032][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-3] fatal error in thread [Thread-4], exiting java.lang.AssertionError: expected all steps for [[pbifllksqt-000001/CO6HMKYZQG-ZcIIKXSaHAw]] to be in phase [new] but they were not, steps: [{"phase":"hot","action":"rollover","name":"attempt_rollover"} => {"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"}, {"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"} => {"phase":"hot","action":"complete","name":"complete"}, {"phase":"hot","action":"complete","name":"complete"} => {"phase":"warm","action":"readonly","name":"readonly"}] at org.elasticsearch.xpack.indexlifecycle.PolicyStepsRegistry.getStep(PolicyStepsRegistry.java:240) ~[?:?] at org.elasticsearch.xpack.indexlifecycle.IndexLifecycleRunner.getCurrentStep(IndexLifecycleRunner.java:200) ~[?:?] at org.elasticsearch.xpack.indexlifecycle.IndexLifecycleRunner.runPolicy(IndexLifecycleRunner.java:89) ~[?:?] at org.elasticsearch.xpack.indexlifecycle.IndexLifecycleService.triggerPolicies(IndexLifecycleService.java:207) ~[?:?] at org.elasticsearch.xpack.indexlifecycle.IndexLifecycleService.triggered(IndexLifecycleService.java:165) ~[?:?] at org.elasticsearch.xpack.core.scheduler.SchedulerEngine.notifyListeners(SchedulerEngine.java:164) ~[?:?] at org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:192) ~[?:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] at java.lang.Thread.run(Thread.java:844) [?:?]

This error seems to shoot the node

I'm looking into this to see if I can reproduce it

colings86 · 2018-09-05T09:51:31Z

...n/ilm/src/test/java/org/elasticsearch/xpack/indexlifecycle/TimeSeriesLifecycleActionsIT.java

+            "{ \"policy\":" + Strings.toString(builder) + "}", ContentType.APPLICATION_JSON);
+        Request request = new Request("PUT", "_ilm/" + policy);
+        request.setEntity(entity);
+        client().performRequest(request);


We should check the response here with something like assertOK()

colings86 · 2018-09-05T09:52:00Z

...n/ilm/src/test/java/org/elasticsearch/xpack/indexlifecycle/TimeSeriesLifecycleActionsIT.java

-            }pollIntervalEntity.endObject();
-        } pollIntervalEntity.endObject();
-        request.setJsonEntity(Strings.toString(pollIntervalEntity));
-        assertOK(adminClient().performRequest(request));
    }

    public static void updatePolicy(String indexName, String policy) throws IOException {
        Request request = new Request("PUT", "/" + indexName + "/_ilm/" + policy);
        client().performRequest(request);


We should check the response here, probably with assertOK()

talevy · 2018-09-14T17:53:33Z

We discussed some of the step pile-up problems that are causing these tests to be flaky in a video-call. I have added the stalled label on this PR until we resolve the locking problem for the same step executing multiple times and failing due to race conditions

- this adds an integration test that runs through a policy with all the actions defined. - adds a test specific to a policy having just a rollover action - bumps the node count to 4

add full policy integration test

542460a

- this adds an integration test that runs through a policy with all the actions defined. - adds a test specific to a policy having just a rollover action - bumps the node count to 4

talevy added >test Issues or PRs that are addressing/adding tests :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Sep 4, 2018

talevy requested review from dakrone and colings86 September 4, 2018 23:34

colings86 approved these changes Sep 5, 2018

View reviewed changes

elasticmachine mentioned this pull request Sep 5, 2018

[meta] Index Lifecycle Management Plan #29823

Closed

talevy added 4 commits September 5, 2018 10:48

Merge branch 'index-lifecycle' into ilm-it

36165f1

Merge branch 'index-lifecycle' into ilm-it

6340536

change replica count in final allocation

5880a43

Merge branch 'index-lifecycle' into ilm-it

b6d99a3

talevy added the stalled label Sep 14, 2018

talevy and others added 4 commits September 17, 2018 18:15

Merge branch 'index-lifecycle' into ilm-it

4763375

increase poll

2afafe5

upgdate build.gradle with more poll

8a31c48

Merge remote-tracking branch 'origin/index-lifecycle' into ilm-it

ac27fc7

dakrone mentioned this pull request Sep 27, 2018

Change step execution flow to be deliberate about type #34126

Merged

dakrone added 2 commits October 3, 2018 08:01

Merge remote-tracking branch 'origin/index-lifecycle' into ilm-it

a9347af

Merge remote-tracking branch 'origin/index-lifecycle' into ilm-it

e9f4d5f

dakrone merged commit f10735a into elastic:index-lifecycle Oct 3, 2018

talevy deleted the ilm-it branch October 10, 2018 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ILM integration test with full policy #33402

ILM integration test with full policy #33402

talevy commented Sep 4, 2018

elasticmachine commented Sep 4, 2018

colings86 left a comment

colings86 Sep 5, 2018

talevy Sep 5, 2018

talevy Sep 5, 2018 •

edited

colings86 Sep 6, 2018

colings86 Sep 6, 2018

colings86 Sep 6, 2018 •

edited

dakrone Sep 6, 2018

colings86 Sep 5, 2018

talevy Sep 5, 2018

colings86 Sep 5, 2018

talevy commented Sep 14, 2018

ILM integration test with full policy #33402

ILM integration test with full policy #33402

Conversation

talevy commented Sep 4, 2018

elasticmachine commented Sep 4, 2018

colings86 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

talevy Sep 5, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colings86 Sep 6, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

talevy commented Sep 14, 2018

talevy Sep 5, 2018 •

edited

colings86 Sep 6, 2018 •

edited