New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ILM integration test with full policy #33402
Conversation
- this adds an integration test that runs through a policy with all the actions defined. - adds a test specific to a policy having just a rollover action - bumps the node count to 4
Pinging @elastic/es-core-infra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few minor comments but LGTM once those are fixed.
new RolloverAction(null, null, 1L)))); | ||
phases.put("warm", new Phase("warm", TimeValue.ZERO, warmActions)); | ||
phases.put("cold", new Phase("cold", TimeValue.ZERO, singletonMap(AllocateAction.NAME, | ||
new AllocateAction(1, singletonMap("_name", "node-3"), null, null)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of replicas needs to be set to 0
here otherwise the shrunken index can never progress past the cold phase meaning it will never be deleted. Setting this to 0
locally causes the test to pass for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, good catch! this is why I wanted another pair of eyes!
The error scenario is misleading due to the attempt to execute it again and the transaction was stuck halfway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. this does not pass the test locally for me. I will continue investigating. I am also seeing exceptions with rollover
rollover failed stacktrace
[2018-09-05T11:20:18,002][ERROR][o.e.c.s.MasterService ] [node-1] exception thrown by listener notifying of failure from [ILM]
org.elasticsearch.ElasticsearchException: policy [nzxPV] for index [hwdnjldqgi-000001] failed trying to move from step [{"phase":"hot","action":"rollover","name":"attempt_rollover"}] to step [{"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"}].
at org.elasticsearch.xpack.indexlifecycle.MoveToNextStepUpdateTask.onFailure(MoveToNextStepUpdateTask.java:78) ~[?:?]
at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.onFailure(MasterService.java:453) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService$TaskOutputs.notifyFailedTasks(MasterService.java:386) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:199) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]
Suppressed: java.lang.NullPointerException
at org.elasticsearch.xpack.indexlifecycle.MoveToNextStepUpdateTask.execute(MoveToNextStepUpdateTask.java:54) ~[?:?]
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: java.lang.NullPointerException
at org.elasticsearch.xpack.indexlifecycle.MoveToNextStepUpdateTask.execute(MoveToNextStepUpdateTask.java:54) ~[?:?]
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
... 9 more
=========================================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That stack trace, whilst something we should fix, I don't think will be the cause of any failure you are still seeing. That seems to be caused because we are trying to access the index metadata for an index that has been deleted in MoveToNextStepUpdateTask
as we are not checking that the index metadata is not null
before we try to use it. However, that should not be causing an issue, only an ugly and annoying NPE in the logs so I think there will be another stack trace from the test itself showing any remaining error that will cause the test to fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I raised #33455 to fix the NPEs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I pull down this PR and rebase the branch on the latest from the feature branch I cannot reproduce the error you get above. However I can intermittently reproduce a failure with the following stack trace:
stacktrace
[2018-09-06T09:34:36,024][ERROR][o.e.ExceptionsHelper ] [node-3] fatal error
at org.elasticsearch.ExceptionsHelper.lambda$maybeDieOnAnotherThread$2(ExceptionsHelper.java:264)
at java.base/java.util.Optional.ifPresent(Optional.java:172)
at org.elasticsearch.ExceptionsHelper.maybeDieOnAnotherThread(ExceptionsHelper.java:254)
at org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:201)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:844)
[2018-09-06T09:34:36,032][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-3] fatal error in thread [Thread-4], exiting
java.lang.AssertionError: expected all steps for [[pbifllksqt-000001/CO6HMKYZQG-ZcIIKXSaHAw]] to be in phase [new] but they were not, steps: [{"phase":"hot","action":"rollover","name":"attempt_rollover"} => {"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"}, {"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"} => {"phase":"hot","action":"complete","name":"complete"}, {"phase":"hot","action":"complete","name":"complete"} => {"phase":"warm","action":"readonly","name":"readonly"}]
at org.elasticsearch.xpack.indexlifecycle.PolicyStepsRegistry.getStep(PolicyStepsRegistry.java:240) ~[?:?]
at org.elasticsearch.xpack.indexlifecycle.IndexLifecycleRunner.getCurrentStep(IndexLifecycleRunner.java:200) ~[?:?]
at org.elasticsearch.xpack.indexlifecycle.IndexLifecycleRunner.runPolicy(IndexLifecycleRunner.java:89) ~[?:?]
at org.elasticsearch.xpack.indexlifecycle.IndexLifecycleService.triggerPolicies(IndexLifecycleService.java:207) ~[?:?]
at org.elasticsearch.xpack.indexlifecycle.IndexLifecycleService.triggered(IndexLifecycleService.java:165) ~[?:?]
at org.elasticsearch.xpack.core.scheduler.SchedulerEngine.notifyListeners(SchedulerEngine.java:164) ~[?:?]
at org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:192) ~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]
This error seems to shoot the node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking into this to see if I can reproduce it
"{ \"policy\":" + Strings.toString(builder) + "}", ContentType.APPLICATION_JSON); | ||
Request request = new Request("PUT", "_ilm/" + policy); | ||
request.setEntity(entity); | ||
client().performRequest(request); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check the response here with something like assertOK()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
}pollIntervalEntity.endObject(); | ||
} pollIntervalEntity.endObject(); | ||
request.setJsonEntity(Strings.toString(pollIntervalEntity)); | ||
assertOK(adminClient().performRequest(request)); | ||
} | ||
|
||
public static void updatePolicy(String indexName, String policy) throws IOException { | ||
Request request = new Request("PUT", "/" + indexName + "/_ilm/" + policy); | ||
client().performRequest(request); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check the response here, probably with assertOK()
We discussed some of the step pile-up problems that are causing these tests to be flaky in a video-call. I have added the |
- this adds an integration test that runs through a policy with all the actions defined. - adds a test specific to a policy having just a rollover action - bumps the node count to 4
with all the actions defined.
NOTE: test fails, and I think it is due to timing of async actions being executed in parallel