Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ILM integration test with full policy #33402

Merged
merged 11 commits into from Oct 3, 2018
Merged

Conversation

talevy
Copy link
Contributor

@talevy talevy commented Sep 4, 2018

  • this adds an integration test that runs through a policy
    with all the actions defined.
  • adds a test specific to a policy having just a rollover action
  • bumps the node count to 4

NOTE: test fails, and I think it is due to timing of async actions being executed in parallel

- this adds an integration test that runs through a policy
with all the actions defined.
- adds a test specific to a policy having just a rollover action
- bumps the node count to 4
@talevy talevy added >test Issues or PRs that are addressing/adding tests :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Sep 4, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few minor comments but LGTM once those are fixed.

new RolloverAction(null, null, 1L))));
phases.put("warm", new Phase("warm", TimeValue.ZERO, warmActions));
phases.put("cold", new Phase("cold", TimeValue.ZERO, singletonMap(AllocateAction.NAME,
new AllocateAction(1, singletonMap("_name", "node-3"), null, null))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of replicas needs to be set to 0 here otherwise the shrunken index can never progress past the cold phase meaning it will never be deleted. Setting this to 0 locally causes the test to pass for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, good catch! this is why I wanted another pair of eyes!

The error scenario is misleading due to the attempt to execute it again and the transaction was stuck halfway

Copy link
Contributor Author

@talevy talevy Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. this does not pass the test locally for me. I will continue investigating. I am also seeing exceptions with rollover

rollover failed stacktrace
[2018-09-05T11:20:18,002][ERROR][o.e.c.s.MasterService    ] [node-1] exception thrown by listener notifying of failure from [ILM]
org.elasticsearch.ElasticsearchException: policy [nzxPV] for index [hwdnjldqgi-000001] failed trying to move from step [{"phase":"hot","action":"rollover","name":"attempt_rollover"}] to step [{"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"}].
        at org.elasticsearch.xpack.indexlifecycle.MoveToNextStepUpdateTask.onFailure(MoveToNextStepUpdateTask.java:78) ~[?:?]
        at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.onFailure(MasterService.java:453) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService$TaskOutputs.notifyFailedTasks(MasterService.java:386) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:199) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]
        Suppressed: java.lang.NullPointerException
                at org.elasticsearch.xpack.indexlifecycle.MoveToNextStepUpdateTask.execute(MoveToNextStepUpdateTask.java:54) ~[?:?]
                at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
                at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
                at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
                at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
                at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
                at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
                at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
                at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
                at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
                at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
                at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: java.lang.NullPointerException
        at org.elasticsearch.xpack.indexlifecycle.MoveToNextStepUpdateTask.execute(MoveToNextStepUpdateTask.java:54) ~[?:?]
        at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        ... 9 more
=========================================

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That stack trace, whilst something we should fix, I don't think will be the cause of any failure you are still seeing. That seems to be caused because we are trying to access the index metadata for an index that has been deleted in MoveToNextStepUpdateTask as we are not checking that the index metadata is not null before we try to use it. However, that should not be causing an issue, only an ugly and annoying NPE in the logs so I think there will be another stack trace from the test itself showing any remaining error that will cause the test to fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised #33455 to fix the NPEs

Copy link
Contributor

@colings86 colings86 Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I pull down this PR and rebase the branch on the latest from the feature branch I cannot reproduce the error you get above. However I can intermittently reproduce a failure with the following stack trace:

stacktrace
[2018-09-06T09:34:36,024][ERROR][o.e.ExceptionsHelper     ] [node-3] fatal error
        at org.elasticsearch.ExceptionsHelper.lambda$maybeDieOnAnotherThread$2(ExceptionsHelper.java:264)
        at java.base/java.util.Optional.ifPresent(Optional.java:172)
        at org.elasticsearch.ExceptionsHelper.maybeDieOnAnotherThread(ExceptionsHelper.java:254)
        at org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:201)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:844)
[2018-09-06T09:34:36,032][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-3] fatal error in thread [Thread-4], exiting
java.lang.AssertionError: expected all steps for [[pbifllksqt-000001/CO6HMKYZQG-ZcIIKXSaHAw]] to be in phase [new] but they were not, steps: [{"phase":"hot","action":"rollover","name":"attempt_rollover"} => {"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"}, {"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"} => {"phase":"hot","action":"complete","name":"complete"}, {"phase":"hot","action":"complete","name":"complete"} => {"phase":"warm","action":"readonly","name":"readonly"}]
        at org.elasticsearch.xpack.indexlifecycle.PolicyStepsRegistry.getStep(PolicyStepsRegistry.java:240) ~[?:?]
        at org.elasticsearch.xpack.indexlifecycle.IndexLifecycleRunner.getCurrentStep(IndexLifecycleRunner.java:200) ~[?:?]
        at org.elasticsearch.xpack.indexlifecycle.IndexLifecycleRunner.runPolicy(IndexLifecycleRunner.java:89) ~[?:?]
        at org.elasticsearch.xpack.indexlifecycle.IndexLifecycleService.triggerPolicies(IndexLifecycleService.java:207) ~[?:?]
        at org.elasticsearch.xpack.indexlifecycle.IndexLifecycleService.triggered(IndexLifecycleService.java:165) ~[?:?]
        at org.elasticsearch.xpack.core.scheduler.SchedulerEngine.notifyListeners(SchedulerEngine.java:164) ~[?:?]
        at org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:192) ~[?:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]

This error seems to shoot the node

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking into this to see if I can reproduce it

"{ \"policy\":" + Strings.toString(builder) + "}", ContentType.APPLICATION_JSON);
Request request = new Request("PUT", "_ilm/" + policy);
request.setEntity(entity);
client().performRequest(request);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check the response here with something like assertOK()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

}pollIntervalEntity.endObject();
} pollIntervalEntity.endObject();
request.setJsonEntity(Strings.toString(pollIntervalEntity));
assertOK(adminClient().performRequest(request));
}

public static void updatePolicy(String indexName, String policy) throws IOException {
Request request = new Request("PUT", "/" + indexName + "/_ilm/" + policy);
client().performRequest(request);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check the response here, probably with assertOK()

@talevy talevy added the stalled label Sep 14, 2018
@talevy
Copy link
Contributor Author

talevy commented Sep 14, 2018

We discussed some of the step pile-up problems that are causing these tests to be flaky in a video-call. I have added the stalled label on this PR until we resolve the locking problem for the same step executing multiple times and failing due to race conditions

@dakrone dakrone merged commit f10735a into elastic:index-lifecycle Oct 3, 2018
dakrone pushed a commit that referenced this pull request Oct 3, 2018
- this adds an integration test that runs through a policy
with all the actions defined.
- adds a test specific to a policy having just a rollover action
- bumps the node count to 4
@talevy talevy deleted the ilm-it branch October 10, 2018 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management stalled >test Issues or PRs that are addressing/adding tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants