re-enable ILM integration tests and fix policyRegistry update bug #32108

talevy · 2018-07-16T20:27:54Z

This PR re-introduces our ILM integration tests with mock steps
that we can control in the tests.

These tests uncovered a bug where the policy-steps-registry was
not being updated on newly elected masters when there were no
cluster-state changes to ILM metadata. The fix layed out cleans up
the registry/runner when a node is un-elected as master. It re-assigns
the class variables so that the existing runner/registry instances that
may be running can continue to do so in other threads, potentially.

This PR re-introduces our ILM integration tests with mock steps that we can control in the tests. These tests uncovered a bug where the policy-steps-registry was not being updated on newly elected masters when there were no cluster-state changes to ILM metadata. The fix layed out cleans up the registry/runner when a node is un-elected as master. It re-assigns the class variables so that the existing runner/registry instances that may be running can continue to do so in other threads, potentially.

elasticmachine · 2018-07-16T20:27:56Z

Pinging @elastic/es-core-infra

colings86

@talevy I left some comments. I also think @jasontedor should take a look at this as I know he had some thoughts on how it might work

colings86 · 2018-07-17T08:40:46Z

...in/core/src/main/java/org/elasticsearch/xpack/core/indexlifecycle/LockableLifecycleType.java

+import java.util.List;
+import java.util.Map;
+
+public class LockableLifecycleType implements LifecycleType {


Its a bit strange to have this in the main code rather than test code since we only want to use it for tests. I can see why you've done this but I wonder if there is a way for us to only register this lifecycle type in tests so it doesn't need to live with the main code?

In fact I see below that you register a plugin for the tests which itself registers this lifecycle type so maybe this class could already be moved to the test source folder?

yeah... I convinced myself I already moved this. will move

colings86 · 2018-07-17T08:41:49Z

...in/core/src/main/java/org/elasticsearch/xpack/core/indexlifecycle/LockableLifecycleType.java

+    @Override
+    public void writeTo(StreamOutput out) {
+
+    }


I could be wrong but I think we need a LockableLifecycleType(StreamInput in); constructor here for serialisation?

I wondered that, but it is never needed it seems.

Since the class itself has the logic to what to do with the method calls, there are no useful instance variables to serialize across

colings86 · 2018-07-17T08:48:03Z

x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/indexlifecycle/MockStep.java

+    public String getWriteableName() {
+        return NAME;
+    }


Could you explain why this is needed?

leftover. will remove

colings86 · 2018-07-17T08:50:01Z

...ex-lifecycle/src/main/java/org/elasticsearch/xpack/indexlifecycle/IndexLifecycleService.java

-            if (lifecycleMetadata != null && event.changedCustomMetaDataSet().contains(IndexLifecycleMetadata.TYPE)) {
+            if (lifecycleMetadata != null
+                    && (event.changedCustomMetaDataSet().contains(IndexLifecycleMetadata.TYPE) ||
+                        lifecycleMetadata.getPolicies().size() != policyRegistry.getLifecyclePolicyMap().size())) {


I'm not sure about this as a way to check if things have changed? what if a policy has been added and another has been removed? The number of policies will be equal but we should still run update().

Is there a problem with always calling update() and just letting that method work out if things have changed?

I'm not sure about this as a way to check if things have changed? what if a policy has been added and another has been removed? The number of policies will be equal but we should still run update().

In the case that policyRegistry is fairly up-to-date, and anything in the policymetadata changes, this will still call update because of the original changedcustomeMetaDataSet check. The size is only to reflect that policyRegistry was zeroed from being unelected or just recently elected to be master

Is there a problem with always calling update() and just letting that method work out if things have changed?

I do not have a problem with this, I was just trying to preserve the

event.changedCustomMetaDataSet().contains(IndexLifecycleMetadata.TYPE)

check. If we do not have a problem with always updating, then this will simplify things for sure.

I see. The size check is a bit of a cryptic way to check if the policy registry needs to be bootstrapped (which is essentially what its doing form your explanation?). Can we instead have a method on the policy registry which you can call to determine that it needs to be bootstrapped, or alternatively we can destroy the registry completely when we are no longer the master and recreate it here if its null?

I do not have a problem with this, I was just trying to preserve the

event.changedCustomMetaDataSet().contains(IndexLifecycleMetadata.TYPE)

Given that we pass the whole cluster state to the registry we could also move this check inside the update method where we have more information and always call update() here.

what do you think of going the always call update approach instead to simplify things even further?

colings86 · 2018-07-17T09:04:04Z

...cle/src/test/java/org/elasticsearch/xpack/indexlifecycle/IndexLifecycleInitialisationIT.java

+        }
+    }
+
+    public static class ObservableClusterStateWaitStep extends IndexLifecycleRunnerTests.MockClusterStateWaitStep


Does this need to extend IndexLifecycleRunnerTests.MockClusterStateWaitStep? I don't think it uses anything from it?

I will take another look. this may not be needed in the latest revision

talevy · 2018-07-17T17:51:26Z

...ex-lifecycle/src/main/java/org/elasticsearch/xpack/indexlifecycle/IndexLifecycleService.java

@@ -121,7 +123,7 @@ public void applyClusterState(ClusterChangedEvent event) {
        if (event.localNodeMaster()) { // only act if we are master, otherwise
                                       // keep idle until elected
            IndexLifecycleMetadata lifecycleMetadata = event.state().metaData().custom(IndexLifecycleMetadata.TYPE);
-            if (lifecycleMetadata != null && event.changedCustomMetaDataSet().contains(IndexLifecycleMetadata.TYPE)) {


Hey @jasontedor,

Previously, IndexLifecycleService was not updating its internal state when
a node was newly-elected master, and there were no changes to policies in the cluster-state.

To fix this, there are a few options, but two that I see are as follows:

always attempt to call policyRegistry.update, Since the real diff that matters is between the internal state of the registry and the current cluster state.

upon un-election, clear the registry, and re-bootstrap it on the first cluster-state listener call that is triggered once it is master. (a version of this is what exists in the code now, but should be cleaned up if this approach is taken)

(2) has the benefit that it makes it clear that this instance is no longer master and therefore should forget about any policies it once had since it is not keeping up to date with updating it anymore for when changes do occur. The downside here is that this can result in weird behavior for when nodes are un-elected, then elected, all before the next state listener callback occurs.

(1) has the benefit that there does not seem to be any edge cases to worry about. The only downside is that the object will be stale until it becomes master again. This should be OK because the cluster-state-applier that will re-update the registry will be called before the cluster-state-listener-callback re-launches the scheduled job and triggers policies.

Are there aspects of this that these thoughts are missing in the story of re-election and state management?

update to this inquiry has been discussed in #32181 and #32212 was opened to address it

talevy · 2018-07-18T22:10:38Z

Hey @colings86,

After giving this some thought, I think that the change to always update is still good independent of the safety of the policyRegistry in various situations.

I have decided to revert the game of "refreshing" the local state variables. Since I think this
PR has value in itself by un-muting the only integration tests we have around master failover, I think we can move the discussion of the safety of the registry, in general, to a separate issue which I have opened here #32181

talevy · 2018-07-19T22:19:18Z

Hi @colings86, I've re-requested review after adding the cleanup commit (15ed476) you suggested.

colings86

This looks good to me but I'd like @jasontedor to take a looks still to check he is happy with the general approach here. Maybe @dakrone could also take a look to see if the approach is ok?

dakrone

This approach seems reasonable to me, I left one minor comment otherwise

dakrone · 2018-07-31T20:29:51Z

...ex-lifecycle/src/test/java/org/elasticsearch/xpack/indexlifecycle/LockableLifecycleType.java

+import java.util.List;
+import java.util.Map;
+
+public class LockableLifecycleType implements LifecycleType {


Can you add javadocs please?

yup. thanks for the suggestion!

talevy · 2018-07-31T22:16:52Z

thanks for the review @dakrone!

…2108) This PR re-introduces our ILM integration tests with mock steps that we can control in the tests. These tests uncovered a bug where the policy-steps-registry was not being updated on newly elected masters when there were no cluster-state changes to ILM metadata. The fix layed out cleans up the registry/runner when a node is un-elected as master. It re-assigns the class variables so that the existing runner/registry instances that may be running can continue to do so in other threads, potentially.

talevy added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Jul 16, 2018

talevy requested a review from colings86 July 16, 2018 20:27

talevy added >bug >test Issues or PRs that are addressing/adding tests labels Jul 16, 2018

colings86 requested changes Jul 17, 2018

View reviewed changes

talevy added 3 commits July 17, 2018 08:21

cleanup

1732360

checkout index-lifecycle MockStep

bb6807a

remove unecessary line

a5f9696

talevy commented Jul 17, 2018

View reviewed changes

talevy added the review label Jul 17, 2018

talevy requested a review from jasontedor July 17, 2018 23:56

talevy added 2 commits July 18, 2018 15:00

Merge branch 'index-lifecycle' into ilm-master-failover

7ae3727

revert re-assignment of policy registry

4503c18

talevy changed the title ~~re-enable ILM integration tests and fix policyRegistry bug~~ re-enable ILM integration tests and fix policyRegistry update bug Jul 18, 2018

elasticmachine mentioned this pull request Jul 19, 2018

[meta] Index Lifecycle Management Plan #29823

Closed

talevy added 2 commits July 19, 2018 12:45

simply argument to IndexLifecycleMetadata in registry update

15ed476

Merge branch 'index-lifecycle' into ilm-master-failover

fc66661

talevy requested a review from colings86 July 19, 2018 22:17

colings86 approved these changes Jul 20, 2018

View reviewed changes

talevy requested review from jasontedor and removed request for jasontedor July 24, 2018 18:06

Merge branch 'index-lifecycle' into ilm-master-failover

8f6f5cf

talevy requested review from jasontedor and dakrone and removed request for jasontedor July 30, 2018 18:51

dakrone approved these changes Jul 31, 2018

View reviewed changes

talevy added 2 commits July 31, 2018 15:13

Merge branch 'index-lifecycle' into ilm-master-failover

1a5049b

add javadoc to LockableLifecycleType

70ceba1

talevy merged commit 304304f into elastic:index-lifecycle Aug 1, 2018

talevy deleted the ilm-master-failover branch August 1, 2018 04:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re-enable ILM integration tests and fix policyRegistry update bug #32108

re-enable ILM integration tests and fix policyRegistry update bug #32108

talevy commented Jul 16, 2018

elasticmachine commented Jul 16, 2018

colings86 left a comment

colings86 Jul 17, 2018

colings86 Jul 17, 2018

talevy Jul 17, 2018

colings86 Jul 17, 2018

talevy Jul 17, 2018

colings86 Jul 17, 2018

talevy Jul 17, 2018

colings86 Jul 17, 2018

talevy Jul 17, 2018

colings86 Jul 17, 2018

colings86 Jul 17, 2018

talevy Jul 17, 2018

colings86 Jul 17, 2018

talevy Jul 17, 2018

talevy Jul 17, 2018 •

edited

talevy Jul 19, 2018

talevy commented Jul 18, 2018

talevy commented Jul 19, 2018

colings86 left a comment •

edited

dakrone left a comment

dakrone Jul 31, 2018

talevy Jul 31, 2018

talevy commented Jul 31, 2018

re-enable ILM integration tests and fix policyRegistry update bug #32108

re-enable ILM integration tests and fix policyRegistry update bug #32108

Conversation

talevy commented Jul 16, 2018

elasticmachine commented Jul 16, 2018

colings86 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

talevy Jul 17, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

talevy commented Jul 18, 2018

talevy commented Jul 19, 2018

colings86 left a comment • edited

Choose a reason for hiding this comment

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

talevy commented Jul 31, 2018

talevy Jul 17, 2018 •

edited

colings86 left a comment •

edited