Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-enable ILM integration tests and fix policyRegistry update bug #32108

Merged
merged 11 commits into from Aug 1, 2018

Conversation

talevy
Copy link
Contributor

@talevy talevy commented Jul 16, 2018

This PR re-introduces our ILM integration tests with mock steps
that we can control in the tests.

These tests uncovered a bug where the policy-steps-registry was
not being updated on newly elected masters when there were no
cluster-state changes to ILM metadata. The fix layed out cleans up
the registry/runner when a node is un-elected as master. It re-assigns
the class variables so that the existing runner/registry instances that
may be running can continue to do so in other threads, potentially.

This PR re-introduces our ILM integration tests with mock steps
that we can control in the tests.

These tests uncovered a bug where the policy-steps-registry was
not being updated on newly elected masters when there were no
cluster-state changes to ILM metadata. The fix layed out cleans up
the registry/runner when a node is un-elected as master. It re-assigns
the class variables so that the existing runner/registry instances that
may be running can continue to do so in other threads, potentially.
@talevy talevy added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Jul 16, 2018
@talevy talevy requested a review from colings86 July 16, 2018 20:27
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@talevy talevy added >bug >test Issues or PRs that are addressing/adding tests labels Jul 16, 2018
Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talevy I left some comments. I also think @jasontedor should take a look at this as I know he had some thoughts on how it might work

import java.util.List;
import java.util.Map;

public class LockableLifecycleType implements LifecycleType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a bit strange to have this in the main code rather than test code since we only want to use it for tests. I can see why you've done this but I wonder if there is a way for us to only register this lifecycle type in tests so it doesn't need to live with the main code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I see below that you register a plugin for the tests which itself registers this lifecycle type so maybe this class could already be moved to the test source folder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... I convinced myself I already moved this. will move

@Override
public void writeTo(StreamOutput out) {

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong but I think we need a LockableLifecycleType(StreamInput in); constructor here for serialisation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered that, but it is never needed it seems.

Since the class itself has the logic to what to do with the method calls, there are no useful instance variables to serialize across

public String getWriteableName() {
return NAME;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover. will remove

if (lifecycleMetadata != null && event.changedCustomMetaDataSet().contains(IndexLifecycleMetadata.TYPE)) {
if (lifecycleMetadata != null
&& (event.changedCustomMetaDataSet().contains(IndexLifecycleMetadata.TYPE) ||
lifecycleMetadata.getPolicies().size() != policyRegistry.getLifecyclePolicyMap().size())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this as a way to check if things have changed? what if a policy has been added and another has been removed? The number of policies will be equal but we should still run update().

Is there a problem with always calling update() and just letting that method work out if things have changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this as a way to check if things have changed? what if a policy has been added and another has been removed? The number of policies will be equal but we should still run update().

In the case that policyRegistry is fairly up-to-date, and anything in the policymetadata changes, this will still call update because of the original changedcustomeMetaDataSet check. The size is only to reflect that policyRegistry was zeroed from being unelected or just recently elected to be master

Is there a problem with always calling update() and just letting that method work out if things have changed?

I do not have a problem with this, I was just trying to preserve the

event.changedCustomMetaDataSet().contains(IndexLifecycleMetadata.TYPE)

check. If we do not have a problem with always updating, then this will simplify things for sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The size check is a bit of a cryptic way to check if the policy registry needs to be bootstrapped (which is essentially what its doing form your explanation?). Can we instead have a method on the policy registry which you can call to determine that it needs to be bootstrapped, or alternatively we can destroy the registry completely when we are no longer the master and recreate it here if its null?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have a problem with this, I was just trying to preserve the

event.changedCustomMetaDataSet().contains(IndexLifecycleMetadata.TYPE)

Given that we pass the whole cluster state to the registry we could also move this check inside the update method where we have more information and always call update() here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think of going the always call update approach instead to simplify things even further?

}
}

public static class ObservableClusterStateWaitStep extends IndexLifecycleRunnerTests.MockClusterStateWaitStep
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to extend IndexLifecycleRunnerTests.MockClusterStateWaitStep? I don't think it uses anything from it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take another look. this may not be needed in the latest revision

@@ -121,7 +123,7 @@ public void applyClusterState(ClusterChangedEvent event) {
if (event.localNodeMaster()) { // only act if we are master, otherwise
// keep idle until elected
IndexLifecycleMetadata lifecycleMetadata = event.state().metaData().custom(IndexLifecycleMetadata.TYPE);
if (lifecycleMetadata != null && event.changedCustomMetaDataSet().contains(IndexLifecycleMetadata.TYPE)) {
Copy link
Contributor Author

@talevy talevy Jul 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jasontedor,

Previously, IndexLifecycleService was not updating its internal state when
a node was newly-elected master, and there were no changes to policies in the cluster-state.

To fix this, there are a few options, but two that I see are as follows:

  1. always attempt to call policyRegistry.update, Since the real diff that matters is between the internal state of the registry and the current cluster state.
  2. upon un-election, clear the registry, and re-bootstrap it on the first cluster-state listener call that is triggered once it is master. (a version of this is what exists in the code now, but should be cleaned up if this approach is taken)

(2) has the benefit that it makes it clear that this instance is no longer master and therefore should forget about any policies it once had since it is not keeping up to date with updating it anymore for when changes do occur. The downside here is that this can result in weird behavior for when nodes are un-elected, then elected, all before the next state listener callback occurs.

(1) has the benefit that there does not seem to be any edge cases to worry about. The only downside is that the object will be stale until it becomes master again. This should be OK because the cluster-state-applier that will re-update the registry will be called before the cluster-state-listener-callback re-launches the scheduled job and triggers policies.

Are there aspects of this that these thoughts are missing in the story of re-election and state management?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update to this inquiry has been discussed in #32181 and #32212 was opened to address it

@talevy talevy added the review label Jul 17, 2018
@talevy talevy requested a review from jasontedor July 17, 2018 23:56
@talevy talevy changed the title re-enable ILM integration tests and fix policyRegistry bug re-enable ILM integration tests and fix policyRegistry update bug Jul 18, 2018
@talevy
Copy link
Contributor Author

talevy commented Jul 18, 2018

Hey @colings86,

After giving this some thought, I think that the change to always update is still good independent of the safety of the policyRegistry in various situations.

I have decided to revert the game of "refreshing" the local state variables. Since I think this
PR has value in itself by un-muting the only integration tests we have around master failover, I think we can move the discussion of the safety of the registry, in general, to a separate issue which I have opened here #32181

@talevy talevy requested a review from colings86 July 19, 2018 22:17
@talevy
Copy link
Contributor Author

talevy commented Jul 19, 2018

Hi @colings86, I've re-requested review after adding the cleanup commit (15ed476) you suggested.

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me but I'd like @jasontedor to take a looks still to check he is happy with the general approach here. Maybe @dakrone could also take a look to see if the approach is ok?

@talevy talevy requested review from jasontedor and removed request for jasontedor July 24, 2018 18:06
@talevy talevy requested review from jasontedor and dakrone and removed request for jasontedor July 30, 2018 18:51
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems reasonable to me, I left one minor comment otherwise

import java.util.List;
import java.util.Map;

public class LockableLifecycleType implements LifecycleType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add javadocs please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. thanks for the suggestion!

@talevy
Copy link
Contributor Author

talevy commented Jul 31, 2018

thanks for the review @dakrone!

@talevy talevy merged commit 304304f into elastic:index-lifecycle Aug 1, 2018
@talevy talevy deleted the ilm-master-failover branch August 1, 2018 04:06
jasontedor pushed a commit that referenced this pull request Aug 17, 2018
…2108)

This PR re-introduces our ILM integration tests with mock steps
that we can control in the tests.

These tests uncovered a bug where the policy-steps-registry was
not being updated on newly elected masters when there were no
cluster-state changes to ILM metadata. The fix layed out cleans up
the registry/runner when a node is un-elected as master. It re-assigns
the class variables so that the existing runner/registry instances that
may be running can continue to do so in other threads, potentially.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management >test Issues or PRs that are addressing/adding tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants