Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ILM use Priority.IMMEDIATE for stop ILM cluster update #54909

Merged

Conversation

andreidan
Copy link
Contributor

@andreidan andreidan commented Apr 7, 2020

This changes the priority of the cluster state update that stops ILM
altogether to IMMEDIATE. We've chosen to change this as it can be useful to
temporarily stop ILM if a cluster is overwhelmed, but a NORMAL
priority can see the "stop ilm update" not make it up the tasks queue.

On the same note, we're keeping the start ILM cluster update priority
to NORMAL on purpose such that we only start ILM if the cluster can
handle it.

This changes the priority of the cluster state update that stops ILM
altogether to `URGENT`. We've chosen to change this as it can be useful to
temporarily stop ILM if a cluster is overwhelemed, but a `NORMAL`
priority can see the "stop ilm update" not make it up the tasks queue.

On the same note, we're keeping the `start ILM` cluster update priority
to `NORMAL` on purpose such that we only start `ILM` if the cluster can
handle it.
@andreidan andreidan added :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 v7.8.0 v7.7.1 labels Apr 7, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan andreidan requested a review from dakrone April 7, 2020 17:08
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andreidan, this needs to update the second half of the shutdown to also use the priority though, so both parts are urgent

@DaveCTurner
Copy link
Contributor

I suggest IMMEDIATE rather than URGENT for this task. I don't know of any starvation issues at URGENT priority but my concern is that we may uncover such an issue in future which prevents us from stopping ILM in a timely fashion. This action is very lightweight, only available to admins, rarely called, but useful when the cluster is under severe pressure, and I can't think of a reason for spamming it that isn't obviously abusive.

Please sprinkle some comments throughout these cluster state updates indicating that they may be running at IMMEDIATE priority, as a defence against accidentally making them more heavyweight in future.

@DaveCTurner
Copy link
Contributor

Also FWIW you can test that the task really runs at a high enough priority with something like this:

final AtomicBoolean keepSubmittingTasks = new AtomicBoolean(true);
final ClusterService clusterService = internalCluster().getInstance(ClusterService.class, internalCluster().getMasterName());
clusterService.submitStateUpdateTask("looping task", new ClusterStateUpdateTask(Priority.LOW) {
@Override
public ClusterState execute(ClusterState currentState) {
return currentState;
}
@Override
public void onFailure(String source, Exception e) {
throw new AssertionError(source, e);
}
@Override
public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
if (keepSubmittingTasks.get()) {
clusterService.submitStateUpdateTask("looping task", this);
}
}
});

(replace Priority.LOW with Priority.IMMEDIATE) -- even if the cluster is processing this looping task it should still process other IMMEDIATE tasks but nothing lower. I prefer that to simply asserting that it returns the right .priority().

@andreidan andreidan changed the title ILM use Priority.URGENT for stop ILM cluster update ILM use Priority.IMMEDIATE for stop ILM cluster update Apr 8, 2020
@andreidan
Copy link
Contributor Author

@DaveCTurner thanks for pointing out the ClusterHealthIT. I wasn't aware of it.

If you don't have a strong opinion on adding an integration test, I'd rather not, as ILM is a client of the ClusterService, using the ClusterService API according to the specification (I think ensuring the order and fairness of task execution is something that we should, and probably are, test as part of the cluster module - I think the PrioritizedExecutorsTests is making sure the executor invariants hold)

@andreidan andreidan requested a review from dakrone April 8, 2020 13:11
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Andrei

@andreidan andreidan merged commit d67df3a into elastic:master Apr 8, 2020
andreidan added a commit to andreidan/elasticsearch that referenced this pull request Apr 9, 2020
This changes the priority of the cluster state update that stops ILM
altogether to `IMMEDIATE`. We've chosen to change this as it can be useful to
temporarily stop ILM if a cluster is overwhelmed, but a `NORMAL`
priority can see the "stop ILM update" not make it up the tasks queue.

On the same note, we're keeping the `start ILM` cluster update priority
to `NORMAL` on purpose such that we only start `ILM` if the cluster can
handle it.

(cherry picked from commit d67df3a)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
andreidan added a commit to andreidan/elasticsearch that referenced this pull request Apr 9, 2020
This changes the priority of the cluster state update that stops ILM
altogether to `IMMEDIATE`. We've chosen to change this as it can be useful to
temporarily stop ILM if a cluster is overwhelmed, but a `NORMAL`
priority can see the "stop ILM update" not make it up the tasks queue.

On the same note, we're keeping the `start ILM` cluster update priority
to `NORMAL` on purpose such that we only start `ILM` if the cluster can
handle it.

(cherry picked from commit d67df3a)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
andreidan added a commit to andreidan/elasticsearch that referenced this pull request Apr 9, 2020
This changes the priority of the cluster state update that stops ILM
altogether to `IMMEDIATE`. We've chosen to change this as it can be useful to
temporarily stop ILM if a cluster is overwhelmed, but a `NORMAL`
priority can see the "stop ILM update" not make it up the tasks queue.

On the same note, we're keeping the `start ILM` cluster update priority
to `NORMAL` on purpose such that we only start `ILM` if the cluster can
handle it.

(cherry picked from commit d67df3a)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
andreidan added a commit that referenced this pull request Apr 11, 2020
…#55017)

* ILM use Priority.IMMEDIATE for stop ILM cluster update (#54909)

This changes the priority of the cluster state update that stops ILM
altogether to `IMMEDIATE`. We've chosen to change this as it can be useful to
temporarily stop ILM if a cluster is overwhelmed, but a `NORMAL`
priority can see the "stop ILM update" not make it up the tasks queue.

On the same note, we're keeping the `start ILM` cluster update priority
to `NORMAL` on purpose such that we only start `ILM` if the cluster can
handle it.

(cherry picked from commit d67df3a)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
andreidan added a commit that referenced this pull request Apr 11, 2020
…#55018)

* ILM use Priority.IMMEDIATE for stop ILM cluster update (#54909)

This changes the priority of the cluster state update that stops ILM
altogether to `IMMEDIATE`. We've chosen to change this as it can be useful to
temporarily stop ILM if a cluster is overwhelmed, but a `NORMAL`
priority can see the "stop ILM update" not make it up the tasks queue.

On the same note, we're keeping the `start ILM` cluster update priority
to `NORMAL` on purpose such that we only start `ILM` if the cluster can
handle it.

(cherry picked from commit d67df3a)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
andreidan added a commit that referenced this pull request Apr 11, 2020
…#55016)

This changes the priority of the cluster state update that stops ILM
altogether to `IMMEDIATE`. We've chosen to change this as it can be useful to
temporarily stop ILM if a cluster is overwhelmed, but a `NORMAL`
priority can see the "stop ILM update" not make it up the tasks queue.

On the same note, we're keeping the `start ILM` cluster update priority
to `NORMAL` on purpose such that we only start `ILM` if the cluster can
handle it.

(cherry picked from commit d67df3a)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
@bpintea bpintea added v7.7.0 and removed v7.7.1 labels Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants