Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform] Integrate transforms with the node shutdown API #100891

Closed
Tracked by #107251
droberts195 opened this issue Oct 16, 2023 · 4 comments · Fixed by #107358
Closed
Tracked by #107251

[Transform] Integrate transforms with the node shutdown API #100891

droberts195 opened this issue Oct 16, 2023 · 4 comments · Fixed by #107358
Assignees
Labels
:ml/Transform Transform Team:ML Meta label for the ML team

Comments

@droberts195
Copy link
Contributor

Transforms currently do not take account of whether the node they're running on has been notified that it is shutting down.

In stateful Elasticsearch this never seemed to matter. I don't remember ever seeing a report of a failed transform that could be attributed to its node shutting down.

In stateless Elasticsearch we currently see many transform failures during rolling restarts of clusters. There are a variety of reasons, and some of the failures occur after moving to the new node rather than while running on the node that is shutting down.

Given that we haven't seen problems in current GA product, and don't want to disturb the code too much, a compromise solution to dealing with node shutdowns on transform nodes would be to say that if a transform suffers a condition that causes it to call fail on a node that is shutting down, instead of actually failing the transform should just mark itself locally complete so that the persistent tasks framework relocates it to a different node. If the cause of failure was something systemic to the cluster then it will fail again on the new node after redoing its most recent work. But if the cause of failure was something related to the node shutdown then the transform should pick up its work successfully on the new node, thus avoiding a spuriously failed transform.

@droberts195 droberts195 added the :ml/Transform Transform label Oct 16, 2023
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Oct 16, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@prwhelan
Copy link
Member

instead of actually failing the transform should just mark itself locally complete so that the persistent tasks framework relocates it to a different node

https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/persistent/AllocatedPersistentTask.java#L144

@prwhelan
Copy link
Member

Given that we haven't seen problems in current GA product, and don't want to disturb the code too much, a compromise solution to dealing with node shutdowns on transform nodes would be to say that if a transform suffers a condition that causes it to call fail on a node that is shutting down, instead of actually failing the transform should just mark itself locally complete so that the persistent tasks framework relocates it to a different node. If the cause of failure was something systemic to the cluster then it will fail again on the new node after redoing its most recent work. But if the cause of failure was something related to the node shutdown then the transform should pick up its work successfully on the new node, thus avoiding a spuriously failed transform.

We'll go with this option. The alternative is to mimic what ML and health do, which is a proactive graceful shutdown by listening for node shutdowns on cluster changes: ML Example, Health example. We already have a TransformClusterStateListener and may be able to reuse that.

Benefits of the alternative are to potentially prevent errors from even occurring in the first place, but the costs are a potentially intrusive change that impacts GA. We'd need something reactive anyway in case the proactive fails, so we'll go with the reactive option first.

prwhelan added a commit to prwhelan/elasticsearch that referenced this issue Apr 11, 2024
Transforms continue to run even when a node is shutting down. This may
lead to a transform failing and putting itself into a failed state,
which will prevent it from restarting when the node comes back online.

The transform will now abort rather than fail, which puts itself into a
started state. When the node comes back online, or another node in the
cluster starts the transform, then the transform will pick up from its
last successful saved state and checkpoint.

Close elastic#100891
@prwhelan
Copy link
Member

2 out of 49 WARNs over the last 24 hours were due to

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of TimedRunnable{original=org.elasticsearch.action.bulk.TransportBulkAction$2/org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener/org.elasticsearch.action.ActionListenerImplementations$MappedActionListener/org.elasticsearch.action.support.ContextPreservingActionListener/org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener/org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener/org.elasticsearch.action.ActionListenerImplementations$MappedActionListener/org.elasticsearch.action.support.ContextPreservingActionListener/org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener/org.elasticsearch.tasks.TaskManager$1{org.elasticsearch.action.support.ContextPreservingActionListener/WrappedActionListener{org.elasticsearch.xpack.transform.persistence.IndexBasedTransformConfigManager$$Lambda/0x000000c0028f9690@1d4d01c}
...
org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@77480751[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]] (shutdown)

This looks like a node shutdown, so closing this issue will prevent it.

The other 47 are due to NodeClosedException which retried and succeeded. There is a chance that this could move into a failed state if we exhaust the number of retries, but there is no indication that is happening in the last 90 days. If that does, we can change the retry logic to ignore failures when there are nodes entering or exiting the cluster

prwhelan added a commit that referenced this issue Apr 15, 2024
Transforms continue to run even when a node is shutting down. This may
lead to a transform failing and putting itself into a failed state,
which will prevent it from restarting when the node comes back online.

The transform will now abort rather than fail, which puts itself into a
started state. When the node comes back online, or another node in the
cluster starts the transform, then the transform will pick up from its
last successful saved state and checkpoint.

Close #100891
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml/Transform Transform Team:ML Meta label for the ML team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants