[Transform] Integrate transforms with the node shutdown API #100891

droberts195 · 2023-10-16T09:15:24Z

Transforms currently do not take account of whether the node they're running on has been notified that it is shutting down.

In stateful Elasticsearch this never seemed to matter. I don't remember ever seeing a report of a failed transform that could be attributed to its node shutting down.

In stateless Elasticsearch we currently see many transform failures during rolling restarts of clusters. There are a variety of reasons, and some of the failures occur after moving to the new node rather than while running on the node that is shutting down.

Given that we haven't seen problems in current GA product, and don't want to disturb the code too much, a compromise solution to dealing with node shutdowns on transform nodes would be to say that if a transform suffers a condition that causes it to call fail on a node that is shutting down, instead of actually failing the transform should just mark itself locally complete so that the persistent tasks framework relocates it to a different node. If the cause of failure was something systemic to the cluster then it will fail again on the new node after redoing its most recent work. But if the cause of failure was something related to the node shutdown then the transform should pick up its work successfully on the new node, thus avoiding a spuriously failed transform.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-10-16T09:15:48Z

Pinging @elastic/ml-core (Team:ML)

prwhelan · 2024-04-10T14:18:06Z

instead of actually failing the transform should just mark itself locally complete so that the persistent tasks framework relocates it to a different node

https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/persistent/AllocatedPersistentTask.java#L144

prwhelan · 2024-04-10T14:44:59Z

Given that we haven't seen problems in current GA product, and don't want to disturb the code too much, a compromise solution to dealing with node shutdowns on transform nodes would be to say that if a transform suffers a condition that causes it to call fail on a node that is shutting down, instead of actually failing the transform should just mark itself locally complete so that the persistent tasks framework relocates it to a different node. If the cause of failure was something systemic to the cluster then it will fail again on the new node after redoing its most recent work. But if the cause of failure was something related to the node shutdown then the transform should pick up its work successfully on the new node, thus avoiding a spuriously failed transform.

We'll go with this option. The alternative is to mimic what ML and health do, which is a proactive graceful shutdown by listening for node shutdowns on cluster changes: ML Example, Health example. We already have a TransformClusterStateListener and may be able to reuse that.

Benefits of the alternative are to potentially prevent errors from even occurring in the first place, but the costs are a potentially intrusive change that impacts GA. We'd need something reactive anyway in case the proactive fails, so we'll go with the reactive option first.

Transforms continue to run even when a node is shutting down. This may lead to a transform failing and putting itself into a failed state, which will prevent it from restarting when the node comes back online. The transform will now abort rather than fail, which puts itself into a started state. When the node comes back online, or another node in the cluster starts the transform, then the transform will pick up from its last successful saved state and checkpoint. Close elastic#100891

prwhelan · 2024-04-12T14:36:44Z

2 out of 49 WARNs over the last 24 hours were due to

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of TimedRunnable{original=org.elasticsearch.action.bulk.TransportBulkAction$2/org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener/org.elasticsearch.action.ActionListenerImplementations$MappedActionListener/org.elasticsearch.action.support.ContextPreservingActionListener/org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener/org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener/org.elasticsearch.action.ActionListenerImplementations$MappedActionListener/org.elasticsearch.action.support.ContextPreservingActionListener/org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener/org.elasticsearch.tasks.TaskManager$1{org.elasticsearch.action.support.ContextPreservingActionListener/WrappedActionListener{org.elasticsearch.xpack.transform.persistence.IndexBasedTransformConfigManager$$Lambda/0x000000c0028f9690@1d4d01c}
...
org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@77480751[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]] (shutdown)

This looks like a node shutdown, so closing this issue will prevent it.

The other 47 are due to NodeClosedException which retried and succeeded. There is a chance that this could move into a failed state if we exhaust the number of retries, but there is no indication that is happening in the last 90 days. If that does, we can change the retry logic to ignore failures when there are nodes entering or exiting the cluster

Transforms continue to run even when a node is shutting down. This may lead to a transform failing and putting itself into a failed state, which will prevent it from restarting when the node comes back online. The transform will now abort rather than fail, which puts itself into a started state. When the node comes back online, or another node in the cluster starts the transform, then the transform will pick up from its last successful saved state and checkpoint. Close #100891

droberts195 added the :ml/Transform Transform label Oct 16, 2023

elasticsearchmachine added the Team:ML Meta label for the ML team label Oct 16, 2023

prwhelan self-assigned this Apr 10, 2024

prwhelan mentioned this issue Apr 10, 2024

[Transform] Elasticsearch upgrades make transforms fail easily #107251

Open

5 tasks

prwhelan mentioned this issue Apr 11, 2024

[Transform] Check node shutdown before fail #107358

Merged

prwhelan mentioned this issue Apr 15, 2024

[Transform] Unattended are failing due to missing configuration #107266

Closed

prwhelan closed this as completed in #107358 Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Transform] Integrate transforms with the node shutdown API #100891

[Transform] Integrate transforms with the node shutdown API #100891

droberts195 commented Oct 16, 2023

elasticsearchmachine commented Oct 16, 2023

prwhelan commented Apr 10, 2024

prwhelan commented Apr 10, 2024

prwhelan commented Apr 12, 2024

[Transform] Integrate transforms with the node shutdown API #100891

[Transform] Integrate transforms with the node shutdown API #100891

Comments

droberts195 commented Oct 16, 2023

elasticsearchmachine commented Oct 16, 2023

prwhelan commented Apr 10, 2024

prwhelan commented Apr 10, 2024

prwhelan commented Apr 12, 2024