New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Transform] Integrate transforms with the node shutdown API #100891
Comments
Pinging @elastic/ml-core (Team:ML) |
|
We'll go with this option. The alternative is to mimic what ML and health do, which is a proactive graceful shutdown by listening for node shutdowns on cluster changes: ML Example, Health example. We already have a TransformClusterStateListener and may be able to reuse that. Benefits of the alternative are to potentially prevent errors from even occurring in the first place, but the costs are a potentially intrusive change that impacts GA. We'd need something reactive anyway in case the proactive fails, so we'll go with the reactive option first. |
Transforms continue to run even when a node is shutting down. This may lead to a transform failing and putting itself into a failed state, which will prevent it from restarting when the node comes back online. The transform will now abort rather than fail, which puts itself into a started state. When the node comes back online, or another node in the cluster starts the transform, then the transform will pick up from its last successful saved state and checkpoint. Close elastic#100891
2 out of 49 WARNs over the last 24 hours were due to
This looks like a node shutdown, so closing this issue will prevent it. The other 47 are due to |
Transforms continue to run even when a node is shutting down. This may lead to a transform failing and putting itself into a failed state, which will prevent it from restarting when the node comes back online. The transform will now abort rather than fail, which puts itself into a started state. When the node comes back online, or another node in the cluster starts the transform, then the transform will pick up from its last successful saved state and checkpoint. Close #100891
Transforms currently do not take account of whether the node they're running on has been notified that it is shutting down.
In stateful Elasticsearch this never seemed to matter. I don't remember ever seeing a report of a failed transform that could be attributed to its node shutting down.
In stateless Elasticsearch we currently see many transform failures during rolling restarts of clusters. There are a variety of reasons, and some of the failures occur after moving to the new node rather than while running on the node that is shutting down.
Given that we haven't seen problems in current GA product, and don't want to disturb the code too much, a compromise solution to dealing with node shutdowns on transform nodes would be to say that if a transform suffers a condition that causes it to call
fail
on a node that is shutting down, instead of actually failing the transform should just mark itself locally complete so that the persistent tasks framework relocates it to a different node. If the cause of failure was something systemic to the cluster then it will fail again on the new node after redoing its most recent work. But if the cause of failure was something related to the node shutdown then the transform should pick up its work successfully on the new node, thus avoiding a spuriously failed transform.The text was updated successfully, but these errors were encountered: