-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Make SLM Tasks Use Infinite Timeout for Master Requests #72085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make SLM Tasks Use Infinite Timeout for Master Requests #72085
Conversation
No point in failing SLM tasks on slow masters. Using 30s timeouts likely leads to many needless slm run failures when master is busy temporarily which is less than ideal especially when snapshot or retention task frequencies are low.
Pinging @elastic/es-core-features (Team:Core/Features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SnapshotRetentionTask.java
Outdated
Show resolved
Hide resolved
Thanks David! |
No point in failing SLM tasks on slow masters. Using 30s timeouts likely leads to many needless slm run failures when master is busy temporarily which is less than ideal especially when snapshot or retention task frequencies are low.
String snapshotName = maybeMetadata.map(policyMetadata -> { | ||
CreateSnapshotRequest request = policyMetadata.getPolicy().toRequest(); | ||
// don't time out on this request to not produce failed SLM runs in case of a temporarily slow master node | ||
CreateSnapshotRequest request = policyMetadata.getPolicy().toRequest().masterNodeTimeout(TimeValue.MAX_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wary of infinite timeouts just because we should be able to come up with a reasonable time that a request should either succeed or fail, or else SLM should actually fail rather than waiting forever.
I know this is already merged, but what about just setting a reasonably long timeout (something like 2 hours, or even 24 hours) instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a timeout here is worse than patiently letting things complete in their own time. Timing out and retrying the thing that just timed out is particularly bad.
In this case, the timeout only applies to finding the master and processing the cluster state update that starts the snapshot. After that, we could already be waiting arbitrarily long. It's definitely a bug for us to take minutes or hours to process that first cluster state update, but it's almost certainly not a bug in SLM or snapshotting, and it doesn't make much sense to me to give up and retry (from the back of the queue) after a timeout elapses. We may as well stay in line and know that the master will get around to this eventually.
Do we retry regardless of whether the previous task completed or not? If so, do we have a mechanism to prevent too many of these jobs from piling up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we retry regardless of whether the previous task completed or not? If so, do we have a mechanism to prevent too many of these jobs from piling up?
We have it on our list to discuss SLM retries and what we want to do in the event that an SLM snapshot fails, so it's still under discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cross linking: #70587
Same as elastic#72085 but for ILM. Having a timeout on these internal "requests" only adds more noise if master is slow already when timed out steps trigger moves to the error step. It seems like it is safe to remove the setting for the timeout outright as well as it was not used anywhere and never documented as far as I can tell.
Same as #72085 but for ILM. Having a timeout on these internal "requests" only adds more noise if master is slow already when timed out steps trigger moves to the error step. It seems like it is safe to remove the setting for the timeout outright as well as it was not used anywhere and never documented as far as I can tell.
Same as elastic#72085 but for ILM. Having a timeout on these internal "requests" only adds more noise if master is slow already when timed out steps trigger moves to the error step. It seems like it is safe to remove the setting for the timeout outright as well as it was not used anywhere and never documented as far as I can tell.
Same as #72085 but for ILM. Having a timeout on these internal "requests" only adds more noise if master is slow already when timed out steps trigger moves to the error step. It seems like it is safe to remove the setting for the timeout outright as well as it was not used anywhere and never documented as far as I can tell.
No point in failing SLM tasks on slow masters. Using 30s timeouts
likely leads to many needless SLM task failures when master is busy
temporarily which is less than ideal especially when snapshot
or retention task frequencies are low.