Make SLM Tasks Use Infinite Timeout for Master Requests #72085

original-brownbear · 2021-04-22T12:23:25Z

No point in failing SLM tasks on slow masters. Using 30s timeouts
likely leads to many needless SLM task failures when master is busy
temporarily which is less than ideal especially when snapshot
or retention task frequencies are low.

No point in failing SLM tasks on slow masters. Using 30s timeouts likely leads to many needless slm run failures when master is busy temporarily which is less than ideal especially when snapshot or retention task frequencies are low.

elasticmachine · 2021-04-22T12:23:28Z

Pinging @elastic/es-core-features (Team:Core/Features)

DaveCTurner

LGTM

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SnapshotRetentionTask.java

…slm-tasks

original-brownbear · 2021-04-22T15:27:30Z

Thanks David!

No point in failing SLM tasks on slow masters. Using 30s timeouts likely leads to many needless slm run failures when master is busy temporarily which is less than ideal especially when snapshot or retention task frequencies are low.

dakrone · 2021-04-22T16:36:43Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

        String snapshotName = maybeMetadata.map(policyMetadata -> {
-            CreateSnapshotRequest request = policyMetadata.getPolicy().toRequest();
+            // don't time out on this request to not produce failed SLM runs in case of a temporarily slow master node
+            CreateSnapshotRequest request = policyMetadata.getPolicy().toRequest().masterNodeTimeout(TimeValue.MAX_VALUE);


I am wary of infinite timeouts just because we should be able to come up with a reasonable time that a request should either succeed or fail, or else SLM should actually fail rather than waiting forever.

I know this is already merged, but what about just setting a reasonably long timeout (something like 2 hours, or even 24 hours) instead?

I think a timeout here is worse than patiently letting things complete in their own time. Timing out and retrying the thing that just timed out is particularly bad.

In this case, the timeout only applies to finding the master and processing the cluster state update that starts the snapshot. After that, we could already be waiting arbitrarily long. It's definitely a bug for us to take minutes or hours to process that first cluster state update, but it's almost certainly not a bug in SLM or snapshotting, and it doesn't make much sense to me to give up and retry (from the back of the queue) after a timeout elapses. We may as well stay in line and know that the master will get around to this eventually.

Do we retry regardless of whether the previous task completed or not? If so, do we have a mechanism to prevent too many of these jobs from piling up?

Do we retry regardless of whether the previous task completed or not? If so, do we have a mechanism to prevent too many of these jobs from piling up?

We have it on our list to discuss SLM retries and what we want to do in the event that an SLM snapshot fails, so it's still under discussion.

cross linking: #70587

) No point in failing SLM tasks on slow masters. Using 30s timeouts likely leads to many needless slm run failures when master is busy temporarily which is less than ideal especially when snapshot or retention task frequencies are low.

Same as elastic#72085 but for ILM. Having a timeout on these internal "requests" only adds more noise if master is slow already when timed out steps trigger moves to the error step. It seems like it is safe to remove the setting for the timeout outright as well as it was not used anywhere and never documented as far as I can tell.

Same as #72085 but for ILM. Having a timeout on these internal "requests" only adds more noise if master is slow already when timed out steps trigger moves to the error step. It seems like it is safe to remove the setting for the timeout outright as well as it was not used anywhere and never documented as far as I can tell.

Same as elastic#72085 but for ILM. Having a timeout on these internal "requests" only adds more noise if master is slow already when timed out steps trigger moves to the error step. It seems like it is safe to remove the setting for the timeout outright as well as it was not used anywhere and never documented as far as I can tell.

Same as #72085 but for ILM. Having a timeout on these internal "requests" only adds more noise if master is slow already when timed out steps trigger moves to the error step. It seems like it is safe to remove the setting for the timeout outright as well as it was not used anywhere and never documented as far as I can tell.

original-brownbear added >non-issue :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 v7.14.0 labels Apr 22, 2021

elasticmachine added the Team:Data Management Meta label for data/management team label Apr 22, 2021

original-brownbear requested review from DaveCTurner and dakrone April 22, 2021 13:19

DaveCTurner approved these changes Apr 22, 2021

View reviewed changes

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SnapshotRetentionTask.java Outdated Show resolved Hide resolved

original-brownbear added 2 commits April 22, 2021 15:56

Merge remote-tracking branch 'elastic/master' into infinite-timeouts-…

0b4f0d7

…slm-tasks

move comment

a80e8c0

original-brownbear merged commit 47fdb46 into elastic:master Apr 22, 2021

original-brownbear deleted the infinite-timeouts-slm-tasks branch April 22, 2021 15:27

original-brownbear mentioned this pull request Apr 22, 2021

Make SLM Tasks Use Infinite Timeout for Master Requests (#72085) #72112

Merged

dakrone reviewed Apr 22, 2021

View reviewed changes

original-brownbear mentioned this pull request Jun 15, 2021

Make ILM Steps use Infinite Master Timeout #74143

Merged

original-brownbear mentioned this pull request Jun 28, 2021

Make ILM Steps use Infinite Master Timeout (#74143) #74622

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

original-brownbear restored the infinite-timeouts-slm-tasks branch April 18, 2023 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make SLM Tasks Use Infinite Timeout for Master Requests #72085

Make SLM Tasks Use Infinite Timeout for Master Requests #72085

Uh oh!

original-brownbear commented Apr 22, 2021 •

edited

Loading

Uh oh!

elasticmachine commented Apr 22, 2021

Uh oh!

DaveCTurner left a comment

Uh oh!

Uh oh!

original-brownbear commented Apr 22, 2021

Uh oh!

dakrone Apr 22, 2021

Uh oh!

DaveCTurner Apr 22, 2021

Uh oh!

dakrone Apr 22, 2021

Uh oh!

jakelandis Apr 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Make SLM Tasks Use Infinite Timeout for Master Requests #72085

Make SLM Tasks Use Infinite Timeout for Master Requests #72085

Uh oh!

Conversation

original-brownbear commented Apr 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Apr 22, 2021

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

original-brownbear commented Apr 22, 2021

Uh oh!

dakrone Apr 22, 2021

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Apr 22, 2021

Choose a reason for hiding this comment

Uh oh!

dakrone Apr 22, 2021

Choose a reason for hiding this comment

Uh oh!

jakelandis Apr 23, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

original-brownbear commented Apr 22, 2021 •

edited

Loading