Skip to content

Conversation

original-brownbear
Copy link
Contributor

@original-brownbear original-brownbear commented Apr 22, 2021

No point in failing SLM tasks on slow masters. Using 30s timeouts
likely leads to many needless SLM task failures when master is busy
temporarily which is less than ideal especially when snapshot
or retention task frequencies are low.

No point in failing SLM tasks on slow masters. Using 30s timeouts
likely leads to many needless slm run failures when master is busy
temporarily which is less than ideal especially when snapshot
or retention task frequencies are low.
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Apr 22, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Contributor Author

Thanks David!

@original-brownbear original-brownbear merged commit 47fdb46 into elastic:master Apr 22, 2021
@original-brownbear original-brownbear deleted the infinite-timeouts-slm-tasks branch April 22, 2021 15:27
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 22, 2021
No point in failing SLM tasks on slow masters. Using 30s timeouts
likely leads to many needless slm run failures when master is busy
temporarily which is less than ideal especially when snapshot
or retention task frequencies are low.
String snapshotName = maybeMetadata.map(policyMetadata -> {
CreateSnapshotRequest request = policyMetadata.getPolicy().toRequest();
// don't time out on this request to not produce failed SLM runs in case of a temporarily slow master node
CreateSnapshotRequest request = policyMetadata.getPolicy().toRequest().masterNodeTimeout(TimeValue.MAX_VALUE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wary of infinite timeouts just because we should be able to come up with a reasonable time that a request should either succeed or fail, or else SLM should actually fail rather than waiting forever.

I know this is already merged, but what about just setting a reasonably long timeout (something like 2 hours, or even 24 hours) instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a timeout here is worse than patiently letting things complete in their own time. Timing out and retrying the thing that just timed out is particularly bad.

In this case, the timeout only applies to finding the master and processing the cluster state update that starts the snapshot. After that, we could already be waiting arbitrarily long. It's definitely a bug for us to take minutes or hours to process that first cluster state update, but it's almost certainly not a bug in SLM or snapshotting, and it doesn't make much sense to me to give up and retry (from the back of the queue) after a timeout elapses. We may as well stay in line and know that the master will get around to this eventually.

Do we retry regardless of whether the previous task completed or not? If so, do we have a mechanism to prevent too many of these jobs from piling up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we retry regardless of whether the previous task completed or not? If so, do we have a mechanism to prevent too many of these jobs from piling up?

We have it on our list to discuss SLM retries and what we want to do in the event that an SLM snapshot fails, so it's still under discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cross linking: #70587

original-brownbear added a commit that referenced this pull request Apr 23, 2021
)

No point in failing SLM tasks on slow masters. Using 30s timeouts
likely leads to many needless slm run failures when master is busy
temporarily which is less than ideal especially when snapshot
or retention task frequencies are low.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jun 15, 2021
Same as elastic#72085 but for ILM. Having a timeout on these internal "requests"
only adds more noise if master is slow already when timed out steps trigger
moves to the error step.
It seems like it is safe to remove the setting for the timeout outright as well
as it was not used anywhere and never documented as far as I can tell.
original-brownbear added a commit that referenced this pull request Jun 17, 2021
Same as #72085 but for ILM. Having a timeout on these internal "requests"
only adds more noise if master is slow already when timed out steps trigger
moves to the error step.
It seems like it is safe to remove the setting for the timeout outright as well
as it was not used anywhere and never documented as far as I can tell.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jun 28, 2021
Same as elastic#72085 but for ILM. Having a timeout on these internal "requests"
only adds more noise if master is slow already when timed out steps trigger
moves to the error step.
It seems like it is safe to remove the setting for the timeout outright as well
as it was not used anywhere and never documented as far as I can tell.
original-brownbear added a commit that referenced this pull request Jun 28, 2021
Same as #72085 but for ILM. Having a timeout on these internal "requests"
only adds more noise if master is slow already when timed out steps trigger
moves to the error step.
It seems like it is safe to remove the setting for the timeout outright as well
as it was not used anywhere and never documented as far as I can tell.
@original-brownbear original-brownbear restored the infinite-timeouts-slm-tasks branch April 18, 2023 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/ILM+SLM Index and Snapshot lifecycle management >non-issue Team:Data Management Meta label for data/management team v7.14.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants