Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
import org.elasticsearch.cluster.service.ClusterService;
import org.elasticsearch.common.Strings;
import org.elasticsearch.common.bytes.BytesReference;
import org.elasticsearch.common.unit.TimeValue;
import org.elasticsearch.common.xcontent.ToXContent;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.json.JsonXContent;
Expand Down Expand Up @@ -85,7 +86,8 @@ public static Optional<String> maybeTakeSnapshot(final String jobId, final Clien
final SnapshotHistoryStore historyStore) {
Optional<SnapshotLifecyclePolicyMetadata> maybeMetadata = getSnapPolicyMetadata(jobId, clusterService.state());
String snapshotName = maybeMetadata.map(policyMetadata -> {
CreateSnapshotRequest request = policyMetadata.getPolicy().toRequest();
// don't time out on this request to not produce failed SLM runs in case of a temporarily slow master node
CreateSnapshotRequest request = policyMetadata.getPolicy().toRequest().masterNodeTimeout(TimeValue.MAX_VALUE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wary of infinite timeouts just because we should be able to come up with a reasonable time that a request should either succeed or fail, or else SLM should actually fail rather than waiting forever.

I know this is already merged, but what about just setting a reasonably long timeout (something like 2 hours, or even 24 hours) instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a timeout here is worse than patiently letting things complete in their own time. Timing out and retrying the thing that just timed out is particularly bad.

In this case, the timeout only applies to finding the master and processing the cluster state update that starts the snapshot. After that, we could already be waiting arbitrarily long. It's definitely a bug for us to take minutes or hours to process that first cluster state update, but it's almost certainly not a bug in SLM or snapshotting, and it doesn't make much sense to me to give up and retry (from the back of the queue) after a timeout elapses. We may as well stay in line and know that the master will get around to this eventually.

Do we retry regardless of whether the previous task completed or not? If so, do we have a mechanism to prevent too many of these jobs from piling up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we retry regardless of whether the previous task completed or not? If so, do we have a mechanism to prevent too many of these jobs from piling up?

We have it on our list to discuss SLM retries and what we want to do in the event that an SLM snapshot fails, so it's still under discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cross linking: #70587

final LifecyclePolicySecurityClient clientWithHeaders = new LifecyclePolicySecurityClient(client,
ClientHelper.INDEX_LIFECYCLE_ORIGIN, policyMetadata.getHeaders());
logger.info("snapshot lifecycle policy [{}] issuing create snapshot [{}]",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,8 @@ void getAllRetainableSnapshots(Collection<String> repositories, ActionListener<M

client.admin().cluster()
.prepareGetSnapshots(repositories.toArray(Strings.EMPTY_ARRAY))
// don't time out on this request to not produce failed SLM runs in case of a temporarily slow master node
.setMasterNodeTimeout(TimeValue.MAX_VALUE)
.setIgnoreUnavailable(true)
.execute(ActionListener.wrap(resp -> {
if (logger.isTraceEnabled()) {
Expand Down Expand Up @@ -373,7 +375,9 @@ private void deleteSnapshots(SnapshotLifecycleStats slmStats, AtomicInteger dele
void deleteSnapshot(String slmPolicy, String repo, SnapshotId snapshot, SnapshotLifecycleStats slmStats,
ActionListener<AcknowledgedResponse> listener) {
logger.info("[{}] snapshot retention deleting snapshot [{}]", repo, snapshot);
client.admin().cluster().prepareDeleteSnapshot(repo, snapshot.getName()).execute(ActionListener.wrap(acknowledgedResponse -> {
// don't time out on this request to not produce failed SLM runs in case of a temporarily slow master node
client.admin().cluster().prepareDeleteSnapshot(repo, snapshot.getName()).setMasterNodeTimeout(TimeValue.MAX_VALUE).execute(
ActionListener.wrap(acknowledgedResponse -> {
slmStats.snapshotDeleted(slmPolicy);
listener.onResponse(acknowledgedResponse);
},
Expand Down