Skip to content

Conversation

@nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Oct 16, 2025

Experimenting with factoring out retry logic so it can be re-used for different CSPs

Just putting this out there as a potential approach for the azure retrying. Basically it just pulls all the ES-specific retry logic out of the S3RetryingStream so we can reuse it for the CSPs.

This might be controversial. The pros/cons as I see them:

Pros

  • Common retry logic for Azure/S3 (and eventually GCP if we think its worthwhile)
  • Able to have more sophisticated retry logic (e.g. different retry behaviour per OperationPurpose, more persistent retrying when progress is being made)
  • Consistent delay/retry config across CSPs
  • Our retry logic is non-trivial, better not to duplicate it
  • Add a resilience feature in one place, get benefit for all 3 CSPs

Cons

  • The different CSP client support retries differently (retries mid-stream-read are supported in Azure out of the box)
  • This change introduces risk in S3 to add a feature to Azure (though I left all the S3 tests unchanged to ensure they still pass)
  • Separate RetryingInputStreams per CSP might be a good thing as it allows specialisation according to each store's particular quirks? (though there is still a wrapper around the client-native streams for S3/Azure where that could be put)

I tried to keep all the logic I pulled from S3RetryingInputStream -> RetryingInputStream in-tact and in-order so doing a diff it is easy to see what changed and what stayed the same.

Anyhow, I wanted to put it out there for discussion before I went much further with it to get feedback on the approach.


Update: this is probably ready for review now. The Azure and GCS implementations are both using the new common retry logic. For downloads that the retrying input stream conducts, we disable client-internal retries, so the retrying behaviour should be consistent across the board.

I apologise for the size of this PR, if it's too much I could probably split it to one per CSP

Relates: ES-13073

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Oct 21, 2025
@nicktindall nicktindall changed the title Factor out non-CSP-specific retrying logic Factor out common retrying logic Oct 21, 2025
@nicktindall nicktindall added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Oct 21, 2025
@nicktindall nicktindall requested review from fcofdez and ywangd October 21, 2025 05:31
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I am in favour of this approach. The extraction looks pretty clean to me.

A few questions to help my understanding:

  1. How feasible or how much effort to expand this to cover GoogleCloudStorageRetryingInputStream?
  2. You said Azure supports retries mid-stream-read out of the box. What does this mean for the common retry code? Is that reopenStreamOrFail is effectively ignored by Azure?

Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the late answer, this looks like a good direction to me 👍. As Yang mentioned, could we use this abstraction for the GCS repository too?

@nicktindall
Copy link
Contributor Author

Thanks @ywangd and @fcofdez, yeah I think it will be simple to replicate this for the GCP client.

You said Azure supports retries mid-stream-read out of the box. What does this mean for the common retry code? Is that reopenStreamOrFail is effectively ignored by Azure?

I'm still not sure about that, I think the most consistent way to do things is basically set Azure to do 0 retries, and do all the retrying in our layer? I have to look at the behaviour to see whether there's anything we're missing by not using their retry capability.

@ywangd
Copy link
Member

ywangd commented Oct 30, 2025

On a second thought, I don't think reopenStreamOrFail will be ignored by Azure even if it does its own functionally equivalent retries (not sure but assuming it does). We retry indefinitely for indices data so that reopenStreamOrFail should be invoked when Azure client exhausts its own retries. Basically we make the mid-retries more tenacious even if Azure client does it already.

@nicktindall
Copy link
Contributor Author

On a second thought, I don't think reopenStreamOrFail will be ignored by Azure even if it does its own functionally equivalent retries (not sure but assuming it does). We retry indefinitely for indices data so that reopenStreamOrFail should be invoked when Azure client exhausts its own retries. Basically we make the mid-retries more tenacious even if Azure client does it already.

I think we should configure retries in a single layer only, as it stands (before this PR) there is no reopenStreamOrFail for Azure an we depend on the client's baked-in retries, which are less sophisticated than the S3 retries and of course don't include any special consideration of progress made or operation purpose.

@ywangd
Copy link
Member

ywangd commented Oct 30, 2025

I think we should configure retries in a single layer only

While I think that's in theory preferrable, the ES level retries currently cover only read operation. So there will be gaps if we configure retries in a single layer.

@Override
public void onRetrySucceeded(String action, long numberOfRetries) {
// No metrics for Azure
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3 has its own special metrics for these, we can probably make that consistent now, but I wonder if we want to do that in a separate PR to keep the volume down

@Override
public long getMeaningfulProgressSize() {
return Math.max(1L, blobStore.getReadChunkSize() / 100L);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value seems kind-of arbitrary.

The Azure value calculates to about 320KiB, the GCS to 160KiB and the S3 one to 1MiB, they are all functions of various loosely-related thresholds. Perhaps it makes sense to make this a first-class setting and consistent across the CSPs?

position,
length == null ? totalSize : length,
totalSize,
0,
Copy link
Contributor Author

@nicktindall nicktindall Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We disable Azure client retries in these downloads so we can control them in the RetryingInputStream

.setInitialRpcTimeout(options.getRetrySettings().getInitialRpcTimeout())
.setRpcTimeoutMultiplier(options.getRetrySettings().getRpcTimeoutMultiplier())
.setMaxRpcTimeout(options.getRetrySettings().getMaxRpcTimeout())
.build()
Copy link
Contributor Author

@nicktindall nicktindall Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests used to configure time-based retries (totalTimeout: where we retry an infinite number of times up until some time limit). The new retrying input stream doesn't support that type of configuration so I unfortunately had to change the config for this test to use attempt count limits instead. I removed a lot of the setters that just set the new value to the old value (we don't need to do that if we use oldSettings.toBuilder().


int getMaxRetries() {
return storageService.clientSettings(projectId, clientName).getMaxRetries();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In GCS, retries are configured at the client level, not the request level. So we need to create a separate client that's configured to not retry, so we can manage retries in the RetryingInputStream. This will mean an extra client is cached for each config, but I don't think that's a huge overhead.

PREFIX,
"max_retries",
(key) -> Setting.intSetting(key, 5, 0, Setting.Property.NodeScope)
);
Copy link
Contributor Author

@nicktindall nicktindall Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS didn't have any retry config previously. The default was to retry up to 5 times, with a time-limit of 50 seconds. The new default will be the same when the RetryBehaviour is ClientConfigured and we use com.google.cloud.ServiceOptions#getNoRetrySettings when RetryBehaviour equals None

retrySettingsBuilder.setMaxAttempts(maxRetries + 1);
}
.setMaxRpcTimeout(Duration.ofSeconds(1))
.setMaxAttempts(options.getRetrySettings().getMaxAttempts());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the setting now instead

purpose -> purpose == OperationPurpose.REPOSITORY_ANALYSIS || purpose == OperationPurpose.INDICES,
BlobStoreTestUtil::randomPurpose
);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were all moved up to the parent class as they're consistent across the different CSPs

@nicktindall nicktindall marked this pull request as ready for review November 20, 2025 01:28
@nicktindall nicktindall requested a review from ywangd November 20, 2025 01:28
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Nov 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@fcofdez
Copy link
Contributor

fcofdez commented Nov 20, 2025

@nicktindall I'm not sure if I'll have the bandwidth to review this PR.

@nicktindall
Copy link
Contributor Author

@nicktindall I'm not sure if I'll have the bandwidth to review this PR.

That's fine @fcofdez, I think I Yang is OK with reviewing it, I just wanted to get your opinion on the approach because you raised the original ticket. Thanks for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue serverless-linked Added by automation, don't add manually Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants