-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Factor out common retrying logic #136663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Factor out common retrying logic #136663
Conversation
server/src/main/java/org/elasticsearch/common/blobstore/RetryingInputStream.java
Show resolved
Hide resolved
ywangd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I am in favour of this approach. The extraction looks pretty clean to me.
A few questions to help my understanding:
- How feasible or how much effort to expand this to cover
GoogleCloudStorageRetryingInputStream? - You said Azure supports retries mid-stream-read out of the box. What does this mean for the common retry code? Is that
reopenStreamOrFailis effectively ignored by Azure?
fcofdez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the late answer, this looks like a good direction to me 👍. As Yang mentioned, could we use this abstraction for the GCS repository too?
|
Thanks @ywangd and @fcofdez, yeah I think it will be simple to replicate this for the GCP client.
I'm still not sure about that, I think the most consistent way to do things is basically set Azure to do 0 retries, and do all the retrying in our layer? I have to look at the behaviour to see whether there's anything we're missing by not using their retry capability. |
|
On a second thought, I don't think |
I think we should configure retries in a single layer only, as it stands (before this PR) there is no |
While I think that's in theory preferrable, the ES level retries currently cover only read operation. So there will be gaps if we configure retries in a single layer. |
| @Override | ||
| public void onRetrySucceeded(String action, long numberOfRetries) { | ||
| // No metrics for Azure | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S3 has its own special metrics for these, we can probably make that consistent now, but I wonder if we want to do that in a separate PR to keep the volume down
| @Override | ||
| public long getMeaningfulProgressSize() { | ||
| return Math.max(1L, blobStore.getReadChunkSize() / 100L); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value seems kind-of arbitrary.
The Azure value calculates to about 320KiB, the GCS to 160KiB and the S3 one to 1MiB, they are all functions of various loosely-related thresholds. Perhaps it makes sense to make this a first-class setting and consistent across the CSPs?
| position, | ||
| length == null ? totalSize : length, | ||
| totalSize, | ||
| 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We disable Azure client retries in these downloads so we can control them in the RetryingInputStream
| .setInitialRpcTimeout(options.getRetrySettings().getInitialRpcTimeout()) | ||
| .setRpcTimeoutMultiplier(options.getRetrySettings().getRpcTimeoutMultiplier()) | ||
| .setMaxRpcTimeout(options.getRetrySettings().getMaxRpcTimeout()) | ||
| .build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests used to configure time-based retries (totalTimeout: where we retry an infinite number of times up until some time limit). The new retrying input stream doesn't support that type of configuration so I unfortunately had to change the config for this test to use attempt count limits instead. I removed a lot of the setters that just set the new value to the old value (we don't need to do that if we use oldSettings.toBuilder().
|
|
||
| int getMaxRetries() { | ||
| return storageService.clientSettings(projectId, clientName).getMaxRetries(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In GCS, retries are configured at the client level, not the request level. So we need to create a separate client that's configured to not retry, so we can manage retries in the RetryingInputStream. This will mean an extra client is cached for each config, but I don't think that's a huge overhead.
| PREFIX, | ||
| "max_retries", | ||
| (key) -> Setting.intSetting(key, 5, 0, Setting.Property.NodeScope) | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCS didn't have any retry config previously. The default was to retry up to 5 times, with a time-limit of 50 seconds. The new default will be the same when the RetryBehaviour is ClientConfigured and we use com.google.cloud.ServiceOptions#getNoRetrySettings when RetryBehaviour equals None
| retrySettingsBuilder.setMaxAttempts(maxRetries + 1); | ||
| } | ||
| .setMaxRpcTimeout(Duration.ofSeconds(1)) | ||
| .setMaxAttempts(options.getRetrySettings().getMaxAttempts()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the setting now instead
| purpose -> purpose == OperationPurpose.REPOSITORY_ANALYSIS || purpose == OperationPurpose.INDICES, | ||
| BlobStoreTestUtil::randomPurpose | ||
| ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were all moved up to the parent class as they're consistent across the different CSPs
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
@nicktindall I'm not sure if I'll have the bandwidth to review this PR. |
That's fine @fcofdez, I think I Yang is OK with reviewing it, I just wanted to get your opinion on the approach because you raised the original ticket. Thanks for the help. |
Experimenting with factoring out retry logic so it can be re-used for different CSPs
Just putting this out there as a potential approach for the azure retrying. Basically it just pulls all the ES-specific retry logic out of the S3RetryingStream so we can reuse it for the CSPs.
This might be controversial. The pros/cons as I see them:
Pros
OperationPurpose, more persistent retrying when progress is being made)Cons
RetryingInputStreamsper CSP might be a good thing as it allows specialisation according to each store's particular quirks? (though there is still a wrapper around the client-native streams for S3/Azure where that could be put)I tried to keep all the logic I pulled from
S3RetryingInputStream->RetryingInputStreamin-tact and in-order so doing a diff it is easy to see what changed and what stayed the same.Anyhow, I wanted to put it out there for discussion before I went much further with it to get feedback on the approach.
Update: this is probably ready for review now. The Azure and GCS implementations are both using the new common retry logic. For downloads that the retrying input stream conducts, we disable client-internal retries, so the retrying behaviour should be consistent across the board.
I apologise for the size of this PR, if it's too much I could probably split it to one per CSP
Relates: ES-13073