Use common retry logic for GCS #138553

nicktindall · 2025-11-25T09:06:42Z

This is the second step for breaking up and merging #136663

I chose to do GCS next because it introduces safe-resume (where we remember a version of the blob we were downloading so we can request specifically that one when we resume). This will mean less refactoring than if we'd done Azure first.

I didn't implement that logic for S3 although its trivial. I will do that in a subsequent change.

nicktindall · 2025-11-25T09:11:01Z

...Test/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStoreRepositoryTests.java

-                                .setRetryDelayMultiplier(options.getRetrySettings().getRetryDelayMultiplier())
                                .setMaxRetryDelay(Duration.ofSeconds(1L))
-                                .setMaxAttempts(0)
                                .setJittered(false)


This test originally configured retries to be time-based (i.e. no limit on the attempts, just keep retrying for some amount of time). I changed it to just make the retry intervals small and depend on the configured retry limits because we don't support time-based retries anymore.

nicktindall · 2025-11-25T09:13:59Z

...alClusterTest/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageThirdPartyTests.java

            container.writeBlob(randomPurpose(), blobKey, new BytesArray(initialValue), true);

-            try (InputStream inputStream = container.readBlob(randomPurpose(), blobKey)) {
+            try (InputStream inputStream = container.readBlob(randomRetryingPurpose(), blobKey)) {


We have to be careful where we use randomPurpose() now because some purposes no longer retry (e.g. REPOSITORY_ANALYSIS)

nicktindall · 2025-11-25T09:18:51Z

.../src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageRetryingInputStream.java

+        @Override
+        public long getMeaningfulProgressSize() {
+            return Math.max(1L, GoogleCloudStorageBlobStore.SDK_DEFAULT_CHUNK_SIZE / 100L);
+        }


The choice of this value is somewhat arbitrary, open to suggestions of whether we should make this consistent across CSPs or use some other value here. SDK default chunk size is 16MB, so this is 160KB

nicktindall · 2025-11-26T01:28:59Z

server/src/main/java/org/elasticsearch/common/blobstore/RetryingInputStream.java

     * will attempt to create a new one of these. If reading from it fails, it should not retry.
     */
-    protected abstract static class SingleAttemptInputStream extends InputStream {
+    protected static final class SingleAttemptInputStream<V> extends FilterInputStream {


I think the best approach for the SingleAttemptInputStream is to implement it as a decorator. This gives the CSPs more freedom in how they implement the single-attempt stream.

Specifically, if you want to do something on every read, it's much more work to extend FilterInputStream because the default implementations all delegate to the wrapped stream. If you extend InputStream you only need to implement int read() and the defaults are all implemented on top of that.

If we expect everyone to extend SingleAttemptInputStream we force everyone to extend whichever of the above that we extended. This is the inheritance issue I alluded to earlier due to InputStream being an abstract class rather than an interface.

I see that GCS's ContentLengthValidatingInputStream extends FilterInputStream which is a different base class than S3SingleAttemptInputStream. I think making S3SingleAttemptInputStream a FilterInputStream could work as well since it is effectively a delegate to ResponseInputStream. But it may be a problem again for Azure.

I am good to go with your suggestion. Can we maybe rename S3SingleAttemptInputStream to something no suggesting potential inheritance?

nicktindall · 2025-11-26T01:32:52Z

...y-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageClientSettings.java

+        PREFIX,
+        "max_retries",
+        (key) -> Setting.intSetting(key, 5, 0, Setting.Property.NodeScope)
+    );


We never configured number of retries for GCS previously. The default settings were for 6 attempts (aka 5 retries)

…c_gcs

nicktindall · 2025-11-26T03:46:01Z

.../src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageRetryingInputStream.java

-    private long currentOffset;
-    private boolean closed;
-    private Long lastGeneration;
+    private static final StorageRetryStrategy STORAGE_RETRY_STRATEGY = GoogleCloudStorageService.createStorageRetryStrategy();


The one we use is stateless, we might need to re-think this lifecycle if we switch to one that is not. You can't get it out of the client or StorageOptions as far as I could see.

I don't know important this is. If necessary, I think we can store the original strategy object to the MeteredStorage object so that we can get it in this class?

…ranslate read exceptions correctly

nicktindall · 2025-11-26T05:03:22Z

.../src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageRetryingInputStream.java

+                }
+                return n;
+            } catch (IOException e) {
+                throw StorageException.translate(e);


We translate these for consistency with the existing implementation. We retry anything when reading, consistent with the existing implementation, but when something goes wrong the translation might add some more context in the stack-trace.

…c_gcs # Conflicts: # modules/repository-gcs/src/internalClusterTest/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStoreRepositoryTests.java # modules/repository-gcs/src/test/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobContainerRetriesTests.java # server/src/test/java/org/elasticsearch/common/blobstore/RetryingInputStreamTests.java

… use_common_retry_logic_gcs

Copilot

Pull request overview

This PR refactors the GCS repository implementation to use common retry logic from RetryingInputStream, introducing blob version tracking for safe-resume functionality. This is the second step in breaking up and merging #136663, following S3's earlier adoption of the common retry pattern.

Key Changes:

Generalized RetryingInputStream to support blob versioning with type parameter <V> for version tracking
Implemented GCS-specific retry logic using blob generation headers for safe-resume
Added RetryBehaviour enum to control GCS client retry configuration
Enhanced test utilities with randomRetryingPurpose() and randomFiniteRetryingPurpose() helpers

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`RetryingInputStream.java`	Made generic with version type parameter `<V>`, added `StreamAction` enum, moved `SingleAttemptInputStream` to use `FilterInputStream`, added `isRetryableException()` to `BlobStoreServices` interface
`GoogleCloudStorageRetryingInputStream.java`	Completely refactored to extend `RetryingInputStream<Long>`, using GCS generation headers for version tracking, delegates to `GoogleCloudStorageBlobStoreServices` implementation
`GoogleCloudStorageService.java`	Added `RetryBehaviour` enum for controlling client retry configuration, updated client caching to include retry behavior in cache key
`GoogleCloudStorageBlobStore.java`	Split client retrieval into `client()` and `clientNoRetries()` methods, added `getMaxRetries()` accessor
`GoogleCloudStorageClientSettings.java`	Added `MAX_RETRIES_SETTING` configuration option with default value of 5
`S3RetryingInputStream.java`	Updated to use `RetryingInputStream<Void>` since S3 doesn't implement version tracking yet
`BlobStoreTestUtil.java`	Added `randomRetryingPurpose()` and `randomFiniteRetryingPurpose()` test utilities
`RetryingInputStreamTests.java`	Updated tests to use parameterized version type, added test for blob version tracking behavior
Various GCS test files	Updated to accommodate new `RetryBehaviour` parameter in client creation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

elasticsearchmachine · 2025-11-26T22:35:11Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

ywangd · 2025-11-28T01:30:17Z

The PR has both labels of >enhancement and >non-issue. Which one is true? I assume it's the later and the PR is a pure refactor?

elasticsearchmachine · 2025-11-28T01:32:47Z

Hi @nicktindall, I've created a changelog YAML for you.

nicktindall · 2025-11-28T01:33:41Z

The PR has both labels of >enhancement and >non-issue. Which one is true? I assume it's the later and the PR is a pure refactor?

@ywangd I think it's probably the former, because the RetryingInputStream includes a lot more smarts than the out-of-the-box retries we were using previously. Moving to the common retry semantics is a change in behaviour. I removed "non-issue"

ywangd

I skimmed through production code changes. They mostly look good to me. I have some small comments.

The GCS retry has some quirkiness in that the openStream method is always retried default number of times (with RetryHelper.runWithRetries) regardless whether some of the retries are already consumed. I think this as a bug and the PR should fix it. In the meantime, I should we should preserve the inner retry plus outer retry as commentted.

ywangd · 2025-11-28T04:32:30Z

.../src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageRetryingInputStream.java

-    private long currentOffset;
-    private boolean closed;
-    private Long lastGeneration;
+    private static final StorageRetryStrategy STORAGE_RETRY_STRATEGY = GoogleCloudStorageService.createStorageRetryStrategy();


I don't know important this is. If necessary, I think we can store the original strategy object to the MeteredStorage object so that we can get it in this class?

ywangd · 2025-11-28T04:36:20Z

.../src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageRetryingInputStream.java

-        try {
+        @Override
+        public SingleAttemptInputStream<Long> getInputStream(@Nullable Long lastGeneration, long start, long end) throws IOException {
+            final MeteredStorage client = blobStore.clientNoRetries();


I am not sure whether this is necessary for now. The current S3 implementation also has nested retries, inner layer with provided by the SDK and outer layer with our own implementation. That seems to be the existing behaviour of GCS retrying input stream as well. Maybe we should keep it as is for now. I am slightly concerned that this reduces number of total retries in production.

We can change it in future for all CSP together if it is necessary.

ywangd · 2025-11-28T05:06:01Z

server/src/main/java/org/elasticsearch/common/blobstore/RetryingInputStream.java

     * will attempt to create a new one of these. If reading from it fails, it should not retry.
     */
-    protected abstract static class SingleAttemptInputStream extends InputStream {
+    protected static final class SingleAttemptInputStream<V> extends FilterInputStream {


I see that GCS's ContentLengthValidatingInputStream extends FilterInputStream which is a different base class than S3SingleAttemptInputStream. I think making S3SingleAttemptInputStream a FilterInputStream could work as well since it is effectively a delegate to ResponseInputStream. But it may be a problem again for Azure.

I am good to go with your suggestion. Can we maybe rename S3SingleAttemptInputStream to something no suggesting potential inheritance?

Use common retry logic for GCS

1ebbae2

elasticsearchmachine added the v9.3.0 label Nov 25, 2025

nicktindall commented Nov 25, 2025

View reviewed changes

nicktindall added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement >non-issue labels Nov 25, 2025

nicktindall added 4 commits November 25, 2025 20:25

Merge branch 'main' into use_common_retry_logic_gcs

915be8a

Use FilterInputStream for SingleAttemptInputStream (delegate close etc.)

709fdf1

Account for bytes skipped in offset

f0fca99

Make SingleAttemptInputStream a decorator

a078ff8

nicktindall commented Nov 26, 2025

View reviewed changes

nicktindall added 3 commits November 26, 2025 14:24

Allow CSPs to define what is retry-able

39f8db6

Merge remote-tracking branch 'origin/main' into use_common_retry_logi…

f5502d5

…c_gcs

Remove forbidden API call (toLowerCase)

1f4b905

nicktindall commented Nov 26, 2025

View reviewed changes

nicktindall added 4 commits November 26, 2025 14:47

Remove misleading comments

b70d52f

Update main javadoc to indicate why this class exists

5c03349

Improve Javadoc for RetryingInputStream

4981c7e

Remove copied config, configure randomXXXPurpose correctly for GCP, t…

fabe579

…ranslate read exceptions correctly

nicktindall commented Nov 26, 2025

View reviewed changes

nicktindall added 7 commits November 26, 2025 16:53

Fix metric value

d137a49

Override retry delay for retry tests

55c98eb

Align behaviour with current

732121f

Merge branch 'main' into use_common_retry_logic_gcs

9da242f

Use retrying purpose when we expect reads to succeed

63d6018

Merge branch 'main' into use_common_retry_logic_gcs

f6c2497

Merge remote-tracking branch 'origin/use_common_retry_logic_gcs' into…

36a5821

… use_common_retry_logic_gcs

nicktindall requested a review from Copilot November 26, 2025 20:30

Copilot started reviewing on behalf of nicktindall November 26, 2025 20:30 View session

Copilot finished reviewing on behalf of nicktindall November 26, 2025 20:33

Copilot AI reviewed Nov 26, 2025

View reviewed changes

nicktindall marked this pull request as ready for review November 26, 2025 22:34

nicktindall requested review from mhl-b and ywangd November 26, 2025 22:35

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Nov 26, 2025

nicktindall removed the >non-issue label Nov 28, 2025

Update docs/changelog/138553.yaml

4636d9a

ywangd reviewed Nov 28, 2025

View reviewed changes

Use common retry logic for GCS #138553

Are you sure you want to change the base?

Use common retry logic for GCS #138553

Conversation

nicktindall commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicktindall Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

elasticsearchmachine commented Nov 26, 2025

Uh oh!

ywangd commented Nov 28, 2025

Uh oh!

elasticsearchmachine commented Nov 28, 2025

Uh oh!

nicktindall commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nicktindall commented Nov 25, 2025 •

edited

Loading

nicktindall Nov 25, 2025 •

edited

Loading

nicktindall Nov 25, 2025 •

edited

Loading

nicktindall Nov 26, 2025 •

edited

Loading

nicktindall commented Nov 28, 2025 •

edited

Loading