Skip to content

Conversation

@mhl-b
Copy link
Contributor

@mhl-b mhl-b commented Dec 3, 2025

Add retry logic for bulk-delete items in GCS blob store.

@mhl-b
Copy link
Contributor Author

mhl-b commented Dec 3, 2025

@DaveCTurner, @joshua-adams-1
Using draft to align on the approach how to retry bulk items.

@mhl-b mhl-b requested a review from nicktindall December 3, 2025 00:08
@mhl-b
Copy link
Contributor Author

mhl-b commented Dec 3, 2025

@nicktindall, I didn't have chance to review GCS retry refactoring PR. But want to verify that we still keep retry strategy for non stream calls.

@nicktindall
Copy link
Contributor

@nicktindall, I didn't have chance to review GCS retry refactoring PR. But want to verify that we still keep retry strategy for non stream calls.

Yes, only get blob should be affected by my changes

Copy link
Contributor

@joshua-adams-1 joshua-adams-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading #138364 this looks good to me. I'm happy to approve once the CI issues are resolved. Could we also add unit tests for the deleteBlobs function?


private static boolean isRetryErrCode(int code) {
return switch (code) {
case 408 | 429 | 500 | 502 | 503 | 504 -> true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we replace these raw values with HttpURLConnection like here
  2. Is it worth explaining why we can retry on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we used the names in org.elasticsearch.rest.RestStatus but yes names >> numbers here, and if there's any docs about why we should retry these codes then it'd be great to link them in a comment.

if (failedItems.isEmpty() == false) {
final var retryBlobId = failedItems.getLast().blobId;
try {
client().deleteBlob(retryBlobId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  1. I assume this blocks?
  2. Does storage.delete(blobId); use an exponential back off retry strategy?

// remaining items go the next bulk
failedItems.removeLast();
} catch (StorageException e) {
throw new IOException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also log the other elements in failedItems? Otherwise they would fail quietly

final var retryBlobId = failedItems.getLast().blobId;
try {
client().deleteBlob(retryBlobId);
// remaining items go the next bulk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// remaining items go the next bulk
// remaining items go into the next bulk

if (isRetryErrCode(errCode)) {
failedItems.add(new DeleteFailure(deleteResult.blobId, e.getCode()));
} else {
throw new IOException("Failed to process bulk delete, non-retryable error for blobId=" + deleteResult.blobId, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question: If we fail to delete N blobs, are these subsequently cleaned up?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible to me, but needs supporting changes in GoogleCloudStorageHttpHandler to exercise the retries properly.


private static boolean isRetryErrCode(int code) {
return switch (code) {
case 408 | 429 | 500 | 502 | 503 | 504 -> true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we used the names in org.elasticsearch.rest.RestStatus but yes names >> numbers here, and if there's any docs about why we should retry these codes then it'd be great to link them in a comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants