Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase tenacity of S3RetryingInputStream #87243

Closed
DaveCTurner opened this issue May 31, 2022 · 1 comment · Fixed by #88015
Closed

Increase tenacity of S3RetryingInputStream #87243

DaveCTurner opened this issue May 31, 2022 · 1 comment · Fixed by #88015
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team

Comments

@DaveCTurner
Copy link
Contributor

The S3RetryingInputStream hides cases where S3 closes a connection partway through downloading a blob. By default it retries 3 times before failing the download. However, the number of failures tends to increase on larger blobs and often 3 failures is not enough to complete a multi-GB download if S3 is suffering from a cluster of failures as sometimes happens. Typically we make 10s-to-100s of MBs of progress between each failure even in this state.

I think we should increase the tenacity of S3RetryingInputStream when downloading larger blobs. For instance, we could not count a partial download towards the retry limit if it makes significant progress before failing.

@DaveCTurner DaveCTurner added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels May 31, 2022
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label May 31, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 24, 2022
S3 sometimes enters a state where blob downloads repeatedly fail but
with nontrivial progress between failures. Often each attempt yields 10s
or 100s of MBs of data. Today we abort a download after three (by
default) such failures, but this may not be enough to completely
retrieve a large blob during one of these flaky patches.

With this commit we start to avoid counting download attempts that
retrieved at least 1% of the configured `buffer_size` (typically 1MB)
towards the maximum number of retries.

Closes elastic#87243
elasticsearchmachine pushed a commit that referenced this issue Jun 30, 2022
S3 sometimes enters a state where blob downloads repeatedly fail but
with nontrivial progress between failures. Often each attempt yields 10s
or 100s of MBs of data. Today we abort a download after three (by
default) such failures, but this may not be enough to completely
retrieve a large blob during one of these flaky patches.

With this commit we start to avoid counting download attempts that
retrieved at least 1% of the configured `buffer_size` (typically 1MB)
towards the maximum number of retries.

Closes #87243
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants