Increase tenacity of S3RetryingInputStream #87243

DaveCTurner · 2022-05-31T10:18:51Z

The S3RetryingInputStream hides cases where S3 closes a connection partway through downloading a blob. By default it retries 3 times before failing the download. However, the number of failures tends to increase on larger blobs and often 3 failures is not enough to complete a multi-GB download if S3 is suffering from a cluster of failures as sometimes happens. Typically we make 10s-to-100s of MBs of progress between each failure even in this state.

I think we should increase the tenacity of S3RetryingInputStream when downloading larger blobs. For instance, we could not count a partial download towards the retry limit if it makes significant progress before failing.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-05-31T10:18:56Z

Pinging @elastic/es-distributed (Team:Distributed)

S3 sometimes enters a state where blob downloads repeatedly fail but with nontrivial progress between failures. Often each attempt yields 10s or 100s of MBs of data. Today we abort a download after three (by default) such failures, but this may not be enough to completely retrieve a large blob during one of these flaky patches. With this commit we start to avoid counting download attempts that retrieved at least 1% of the configured `buffer_size` (typically 1MB) towards the maximum number of retries. Closes elastic#87243

S3 sometimes enters a state where blob downloads repeatedly fail but with nontrivial progress between failures. Often each attempt yields 10s or 100s of MBs of data. Today we abort a download after three (by default) such failures, but this may not be enough to completely retrieve a large blob during one of these flaky patches. With this commit we start to avoid counting download attempts that retrieved at least 1% of the configured `buffer_size` (typically 1MB) towards the maximum number of retries. Closes #87243

DaveCTurner added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels May 31, 2022

elasticmachine added the Team:Distributed Meta label for distributed team label May 31, 2022

DaveCTurner mentioned this issue Jun 24, 2022

Retry after all S3 get failures that made progress #88015

Merged

elasticsearchmachine closed this as completed in #88015 Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase tenacity of S3RetryingInputStream #87243

Increase tenacity of S3RetryingInputStream #87243

DaveCTurner commented May 31, 2022

elasticmachine commented May 31, 2022

Increase tenacity of S3RetryingInputStream #87243

Increase tenacity of S3RetryingInputStream #87243

Comments

DaveCTurner commented May 31, 2022

elasticmachine commented May 31, 2022