Implement retries in S3 requests #41

tgeoghegan · 2020-10-06T22:15:19Z

As a workaround for a mismatch between connection pool timeouts in Hyper and AWS S3, #36 carefully constructs HTTP clients with a carefully chosen timeout value. For more robustness, we should implement retries.

yuriks · 2020-10-08T23:01:07Z

https://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html

During Narnia testing, we encountered cases where we failed to upload validation batches to S3 because the ingestion batches were _just_ big enough so that roughly ten seconds would elapse between the CreateMultipartUpload call and the first UploadPart. We had previously set the Hyper connection pool idle timeout to 10 seconds to avoid having S3 close connections on us after 20 seconds, so sometimes we would lose the race with Hyper and have uploads spuriously fail. Our fix is to make up to three attempts to upload. This is far from a perfect solution to problem #41, because we (1) should only retry on errors that are known to be retryable and (2) we should use exponential backoffs.

tgeoghegan · 2020-12-03T17:32:47Z

#260 is progress towards this, but I'm leaving this open because that PR only handles HTTP dispatch errors. We should have more comprehensive error handling as described in Amazon docs, and should also do exponential backoff. That work is not a P1 for production, though, so downgrading this ticket.

tgeoghegan · 2021-02-04T20:05:59Z

More good material on AWS retry best practices: rusoto/rusoto#234 (comment)

Adapts `aws_credentials::retry_request` to use mod `retries` (and hence `backoff::ExponentialBackoff` so that we get more graceful retry behavior, in line with AWS developer guidelines. Additionally, our assessment of which errors are retryable is more sophisticated: we now retry on all credential fetch failures, on server errors (i.e., HTTP 5xx) and when throttled. Resolves #41, #354

yuriks added this to the Production readiness milestone Oct 13, 2020

tgeoghegan mentioned this issue Dec 3, 2020

facilitator: retry AWS S3 requests #260

Merged

tgeoghegan removed the p1 Must be fixed for the corresponding milestone to be reached. label Dec 3, 2020

tgeoghegan removed this from the Production readiness milestone Dec 3, 2020

tgeoghegan added enhancement New feature or request good first issue Good for newcomers labels Dec 11, 2020

tgeoghegan mentioned this issue Feb 4, 2021

facilitator: don't abort integration for bad batch #348

Merged

tgeoghegan added the api-client-retries Our Rust cloud API clients need better retry logic label Feb 4, 2021

tgeoghegan added this to the Spring 2021 reliability milestone Apr 5, 2021

tgeoghegan mentioned this issue Apr 26, 2021

facilitator: retries w/ backoff for AWS #597

Merged

tgeoghegan closed this as completed in #597 Apr 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement retries in S3 requests #41

Implement retries in S3 requests #41

tgeoghegan commented Oct 6, 2020

yuriks commented Oct 8, 2020

tgeoghegan commented Dec 3, 2020 •

edited

Loading

tgeoghegan commented Feb 4, 2021

Implement retries in S3 requests #41

Implement retries in S3 requests #41

Comments

tgeoghegan commented Oct 6, 2020

yuriks commented Oct 8, 2020

tgeoghegan commented Dec 3, 2020 • edited Loading

tgeoghegan commented Feb 4, 2021

tgeoghegan commented Dec 3, 2020 •

edited

Loading