Skip to content
This repository has been archived by the owner on Jul 21, 2023. It is now read-only.

Implement retries in S3 requests #41

Closed
tgeoghegan opened this issue Oct 6, 2020 · 3 comments · Fixed by #597
Closed

Implement retries in S3 requests #41

tgeoghegan opened this issue Oct 6, 2020 · 3 comments · Fixed by #597
Labels
api-client-retries Our Rust cloud API clients need better retry logic enhancement New feature or request good first issue Good for newcomers

Comments

@tgeoghegan
Copy link
Contributor

As a workaround for a mismatch between connection pool timeouts in Hyper and AWS S3, #36 carefully constructs HTTP clients with a carefully chosen timeout value. For more robustness, we should implement retries.

@yuriks
Copy link
Contributor

yuriks commented Oct 8, 2020

@yuriks yuriks added this to the Production readiness milestone Oct 13, 2020
@tgeoghegan tgeoghegan added p1 Must be fixed for the corresponding milestone to be reached. internal Improvements to project infrastructure or code that don't directly affect the end result. and removed internal Improvements to project infrastructure or code that don't directly affect the end result. labels Oct 19, 2020
tgeoghegan added a commit that referenced this issue Dec 3, 2020
During Narnia testing, we encountered cases where we failed to upload
validation batches to S3 because the ingestion batches were _just_ big
enough so that roughly ten seconds would elapse between the
CreateMultipartUpload call and the first UploadPart. We had previously
set the Hyper connection pool idle timeout to 10 seconds to avoid having
S3 close connections on us after 20 seconds, so sometimes we would lose
the race with Hyper and have uploads spuriously fail. Our fix is to make
up to three attempts to upload. This is far from a perfect solution to
problem #41, because we (1) should only retry on errors that are known
to be retryable and (2) we should use exponential backoffs.
tgeoghegan added a commit that referenced this issue Dec 3, 2020
During Narnia testing, we encountered cases where we failed to upload
validation batches to S3 because the ingestion batches were _just_ big
enough so that roughly ten seconds would elapse between the
CreateMultipartUpload call and the first UploadPart. We had previously
set the Hyper connection pool idle timeout to 10 seconds to avoid having
S3 close connections on us after 20 seconds, so sometimes we would lose
the race with Hyper and have uploads spuriously fail. Our fix is to make
up to three attempts to upload. This is far from a perfect solution to
problem #41, because we (1) should only retry on errors that are known
to be retryable and (2) we should use exponential backoffs.
@tgeoghegan
Copy link
Contributor Author

tgeoghegan commented Dec 3, 2020

#260 is progress towards this, but I'm leaving this open because that PR only handles HTTP dispatch errors. We should have more comprehensive error handling as described in Amazon docs, and should also do exponential backoff. That work is not a P1 for production, though, so downgrading this ticket.

@tgeoghegan tgeoghegan removed the p1 Must be fixed for the corresponding milestone to be reached. label Dec 3, 2020
@tgeoghegan tgeoghegan removed this from the Production readiness milestone Dec 3, 2020
@tgeoghegan tgeoghegan added enhancement New feature or request good first issue Good for newcomers labels Dec 11, 2020
@tgeoghegan
Copy link
Contributor Author

More good material on AWS retry best practices: rusoto/rusoto#234 (comment)

@tgeoghegan tgeoghegan added the api-client-retries Our Rust cloud API clients need better retry logic label Feb 4, 2021
@tgeoghegan tgeoghegan added this to the Spring 2021 reliability milestone Apr 5, 2021
tgeoghegan added a commit that referenced this issue Apr 21, 2021
Adapts `aws_credentials::retry_request` to use mod `retries` (and hence
`backoff::ExponentialBackoff` so that we get more graceful retry
behavior, in line with AWS developer guidelines. Additionally, our
assessment of which errors are retryable is more sophisticated: we now
retry on all credential fetch failures, on server errors (i.e., HTTP
5xx) and when throttled.

Resolves #41, #354
tgeoghegan added a commit that referenced this issue Apr 26, 2021
Adapts `aws_credentials::retry_request` to use mod `retries` (and hence
`backoff::ExponentialBackoff` so that we get more graceful retry
behavior, in line with AWS developer guidelines. Additionally, our
assessment of which errors are retryable is more sophisticated: we now
retry on all credential fetch failures, on server errors (i.e., HTTP
5xx) and when throttled.

Resolves #41, #354
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-client-retries Our Rust cloud API clients need better retry logic enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants