Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for retries if request fails #708

Closed
starred-gijs opened this issue Jul 1, 2020 · 11 comments · Fixed by #726 or #824
Closed

Add support for retries if request fails #708

starred-gijs opened this issue Jul 1, 2020 · 11 comments · Fixed by #726 or #824

Comments

@starred-gijs
Copy link

Hi, I start using this package instead of the official AWS SDK over a week ago, and since then I have been having connection issues, while new sending messages to SQS.

I run a serverless setup: "SQS => Lambda trigger" and execute Symfony Messenger with bref/symfony-messenger
During the handling of the messages, new messages are pushed to SQS.
Very occasionally this fails, but I dont understand why. The only difference I can think of with the AWS SDK is that it implements an retry schema.

Does anyone else experience these occasional failures? Some of the errors messages I extracted:

So I was considering, would it be possible/wise/useful to create a Retry middleware of some sort, that would handle these failures, and retry if appropriate?
The HttpClient should have some support for middlewares according to symfony/symfony#36779
The retry mechanism could mimimic eg https://github.com/aws/aws-sdk-php/blob/master/src/RetryMiddleware.php

@jderusse
Copy link
Member

jderusse commented Jul 1, 2020

I didn't implemented retry at first because thought that AWS was stable enough for not needing that.

But if you encounter too many error, and are impacted but that, it can worth it. moreover Symfony provides primitive to do it in a clean way..

Ready for a PR?

@bohanyang
Copy link

@jderusse

I didn't implemented retry at first because thought that AWS was stable enough for not needing that.

Yes, but every service MAY down.
Moreover, using DynamoDB with low read/write caps REQUIRES a retry algo for practical use.

But if you encounter too many error, and are impacted but that, it can worth it. moreover Symfony provides primitive to do it in a clean way..

Do you mean making use of the new AsyncDecoratorTrait?
I'm really curious about

  1. How can we implement the retry support with AsyncDecoratorTrait
  2. It's possible to do retry using just Symfony 5.1 code (without AsyncDecoratorTrait)?

Sorry if what I said made no sense...haha

@starred-gijs
Copy link
Author

starred-gijs commented Jul 2, 2020

But if you encounter too many error, and are impacted but that, it can worth it.

When those errors occurred, I swapped the CurlHttpClient with NativeHttpClient, because I wanted to check if that made a difference. It did not. Yesterday I swapped it back, and did not receive any errors since. I will keep monitoring.

Ready for a PR?

TBH, I will not have time in the coming days/weeks...

@starred-gijs
Copy link
Author

starred-gijs commented Jul 17, 2020

I got in touch with AWS support about the connectivity issues, they replied:

A very small % of requests are expected to timeout due to transient network issues. We recommend retrying these requests and use a lower connection timeout so that the clients can fail fast and retry. While AWS SDKs do have a retry logic builtin, sometimes the default retry logic may not be sufficient. You can always tweak the retry logic for the client to make the function code more resilient to these transient failures. Below is a good read on this topic.

https://aws.amazon.com/premiumsupport/knowledge-center/lambda-function-retry-timeout-sdk/

AWS X-Ray tracing on Lambda function can help in confirming if any particular segments of code are causing these timeouts. For any timeout of calls over network to downstream services, please retry these calls.

https://docs.aws.amazon.com/lambda/latest/dg/services-xray.html

Im now in doubt if I should attempt to implement the retry mechanism (time consuming, uncertain result), or revert to the SDK (quick change and predictable result)...

In the long run, this package should implement a retry/failure mechanism if it want to match the resiliency of the SDK?
@jderusse @Nyholm what are your thoughts on this?

@dcsg
Copy link
Contributor

dcsg commented Oct 9, 2020

I believe this issue should be re-open. I'm currently reverting to aws sdk due to lack of retry support like the aws sdk does, specially in the session handler where is critical to happen otherwise users will have a poor experience when it fails. And let me tell you, it fails some times. The solution that was merged is not enough and is directly coupled to symfony, and symfony 5.2 was not released yet. I would really love to continue to use this lib but in the current state is not usable.

@starred-gijs
Copy link
Author

FYI:
We are running the RetryableHttpClient in production environment with this package for 2 weeks now, and the number of error reported went down, so im really happy with that! 👍
Still seeing some errors like

Idle timeout reached for "https://sqs.eu-west-1.amazonaws.com/".

But we only use the sqs api client on lamdba (with retry on the sqs/lambda itself), with limited business impact, so no big deal.

Unfortunately, the difference in resilience stops us from moving away from the heavy official AWS SDK to these clients.

@jderusse
Copy link
Member

jderusse commented Oct 9, 2020

Still seeing some errors like Idle timeout reached for

Hmm, it's weird those error are not retried. It's probably a bug.

the difference in resilience stops us from moving away

Except the bug reported above, what kind of improvement can we make to fix that?

@starred-gijs
Copy link
Author

Hmm, it's weird those error are not retried. It's probably a bug.

Maybe this trace will help from Sentry. Im more than happy to track down if this is a bug or not. Contact me directly
Screenshot 2020-10-09 at 11 36 20

I ran into 2 'issues' earlier; DynamoDB marshalling, and the DynamoDB caps mentioned above
Not sure if/how the DynamoDB caps is fixed with the RetryableHttpClient, did not check it yet. And marshalling is not that big issue either, just did not have the time to look into that at that time.

@jderusse
Copy link
Member

This PR symfony/symfony#38426 will change the way request a retry. To only retry indempotent method. Issue is: Async-AWS always use POST (even when operation is safe). Meaning the following cases will not be retryied anymore:

  • status code 500, 504, 507, 510
  • exception thrown after request is sent (network issue)
  • idle timeout after request is sent

@starred-gijs . Did you kept a dump of response or exception/error before you switched to RetryableClient?

Possible solution:

  • ignore the default behavior of symfony, and treat POST as idempotent (we accept to send SQS message or email multiple times.
  • adds a parameter in manifest.json to list operations that are idempotent
  • do nothing, let people inject an HttpClient that retry even on for POST (adds a documentation for this)

@jderusse jderusse reopened this Oct 14, 2020
@starred-gijs
Copy link
Author

Sorry no dumps stored. I did search in sentry:

Screenshot 2020-10-15 at 09 23 21

Screenshot 2020-10-15 at 09 22 46

I hope this helps a bit?

@jderusse
Copy link
Member

Yes it helps a lot: We know those are 500 and 503 errors.
503 will be retried, but 500 will not be retried.

We can trust AWS and override the Symfony streategy to retry 500 errors on POST methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants