Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: When consumer fails to connect to SQS due to connectivity instant retry occurs #490

Open
fraserbenjamin opened this issue Apr 29, 2024 · 2 comments

Comments

@fraserbenjamin
Copy link

fraserbenjamin commented Apr 29, 2024

Describe the bug

Apologies for raising another issue though in this case the key error case hasn't been resolved despite PR #485 which adds additional support for SQS Error types, currently if the consumer cannot access the queue due to internet connectivity or an invalid URL being passed in, it will throw an error that looks like the one below which won't be caught by isConnectionError and cause an infinite loop.

{"name":"SQSError","code":"Error","time":"2024-04-29T08:08:58.721Z"}

This is why in the previous PR I added code SQSError to the potential PR to resolve it or you could use code Error in the current version as its such a generic message output from the AWS SDK.

If the consumer cannot connect to SQS due to no internet connection or the unlikely event SQS is down, if pollingWaitTimeMs is 0 (set by default), it will instantly retry the connection causing a continual flurry of requests. There is currently no option to resolve this other than by setting pollingWaitTimeMs to a higher value however this also isn't ideal since it adds a longer wait between polls too. Currently authentication errors are caught and can have a back-off configured with authenticationErrorTimeout though this doesn't exist for general connection errors.

Your minimal, reproducible example

https://stackblitz.com/edit/github-pwpmkg?file=index.js

Steps to reproduce

Method 1 (Error)

  1. Clone the code from StackBlitz (essential as you can't disable connectivity in StackBlitz)
  2. Disable device internet connectivity
  3. Run the script using node index.js
  4. Error will occur spamming the logs and creating a high number of retries

Method 2 (TimeoutError)

  1. Open the codesandbox project
  2. Run the script using node index.js
  3. TimeoutError will occur spamming the logs and creating a high number of retries

Expected behavior

As a user I expected these retries to be throttled after a failure but instead I see requests continuously being made and error messages being sent to the console at an excessive rate.

How often does this bug happen?

Every time

Screenshots or Videos

Screenshot 2024-04-29 at 09 33 11

Platform

Occurs on all platforms and tested with the issue happening on Node versions 16-20.

Package version

10.1.0

AWS SDK version

3.564.0

@nicholasgriffintn
Copy link
Member

nicholasgriffintn commented Apr 29, 2024

Hey, yeah we can't react to the code Error or the error type SQSError overall as this would make every error from SQS a connection error that's not what we're looking to do.

One possible way of detecting might be to look for ENOTFOUND, but that's also a little hacky and not super supported.

I don't think we can't ultimately support every possible use case with this library, and we're not really looking to, it's main aim is to be simple boilerplate that we used internally and shared externally.

@fraserbenjamin
Copy link
Author

fraserbenjamin commented Apr 30, 2024

Hey, yeah we can't react to the code Error or the error type SQSError overall as this would make every error from SQS a connection error that's not what we're looking to do.

One possible way of detecting might be to look for ENOTFOUND, but that's also a little hacky and not super supported.

I don't think we can't ultimately support every possible use case with this library, and we're not really looking to, it's main aim is to be simple boilerplate that we used internally and shared externally.

Yeah 100% agree and appreciate your time on this so far thanks, with high traffic that could cause a big delay if the odd message fails for a genuine reason. Totally understand not being able to support every use case although surely a malformed url or no connection are 2 very common cases especially if deploying to cloud where a malformed security group or bad env variable could then cause high cpu/memory and lots of potentially expensive logging to occur without people realising straight away. Especially since the loop can be every 1-2ms which is a lot of events.

In the first draft PR I raised I added another configurable option that a timeout happens on any errors which whilst not perfect I'd say is a better alternative than having the potential for the issues mentioned above. Would be keen to know your opinion on adding something like that back in, happy to do the work but want to check its the way you'd like to go first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants