Sqs deletion issue 360 #364

azenk · 2021-02-12T23:53:00Z

Issue #360

Fixes #360 by iterating over all messages retrieved from SQS. The tests are also updated to process all three test events at once without spawning a go process.

I have at least one open question:

Should further processing of messages from the queue continue if an error is encountered?
Currently those events are silently dropped for future retry. SQS will bury the unprocessed events until the visibility timeout expires and then redeliver them. A few possible options:

Just let them get ignored. This slows down handling of events, and is likely sub-optimal
Continue processing the next message in the queue, but that changes the assumed semantics of Monitor() a bit, as errors couldn't be surfaced unless they're merged together.
Forget processing multiple messages per receive. Set the max sqs receive count to 1, which would allow it to only bury messages that generate errors and retain existing semantics at the cost of a few extra API calls under load.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* Wait for events to be received * Try sending multiple events * Use subtests

Process all messages received one at a time

The existing tests all assumed that errors should be returned from Monitor() if something fails. Now those are logged and no event is generated. Refactor tests to remove go functions and use a channel with non-zero capacity to buffer generated events.

haugenj · 2021-02-17T20:02:28Z

Thanks for writing this and for your patience in getting a response. We're based out of central Texas and are dealing with a bad winter storm right now.

The intention with returning those errors is that we can kill NTH if the same error keeps occurring, the idea being something has been misconfigured and NTH can't function properly.

With your change we still cover errors in receiving messages, it's just the processing of messages that is trickier to handle now. I see you updated the PR to ignore the errors in that case, which I think is a fine solution.

Another idea is that we could perform a duplicate error check within sqs-monitor.go, and panic from within that file if we get too many consecutive duplicate errors. Essentially, copying the panic functionality from node-termination-handler.go into sqs-monitor.go. I don't think this is necessary, but am curious to hear your thoughts

azenk · 2021-02-17T20:03:58Z

Based on the suggestion that @haugenj made to follow the pattern used for scheduled events, I opted to simply continue processing the next message in the queue after logging any errors. This makes the current testing a little less than ideal. As things stand right now, most errors aren't actually checked. The tests simply verify that Monitor() doesn't error and no events are generated.

Much of the error testing should probably be pushed into the internal tests running directly against processSQSMessage() or the various event building functions. I'd like to do this work, but it's a larger effort than I have time for at the moment.

As an aside, have you considered using something like testify for assertions? I found the current helper library a bit confusing. The testify output is also very clean and easy to read.

azenk · 2021-02-17T20:15:53Z

Another idea is that we could perform a duplicate error check within sqs-monitor.go, and panic from within that file if we get too many consecutive duplicate errors. Essentially, copying the panic functionality from node-termination-handler.go into sqs-monitor.go. I don't think this is necessary, but am curious to hear your thoughts

How about returning something like AllMessageProcessingFailed if we can't process any of the messages in the queue? That should prevent us from getting stuck forever since bad messages eventually get removed. If the duplicate error threshold is above the SQS delivery retry maximum, a single bad message should never cause an issue. Perhaps that should be documented if we go that route?

haugenj · 2021-02-17T20:33:24Z

That sounds good to me. I think we might as well also increase the number of SQS messages taken in a single receive request now that we can properly handle them. Perhaps up from 2 to 5 now?

If none of the messages received from SQS can be processed, return an error. This will allow the NTH to detect repeated issues processing the queue.

haugenj

👍 LGTM. Thanks for going the extra mile improving the tests!

* Refactor sqs message handling Process all messages received one at a time * Update tests for error handling The existing tests all assumed that errors should be returned from Monitor() if something fails. Now those are logged and no event is generated. Refactor tests to remove go functions and use a channel with non-zero capacity to buffer generated events. * Return error if no messages can be processed If none of the messages received from SQS can be processed, return an error. This will allow the NTH to detect repeated issues processing the queue.

azenk added 4 commits February 12, 2021 14:56

Update testing for sqs tasks

19e02f8

* Wait for events to be received * Try sending multiple events * Use subtests

Refactor sqs message handling

6afa35b

Process all messages received one at a time

Update tests for error handling

72e01cd

The existing tests all assumed that errors should be returned from Monitor() if something fails. Now those are logged and no event is generated. Refactor tests to remove go functions and use a channel with non-zero capacity to buffer generated events.

Ignore events that generate errors and log them

27e5737

haugenj self-requested a review February 17, 2021 20:02

azenk marked this pull request as ready for review February 17, 2021 20:04

azenk added 2 commits February 17, 2021 15:30

Return error if no messages can be processed

f9c83ce

If none of the messages received from SQS can be processed, return an error. This will allow the NTH to detect repeated issues processing the queue.

Increase max receieved SQS messages to 5

1cb49a3

haugenj approved these changes Feb 18, 2021

View reviewed changes

haugenj merged commit f718727 into aws:main Feb 18, 2021

ec2-bot mentioned this pull request Mar 2, 2021

🥳 node-termination-handler v1.12.1 Automated Release! 🥑 aws/eks-charts#469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sqs deletion issue 360 #364

Sqs deletion issue 360 #364

Uh oh!

azenk commented Feb 12, 2021

Uh oh!

haugenj commented Feb 17, 2021

Uh oh!

azenk commented Feb 17, 2021

Uh oh!

azenk commented Feb 17, 2021

Uh oh!

haugenj commented Feb 17, 2021

Uh oh!

haugenj left a comment

Uh oh!

Uh oh!

Sqs deletion issue 360 #364

Sqs deletion issue 360 #364

Uh oh!

Conversation

azenk commented Feb 12, 2021

Uh oh!

haugenj commented Feb 17, 2021

Uh oh!

azenk commented Feb 17, 2021

Uh oh!

azenk commented Feb 17, 2021

Uh oh!

haugenj commented Feb 17, 2021

Uh oh!

haugenj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!