Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sqs deletion issue 360 #364

Merged
merged 6 commits into from Feb 18, 2021
Merged

Sqs deletion issue 360 #364

merged 6 commits into from Feb 18, 2021

Conversation

azenk
Copy link
Contributor

@azenk azenk commented Feb 12, 2021

Issue #360

Fixes #360 by iterating over all messages retrieved from SQS. The tests are also updated to process all three test events at once without spawning a go process.

I have at least one open question:

Should further processing of messages from the queue continue if an error is encountered?
Currently those events are silently dropped for future retry. SQS will bury the unprocessed events until the visibility timeout expires and then redeliver them. A few possible options:

  • Just let them get ignored. This slows down handling of events, and is likely sub-optimal
  • Continue processing the next message in the queue, but that changes the assumed semantics of Monitor() a bit, as errors couldn't be surfaced unless they're merged together.
  • Forget processing multiple messages per receive. Set the max sqs receive count to 1, which would allow it to only bury messages that generate errors and retain existing semantics at the cost of a few extra API calls under load.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* Wait for events to be received
* Try sending multiple events
* Use subtests
Process all messages received one at a time
The existing tests all assumed that errors should be returned from Monitor() if something fails.  Now those are logged and no event is generated.

Refactor tests to remove go functions and use a channel with non-zero capacity to buffer generated events.
@haugenj
Copy link
Contributor

haugenj commented Feb 17, 2021

Thanks for writing this and for your patience in getting a response. We're based out of central Texas and are dealing with a bad winter storm right now.

The intention with returning those errors is that we can kill NTH if the same error keeps occurring, the idea being something has been misconfigured and NTH can't function properly.

With your change we still cover errors in receiving messages, it's just the processing of messages that is trickier to handle now. I see you updated the PR to ignore the errors in that case, which I think is a fine solution.

Another idea is that we could perform a duplicate error check within sqs-monitor.go, and panic from within that file if we get too many consecutive duplicate errors. Essentially, copying the panic functionality from node-termination-handler.go into sqs-monitor.go. I don't think this is necessary, but am curious to hear your thoughts

@haugenj haugenj self-requested a review February 17, 2021 20:02
@azenk
Copy link
Contributor Author

azenk commented Feb 17, 2021

Based on the suggestion that @haugenj made to follow the pattern used for scheduled events, I opted to simply continue processing the next message in the queue after logging any errors. This makes the current testing a little less than ideal. As things stand right now, most errors aren't actually checked. The tests simply verify that Monitor() doesn't error and no events are generated.

Much of the error testing should probably be pushed into the internal tests running directly against processSQSMessage() or the various event building functions. I'd like to do this work, but it's a larger effort than I have time for at the moment.

As an aside, have you considered using something like testify for assertions? I found the current helper library a bit confusing. The testify output is also very clean and easy to read.

@azenk azenk marked this pull request as ready for review February 17, 2021 20:04
@azenk
Copy link
Contributor Author

azenk commented Feb 17, 2021

Another idea is that we could perform a duplicate error check within sqs-monitor.go, and panic from within that file if we get too many consecutive duplicate errors. Essentially, copying the panic functionality from node-termination-handler.go into sqs-monitor.go. I don't think this is necessary, but am curious to hear your thoughts

How about returning something like AllMessageProcessingFailed if we can't process any of the messages in the queue? That should prevent us from getting stuck forever since bad messages eventually get removed. If the duplicate error threshold is above the SQS delivery retry maximum, a single bad message should never cause an issue. Perhaps that should be documented if we go that route?

@haugenj
Copy link
Contributor

haugenj commented Feb 17, 2021

That sounds good to me. I think we might as well also increase the number of SQS messages taken in a single receive request now that we can properly handle them. Perhaps up from 2 to 5 now?

If none of the messages received from SQS can be processed, return an error.  This will allow the NTH to detect repeated issues processing the queue.
Copy link
Contributor

@haugenj haugenj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM. Thanks for going the extra mile improving the tests!

@haugenj haugenj merged commit f718727 into aws:main Feb 18, 2021
haugenj pushed a commit to haugenj/aws-node-termination-handler that referenced this pull request Feb 19, 2021
* Refactor sqs message handling

Process all messages received one at a time

* Update tests for error handling

The existing tests all assumed that errors should be returned from Monitor() if something fails.  Now those are logged and no event is generated.

Refactor tests to remove go functions and use a channel with non-zero capacity to buffer generated events.

* Return error if no messages can be processed

If none of the messages received from SQS can be processed, return an error.  This will allow the NTH to detect repeated issues processing the queue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQS messages deleted without being processed
2 participants