Amazon SQS input stalls on new queue flush timeout defaults #37754

faec · 2024-01-25T19:06:06Z

Short version if you're here because your SQS ingestion slowed down after installing 8.12: if your configuration uses a performance preset, switch it to preset: latency. If you use no preset or a custom preset, then set queue.mem.flush.timeout: 1.

Long version:

In 8.12 the default memory queue flush interval was raised from 1 second to 10 seconds. In many configurations this improves performance because it allows the output to batch more events per round trip, which improves efficiency. However, the SQS input has an extra bottleneck that interacts badly with the new value.

The SQS input is configured with a number of input workers, by default 5. Each worker reads one message from the SQS queue, fetches and publishes the events it references, waits for those events to be acknowledged upstream, and then deletes the original message. The worker will not proceed to handling the next message until the previous one is fully acknowledged.

Now suppose we are using default settings, and each SQS message corresponds to 200 events. 5 workers will read 5 SQS messages and publish 1000 events. However, this is less than the queue's flush.min_events value of 1600, so the queue will continue waiting for a full 10 seconds before making those events available to the output. Once it does, the output will need to fully ingest and acknowledge those events before the input workers resume. So no matter how fast the reading and ingestion is, the pipeline will be capped at 5 SQS messages every 10 seconds.

The pipeline expects backpressure to come from the outputs as their throughput is saturated, to propagate from there to the queue, and then to block the input's Publish calls once the queue becomes full. However, in many scenarios the current SQS input will never use more than a tiny fraction of the queue, and will be entirely dependent on the queue's flush interval to make progress.

One important question is whether the current approach is de facto imposed by Amazon APIs, rate limits, or similar. If that's the case then we'll need to look for other workarounds based on those constraints. However, the ideal solution would be for the SQS input to decouple message cleanup from the worker lifecycle, by saving acknowledgment metadata and moving on to the next SQS message before the previous one has been fully acknowledged. This would let the input take full advantage of the configured queue to accumulate data and improve ingestion performance. It would also improve performance beyond the existing baseline in some scenarios (even before 8.12, an SQS queue with small payloads could never be processed faster than 5 messages per second, no matter how fast the actual ingestion was).

The text was updated successfully, but these errors were encountered:

cmacknz · 2024-01-25T19:40:26Z

Looks like @andrewkroh did the original implementation here in #27199, he might be best to comment on if this is something we can improve in the implementation.

CC @lucabelluccini. Also CC @elastic/obs-cloud-monitoring since it doesn't look like the team label did anything.

I think we'll want to document this recommendation in:

The 8.12 release notes for beats and agent.
The preset documentation for beats and agent
The support knowledgebase.

One complication for getting this to be consistently used with just documentation is that the awss3 input is an implementation detail of several integrations, so users might not be aware that they are affected by this until they observe the performance regression.

andrewkroh · 2024-01-25T20:37:33Z

The max_number_of_messages configuration option controls the number of SQS messages that can be in flight (received from a queue by a consumer, but not yet deleted from the queue) at any time for the input. Each queue has an in flight quota associated to it. Ideally you would keep the number of inputs * max_number_of_messages below the quota, but in practice this isn't a problem because the quota is high AND ReceiveMessage will silently stop handing out more SQS message until you fall back below the quoto.

max_number_of_messages also implicitly controls the number of goroutines that are used to process messages. I think this is where there is some flexibility to decouple the control of max inflight SQS messages vs the max goroutines. In fact there is a number_of_workers setting with an aligned definition, but it's only used in S3 listing mode.

If both max_number_of_messages and number_of_workers were available for use with SQS mode then you could set max_number_of_messages to like 100 while keeping number_of_workers at a more conservative 5 to account for large internal queues with long flush intervals.

Additionally with the concepts being separated it would be an opportunity to concurrently process multiple S3 objects that can be contained in a single SQS notification. Today, the S3 objects contained in one SQS message are processed serially by one goroutine.

strawgate · 2024-01-25T20:53:03Z

Is there a higher value of max_number_of_messages that we would feel comfortable defaulting to to restore at least some performance?

aspacca · 2024-01-26T00:25:23Z

If both max_number_of_messages and number_of_workers were available for use with SQS mode then you could set max_number_of_messages to like 100 while keeping number_of_workers at a more conservative 5 to account for large internal queues with long flush intervals.

there's a very old PR that does something similar: #33659

It's not exactly the same, as far as I understand what you are proposing, @andrewkroh

The changes in the PR creates number_of_workers goroutines, each consuming max_number_of_messages SQS message.

While I guess you propose to have number_of_workers goroutines, each consuming max_number_of_messages / number_of_workers, is it correct?

Or is it something even different?

In the PR number_of_sqs_consumers is introduced, instead of using number_of_workers. but that's just a minor detail we can get rid of.

lucabelluccini · 2024-01-26T14:48:12Z

Thanks @cmacknz for the notification ❤️
I think the actions you propose are great (I can cover the knowledge article part).
This problem can affect:

Beat users (who make use of AWS input explicitly)
Beat modules users (who make use of AWS input almost implicitly)
EA Integration users (who make use of AWS SQS input almost implicitly)

I think just warning users about this recommendation is tricky as the chances this is going to be missed are high.

This is one example of what can happen with Elastic Agent:

On 8.11,ES Output with custom workers & bulk_max_size defined by the user
- Note: on 8.11, the default queue.mem.flush.timeout: 1s (https://www.elastic.co/guide/en/beats/filebeat/8.11/configuring-internal-queue.html#_flush_timeout)
Upgrading to 8.12, their ES Output becomes a custom profile, as there were some personalized settings (keeping the workers and bulk_max_size of the user, as per documentation). The personalized settings didn't include the queue.mem.flush.timeout.
- Note: on 8.12, the default queue.mem.flush.timeout: 10s (https://www.elastic.co/guide/en/beats/filebeat/8.12/configuring-internal-queue.html#queue-mem-flush-timeout-option)

As a consequence, if the user has AWS SQS input in any of the integrations deployed, they get performance regressions detailed here.

strawgate · 2024-01-26T15:45:17Z

The personalized settings didn't include the queue.mem.flush.timeout.

For an additional piece of information here: It was not possible to customize the queue settings including timeout via fleet output settings before 8.12.

cmacknz · 2024-01-26T20:55:34Z

Known issue PRs for agent and beats:

strawgate · 2024-01-31T19:11:42Z

max_number_of_messages also implicitly controls the number of goroutines that are used to process messages. I think this is where there is some flexibility to decouple the control of max inflight SQS messages vs the max goroutines. In fact there is a number_of_workers setting with an aligned definition, but it's only used in S3 listing mode.

If both max_number_of_messages and number_of_workers were available for use with SQS mode then you could set max_number_of_messages to like 100 while keeping number_of_workers at a more conservative 5 to account for large internal queues with long flush intervals.

It sounds like we'll need to coordinate between teams to get this implemented to resolve the core performance issue with the SQS input. @jlind23 can we figure out how to divy this up to get this fix together?

nimarezainia · 2024-02-01T03:37:16Z

removing the agent label. hoping it gets routed properly this time.

nimarezainia · 2024-02-02T03:07:42Z

@aspacca 's PR above may be worth another look. Based on a novice read seems to suggest the same fix as what @andrewkroh has mentioned. Wouldn't a configurable max_number_of_messages and number_of_workers be a fix for this?

(i don't know what that PR was closed)

jlind23 · 2024-02-02T08:28:28Z

It sounds like we'll need to coordinate between teams to get this implemented to resolve the core performance issue with the SQS input. @jlind23 can we figure out how to divy this up to get this fix together?

@strawgate Nima escalated this to the o11y team that owns the SQS input, there is an ongoing mailthread to get this sorted out.

@bturquet you probably want to track this issue on your end.

jlind23 · 2024-02-12T17:58:49Z

@bturquet Shall we add this to any of your board for priotisation purpose?

bturquet · 2024-02-13T17:02:08Z

@jlind23 we are tracking our progress in a separate issue for S3 input:

https://github.com/elastic/observability-dev/issues/3116

We are still in performance tests stage (playing with different mix of parameters and versions). We plan to have conclusions before the end of the week and will then decide if we need to make some change in the input logic.

cc @aspacca for coordination and communication

jlind23 · 2024-02-13T17:54:38Z

I'm closing this one then in favour of yours.

faec added bug Team:Elastic-Agent Label for the Agent team Team:Cloud-Monitoring Label for the Cloud Monitoring team labels Jan 25, 2024

faec mentioned this issue Jan 25, 2024

Queue flush settings interact badly with output's bulk_max_size #37757

Closed

This was referenced Jan 26, 2024

Add known issue for AWS S3 performance regression in 8.12 elastic/ingest-docs#865

Merged

Add a known issue for the AWS S3 input with SQS notifications #37766

Merged

nimarezainia removed the Team:Elastic-Agent Label for the Agent team label Feb 1, 2024

jlind23 closed this as completed Feb 13, 2024

kcreddy mentioned this issue Apr 9, 2024

Add multiple beat.Clients #37657

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amazon SQS input stalls on new queue flush timeout defaults #37754

Amazon SQS input stalls on new queue flush timeout defaults #37754

faec commented Jan 25, 2024

cmacknz commented Jan 25, 2024

andrewkroh commented Jan 25, 2024

strawgate commented Jan 25, 2024

aspacca commented Jan 26, 2024

lucabelluccini commented Jan 26, 2024

strawgate commented Jan 26, 2024

cmacknz commented Jan 26, 2024 •

edited

Loading

strawgate commented Jan 31, 2024 •

edited

Loading

nimarezainia commented Feb 1, 2024

nimarezainia commented Feb 2, 2024

jlind23 commented Feb 2, 2024 •

edited

Loading

jlind23 commented Feb 12, 2024

bturquet commented Feb 13, 2024

jlind23 commented Feb 13, 2024

Amazon SQS input stalls on new queue flush timeout defaults #37754

Amazon SQS input stalls on new queue flush timeout defaults #37754

Comments

faec commented Jan 25, 2024

cmacknz commented Jan 25, 2024

andrewkroh commented Jan 25, 2024

strawgate commented Jan 25, 2024

aspacca commented Jan 26, 2024

lucabelluccini commented Jan 26, 2024

strawgate commented Jan 26, 2024

cmacknz commented Jan 26, 2024 • edited Loading

strawgate commented Jan 31, 2024 • edited Loading

nimarezainia commented Feb 1, 2024

nimarezainia commented Feb 2, 2024

jlind23 commented Feb 2, 2024 • edited Loading

jlind23 commented Feb 12, 2024

bturquet commented Feb 13, 2024

jlind23 commented Feb 13, 2024

cmacknz commented Jan 26, 2024 •

edited

Loading

strawgate commented Jan 31, 2024 •

edited

Loading

jlind23 commented Feb 2, 2024 •

edited

Loading