Remove max_number_of_messages cap in Filebeat s3 input #32278

kaiyan-sheng · 2022-07-08T19:11:21Z

In Filebeat aws-s3 input, when SQS notification is set up along with S3 bucket, the max_number_of_messages configuration parameter can be used to control the maximum number of SQS messages in flight at any time. Right now max_number_of_messages has a default value of 5 and the maximum accepted value is 10. The maximum value of 10 comes from the limitation of the AWS ReceiveMessage API call. ReceiveMessage API call retrieves one or more messages, up to 10 from the specified queue.

With this limitation, the only way to scale s3 input is to have multiple Filebeat running in parallel pointing to the same SQS queue. @aspacca and I talked about this, it would be good to remove the limitation (10) for the max_number_of_messages parameter and that way we can specify a larger value of max_number_of_messages, start max_number_of_messages / 10 go routines for the poller to scale.

The text was updated successfully, but these errors were encountered:

nimarezainia · 2022-07-11T15:35:35Z

there's also a PR somewhat relevant to this: #31614

andrewkroh · 2022-07-12T12:31:15Z

As I mentioned in #31614 (comment), I don't believe our input is limited to max_number_of_messages of 10. It will request at most 10 from the API, but you can configure the max_number_of_messages > 10 and it will continue to make SQS ReceiveMessage requests until that number is reached. Since 7.15 our documentation for max_number_of_messages has not listed an upper limit because there should not be one in the code.

Have you encountered some kind of error while using max_number_of_messages > 10?

You can confirm the number of inflight SQS messages by looking at the sqs_messages_inflight_gauge metric (see https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-aws-s3.html#_metrics).

nimarezainia · 2022-07-12T17:56:17Z

@andrewkroh the team is trying to address the slower than expected ingestion of SQS logs as reported here: https://github.com/elastic/sdh-beats/issues/2238

Some notable observations:

agent seems to be throttled somehow with under utilized CPU
SQS queue has built up
Adding an additional integration seemed to have increased throughput.

aspacca · 2022-07-13T03:31:53Z

I misinterpreted the cap (I thought that steps from 2. to 4. below were capped at 10 goroutines in total)

this is the behaviour I've found in the code:

first iteration of the cycle

acquire 100 workers - blocking
fetch 10 messagesn (min between workers at 1. and 10) - blocking
release 100 - 10 workers - blocking
process 10 messages - threads
release a worker a time for every threads - threads

second iteration:

acquire 100 workers (only between 90 and 100 available according to what's the status of the previous iteration threads at 5.) - blocking
fetch from 10 (min between workers at 1. and 10) - blocking
release 10 workers - blocking
process 10 messages - threads
release a worker a time for every threads - threads

third iteration:

acquire 100 workers (only between 80 and 100 available according to what's the status of the previous iteration threads at 5.) - blocking
fetch 10 messages (min between workers at 1. and 10) - blocking
release 10 workers - blocking
process 10 messages - threads
release a worker a time for every threads - threads

etc etc

we should reach up to the 10th iteration where, in case all of the threads at 5. didn't finish yet but for a few, steps from 2 to 4 will become from 0 to 10 instead of 10

so yes, @andrewkroh , max_number_of_messages > 0 will produce having at max max_number_of_messages concurrent goroutines processing the polled sqs messages over multiple iterations

it's interesting anyway that according to @nimarezainia having multiple integrations increased the throughput
@nimarezainia was it made a test with a single integration with max_number_of_messages = X where X was the sum of max_number_of_messages = N? (as N the value set in the 3 different integrations?)

if not, it should produce similar increased throughput. if it doesn't I suspect the blocking nature of sqs.ReceiveMessage it's affecting throughput because of network I/O or similar

there still space for optimisation maybe (that was my initial thought): having at max max_number_of_messages concurrent goroutines since the first iteration, instead of waiting for them to be added up over iterations
if I got it right, if every goroutine is fast enough to end just after receiving the new messages we will have effectively only 10 concurrent goroutines most of the time

this could also be the case for the different throughput in the two secenarios

what do you think @andrewkroh ?

aspacca · 2022-07-13T03:47:51Z

to summarise:
it seems to me there are two different throughputs to consider:

the one for receive the messages
the one for processing the received messages

both are made in batch

is always capped at 10 for every single batch
the same as consequence of capped batch at 1.

max_number_of_messages acts as total throttling limit over 2. across different batches (since 2. is not blocking)

every batch is blocking at 1., this means that on a performant enough compute environment the effective throughput for every batch at 2. will be so limited to 10

this seems to be validated by the increased throughput with the 3x integrations configuration @nimarezainia reported

beware anyway that such configuration produced instability in the agent that eventually died

andrewkroh · 2022-07-13T03:55:27Z

I suspect the difference between using two input instances vs one is that each instance creates its own beat.Client (source ref). There is a lock that is held during the Publish call of the client.

This code could be changed to create one beat.Client per worker to remove any contention for the publish lock. This contention would be most noticeable when there are some local Beats processors involved because the lock is held while each event is processed.

andrewkroh · 2022-11-21T13:21:35Z

Closing because max_number_of_messages is working correctly. And the issue that I suspected in the previous comment has proven true and is fixed in #33658 for SQS mode.

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 8, 2022

kaiyan-sheng added enhancement Team:Cloud-Monitoring Label for the Cloud Monitoring team labels Jul 8, 2022

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 8, 2022

andrewkroh mentioned this issue Nov 13, 2022

[Filebeat] aws-s3 - create beat.Client for each SQS worker #33658

Merged

6 tasks

andrewkroh closed this as completed Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove max_number_of_messages cap in Filebeat s3 input #32278

Remove max_number_of_messages cap in Filebeat s3 input #32278

kaiyan-sheng commented Jul 8, 2022

nimarezainia commented Jul 11, 2022

andrewkroh commented Jul 12, 2022 •

edited

Loading

nimarezainia commented Jul 12, 2022

aspacca commented Jul 13, 2022

aspacca commented Jul 13, 2022 •

edited

Loading

andrewkroh commented Jul 13, 2022 •

edited

Loading

andrewkroh commented Nov 21, 2022 •

edited

Loading

Remove max_number_of_messages cap in Filebeat s3 input #32278

Remove max_number_of_messages cap in Filebeat s3 input #32278

Comments

kaiyan-sheng commented Jul 8, 2022

nimarezainia commented Jul 11, 2022

andrewkroh commented Jul 12, 2022 • edited Loading

nimarezainia commented Jul 12, 2022

aspacca commented Jul 13, 2022

aspacca commented Jul 13, 2022 • edited Loading

andrewkroh commented Jul 13, 2022 • edited Loading

andrewkroh commented Nov 21, 2022 • edited Loading

andrewkroh commented Jul 12, 2022 •

edited

Loading

aspacca commented Jul 13, 2022 •

edited

Loading

andrewkroh commented Jul 13, 2022 •

edited

Loading

andrewkroh commented Nov 21, 2022 •

edited

Loading