[AWS] S3 input consumes significant amount of memory #9463

kaiyan-sheng · 2024-03-28T05:14:26Z

Filebeat memory usage of their Agent increases continuously until it becomes unresponsive and they restart the Agent, after which this behavior repeats. Based on the hprof data, looks like S3 integration consumes up to 10GB memory, mostly from GetStates. Similar behavior in performance testing @zmoog did with aws-s3 input in polling mode: zmoog/public-notes#77

The text was updated successfully, but these errors were encountered:

bturquet · 2024-04-04T14:00:19Z

Hi @andrewkroh , we are investigating an issue with s3-input (direct polling) performance. We have made some tests to reproduce the memory usage behaviour here:

https://github.com/elastic/sdh-beats/issues/4436

And some examples have been shared through this SDH:

Figure out how to performance test the aws-s3 input in polling mode zmoog/public-notes#77

We would need your help or guidance to identify the root cause and how we could fix it, thank you !

andrewkroh · 2024-04-04T14:16:56Z

How can I help? What do you need me to do? I know the SQS mode of the aws-s3 input, and have not worked on the S3 listing mode.

If it is possible to switch to SQS mode for the use-case, then I would recommend that. It's a lot simpler when AWS tells the input what to read and it is stateless (so you it scale it horizontally).

kaiyan-sheng · 2024-04-05T02:48:26Z

Hi @andrewkroh thanks for offering to help! We want help to identify if there is a memory leak in the s3 input for s3 polling mode. So far from @zmoog's perf testing, we do see an increase in memory over time but we haven't got a chance to check if this is a memory leak or it's by design.

andrewkroh · 2024-04-05T15:00:11Z

My main question is how was it designed to store state. Does it keep a record of every S3 object (that would be bad) or does it use some techniques to limit state based on time? If its the latter, then what assumptions does it make (like that no older data is expected to be read)?

I will try to spend some time understanding that code.

andrewkroh · 2024-04-11T17:32:28Z

The state tracking in the input is complex. I didn't do any tracing in a debugger or other profiling so there may be some things I'm not understanding.

Does it keep a record of every S3 object (that would be bad) or does it use some techniques to limit state based on time?

As it is reading pages of the S3 listing it is storing a state for each S3 object. When it finishes writing all of the S3 objects to ES it exchanges the individual S3 object states for a newest LastModified timestamp found on a given S3 prefix.

Then when it lists objects again, it will check if the objects are modified after that stored timestamp. If so it will read them.

As far as memory usage goes, I can see where it might need a lot memory to track all of the S3 object states while it is reading and persisting. The input will be holding one state object in a slice and then mapping of state IDs to indices within that slice (src). Changes to the states slice, like Delete, trigger a new map allocation which could be expensive in terms of memory use.

When the state information is persisted to the Filebeat state store it makes a full copy of the states. This could cause a big malloc too. I think this is what was observed in this comment:

looks like S3 integration consumes up to 10GB memory, mostly from GetStates

I suspect it makes a copy to avoid needing to hold a lock and block other operations while persisting the state. This might not be an ideal tradeoff to make when there is a lot state to copy.

Another avenue of investigation is to check if any S3 object states are orphaned in the registry. IIUC upon successful completion of a poll loop the registry should be left only with keys like filebeat::aws-s3::writeCommit:: and no filebeat::aws-s3::state:: keys (src). Does anything go wrong if the input is interrupted and has to restart? Orphaned states would lead to an effective memory leak.

faec · 2024-04-16T16:31:46Z

Met with @andrewkroh to hand off context on this one, I'll be picking up the fix as part of my overall AWS cleanup work

…#39131) This is a cleanup of concurrency and error handling in the `aws-s3` input that could cause several known bugs: - Memory leaks ([1](elastic/integrations#9463), [2](#39052)). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine `s3Poller.Purge` being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by: * Changing the `s3Poller` run loop to only run one scan at a time, and wait for it to complete before starting the next one. * Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once. - This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the `states` helper object is now much simpler. - Skipped data due to buggy last-modified calculations ([3](#39065)). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed. * Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed. - Skipped data because rate limiting is treated as permanent failure ([4](#39114)). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors. * Fixed by creating an error, `errS3DownloadFailure`, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the `states` table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object. * Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly.

…#39131) This is a cleanup of concurrency and error handling in the `aws-s3` input that could cause several known bugs: - Memory leaks ([1](elastic/integrations#9463), [2](#39052)). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine `s3Poller.Purge` being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by: * Changing the `s3Poller` run loop to only run one scan at a time, and wait for it to complete before starting the next one. * Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once. - This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the `states` helper object is now much simpler. - Skipped data due to buggy last-modified calculations ([3](#39065)). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed. * Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed. - Skipped data because rate limiting is treated as permanent failure ([4](#39114)). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors. * Fixed by creating an error, `errS3DownloadFailure`, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the `states` table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object. * Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly. (cherry picked from commit e588628) # Conflicts: # x-pack/filebeat/input/awss3/input.go

…ss in the `aws-s3` input (#39262) * Fix concurrency bugs that could cause data loss in the `aws-s3` input (#39131) This is a cleanup of concurrency and error handling in the `aws-s3` input that could cause several known bugs: - Memory leaks ([1](elastic/integrations#9463), [2](#39052)). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine `s3Poller.Purge` being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by: * Changing the `s3Poller` run loop to only run one scan at a time, and wait for it to complete before starting the next one. * Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once. - This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the `states` helper object is now much simpler. - Skipped data due to buggy last-modified calculations ([3](#39065)). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed. * Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed. - Skipped data because rate limiting is treated as permanent failure ([4](#39114)). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors. * Fixed by creating an error, `errS3DownloadFailure`, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the `states` table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object. * Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly. (cherry picked from commit e588628) # Conflicts: # x-pack/filebeat/input/awss3/input.go * fix merge --------- Co-authored-by: Fae Charlton <fae.charlton@elastic.co>

bturquet changed the title ~~[AWS] S3 input cosumes significant amount of memory~~ [AWS] S3 input consumes significant amount of memory Apr 4, 2024

faec self-assigned this Apr 16, 2024

faec mentioned this issue Apr 16, 2024

Meta: Improve performance and reliability of awss3 and awscloudwatch inputs elastic/beats#38956

Open

This was referenced Apr 18, 2024

aws-s3 input writes to Filebeat registry without proper synchronization elastic/beats#39052

Closed

Fix concurrency bugs that could cause data loss in the aws-s3 input elastic/beats#39131

Merged

faec closed this as completed in elastic/beats#39131 Apr 29, 2024

mergify bot mentioned this issue Apr 29, 2024

[8.14](backport #39131) Fix concurrency bugs that could cause data loss in the aws-s3 input elastic/beats#39262

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AWS] S3 input consumes significant amount of memory #9463

[AWS] S3 input consumes significant amount of memory #9463

kaiyan-sheng commented Mar 28, 2024

bturquet commented Apr 4, 2024

andrewkroh commented Apr 4, 2024

kaiyan-sheng commented Apr 5, 2024

andrewkroh commented Apr 5, 2024

andrewkroh commented Apr 11, 2024 •

edited

Loading

faec commented Apr 16, 2024

[AWS] S3 input consumes significant amount of memory #9463

[AWS] S3 input consumes significant amount of memory #9463

Comments

kaiyan-sheng commented Mar 28, 2024

bturquet commented Apr 4, 2024

andrewkroh commented Apr 4, 2024

kaiyan-sheng commented Apr 5, 2024

andrewkroh commented Apr 5, 2024

andrewkroh commented Apr 11, 2024 • edited Loading

faec commented Apr 16, 2024

andrewkroh commented Apr 11, 2024 •

edited

Loading