-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fluent-bit stops processing logs after reaching memory buffer limit #8046
Comments
Hello @robmcelhinney, I will try reproducing this behavior, but please confirm my assumptions. Is Fluent-Bit running as a docker container, and docker is using the FluentD logging driver to send all container logs to this Fluent-Bit container, and from here you are using the HTTP output plugin to send these logs to an endpoint? If this is the case, please let me know what this endpoint is.
As per your description, the behavior is expected when reaching 150MB From: https://docs.fluentbit.io/manual/administration/backpressure#mem_buf_limit *** block local buffers for the input plugin (cannot append more data) |
Yes, fluent-bit is running as a docker container. We get a lot of our logs from other docker containers using the fluentd logging driver. Yes, we use the HTTP output plugin for all forward input logs and also use a Splunk Output plugin for those few tailed logs.
We use a SplunkCloud HTTPEvent Collector: https://docs.splunk.com/Documentation/SplunkCloud/latest/Data/UsetheHTTPEventCollector
We tend to start seeing issues once we are ingesting over ~8GB an hour. It is usually fine before then and can sometimes handle that amount too.
I don't have any right now, I'll add any to this ticket when it reoccurs.
We expect the block to happen at that stage, we're just not sure why it often fails to resolve itself even when the ingest rate drops to 0. |
Hello @robmcelhinney Based on your configuration, I tested with a reduced Fluent Bit config file, and I can see the same behavior in the logs, mem buf over-limit pausing the input, but on Splunk, I can see all the logs coming into Splunk. The Splunk output plugin doesn't send an HTTP server response code 200 when the records are received at the endpoint; this is why nothing else is displayed on the logs when log_level is info.
If you set the log_level to debug instead of info, you will see the data is flowing from fluent-bit to Splunk, and the below log messages confirm it. As you can see in the below snippet, the chunks are getting updated with data, then paused, and then resumed again. I will engage our Dev-Team to get more details about the forward plugin because we should see a message in the logs indicating that the input plugin is resumed when the mem_buf_limit has cleared the pause condition.
Could you please provide a simplified version of your configuration that reproduces the issue? as I mentioned, I can see the same in the logs that you are seeing, but the logs are reaching the endpoint using the Splunk and the HTTP output plugins. Thanks, |
The Forward plugin doesn't have the callback for the resume; hence, there is no log saying it resumed. |
I believe this was caused by the rewrite_tag deadlock noted: #4940 (comment) And the new version 3.0.2 seems to have fixed it after including: #8473 |
Bug Report
Describe the bug
We’ve seen lots of instances of our fluent-bit sidecars becoming stuck once they reach/near their memory limit (we can’t use a filesystem buffer).
The only way to resolve the issue is to restart the container. No logs emit from the container either so we can’t find any root cause.
To Reproduce
No (info level) logs are emitted during the incident
Expected behavior
New logs are not accepted until backlog is cleared.
Screenshots
![Screenshot 2023-10-12 at 12 39 55](https://private-user-images.githubusercontent.com/9123267/275551821-2df28d1d-8ecd-461d-9cf6-f7ece498d05c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI1MjUzMTUsIm5iZiI6MTcyMjUyNTAxNSwicGF0aCI6Ii85MTIzMjY3LzI3NTU1MTgyMS0yZGYyOGQxZC04ZWNkLTQ2MWQtOWNmNi1mN2VjZTQ5OGQwNWMucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDgwMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA4MDFUMTUxMDE1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZWQ5NWQ0ZDYyZDYxMjA5MTIxYTkzZTEzNDU4MmEwMWVkZTMwNTBlOTE1YWMxOWJhMWVhZjM5MGU0ZjQyZmVlNiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.IAp87N3izA9HgOw4pxKbjF6AFhRQzibGp10nXeGKFtg)
![Screenshot 2023-10-12 at 12 37 55](https://private-user-images.githubusercontent.com/9123267/275551828-421da7d6-fb07-4f59-b667-843fe2ab0388.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI1MjUzMTUsIm5iZiI6MTcyMjUyNTAxNSwicGF0aCI6Ii85MTIzMjY3LzI3NTU1MTgyOC00MjFkYTdkNi1mYjA3LTRmNTktYjY2Ny04NDNmZTJhYjAzODgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDgwMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA4MDFUMTUxMDE1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MzI3MzYzMjE1YzBlNTVkM2NjNzEyYjVjY2U4OTFkYWY3MDdkMTFhN2RjODZjNWYyMGUxYmM1NWViMmNjOGEyNyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.adcKUDm--u6Mj9-RgW3Vye78hahfdiNxa5knB9MfsfI)
![Screenshot 2023-10-12 at 12 37 44](https://private-user-images.githubusercontent.com/9123267/275551834-13a08b73-1c40-46b9-b185-f3cff7619222.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI1MjUzMTUsIm5iZiI6MTcyMjUyNTAxNSwicGF0aCI6Ii85MTIzMjY3LzI3NTU1MTgzNC0xM2EwOGI3My0xYzQwLTQ2YjktYjE4NS1mM2NmZjc2MTkyMjIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDgwMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA4MDFUMTUxMDE1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MDU0YWY1MjNmNDVkMWNjNzhkMTU0YmFlMDRhYzMyMjRjMjVhNWM3OWRkZDdkMmJmNGNlMDBlNWI0ZWI3ZjcyOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.1pH0i9viu_SRvFCX9anEeVrzoBdpe1jbN2HLdr0Co8I)
Different host than the logs above.
Your Environment
I've posted this in Slack and received a comment about using a ring buffer. Do you have any information about that?
Configuration files:
The text was updated successfully, but these errors were encountered: