-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluent-bit stops processing logs under high load, locks writing in ch_manager pipe #2661
Comments
Awesome Thanks @grep4error for all help. @edsiper This affecting more people #2621 & #2577. Please do the needful. We can help to validate fix. cc @tomerleib |
I've done two more tests in this area:
|
@grep4error do you have an image that you can share with me and I will test it in my environment as well? |
edited However, I created an image for 1.7.0 (yes, I know it's not released, but I forgot to change the branch) and everything worked for more than 14 hours. |
We have the same issue, for now we will revert to 1.3.X which seems not having this issue. |
@mtparet which 1.3.x version have you tested? |
It was 1.3.9, I cannot guarantee we had no freeze but I did not observed it. |
Tested 1.3.9, after 28 hours one of the fluent-bit pods froze again... |
Hello @edsiper, |
FYI: I am taking a look at this (WIP) |
I am thinking about a solution, making the socket async is right, but when EAGAIN is returned when trying to write to the pipe in a full state, will need an extra care, since that notification from the output saying "I am done, or need a retry", will be missed. Work in process |
Indeed at high load Work in process |
Thanks a lot of clarifying and your work on that ! @edsiper |
Increasing CPU allocated could be a workaround ? |
not necessary, it's a bit to have separate channels for that specific kind of notifications. |
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
@grep4error @avdhoot @tomerleib @mtparet @shirolimit @clebs I've pushed a possible solution in branch ch_manager, would you please build it and test it? If you confirm the fix is good to go, I can merge it in the 1.6 series this week, but I need your help with proper feedback. About the solution: now every output plugin instance has it owns channels to notify events through the event loop, in this way it does not saturate the main engine |
ping, any feedback ? |
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
I didn't take the time to build/push/test it, I will. |
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
I have limited experience with fluentbit but stumbled across this thread as I am experiencing the same issues. |
please supply your config files |
Seeing this issue as well. Any updates on this? Any information we can capture and provide to attempt to help get this issue moving towards a fix? Looks like it has been around for several releases and makes fluent-bit unreliable in high load but also unpredictably it seems. We are seeing this in multiple Kubernetes environments and there is no viable work around to detect that won't have the possibility of missing logs. |
Currently running into this issue with file buffering turned on and might have a theory why some folks might be running into it. So, this issue might manifest depending on what the pipe buffer size is. You could possibly end up with three possible pipe buffer sizes depending on what Linux kernel you are running and if you are hitting the kernel's pipe buffer soft limits Default pipe buffer size on Linux is going to be 65536 bytes. If you are hitting the kernel pipe buffer soft limits, your pipe size might be either 4096 bytes (https://github.com/torvalds/linux/blob/v4.18/fs/pipe.c#L642) or 8192 bytes (https://github.com/torvalds/linux/blob/v5.15/fs/pipe.c#L797). As a temp workaround, you could add CAP_SYS_RESOURCE capability to the fluent-bit docker container as that ensures the pipe buffer size is at least 65536 bytes. A proper solution would be to alter the amount of tasks that can be run based on the pipe's buffer. |
Tried lower task size from 2048 to 512 to see how a pipe buffer of 4k would perform if you hit the pipe buffer limits. With 2048 tasks max with each notification needing 8 bytes, you would need a buffer size of 16384 bytes. |
i have tried the 500 tasks, fluent-bit block also exist(but work normal after restart it), maybe coroutine scheduler algorithm has some bugs:)? |
I believe we have been hitting this too, on a deployment of roughly 700 nodes across around 40 clusters, we were seeing around 8% of pods getting stuck. Attaching a debugger to the stuck instances we saw a similar stack trace:
If we detach, wait and reattach it is still stuck there. We think this is maybe only affecting nodes with either large numbers of files tracked, or large log volumes, or both. When we terminate the stuck pod and it gets replaced we see it is very likely to get stuck again within the next ten to fifteen minutes. We assume that this is because it's on a node that is hosting particularly busy pods or particularly many pods but we're not sure. We have tested the CAP_SYS_RESOURCE workaround on a subset of affected clusters and it does appear to remove the problem. This is our current configuration:
I'll note that the Mem_Buf_Limit is set very high, but that was an earlier attempt to fix the problem before we understood what was going wrong. The newrelic output plugin is this: https://github.com/newrelic/newrelic-fluent-bit-output Some of the relevant environment variables:
The stack trace above is from 1.8.9 with Amazon's patches, which we had been trying to see if they fixed anything, but we have been having this problem with 1.8.12 too. |
The soft pipe limit is per-user, right? As I understand it, Kubernetes doesn't namespace users, so if Fluent Bit is running inside a container as uid 0, and lots of other processes are all running inside containers also as uid 0, then are they all sharing that same limit? That could explain why it is hard to reproduce outside of production - it needs something else running as the same uid to consume all of the soft pipe limit, and it needs heavy load to fill up the queue. |
I spent a while trying to make an isolated reproduction, but I haven't been successful yet. I am using one process running as the same user to allocate pipes and using I note @danlenar's observation earlier, that it's possible to block on a write to pipe even if it's not empty. I would make this even more explicit - it's possible to block on a write to a pipe if the unused capacity is at least one byte and less than 4096 bytes. Once the pipe writer reaches a page boundary, it needs a completely free page that does not still contain any data waiting to be read from the pipe. You can observe this behaviour by creating a pipe, resizing it to 4096 bytes, writing 4096 bytes, reading 4095 bytes, then attempting to write 1 byte. Even though the pipe has only a single unread byte remaining, writing even a single byte into the pipe will block until that byte has been read. Based on this, I think it should be possible to reproduce by creating the soft pipe limit scenario on a machine with kernel <5.1.4, and finding some configuration that spends a meaningful proportion of time with a task queue of two or more tasks. Sooner or later two or more tasks should attempt to write to the (single page capacity) pipe as it is crossing the page boundary. |
I'm afraid haven't had any more time to spend on reproducing this. I'm tending to think that there's no good way to use a pipe as a message queue for a single thread on kernels <5.1.4, since a one-page pipe basically cannot guarantee even to hold two messages without blocking. You pretty much have to consider the guaranteed usable capacity of a pipe to be one page less than it has allocated. If you get at least two pages it makes sense to limit the task queue to Would any Fluent Bit devs comment on what might make an acceptable PR here? |
@annettejanewilson saturation happens because the event loop and channels are saturated. To fix the problem, just enable workers in your output plugin, each worker will have it independent event loop and pipe/channels so it should be fine. FYI: we just changed in 1.8.13 to have default workers to avoid this situation |
example:
|
I am also getting the same issue. My setup is 8-9 months old and in my case, everything was working fine till yesterday evening but suddenly fluent-bit stopped processing logs for a few pods only for one namespace. Nothing unusual was found in the fluent-bit logs. fluent bit pods are still running but stopped sending logs to the output. The Fluent-bit version that I am currently using is v1.8 Any fix for this? A quick response will be appreciated. |
The issue is fixed by upgrading the fluent-bit version from v1.8 to v1.8.1. Also, worked for v.1.7.9 |
@edsiper Would some kind of dedicated scale/load testing infra help find this problem faster? With input from @hh and @jeefy can easily imagine spinning up some high core count / fast network machines on the CNCF CIL (Equinix Metal) infra and get some reproducible saturation testing (e.g. an 80-core Ampere Altra posting at 10GB to a 4-core x86 system should be a test of mettle). It would be also possible to set up either very low latency data paths or very high latency data paths depending on your needs. Looking at cncf/cluster#114 and wondering if there's time during Kubecon next week to map out some strategy. |
I’m here to help strategise and assist. I’m not insite by available for a
sync or async catch-up.
Cheers,
Hippie Hacker
…On Sat, 14 May 2022 at 9:25 AM, Edward Vielmetti ***@***.***> wrote:
@edsiper <https://github.com/edsiper> Would some kind of dedicated
scale/load testing infra help find this problem faster? With input from
@hh <https://github.com/hh> and @jeefy <https://github.com/jeefy> can
easily imagine spinning up some high core count / fast network machines on
the CNCF CIL (Equinix Metal) infra and get some reproducible saturation
testing (e.g. an 80-core Ampere Altra posting at 10GB to a 4-core x86
system should be a test of mettle). It would be also possible to set up
either very low latency data paths or very high latency data paths
depending on your needs.
Looking at cncf/cluster#114 <cncf/cluster#114>
and wondering if there's time during Kubecon next week to map out some
strategy.
—
Reply to this email directly, view it on GitHub
<#2661 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAHUYZURXQKCUZKZEOYJITVJ3CEHANCNFSM4SJQA2ZQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
hey Team, sorry for missing the note about syncing up at KubeCon - @hh @vielmetti would you be able to join the Fluent Community Meeting next week for us to discuss this? Also adding @patrick-stephens @niedbalski |
Yeah I'm still at kubecon at the Calyptia booth if you want a quick chat as well. Sounds like a good idea to me, it's something I've been wanting to add. |
It's an early one for me, but I'm up for it. :)
[image: image.png]
…On Fri, May 20, 2022 at 1:12 PM Anurag Gupta ***@***.***> wrote:
hey Team, sorry for missing the note about syncing up at KubeCon - @hh
<https://github.com/hh> @vielmetti <https://github.com/vielmetti> would
you be able to join the Fluent Community Meeting next week for us to
discuss this? Also adding @patrick-stephens
<https://github.com/patrick-stephens> @niedbalski
<https://github.com/niedbalski>
—
Reply to this email directly, view it on GitHub
<#2661 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAHUY7IXP63ZMB2FF4ADH3VK3RIZANCNFSM4SJQA2ZQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
CAP_SYS_RESOURCE did not help, I tried to increase the number of kafka workers (up to 2048, I didn’t go further, because it began to take a lot of memory) - it didn’t help either. Version 2.0.6 Stats for 16 workers:
There is a similar problem with the out_forward plugin, but with the out_null plugin everything works fine... |
Hello, I am facing below issue og.20240619-164548 I am using aws-for-fluent-bit:2.32.2.20240425 and passing records to aws firehose data stream Here is snippet from my fluentbit config [FILTER] [OUTPUT] |
Bug Report
My fluent-bit 1.5.7 is running in a container in k8s (AKS) environment. It’s configured to collect docker logs (33 tail inputs configured) and send them to elasticsearch (33 outputs) and a few filters.
Recently, as the amount of logs per node increased, fluent-bit started sporadically freezing up. The process would continue running consuming 0% cpu and not processing any new logs or filesystem storage backlog. It would however respond to monitoring queries on its http port.
After some debugging using strace and gdb, I found that it locks up attempting to write to ch_manager pipe.
Here’s the stack trace
strace (filtered by read from fd 13, and write to fd 14, which is the ch_manager pipe fd’s)
it looks like elasticsearch outputs may send so many responses to inputs at the same time that the pipe fills up and blocks in write(). But inputs are running in the same thread, so they can’t read responses from the pipe and fluent-bit locks up.
I produced a dirty fix for it by making ch_manager pipe non-blocking. I also tried extending the size of the pipe (or at least get it), but ioctl fails to get or set new pipe size. See the snippet below; I added the last line.
flb_engine.c:
there's probably a cleaner way to fix it, but this one-liner worked for me. Now I am getting occasional “resource not available” error in the log, but fluent-bit survives and continues crunching logs.
Environment
Version used: 1.5.7 (container fluent/fluent-bit:1.5.7)
kubernetes 1.16.13
docker 3.0.10+azure
Ubuntu 16.04.1
The text was updated successfully, but these errors were encountered: