Fluent-bit stops processing logs under high load, locks writing in ch_manager pipe #2661

grep4error · 2020-10-09T02:24:22Z

Bug Report

My fluent-bit 1.5.7 is running in a container in k8s (AKS) environment. It’s configured to collect docker logs (33 tail inputs configured) and send them to elasticsearch (33 outputs) and a few filters.
Recently, as the amount of logs per node increased, fluent-bit started sporadically freezing up. The process would continue running consuming 0% cpu and not processing any new logs or filesystem storage backlog. It would however respond to monitoring queries on its http port.
After some debugging using strace and gdb, I found that it locks up attempting to write to ch_manager pipe.
Here’s the stack trace

0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000055a127ecad01 in flb_output_return (ret=1, th=0x7fac548cd240) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:545
#2  0x000055a127ecade9 in flb_output_return_do (x=1) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:576
#3  0x000055a127eccca1 in cb_es_flush (data=0x7fac5991e0b2, bytes=3842,
    tag=0x7fac5437fd20 "kube.gws.gws-platform-datacollector-blue-6c474d7c84-wsdcw.gws-platform-datacollector-blue.2db22adbd3daaaf836a3f6311f4b3e5ad9ec7727280458ac68868419fb758ab9", tag_len=154,
    ins=0x7fac5949e480, out_context=0x7fac56c6bc00, config=0x7fac59432c80) at /tmp/fluent-bit/plugins/out_es/es.c:748
#4  0x000055a127e72649 in output_pre_cb_flush () at /tmp/fluent-bit/include/fluent-bit/flb_output.h:449
#5  0x000055a1282a6907 in co_init () at /tmp/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6  0x3039663238613165 in ?? ()
#

strace (filtered by read from fd 13, and write to fd 14, which is the ch_manager pipe fd’s)

...
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
write(14, "\0\200\25\20\2\0\0\0", 8)    = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
write(14, "\0\200\17\20\2\0\0\0", 8)    = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\0\33\20\2\0\0\0", 8)       = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = 8
write(14, "\0\300\24\20\2\0\0\0", 8)    = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
write(14, "\0\0\33\20\2\0\0\0", 8)      = 8
read(13, "\0\200\25\20\2\0\0\0", 8)     = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
write(14, "\0\200\17\20\2\0\0\0", 8)    = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
write(14, "\0@\16\20\2\0\0\0", 8)       = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
read(13, "\0\0\33\20\2\0\0\0", 8)       = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
read(13, "\0@\16\20\2\0\0\0", 8)        = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
write(14, "\0\300\24\20\2\0\0\0", 8)    = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
write(14, "\0@\16\20\2\0\0\0", 8)       = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
write(14, "\0\0\33\20\2\0\0\0", 8)      = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
read(13, "\0@\16\20\2\0\0\0", 8)        = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = ? ERESTARTSYS (To be restarted if SA_RESTART is set)

it looks like elasticsearch outputs may send so many responses to inputs at the same time that the pipe fills up and blocks in write(). But inputs are running in the same thread, so they can’t read responses from the pipe and fluent-bit locks up.

I produced a dirty fix for it by making ch_manager pipe non-blocking. I also tried extending the size of the pipe (or at least get it), but ioctl fails to get or set new pipe size. See the snippet below; I added the last line.

flb_engine.c:

    /*
     * Create a communication channel: this routine creates a channel to
     * signal the Engine event loop. It's useful to stop the event loop
     * or to instruct anything else without break.
     */
    ret = mk_event_channel_create(config->evl,
                                  &config->ch_manager[0],
                                  &config->ch_manager[1],
                                  config);
    if (ret != 0) {
        flb_error("[engine] could not create manager channels");
        return -1;
    }

    flb_pipe_set_nonblocking(&config->ch_manager[1]); /* <----- I made it non-blocking ------- */

there's probably a cleaner way to fix it, but this one-liner worked for me. Now I am getting occasional “resource not available” error in the log, but fluent-bit survives and continues crunching logs.

Environment

Version used: 1.5.7 (container fluent/fluent-bit:1.5.7)
kubernetes 1.16.13
docker 3.0.10+azure
Ubuntu 16.04.1

The text was updated successfully, but these errors were encountered:

avdhoot · 2020-10-09T06:24:53Z

Awesome Thanks @grep4error for all help. @edsiper This affecting more people #2621 & #2577. Please do the needful. We can help to validate fix. cc @tomerleib

tomerleib · 2020-10-11T06:35:39Z

I've done two more tests in this area:

Downgraded the image to 1.3.6 - There was a slight improvement, only one pod encountered the issue and froze. 2/3 pods of the Daemonset continue to work.
I've removed all the tails and left a single input and a single stdout output - No issues, Fluent-Bit is running for almost 3 days and nothing stopped.

tomerleib · 2020-10-11T12:27:24Z

@grep4error do you have an image that you can share with me and I will test it in my environment as well?

tomerleib · 2020-10-12T03:11:58Z

edited
No changes for me with 1.5.7 and this line.
Still stopped processing after 20-30 minutes.

However, I created an image for 1.7.0 (yes, I know it's not released, but I forgot to change the branch) and everything worked for more than 14 hours.

mtparet · 2020-10-12T10:06:55Z

We have the same issue, for now we will revert to 1.3.X which seems not having this issue.

tomerleib · 2020-10-12T10:30:35Z

@mtparet which 1.3.x version have you tested?
I tested with 1.3.7 and saw the same issue, although not entirely (2/3 pods are running and 1 froze) but still.

mtparet · 2020-10-12T14:06:40Z

It was 1.3.9, I cannot guarantee we had no freeze but I did not observed it.

tomerleib · 2020-10-15T04:44:40Z

Tested 1.3.9, after 28 hours one of the fluent-bit pods froze again...

mtparet · 2020-11-16T13:14:34Z

Hello @edsiper,
Do you acknowledge the issue and so we need to find a fix for it ? Or this issue should not happen and there is something else outside of fluentbit that is broken.

edsiper · 2020-11-16T13:19:51Z

FYI: I am taking a look at this (WIP)

edsiper · 2020-11-16T13:33:38Z

I am thinking about a solution, making the socket async is right, but when EAGAIN is returned when trying to write to the pipe in a full state, will need an extra care, since that notification from the output saying "I am done, or need a retry", will be missed.

Work in process

edsiper · 2020-11-16T15:35:30Z

Indeed at high load ch_manager get's saturated, the obvious solution seems to implement a pool of pipes/channels for the output plugins. Now we were abusing the internal event loop channel manager and it was quite OK until a certain load, I will work in a POC with independent notification channels.

Work in process

mtparet · 2020-11-16T15:55:12Z

Thanks a lot of clarifying and your work on that ! @edsiper

mtparet · 2020-11-16T15:57:14Z

Indeed at high load ch_manager get's saturated

Increasing CPU allocated could be a workaround ?

edsiper · 2020-11-16T16:09:08Z

not necessary, it's a bit to have separate channels for that specific kind of notifications.

Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>

edsiper · 2020-11-17T03:43:22Z

@grep4error @avdhoot @tomerleib @mtparet @shirolimit @clebs

I've pushed a possible solution in branch ch_manager, would you please build it and test it? If you confirm the fix is good to go, I can merge it in the 1.6 series this week, but I need your help with proper feedback.

About the solution: now every output plugin instance has it owns channels to notify events through the event loop, in this way it does not saturate the main engine ch_manager.

edsiper · 2020-11-20T16:55:57Z

ping, any feedback ?

Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>

mtparet · 2020-11-23T09:20:41Z

I didn't take the time to build/push/test it, I will.

Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>

Caroga · 2021-04-28T13:16:36Z

I have limited experience with fluentbit but stumbled across this thread as I am experiencing the same issues.
Anything I could do for debugging/insights?

edsiper · 2021-04-28T13:20:10Z

please supply your config files

elruwen · 2021-04-30T04:09:21Z

@elruwen also with 1.7.3?

Sorry, I mixed up the issues. It got an issue very similar to this, see #3014 That one is still with 1.7.3 an issue and I wouldn't be surprised if those two issues are related. The issue contains everything to reproduce it.

frankreno · 2021-09-22T01:51:04Z

Seeing this issue as well. Any updates on this? Any information we can capture and provide to attempt to help get this issue moving towards a fix? Looks like it has been around for several releases and makes fluent-bit unreliable in high load but also unpredictably it seems. We are seeing this in multiple Kubernetes environments and there is no viable work around to detect that won't have the possibility of missing logs.

danlenar · 2021-11-05T05:18:57Z

Currently running into this issue with file buffering turned on and might have a theory why some folks might be running into it.

So, this issue might manifest depending on what the pipe buffer size is.

You could possibly end up with three possible pipe buffer sizes depending on what Linux kernel you are running and if you are hitting the kernel's pipe buffer soft limits

Default pipe buffer size on Linux is going to be 65536 bytes.
Since the max tasks you can run is 2048, this pipe size should easily handle all the tasks easily without the pipe being blocked.

If you are hitting the kernel pipe buffer soft limits, your pipe size might be either 4096 bytes (https://github.com/torvalds/linux/blob/v4.18/fs/pipe.c#L642) or 8192 bytes (https://github.com/torvalds/linux/blob/v5.15/fs/pipe.c#L797).
On centos/rhel 7/8, your pipe size would be 4096 bytes, while newer kernels would be 8192 bytes.
So, you can run into a scenario where you add too many tasks/coroutines at once and the pipe gets blocked on writing.
This really means that the pipe can only really handle writing 512 or 1024 tasks and doesn't have room for current max task size of 2048.

As a temp workaround, you could add CAP_SYS_RESOURCE capability to the fluent-bit docker container as that ensures the pipe buffer size is at least 65536 bytes.

A proper solution would be to alter the amount of tasks that can be run based on the pipe's buffer.

danlenar · 2021-11-08T17:14:01Z

Tried lower task size from 2048 to 512 to see how a pipe buffer of 4k would perform if you hit the pipe buffer limits.
I ran into into this issue that has been fixed in the Linux kernel 5.14 (torvalds/linux@46c4c9d) where you may still block on a pipe write even if the pipe is not fully full.

With 2048 tasks max with each notification needing 8 bytes, you would need a buffer size of 16384 bytes.
If CAP_SYS_RESOURCE capability is not an option, perhaps adding an array of 4k pipe buffers to write should do the trick.

zhanghe9702 · 2021-12-27T12:40:03Z

Tried lower task size from 2048 to 512 to see how a pipe buffer of 4k would perform if you hit the pipe buffer limits. I ran into into this issue that has been fixed in the Linux kernel 5.14 (torvalds/linux@46c4c9d) where you may still block on a pipe write even if the pipe is not fully full.

With 2048 tasks max with each notification needing 8 bytes, you would need a buffer size of 16384 bytes. If CAP_SYS_RESOURCE capability is not an option, perhaps adding an array of 4k pipe buffers to write should do the trick.

i have tried the 500 tasks, fluent-bit block also exist(but work normal after restart it), maybe coroutine scheduler algorithm has some bugs:)?

annettejanewilson · 2022-02-25T15:11:29Z

I believe we have been hitting this too, on a deployment of roughly 700 nodes across around 40 clusters, we were seeing around 8% of pods getting stuck. Attaching a debugger to the stuck instances we saw a similar stack trace:

Thread 8 (Thread 0x7f25b61fd700 (LWP 13)):
#0  0x00007f25c231c459 in write () from /lib64/libpthread.so.0
#1  0x00000000004ff3e7 in flb_output_return (ret=1, co=0x7f25b5217308) at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:649
#2  0x00000000004ff443 in flb_output_return_do (x=1) at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:685
#3  0x00000000004ff4eb in flb_proxy_cb_flush (data=0x7f257439cd80, bytes=790, tag=0x7f25b41cab80 "kube.var.log.containers.off-boarding-manager-6aapa-74c94c54b5-k2nz5_apollo_off-boarding-manager-3477301c92e8e9da2d9c6769983a2c06107c2ebf139548995e7aca30777f639d.log", tag_len=164, i_ins=0x7f25bda3a680, out_context=0x7f25bda22330, config=0x7f25bda19980) at /tmp/fluent-bit-1.8.9/src/flb_plugin_proxy.c:65
#4  0x00000000004d6973 in output_pre_cb_flush () at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:514
#5  0x00000000009d3667 in co_init () at /tmp/fluent-bit-1.8.9/lib/monkey/deps/flb_libco/amd64.c:117

If we detach, wait and reattach it is still stuck there.

We think this is maybe only affecting nodes with either large numbers of files tracked, or large log volumes, or both.

When we terminate the stuck pod and it gets replaced we see it is very likely to get stuck again within the next ten to fifteen minutes. We assume that this is because it's on a node that is hosting particularly busy pods or particularly many pods but we're not sure.

We have tested the CAP_SYS_RESOURCE workaround on a subset of affected clusters and it does appear to remove the problem.

This is our current configuration:

    [SERVICE]
        Flush         1
        Log_Level     ${LOG_LEVEL}
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020


    [INPUT]
        Name              tail
        Tag               kube.*
        Path              ${PATH}
        Parser            ${LOG_PARSER}
        DB                ${FB_DB}
        Mem_Buf_Limit     200Mb
        Skip_Long_Lines   On
        Refresh_Interval  10


    [FILTER]
        Name                kubernetes
        Match               kube.*
        Use_Kubelet         On
        Buffer_Size         2Mb
        K8S-Logging.Exclude On
        K8S-Logging.Parser  On
        Merge_Log           On
        Keep_Log            Off
        tls.verify          On

    [FILTER]
        Name          nest
        Match         *
        Operation     lift
        Nested_under  kubernetes

    [FILTER]
        Name          nest
        Match         *
        Operation     lift
        Nested_under  labels
        Add_prefix    labels.

    [FILTER]
        Name             modify
        Match            *
        Rename           labels.app                       service.name
        Rename           labels.app.kubernetes.io/name    service.name
        Rename           labels.app.kubernetes.io/version service.version
        Rename           trace_id                         trace.id
        Remove_wildcard  labels.
        Remove_wildcard  annotations
        Remove           container_hash
        Remove           docker_id
        Remove           pod_id

    [FILTER]
        Name           record_modifier
        Match          *
        Record         cluster_name ${CLUSTER_NAME}


    [OUTPUT]
        Name           newrelic
        Match          *
        licenseKey     ${LICENSE_KEY}
        endpoint       ${ENDPOINT}
        lowDataMode    true

I'll note that the Mem_Buf_Limit is set very high, but that was an earlier attempt to fix the problem before we understood what was going wrong. The newrelic output plugin is this: https://github.com/newrelic/newrelic-fluent-bit-output

Some of the relevant environment variables:

        - name: LOG_LEVEL
          value: info
        - name: LOG_PARSER
          value: docker
        - name: FB_DB
          value: /var/log/flb_kube.db
        - name: PATH
          value: /var/log/containers/*.log

The stack trace above is from 1.8.9 with Amazon's patches, which we had been trying to see if they fixed anything, but we have been having this problem with 1.8.12 too.

annettejanewilson · 2022-02-26T10:46:30Z

The soft pipe limit is per-user, right? As I understand it, Kubernetes doesn't namespace users, so if Fluent Bit is running inside a container as uid 0, and lots of other processes are all running inside containers also as uid 0, then are they all sharing that same limit? That could explain why it is hard to reproduce outside of production - it needs something else running as the same uid to consume all of the soft pipe limit, and it needs heavy load to fill up the queue.

annettejanewilson · 2022-03-02T11:21:12Z

I spent a while trying to make an isolated reproduction, but I haven't been successful yet. I am using one process running as the same user to allocate pipes and using fcntl with F_SETPIPE_SZ to resize them until hitting the limit, at which point all new pipes allocated can be observed (via fcntl with F_GETPIPE_SZ) to be 4096 bytes (one page). At this point I think what is required to reproduce is to generate a sustained backlog of tasks, but I haven't figured out how to do this in an isolated setup that doesn't depend on Kubernetes or New Relic.

I note @danlenar's observation earlier, that it's possible to block on a write to pipe even if it's not empty. I would make this even more explicit - it's possible to block on a write to a pipe if the unused capacity is at least one byte and less than 4096 bytes. Once the pipe writer reaches a page boundary, it needs a completely free page that does not still contain any data waiting to be read from the pipe. You can observe this behaviour by creating a pipe, resizing it to 4096 bytes, writing 4096 bytes, reading 4095 bytes, then attempting to write 1 byte. Even though the pipe has only a single unread byte remaining, writing even a single byte into the pipe will block until that byte has been read.

Based on this, I think it should be possible to reproduce by creating the soft pipe limit scenario on a machine with kernel <5.1.4, and finding some configuration that spends a meaningful proportion of time with a task queue of two or more tasks. Sooner or later two or more tasks should attempt to write to the (single page capacity) pipe as it is crossing the page boundary.

elruwen · 2022-03-03T00:20:20Z

Maybe something that helps with reproducing: #4162

I created it for this issue: #3014

Chunks get stuck under load. Maybe it helps...

annettejanewilson · 2022-03-04T12:14:48Z

I'm afraid haven't had any more time to spend on reproducing this.

I'm tending to think that there's no good way to use a pipe as a message queue for a single thread on kernels <5.1.4, since a one-page pipe basically cannot guarantee even to hold two messages without blocking. You pretty much have to consider the guaranteed usable capacity of a pipe to be one page less than it has allocated. If you get at least two pages it makes sense to limit the task queue to (pipe_capacity - page_size) / message_size. I think at the bare minimum it would be valuable to emit a warning on startup when the task queue is larger than this limit, even (especially) when this limit is calculated to be zero.

Would any Fluent Bit devs comment on what might make an acceptable PR here?

edsiper · 2022-03-04T15:28:38Z

@annettejanewilson saturation happens because the event loop and channels are saturated. To fix the problem, just enable workers in your output plugin, each worker will have it independent event loop and pipe/channels so it should be fine.

FYI: we just changed in 1.8.13 to have default workers to avoid this situation

edsiper · 2022-03-04T15:30:09Z

@annettejanewilson

example:

    [OUTPUT]
        Name           newrelic
        Match          *
        licenseKey     ${LICENSE_KEY}
        endpoint       ${ENDPOINT}
        lowDataMode    true
        workers         4

Tafsiralam · 2022-03-05T03:22:31Z

I am also getting the same issue. My setup is 8-9 months old and in my case, everything was working fine till yesterday evening but suddenly fluent-bit stopped processing logs for a few pods only for one namespace. Nothing unusual was found in the fluent-bit logs. fluent bit pods are still running but stopped sending logs to the output.
When I restart the fluent-bit service it starts sending the logs to the output but after 10-15 minutes it again stops sending the logs to the output.

The Fluent-bit version that I am currently using is v1.8
I also tried upgrading the version to v1.8.13 which is the latest and the stable version but I am still seeing the same issue.

Any fix for this? A quick response will be appreciated.
Thanks

Tafsiralam · 2022-03-07T10:53:29Z

The issue is fixed by upgrading the fluent-bit version from v1.8 to v1.8.1. Also, worked for v.1.7.9

vielmetti · 2022-05-13T21:25:11Z

@edsiper Would some kind of dedicated scale/load testing infra help find this problem faster? With input from @hh and @jeefy can easily imagine spinning up some high core count / fast network machines on the CNCF CIL (Equinix Metal) infra and get some reproducible saturation testing (e.g. an 80-core Ampere Altra posting at 10GB to a 4-core x86 system should be a test of mettle). It would be also possible to set up either very low latency data paths or very high latency data paths depending on your needs.

Looking at cncf/cluster#114 and wondering if there's time during Kubecon next week to map out some strategy.

hh · 2022-05-18T18:13:38Z

I’m here to help strategise and assist. I’m not insite by available for a sync or async catch-up. Cheers, Hippie Hacker

…

On Sat, 14 May 2022 at 9:25 AM, Edward Vielmetti ***@***.***> wrote: @edsiper <https://github.com/edsiper> Would some kind of dedicated scale/load testing infra help find this problem faster? With input from @hh <https://github.com/hh> and @jeefy <https://github.com/jeefy> can easily imagine spinning up some high core count / fast network machines on the CNCF CIL (Equinix Metal) infra and get some reproducible saturation testing (e.g. an 80-core Ampere Altra posting at 10GB to a 4-core x86 system should be a test of mettle). It would be also possible to set up either very low latency data paths or very high latency data paths depending on your needs. Looking at cncf/cluster#114 <cncf/cluster#114> and wondering if there's time during Kubecon next week to map out some strategy. — Reply to this email directly, view it on GitHub <#2661 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAHUYZURXQKCUZKZEOYJITVJ3CEHANCNFSM4SJQA2ZQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

agup006 · 2022-05-20T01:12:30Z

hey Team, sorry for missing the note about syncing up at KubeCon - @hh @vielmetti would you be able to join the Fluent Community Meeting next week for us to discuss this? Also adding @patrick-stephens @niedbalski

patrick-stephens · 2022-05-20T08:05:00Z

Yeah I'm still at kubecon at the Calyptia booth if you want a quick chat as well.

Sounds like a good idea to me, it's something I've been wanting to add.

hh · 2022-10-11T07:09:20Z

It's an early one for me, but I'm up for it. :) [image: image.png]

…

On Fri, May 20, 2022 at 1:12 PM Anurag Gupta ***@***.***> wrote: hey Team, sorry for missing the note about syncing up at KubeCon - @hh <https://github.com/hh> @vielmetti <https://github.com/vielmetti> would you be able to join the Fluent Community Meeting next week for us to discuss this? Also adding @patrick-stephens <https://github.com/patrick-stephens> @niedbalski <https://github.com/niedbalski> — Reply to this email directly, view it on GitHub <#2661 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAHUY7IXP63ZMB2FF4ADH3VK3RIZANCNFSM4SJQA2ZQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

yackushevas · 2022-12-10T11:47:55Z

CAP_SYS_RESOURCE did not help, I tried to increase the number of kafka workers (up to 2048, I didn’t go further, because it began to take a lot of memory) - it didn’t help either. Version 2.0.6

Stats for 16 workers:

 Performance counter stats for process id '2320621':

         71,949.12 msec task-clock                #    1.165 CPUs utilized
            16,389      context-switches          #    0.228 K/sec
             4,143      cpu-migrations            #    0.058 K/sec
           821,492      page-faults               #    0.011 M/sec
   235,908,381,214      cycles                    #    3.279 GHz                      (49.97%)
     6,466,760,759      stalled-cycles-frontend   #    2.74% frontend cycles idle     (49.99%)
    45,908,914,952      stalled-cycles-backend    #   19.46% backend cycles idle      (50.00%)
   598,812,184,507      instructions              #    2.54  insn per cycle
                                                  #    0.08  stalled cycles per insn  (50.03%)
   129,253,350,302      branches                  # 1796.455 M/sec                    (50.01%)
       489,237,729      branch-misses             #    0.38% of all branches          (50.00%)

      61.742176196 seconds time elapsed

There is a similar problem with the out_forward plugin, but with the out_null plugin everything works fine...

rishabhToshniwal · 2024-06-19T16:59:09Z

Hello,

I am facing below issue

og.20240619-164548
[2024/06/19 16:45:56] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=101819671 watch_fd=54
[2024/06/19 16:46:08] [error] [plugins/in_tail/tail_file.c:1432 errno=2] No such file or directory
[2024/06/19 16:46:08] [error] [plugins/in_tail/tail_fs_inotify.c:147 errno=2] No such file or directory
[2024/06/19 16:46:08] [error] [input:tail:tail.0] inode=101819674 cannot register file /var/log/containers/plt-cbc-cluster-0000_aws-bs-clusters_logging-37b552124109c95e6926378d795abbd4bd00ebee5f89f99c903ee5ea9166f3ff.log
[2024/06/19 16:46:20] [error] [plugins/in_tail/tail_file.c:1432 errno=2] No such file or directory
[2024/06/19 16:46:20] [error] [plugins/in_tail/tail_fs_inotify.c:147 errno=2] No such file or directory
[2024/06/19 16:46:20] [error] [input:tail:tail.0] inode=101819671 cannot register file /var/log/containers/plt-cbc-cluster-0000_aws-bs-clusters_logging-37b552124109c95e6926378d795abbd4bd00ebee5f89f99c903ee5ea9166f3ff.log

I am using aws-for-fluent-bit:2.32.2.20240425 and passing records to aws firehose data stream

Here is snippet from my fluentbit config

[FILTER]
Name rewrite_tag
Match kube.*
Rule $kubernetes['labels']['couchbase_cluster'] ^.*$ couchbase false
Emitter_Name re_emitted_couchbase
Emitter_Mem_Buf_Limit 50M

[OUTPUT]
Name kinesis_firehose
Match couchbase
region eu-west-1
endpoint https://xyz.firehose.eu-west-1.vpce.amazonaws.com
delivery_stream my-firehose-log-stream

grep4error mentioned this issue Oct 10, 2020

I am facing missing log issue with Fluent bit 1.5.. #2509

Closed

tomerleib mentioned this issue Oct 12, 2020

Fluent-bit Daemonset stopped processing logs #2621

Closed

edsiper self-assigned this Oct 12, 2020

shirolimit mentioned this issue Oct 12, 2020

Use non-blocking writes for output tasks #2672

Closed

edsiper added a commit that referenced this issue Nov 17, 2020

output: plugins now uses an independent channel for events (#2661)

482b45d

Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>

edsiper added a commit that referenced this issue Nov 17, 2020

engine: manage output envents from indepentent channel (#2661)

f27a75f

Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>

edsiper added enhancement waiting-for-user Waiting for more information, tests or requested changes labels Nov 17, 2020

edsiper added a commit that referenced this issue Nov 21, 2020

engine: manage output events from independent channel (#2661)

b41aba5

Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>

edsiper added a commit that referenced this issue Nov 21, 2020

engine: manage output events from independent channel (#2661)

46ab4d4

Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>

edsiper added a commit that referenced this issue Nov 23, 2020

output: plugins now uses an independent channel for events (#2661)

f2900b8

Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>

edsiper added a commit that referenced this issue Nov 23, 2020

output: plugins now uses an independent channel for events (#2661)

4d5b9b0

Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>

github-actions bot removed the Stale label Apr 28, 2021

gabegorelick mentioned this issue Jun 30, 2021

FluentBit stops working consistently around ~3 minutes after startup #3518

Closed

danlenar mentioned this issue Nov 12, 2021

out_kafka: Skip if msgpack object is not an array #4320

Closed

1 task

zhanghe9702 mentioned this issue Dec 22, 2021

fluent-bit in dead status, looks like too many ESTABLISHED tcp connection not release #4507

Closed

Fluent-bit stops processing logs under high load, locks writing in ch_manager pipe #2661

Fluent-bit stops processing logs under high load, locks writing in ch_manager pipe #2661

Comments

grep4error commented Oct 9, 2020 • edited Loading

Bug Report

avdhoot commented Oct 9, 2020

tomerleib commented Oct 11, 2020

tomerleib commented Oct 11, 2020

tomerleib commented Oct 12, 2020 • edited Loading

mtparet commented Oct 12, 2020

tomerleib commented Oct 12, 2020

mtparet commented Oct 12, 2020

tomerleib commented Oct 15, 2020

mtparet commented Nov 16, 2020

edsiper commented Nov 16, 2020

edsiper commented Nov 16, 2020

edsiper commented Nov 16, 2020

mtparet commented Nov 16, 2020

mtparet commented Nov 16, 2020

edsiper commented Nov 16, 2020

edsiper commented Nov 17, 2020

edsiper commented Nov 20, 2020

mtparet commented Nov 23, 2020

Caroga commented Apr 28, 2021

edsiper commented Apr 28, 2021

elruwen commented Apr 30, 2021

frankreno commented Sep 22, 2021 • edited Loading

danlenar commented Nov 5, 2021

danlenar commented Nov 8, 2021

zhanghe9702 commented Dec 27, 2021

annettejanewilson commented Feb 25, 2022 • edited Loading

annettejanewilson commented Feb 26, 2022

annettejanewilson commented Mar 2, 2022 • edited Loading

elruwen commented Mar 3, 2022

annettejanewilson commented Mar 4, 2022

edsiper commented Mar 4, 2022

edsiper commented Mar 4, 2022

Tafsiralam commented Mar 5, 2022

Tafsiralam commented Mar 7, 2022

vielmetti commented May 13, 2022

hh commented May 18, 2022 via email

agup006 commented May 20, 2022

patrick-stephens commented May 20, 2022

hh commented Oct 11, 2022 via email

yackushevas commented Dec 10, 2022 • edited Loading

rishabhToshniwal commented Jun 19, 2024

grep4error commented Oct 9, 2020 •

edited

Loading

tomerleib commented Oct 12, 2020 •

edited

Loading

frankreno commented Sep 22, 2021 •

edited

Loading

annettejanewilson commented Feb 25, 2022 •

edited

Loading

annettejanewilson commented Mar 2, 2022 •

edited

Loading

yackushevas commented Dec 10, 2022 •

edited

Loading