Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluent-bit stops processing logs under high load, locks writing in ch_manager pipe #2661

Open
grep4error opened this issue Oct 9, 2020 · 63 comments
Assignees
Labels
enhancement waiting-for-user Waiting for more information, tests or requested changes

Comments

@grep4error
Copy link

grep4error commented Oct 9, 2020

Bug Report

My fluent-bit 1.5.7 is running in a container in k8s (AKS) environment. It’s configured to collect docker logs (33 tail inputs configured) and send them to elasticsearch (33 outputs) and a few filters.
Recently, as the amount of logs per node increased, fluent-bit started sporadically freezing up. The process would continue running consuming 0% cpu and not processing any new logs or filesystem storage backlog. It would however respond to monitoring queries on its http port.
After some debugging using strace and gdb, I found that it locks up attempting to write to ch_manager pipe.
Here’s the stack trace

0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000055a127ecad01 in flb_output_return (ret=1, th=0x7fac548cd240) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:545
#2  0x000055a127ecade9 in flb_output_return_do (x=1) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:576
#3  0x000055a127eccca1 in cb_es_flush (data=0x7fac5991e0b2, bytes=3842,
    tag=0x7fac5437fd20 "kube.gws.gws-platform-datacollector-blue-6c474d7c84-wsdcw.gws-platform-datacollector-blue.2db22adbd3daaaf836a3f6311f4b3e5ad9ec7727280458ac68868419fb758ab9", tag_len=154,
    ins=0x7fac5949e480, out_context=0x7fac56c6bc00, config=0x7fac59432c80) at /tmp/fluent-bit/plugins/out_es/es.c:748
#4  0x000055a127e72649 in output_pre_cb_flush () at /tmp/fluent-bit/include/fluent-bit/flb_output.h:449
#5  0x000055a1282a6907 in co_init () at /tmp/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6  0x3039663238613165 in ?? ()
#

strace (filtered by read from fd 13, and write to fd 14, which is the ch_manager pipe fd’s)

...
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
write(14, "\0\200\25\20\2\0\0\0", 8)    = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
write(14, "\0\200\17\20\2\0\0\0", 8)    = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\0\33\20\2\0\0\0", 8)       = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = 8
write(14, "\0\300\24\20\2\0\0\0", 8)    = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
write(14, "\0\0\33\20\2\0\0\0", 8)      = 8
read(13, "\0\200\25\20\2\0\0\0", 8)     = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
write(14, "\0\200\17\20\2\0\0\0", 8)    = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
write(14, "\0@\16\20\2\0\0\0", 8)       = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
read(13, "\0\0\33\20\2\0\0\0", 8)       = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
read(13, "\0@\16\20\2\0\0\0", 8)        = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
write(14, "\0\300\24\20\2\0\0\0", 8)    = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
write(14, "\0@\16\20\2\0\0\0", 8)       = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
write(14, "\0\0\33\20\2\0\0\0", 8)      = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
read(13, "\0@\16\20\2\0\0\0", 8)        = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = ? ERESTARTSYS (To be restarted if SA_RESTART is set)

it looks like elasticsearch outputs may send so many responses to inputs at the same time that the pipe fills up and blocks in write(). But inputs are running in the same thread, so they can’t read responses from the pipe and fluent-bit locks up.

I produced a dirty fix for it by making ch_manager pipe non-blocking. I also tried extending the size of the pipe (or at least get it), but ioctl fails to get or set new pipe size. See the snippet below; I added the last line.

flb_engine.c:

    /*
     * Create a communication channel: this routine creates a channel to
     * signal the Engine event loop. It's useful to stop the event loop
     * or to instruct anything else without break.
     */
    ret = mk_event_channel_create(config->evl,
                                  &config->ch_manager[0],
                                  &config->ch_manager[1],
                                  config);
    if (ret != 0) {
        flb_error("[engine] could not create manager channels");
        return -1;
    }

    flb_pipe_set_nonblocking(&config->ch_manager[1]); /* <----- I made it non-blocking ------- */

there's probably a cleaner way to fix it, but this one-liner worked for me. Now I am getting occasional “resource not available” error in the log, but fluent-bit survives and continues crunching logs.

Environment

Version used: 1.5.7 (container fluent/fluent-bit:1.5.7)
kubernetes 1.16.13
docker 3.0.10+azure
Ubuntu 16.04.1

@avdhoot
Copy link

avdhoot commented Oct 9, 2020

Awesome Thanks @grep4error for all help. @edsiper This affecting more people #2621 & #2577. Please do the needful. We can help to validate fix. cc @tomerleib

@tomerleib
Copy link

I've done two more tests in this area:

  1. Downgraded the image to 1.3.6 - There was a slight improvement, only one pod encountered the issue and froze. 2/3 pods of the Daemonset continue to work.
  2. I've removed all the tails and left a single input and a single stdout output - No issues, Fluent-Bit is running for almost 3 days and nothing stopped.

@tomerleib
Copy link

@grep4error do you have an image that you can share with me and I will test it in my environment as well?

@tomerleib
Copy link

tomerleib commented Oct 12, 2020

edited
No changes for me with 1.5.7 and this line.
Still stopped processing after 20-30 minutes.

However, I created an image for 1.7.0 (yes, I know it's not released, but I forgot to change the branch) and everything worked for more than 14 hours.

@edsiper edsiper self-assigned this Oct 12, 2020
@mtparet
Copy link

mtparet commented Oct 12, 2020

We have the same issue, for now we will revert to 1.3.X which seems not having this issue.

@tomerleib
Copy link

@mtparet which 1.3.x version have you tested?
I tested with 1.3.7 and saw the same issue, although not entirely (2/3 pods are running and 1 froze) but still.

@mtparet
Copy link

mtparet commented Oct 12, 2020

It was 1.3.9, I cannot guarantee we had no freeze but I did not observed it.

@tomerleib
Copy link

Tested 1.3.9, after 28 hours one of the fluent-bit pods froze again...

@mtparet
Copy link

mtparet commented Nov 16, 2020

Hello @edsiper,
Do you acknowledge the issue and so we need to find a fix for it ? Or this issue should not happen and there is something else outside of fluentbit that is broken.

@edsiper
Copy link
Member

edsiper commented Nov 16, 2020

FYI: I am taking a look at this (WIP)

@edsiper
Copy link
Member

edsiper commented Nov 16, 2020

I am thinking about a solution, making the socket async is right, but when EAGAIN is returned when trying to write to the pipe in a full state, will need an extra care, since that notification from the output saying "I am done, or need a retry", will be missed.

Work in process

@edsiper
Copy link
Member

edsiper commented Nov 16, 2020

Indeed at high load ch_manager get's saturated, the obvious solution seems to implement a pool of pipes/channels for the output plugins. Now we were abusing the internal event loop channel manager and it was quite OK until a certain load, I will work in a POC with independent notification channels.

Work in process

@mtparet
Copy link

mtparet commented Nov 16, 2020

Thanks a lot of clarifying and your work on that ! @edsiper

@mtparet
Copy link

mtparet commented Nov 16, 2020

Indeed at high load ch_manager get's saturated

Increasing CPU allocated could be a workaround ?

@edsiper
Copy link
Member

edsiper commented Nov 16, 2020

not necessary, it's a bit to have separate channels for that specific kind of notifications.

edsiper added a commit that referenced this issue Nov 17, 2020
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
edsiper added a commit that referenced this issue Nov 17, 2020
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
@edsiper
Copy link
Member

edsiper commented Nov 17, 2020

@grep4error @avdhoot @tomerleib @mtparet @shirolimit @clebs

I've pushed a possible solution in branch ch_manager, would you please build it and test it? If you confirm the fix is good to go, I can merge it in the 1.6 series this week, but I need your help with proper feedback.

About the solution: now every output plugin instance has it owns channels to notify events through the event loop, in this way it does not saturate the main engine ch_manager.

@edsiper edsiper added enhancement waiting-for-user Waiting for more information, tests or requested changes labels Nov 17, 2020
@edsiper
Copy link
Member

edsiper commented Nov 20, 2020

ping, any feedback ?

edsiper added a commit that referenced this issue Nov 21, 2020
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
edsiper added a commit that referenced this issue Nov 21, 2020
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
@mtparet
Copy link

mtparet commented Nov 23, 2020

I didn't take the time to build/push/test it, I will.

edsiper added a commit that referenced this issue Nov 23, 2020
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
edsiper added a commit that referenced this issue Nov 23, 2020
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
@github-actions github-actions bot removed the Stale label Apr 28, 2021
@Caroga
Copy link

Caroga commented Apr 28, 2021

I have limited experience with fluentbit but stumbled across this thread as I am experiencing the same issues.
Anything I could do for debugging/insights?

@edsiper
Copy link
Member

edsiper commented Apr 28, 2021

please supply your config files

@elruwen
Copy link

elruwen commented Apr 30, 2021

@elruwen also with 1.7.3?

Sorry, I mixed up the issues. It got an issue very similar to this, see #3014 That one is still with 1.7.3 an issue and I wouldn't be surprised if those two issues are related. The issue contains everything to reproduce it.

@frankreno
Copy link

frankreno commented Sep 22, 2021

Seeing this issue as well. Any updates on this? Any information we can capture and provide to attempt to help get this issue moving towards a fix? Looks like it has been around for several releases and makes fluent-bit unreliable in high load but also unpredictably it seems. We are seeing this in multiple Kubernetes environments and there is no viable work around to detect that won't have the possibility of missing logs.

@danlenar
Copy link
Contributor

danlenar commented Nov 5, 2021

Currently running into this issue with file buffering turned on and might have a theory why some folks might be running into it.

So, this issue might manifest depending on what the pipe buffer size is.

You could possibly end up with three possible pipe buffer sizes depending on what Linux kernel you are running and if you are hitting the kernel's pipe buffer soft limits

Default pipe buffer size on Linux is going to be 65536 bytes.
Since the max tasks you can run is 2048, this pipe size should easily handle all the tasks easily without the pipe being blocked.

If you are hitting the kernel pipe buffer soft limits, your pipe size might be either 4096 bytes (https://github.com/torvalds/linux/blob/v4.18/fs/pipe.c#L642) or 8192 bytes (https://github.com/torvalds/linux/blob/v5.15/fs/pipe.c#L797).
On centos/rhel 7/8, your pipe size would be 4096 bytes, while newer kernels would be 8192 bytes.
So, you can run into a scenario where you add too many tasks/coroutines at once and the pipe gets blocked on writing.
This really means that the pipe can only really handle writing 512 or 1024 tasks and doesn't have room for current max task size of 2048.

As a temp workaround, you could add CAP_SYS_RESOURCE capability to the fluent-bit docker container as that ensures the pipe buffer size is at least 65536 bytes.

A proper solution would be to alter the amount of tasks that can be run based on the pipe's buffer.

@danlenar
Copy link
Contributor

danlenar commented Nov 8, 2021

Tried lower task size from 2048 to 512 to see how a pipe buffer of 4k would perform if you hit the pipe buffer limits.
I ran into into this issue that has been fixed in the Linux kernel 5.14 (torvalds/linux@46c4c9d) where you may still block on a pipe write even if the pipe is not fully full.

With 2048 tasks max with each notification needing 8 bytes, you would need a buffer size of 16384 bytes.
If CAP_SYS_RESOURCE capability is not an option, perhaps adding an array of 4k pipe buffers to write should do the trick.

@zhanghe9702
Copy link

Tried lower task size from 2048 to 512 to see how a pipe buffer of 4k would perform if you hit the pipe buffer limits. I ran into into this issue that has been fixed in the Linux kernel 5.14 (torvalds/linux@46c4c9d) where you may still block on a pipe write even if the pipe is not fully full.

With 2048 tasks max with each notification needing 8 bytes, you would need a buffer size of 16384 bytes. If CAP_SYS_RESOURCE capability is not an option, perhaps adding an array of 4k pipe buffers to write should do the trick.

i have tried the 500 tasks, fluent-bit block also exist(but work normal after restart it), maybe coroutine scheduler algorithm has some bugs:)?

@annettejanewilson
Copy link

annettejanewilson commented Feb 25, 2022

I believe we have been hitting this too, on a deployment of roughly 700 nodes across around 40 clusters, we were seeing around 8% of pods getting stuck. Attaching a debugger to the stuck instances we saw a similar stack trace:

Thread 8 (Thread 0x7f25b61fd700 (LWP 13)):
#0  0x00007f25c231c459 in write () from /lib64/libpthread.so.0
#1  0x00000000004ff3e7 in flb_output_return (ret=1, co=0x7f25b5217308) at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:649
#2  0x00000000004ff443 in flb_output_return_do (x=1) at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:685
#3  0x00000000004ff4eb in flb_proxy_cb_flush (data=0x7f257439cd80, bytes=790, tag=0x7f25b41cab80 "kube.var.log.containers.off-boarding-manager-6aapa-74c94c54b5-k2nz5_apollo_off-boarding-manager-3477301c92e8e9da2d9c6769983a2c06107c2ebf139548995e7aca30777f639d.log", tag_len=164, i_ins=0x7f25bda3a680, out_context=0x7f25bda22330, config=0x7f25bda19980) at /tmp/fluent-bit-1.8.9/src/flb_plugin_proxy.c:65
#4  0x00000000004d6973 in output_pre_cb_flush () at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:514
#5  0x00000000009d3667 in co_init () at /tmp/fluent-bit-1.8.9/lib/monkey/deps/flb_libco/amd64.c:117

If we detach, wait and reattach it is still stuck there.

We think this is maybe only affecting nodes with either large numbers of files tracked, or large log volumes, or both.

When we terminate the stuck pod and it gets replaced we see it is very likely to get stuck again within the next ten to fifteen minutes. We assume that this is because it's on a node that is hosting particularly busy pods or particularly many pods but we're not sure.

We have tested the CAP_SYS_RESOURCE workaround on a subset of affected clusters and it does appear to remove the problem.

This is our current configuration:

    [SERVICE]
        Flush         1
        Log_Level     ${LOG_LEVEL}
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020


    [INPUT]
        Name              tail
        Tag               kube.*
        Path              ${PATH}
        Parser            ${LOG_PARSER}
        DB                ${FB_DB}
        Mem_Buf_Limit     200Mb
        Skip_Long_Lines   On
        Refresh_Interval  10


    [FILTER]
        Name                kubernetes
        Match               kube.*
        Use_Kubelet         On
        Buffer_Size         2Mb
        K8S-Logging.Exclude On
        K8S-Logging.Parser  On
        Merge_Log           On
        Keep_Log            Off
        tls.verify          On

    [FILTER]
        Name          nest
        Match         *
        Operation     lift
        Nested_under  kubernetes

    [FILTER]
        Name          nest
        Match         *
        Operation     lift
        Nested_under  labels
        Add_prefix    labels.

    [FILTER]
        Name             modify
        Match            *
        Rename           labels.app                       service.name
        Rename           labels.app.kubernetes.io/name    service.name
        Rename           labels.app.kubernetes.io/version service.version
        Rename           trace_id                         trace.id
        Remove_wildcard  labels.
        Remove_wildcard  annotations
        Remove           container_hash
        Remove           docker_id
        Remove           pod_id

    [FILTER]
        Name           record_modifier
        Match          *
        Record         cluster_name ${CLUSTER_NAME}


    [OUTPUT]
        Name           newrelic
        Match          *
        licenseKey     ${LICENSE_KEY}
        endpoint       ${ENDPOINT}
        lowDataMode    true

I'll note that the Mem_Buf_Limit is set very high, but that was an earlier attempt to fix the problem before we understood what was going wrong. The newrelic output plugin is this: https://github.com/newrelic/newrelic-fluent-bit-output

Some of the relevant environment variables:

        - name: LOG_LEVEL
          value: info
        - name: LOG_PARSER
          value: docker
        - name: FB_DB
          value: /var/log/flb_kube.db
        - name: PATH
          value: /var/log/containers/*.log

The stack trace above is from 1.8.9 with Amazon's patches, which we had been trying to see if they fixed anything, but we have been having this problem with 1.8.12 too.

@annettejanewilson
Copy link

The soft pipe limit is per-user, right? As I understand it, Kubernetes doesn't namespace users, so if Fluent Bit is running inside a container as uid 0, and lots of other processes are all running inside containers also as uid 0, then are they all sharing that same limit? That could explain why it is hard to reproduce outside of production - it needs something else running as the same uid to consume all of the soft pipe limit, and it needs heavy load to fill up the queue.

@annettejanewilson
Copy link

annettejanewilson commented Mar 2, 2022

I spent a while trying to make an isolated reproduction, but I haven't been successful yet. I am using one process running as the same user to allocate pipes and using fcntl with F_SETPIPE_SZ to resize them until hitting the limit, at which point all new pipes allocated can be observed (via fcntl with F_GETPIPE_SZ) to be 4096 bytes (one page). At this point I think what is required to reproduce is to generate a sustained backlog of tasks, but I haven't figured out how to do this in an isolated setup that doesn't depend on Kubernetes or New Relic.

I note @danlenar's observation earlier, that it's possible to block on a write to pipe even if it's not empty. I would make this even more explicit - it's possible to block on a write to a pipe if the unused capacity is at least one byte and less than 4096 bytes. Once the pipe writer reaches a page boundary, it needs a completely free page that does not still contain any data waiting to be read from the pipe. You can observe this behaviour by creating a pipe, resizing it to 4096 bytes, writing 4096 bytes, reading 4095 bytes, then attempting to write 1 byte. Even though the pipe has only a single unread byte remaining, writing even a single byte into the pipe will block until that byte has been read.

Based on this, I think it should be possible to reproduce by creating the soft pipe limit scenario on a machine with kernel <5.1.4, and finding some configuration that spends a meaningful proportion of time with a task queue of two or more tasks. Sooner or later two or more tasks should attempt to write to the (single page capacity) pipe as it is crossing the page boundary.

@elruwen
Copy link

elruwen commented Mar 3, 2022

Maybe something that helps with reproducing: #4162

I created it for this issue: #3014

Chunks get stuck under load. Maybe it helps...

@annettejanewilson
Copy link

I'm afraid haven't had any more time to spend on reproducing this.

I'm tending to think that there's no good way to use a pipe as a message queue for a single thread on kernels <5.1.4, since a one-page pipe basically cannot guarantee even to hold two messages without blocking. You pretty much have to consider the guaranteed usable capacity of a pipe to be one page less than it has allocated. If you get at least two pages it makes sense to limit the task queue to (pipe_capacity - page_size) / message_size. I think at the bare minimum it would be valuable to emit a warning on startup when the task queue is larger than this limit, even (especially) when this limit is calculated to be zero.

Would any Fluent Bit devs comment on what might make an acceptable PR here?

@edsiper
Copy link
Member

edsiper commented Mar 4, 2022

@annettejanewilson saturation happens because the event loop and channels are saturated. To fix the problem, just enable workers in your output plugin, each worker will have it independent event loop and pipe/channels so it should be fine.

FYI: we just changed in 1.8.13 to have default workers to avoid this situation

@edsiper
Copy link
Member

edsiper commented Mar 4, 2022

@annettejanewilson

example:

    [OUTPUT]
        Name           newrelic
        Match          *
        licenseKey     ${LICENSE_KEY}
        endpoint       ${ENDPOINT}
        lowDataMode    true
        workers         4

@Tafsiralam
Copy link

I am also getting the same issue. My setup is 8-9 months old and in my case, everything was working fine till yesterday evening but suddenly fluent-bit stopped processing logs for a few pods only for one namespace. Nothing unusual was found in the fluent-bit logs. fluent bit pods are still running but stopped sending logs to the output.
When I restart the fluent-bit service it starts sending the logs to the output but after 10-15 minutes it again stops sending the logs to the output.

The Fluent-bit version that I am currently using is v1.8
I also tried upgrading the version to v1.8.13 which is the latest and the stable version but I am still seeing the same issue.

Any fix for this? A quick response will be appreciated.
Thanks

@Tafsiralam
Copy link

The issue is fixed by upgrading the fluent-bit version from v1.8 to v1.8.1. Also, worked for v.1.7.9

@vielmetti
Copy link

@edsiper Would some kind of dedicated scale/load testing infra help find this problem faster? With input from @hh and @jeefy can easily imagine spinning up some high core count / fast network machines on the CNCF CIL (Equinix Metal) infra and get some reproducible saturation testing (e.g. an 80-core Ampere Altra posting at 10GB to a 4-core x86 system should be a test of mettle). It would be also possible to set up either very low latency data paths or very high latency data paths depending on your needs.

Looking at cncf/cluster#114 and wondering if there's time during Kubecon next week to map out some strategy.

@hh
Copy link

hh commented May 18, 2022 via email

@agup006
Copy link
Member

agup006 commented May 20, 2022

hey Team, sorry for missing the note about syncing up at KubeCon - @hh @vielmetti would you be able to join the Fluent Community Meeting next week for us to discuss this? Also adding @patrick-stephens @niedbalski

@patrick-stephens
Copy link
Contributor

Yeah I'm still at kubecon at the Calyptia booth if you want a quick chat as well.

Sounds like a good idea to me, it's something I've been wanting to add.

@hh
Copy link

hh commented Oct 11, 2022 via email

@yackushevas
Copy link

yackushevas commented Dec 10, 2022

CAP_SYS_RESOURCE did not help, I tried to increase the number of kafka workers (up to 2048, I didn’t go further, because it began to take a lot of memory) - it didn’t help either. Version 2.0.6

Stats for 16 workers:

 Performance counter stats for process id '2320621':

         71,949.12 msec task-clock                #    1.165 CPUs utilized
            16,389      context-switches          #    0.228 K/sec
             4,143      cpu-migrations            #    0.058 K/sec
           821,492      page-faults               #    0.011 M/sec
   235,908,381,214      cycles                    #    3.279 GHz                      (49.97%)
     6,466,760,759      stalled-cycles-frontend   #    2.74% frontend cycles idle     (49.99%)
    45,908,914,952      stalled-cycles-backend    #   19.46% backend cycles idle      (50.00%)
   598,812,184,507      instructions              #    2.54  insn per cycle
                                                  #    0.08  stalled cycles per insn  (50.03%)
   129,253,350,302      branches                  # 1796.455 M/sec                    (50.01%)
       489,237,729      branch-misses             #    0.38% of all branches          (50.00%)

      61.742176196 seconds time elapsed

There is a similar problem with the out_forward plugin, but with the out_null plugin everything works fine...

@rishabhToshniwal
Copy link

Hello,

I am facing below issue

og.20240619-164548
[2024/06/19 16:45:56] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=101819671 watch_fd=54
[2024/06/19 16:46:08] [error] [plugins/in_tail/tail_file.c:1432 errno=2] No such file or directory
[2024/06/19 16:46:08] [error] [plugins/in_tail/tail_fs_inotify.c:147 errno=2] No such file or directory
[2024/06/19 16:46:08] [error] [input:tail:tail.0] inode=101819674 cannot register file /var/log/containers/plt-cbc-cluster-0000_aws-bs-clusters_logging-37b552124109c95e6926378d795abbd4bd00ebee5f89f99c903ee5ea9166f3ff.log
[2024/06/19 16:46:20] [error] [plugins/in_tail/tail_file.c:1432 errno=2] No such file or directory
[2024/06/19 16:46:20] [error] [plugins/in_tail/tail_fs_inotify.c:147 errno=2] No such file or directory
[2024/06/19 16:46:20] [error] [input:tail:tail.0] inode=101819671 cannot register file /var/log/containers/plt-cbc-cluster-0000_aws-bs-clusters_logging-37b552124109c95e6926378d795abbd4bd00ebee5f89f99c903ee5ea9166f3ff.log

I am using aws-for-fluent-bit:2.32.2.20240425 and passing records to aws firehose data stream

Here is snippet from my fluentbit config

[FILTER]
Name rewrite_tag
Match kube.*
Rule $kubernetes['labels']['couchbase_cluster'] ^.*$ couchbase false
Emitter_Name re_emitted_couchbase
Emitter_Mem_Buf_Limit 50M

[OUTPUT]
Name kinesis_firehose
Match couchbase
region eu-west-1
endpoint https://xyz.firehose.eu-west-1.vpce.amazonaws.com
delivery_stream my-firehose-log-stream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement waiting-for-user Waiting for more information, tests or requested changes
Projects
None yet
Development

No branches or pull requests