Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiline filter crashes after buffer limit is reached causing high cpu and memory usage #4940

Closed
staniondaniel opened this issue Feb 28, 2022 · 50 comments · Fixed by fluent/fluent-operator#1069

Comments

@staniondaniel
Copy link

Bug Report

Describe the bug

Hello

Multiline filter is crashing on pods that generate a large amount of logs after reaching Emitter_Mem_Buf_Limit . On pods with a normal/low number of logs it works without problems

To Reproduce

[2022/02/25 15:40:06] [ info] [input:tail:tail.0] inode=4540032 handle rotation(): /var/log/containers/xxxxx.log => /var/lib/docker/containers/xxxxx/xxxxx-json.log.5
[2022/02/25 15:40:06] [ info] [input:tail:tail.0] inotify_fs_add(): inode=4540032 watch_fd=113 name=/var/lib/docker/containers/xxxxx/xxxxx-json.log.5
[2022/02/25 15:40:07] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=4592513 watch_fd=111
[2022/02/25 15:40:12] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=3353782 watch_fd=112
[2022/02/25 15:40:12] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=4540032 watch_fd=113
[2022/02/25 15:40:27] [ info] [input:tail:tail.0] inode=3353789 handle rotation(): /var/log/containers/xxxxx.log => /var/lib/docker/containers/xxxxx/xxxxx-json.log.7
[2022/02/25 15:40:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=3353789 watch_fd=114 name=/var/lib/docker/containers/xxxxx/xxxxx-json.log.7
[2022/02/25 15:40:27] [ warn] [input] emitter.3 paused (mem buf overlimit)
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log

  • Steps to reproduce the problem:

This is my configuration (i left only the relevant parts):

    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        DB /var/log/containers/fluentbit_db.db
        Parser docker
        Tag kube.*
        Mem_Buf_Limit 10MB
        Buffer_Chunk_Size 256k
        Buffer_Max_Size 256k
        Skip_Long_Lines On
        Refresh_Interval 1
        multiline.parser docker,cri
......


    [FILTER]
        name multiline
        Emitter_Mem_Buf_Limit 2.4GB
        match kube.*
        multiline.key_content log
        multiline.parser java,go,python
    [FILTER]
        Name kubernetes
        Buffer_Size 512k
        Match kube.*
        Merge_Log On
        Merge_Log_Key log_json
        Merge_Log_Trim On
        Keep_Log On
        K8S-Logging.Parser On
        K8S-Logging.Exclude On

....

Your Environment

  • Version used: Fluent Bit v1.8.12
  • Environment name and version (e.g. Kubernetes? What version?): v1.20.4-eks
  • Cpu and memory load after enabling the multiline filter. I tried increasing memory limit and emitter_buffer limit to a few GB and the processes were still crashing

image

Additional context

Fluentbit container keeps crashing after it gets to the memory limit configured for that container. Also a lot of logs like

[error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag:

are flooding fluentbit logs

@erenming
Copy link

erenming commented Mar 3, 2022

Same issue, i generate test log with about 10MB/S, fluent-bit will log bounch of error log.
image

Is there any way to work around? @edsiper

@erenming
Copy link

erenming commented Mar 4, 2022

Hi, i may found this issue is intruduced by PR:#4383.
When i downgrade fluent-bit to 1.8.11, flb works pretty well

@madmarcinadamski
Copy link

We are experiencing exactly the same issue, with the main problem being that we actually need the functionality of the previously mentioned PR for the multiline logs.
We use the amazon/aws-for-fluent-bit and run it next to our own containers using the built-in integration with firelens
[2022/03/07 10:10:52] [error] [input:emitter:emitter_for_multiline.2] error registering chunk with tag: xxx-firelens-GUID
The container has 50MB of allocated memory and it used to be more that enough, when now it used up the memory in 3 days and started just returning errors.

@PettitWesley could you have a look if there are any issues in your implementation or if there are any workarounds?

@PettitWesley PettitWesley self-assigned this Mar 15, 2022
@PettitWesley
Copy link
Contributor

I have added this on my personal backlog, which unfortunately right now is very long.

My recommendation is to take a look at past issues involving the rewrite_tag filter, with the new code, the multiline filter with buffer is very similar to rewrite_tag. This same issue should have arisen with that filter as well.

@PettitWesley
Copy link
Contributor

So when I send multiline logs at a very high throughput, all I get is this:

[2022/04/04 03:38:14] [ warn] [input] emitter.1 paused (mem buf overlimit)

Which can be fixed by increasing the mem_buf_limit for the emitter:

[FILTER]
     name                  multiline
     match                 *
     multiline.key_content log
     multiline.parser	   go, multiline-regex-test

     emitter_mem_buf_limit 1G

@PettitWesley
Copy link
Contributor

Please see: #5235

If you have more than one multiline filter definition and they match the same records, it can cause all sorts of trouble.

@PettitWesley
Copy link
Contributor

So I want to direct everyone to this issue again: #5235

It is not clear to me that there is a widespread use case that truly requires multiple filter definitions for the same logs. Please post in that issue if you do not understand or you think you have a use case that requires multiple filter definitions that match the same logs.

@staniondaniel
Copy link
Author

staniondaniel commented Apr 7, 2022

#5235

it is not clear to me what do you mean multiple filter definitions. Do you mean more of these here:
multiline.parser java,go,python

or multiple config like the above in different section:
`
[INPUT]

    multiline.parser docker,cri

[FILTER]

    multiline.parser java,go,python

`

I am using the above config because if i try to add the line to the INPUT section like this
`

[INPUT]

    multiline.parser java,go,python

`
It simply doesn't work and log is not merged

So this is why i am doing it in 2 stages:

  • first stage to "[INPUT]" to merge docker logs( because i am using EKS)
  • second to normal "[FILTER]" section so it will merge the logs which can come from various application written in go, java, or python

@PettitWesley
Copy link
Contributor

@staniondaniel Multiple fitler means like I show in the issue like this:

[FILTER]
     name                  multiline
     match                 *
     multiline.key_content log
     multiline.parser	   go

[FILTER]
     name                  multiline
     match                 *
     multiline.key_content log
     multiline.parser      multiline-regex-test

Instead, the CORRECT way is what you show, to have multiple parsers in a comma list.

It simply doesn't work and log is not merged

that's interesting. In this case what you are doing is fine, you only have one filter definition. It's fine to use both the filter and tail input multiline.parser keys.

@erenming
Copy link

I found there is a new buffer config parameter, set it to off seems to avoid this issue.
doc: https://docs.fluentbit.io/manual/pipeline/filters/multiline-stacktrace#configuration-parameters

[FILTER]
    Name                             multiline
    match                            kube.*
    multiline.key_content   log
    multiline.parser            java
    buffer                            off

@PettitWesley
Copy link
Contributor

PettitWesley commented May 24, 2022

@erenming With buffer off, the filter will only work with a few inputs, mainly tail. As noted in the docs.

@erenming
Copy link

@PettitWesley Really thanks for your remind. Actually, we only use tail, so i think with buffer off will be okay.
But i may found a issue with buffer off for these days, multiline seems not working :(
This is my sample config:

[SERVICE]
    flush        0.5
    daemon       Off
    grace        5
    log_level    info
    parsers_file parsers.conf
    http_server  On
    http_listen  0.0.0.0
    http_port    2020

[INPUT]
    name                 tail
    path                 test.log
    DB                   flb_k8s.db
    Tag                  kube.*
    read_from_head       true
    #ignore_older         5m
    Buffer_Chunk_Size    1MB
    Buffer_Max_Size      10MB
    Skip_Long_Lines      On
    Skip_Empty_Lines     On
    Refresh_Interval     30
    Rotate_Wait          60
    DB.locking           true
    DB.journal_mode      WAL

    Mem_Buf_Limit        300MB
    multiline.parser     docker, cri

[FILTER]
    Name                   multiline
    match                  *
    multiline.key_content  log
    multiline.parser       java
    buffer                 Off

[OUTPUT]
    name stdout
    match *
    format json

and sample log:

{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a406-4575af70a449,d5c28df2-32cc-4d7f-ab36-1c41334579c2] - [http-nio-8080-exec-377] i.t.t.exception.TrantorWebAdvice        : handleException error with request info: RequestInfo(path: [], referer: [null])\n","stream":"stdout","time":"2021-08-16T08:25:28.826167874Z"}
{"log":"io.terminus.trantorframework.exception.BusinessException: Schedule.request.param.is.illegal\n","stream":"stdout","time":"2021-08-16T08:24:28.826172954Z"}
{"log":"\u0009at io.terminus.gaaia.schedule.util.ValidateUtil.expectedTrue(ValidateUtil.java:39)\n","stream":"stdout","time":"2021-08-16T08:24:28.82617719Z"}
{"log":"\u0009at io.terminus.gaaia.schedule.func.schedule.validate.ValidateScheduleParamForCalculationFuncImpl.execute(ValidateScheduleParamForCalculationFuncImpl.java:47)\n","stream":"stdout","time":"2021-08-16T08:24:28.826182475Z"}
{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a406-4575af70a449,d5c28df2-32cc-4d7f-ab36-1c41334579c2] - [http-nio-8080-exec-377] i.t.t.exception.TrantorWebAdvice        : handleException error with request info: RequestInfo(path: [], referer: [null])\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}
{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a406-4575af70a449,d5c28df2-32cc-4d7f-ab36-1c41334579c2] - [http-nio-8080-exec-377] i.t.t.exception.TrantorWebAdvice        : handleException error with request info: RequestInfo(path: [], referer: [null])\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}
{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}
{"log":"ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}

Is there any mistake in my configuration?

@staniondaniel
Copy link
Author

Is there any mistake in my configuration?
I believe he meant that you need to use the parser config directly in the INPUT section and not in the [FILTER] section . something like:

[INPUT]
    name                 tail
......
    Mem_Buf_Limit        300MB
    multiline.parser     docker, cri, java

@PettitWesley as an update to the original problem. After updating to the latest Fluentbit version I noticed that using only the " multiline.parser docker,cri" in the INPUT section without any [FILTER] config like bellow:

    [INPUT]
        Name tail
......
        multiline.parser docker,cri

the pods don't crash anymore due to high MEMORY usage

As soon as I add the [FILTER] config as bellow:

   [FILTER]
        name multiline
        Emitter_Mem_Buf_Limit 2.4GB
        match kube.*
        multiline.key_content log
        multiline.parser java,go,python

The problem reoccurs and the memory slowly rises until it hits the buffer limit which causes the logs to be flooded with :
[input:emitter:emitter_for_multiline.0] error registering chunk with tag:
or it gets killed by kubernetes when memory usage of the pods hits the limits set for it

From what I noticed this is simply an issue of the multiline filter not being able to process the data fast enough or just not releasing the buffer once the data is processed which causes high cpu usage and leads to a crash eventually

@rajeev-netomi
Copy link

rajeev-netomi commented Jun 22, 2022

@PettitWesley We have been facing the same issue with aws-for-fluent-bit:2.25.0. Our custom fluentbit configuration is below.

[SERVICE]
flush 1
storage.path /var/log/app-storage/
storage.sync normal
[INPUT]
Name forward
unix_path /var/run/fluent.sock
Mem_Buf_Limit 50MB
storage.type filesystem
[FILTER]
name multiline
match *
multiline.key_content log
multiline.parser docker,cri,java
Emitter_Mem_Buf_Limit 50M
emitter_storage.type filesystem

Firelens configuration

logConfiguration = {
logDriver = "awsfirelens"
options = {
"Name": "datadog",
"apikey": "",
"Host": "http-intake.logs.datadoghq.com",
"net.keepalive" : "false",
"storage.total_limit_size" : "5G",
"dd_service": "",
"dd_source": "ecs-fargate-dev",
"dd_message_key": "",
"dd_tags": "env:dev",
"TLS": "on",
"provider": "ecs"
}
}

Logs do not show any error. It just fails with segmentation fault.

`---------------------------------------------------------------------------------------------------------------------------------------------------------

timestamp message
1656096324683 [2022/06/24 18:45:24] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096325650 [2022/06/24 18:45:25] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096326704 [2022/06/24 18:45:26] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096329689 [2022/06/24 18:45:29] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096331692 [2022/06/24 18:45:31] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096332695 [2022/06/24 18:45:32] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096333726 [2022/06/24 18:45:33] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096334685 [2022/06/24 18:45:34] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096336749 [2022/06/24 18:45:36] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096338648 [2022/06/24 18:45:38] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096344687 [2022/06/24 18:45:44] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096346709 [2022/06/24 18:45:46] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096347677 [2022/06/24 18:45:47] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096348703 [2022/06/24 18:45:48] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096349674 [2022/06/24 18:45:49] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096350678 [2022/06/24 18:45:50] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096351692 [2022/06/24 18:45:51] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096352721 [2022/06/24 18:45:52] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096353671 [2022/06/24 18:45:53] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096354685 [2022/06/24 18:45:54] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096355694 [2022/06/24 18:45:55] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096361716 [2022/06/24 18:46:01] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096363714 [2022/06/24 18:46:03] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096364712 [2022/06/24 18:46:04] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096365644 [2022/06/24 18:46:05] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096368704 [2022/06/24 18:46:08] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096369701 [2022/06/24 18:46:09] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096372686 [2022/06/24 18:46:12] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096375691 [2022/06/24 18:46:15] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096376685 [2022/06/24 18:46:16] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096378678 [2022/06/24 18:46:18] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096380698 [2022/06/24 18:46:20] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096383700 [2022/06/24 18:46:23] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096384679 [2022/06/24 18:46:24] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096386694 [2022/06/24 18:46:26] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096388690 [2022/06/24 18:46:28] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096390698 [2022/06/24 18:46:30] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096391696 [2022/06/24 18:46:31] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096392705 [2022/06/24 18:46:32] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096393696 [2022/06/24 18:46:33] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096395689 [2022/06/24 18:46:35] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096396682 [2022/06/24 18:46:36] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096399692 [2022/06/24 18:46:39] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096401703 [2022/06/24 18:46:41] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096403693 [2022/06/24 18:46:43] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096404708 [2022/06/24 18:46:44] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096405694 [2022/06/24 18:46:45] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096406682 [2022/06/24 18:46:46] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096410688 [2022/06/24 18:46:50] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096412651 [2022/06/24 18:46:52] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096413683 [2022/06/24 18:46:53] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096414671 [2022/06/24 18:46:54] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096416720 [2022/06/24 18:46:56] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096418662 [2022/06/24 18:46:58] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096419663 [2022/06/24 18:46:59] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096420682 [2022/06/24 18:47:00] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096421696 [2022/06/24 18:47:01] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096422655 [2022/06/24 18:47:02] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096425696 [2022/06/24 18:47:05] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096428686 [2022/06/24 18:47:08] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096430710 [2022/06/24 18:47:10] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096431661 [2022/06/24 18:47:11] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096434709 [2022/06/24 18:47:14] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096435655 [2022/06/24 18:47:15] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096437670 [2022/06/24 18:47:17] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096438655 [2022/06/24 18:47:18] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096439675 [2022/06/24 18:47:19] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096440691 [2022/06/24 18:47:20] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096441701 [2022/06/24 18:47:21] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096443747 [2022/06/24 18:47:23] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096445704 [2022/06/24 18:47:25] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096449695 [2022/06/24 18:47:29] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096450679 [2022/06/24 18:47:30] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096451681 [2022/06/24 18:47:31] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096452669 [2022/06/24 18:47:32] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096453645 [2022/06/24 18:47:33] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096454693 [2022/06/24 18:47:34] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096455683 [2022/06/24 18:47:35] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096456675 [2022/06/24 18:47:36] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096457698 [2022/06/24 18:47:37] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096459704 [2022/06/24 18:47:39] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096460664 [2022/06/24 18:47:40] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096462681 [2022/06/24 18:47:42] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096463700 [2022/06/24 18:47:43] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096464691 [2022/06/24 18:47:44] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096465694 [2022/06/24 18:47:45] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096466652 [2022/06/24 18:47:46] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096467653 [2022/06/24 18:47:47] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096468696 [2022/06/24 18:47:48] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096469747 [2022/06/24 18:47:49] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096472717 [2022/06/24 18:47:52] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096473665 [2022/06/24 18:47:53] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096474674 [2022/06/24 18:47:54] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096475721 [2022/06/24 18:47:55] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096476664 [2022/06/24 18:47:56] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096477653 [2022/06/24 18:47:57] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096480673 [2022/06/24 18:48:00] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096481704 [2022/06/24 18:48:01] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096482634 [2022/06/24 18:48:02] [engine] caught signal (SIGSEGV)
1656096483998 AWS for Fluent Bit Container Image Version 2.25.0
---------------------------------------------------------------------------------------------------------------------------------------------------------`

@PettitWesley
Copy link
Contributor

Sorry folks, I may not be able to take a look at this immediately.

If someone is willing to try these techniques to get a stack trace for me that would help speed up the fix: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#segfaults-and-crashes-sigsegv

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Sep 26, 2022
@rajeev-netomi
Copy link

Not stale.

@github-actions github-actions bot removed the Stale label Sep 27, 2022
@PRIHLOP
Copy link

PRIHLOP commented Oct 20, 2022

unstale

@jujubetsz
Copy link

jujubetsz commented Oct 25, 2022

Having same issue here. Container use all request causing high CPU load.

    service: |
      [SERVICE]
          Daemon Off
          Flush 1
          Log_Level info
          Parsers_File parsers.conf
          Parsers_File custom_parsers.conf
          HTTP_Server On
          HTTP_Listen 0.0.0.0
          HTTP_Port 2020
          Health_Check On

    inputs: |
      [INPUT]
          Name tail	
          Tag kube.*	
          Path /var/log/containers/*.log	
          DB /var/log/flb_kube.db	
          Mem_Buf_Limit 128MB	
          Skip_Long_Lines On	
          Refresh_Interval 10	
          Buffer_Chunk_Size 256k
          Buffer_Max_Size 100MB
          multiline.parser docker, cri

    filters: |
      [FILTER]
          Name kubernetes	
          Match kube.*	
          Kube_URL https://kubernetes.default.svc:443	
          Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt	
          Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token	
          Kube_Tag_Prefix kube.var.log.containers.	
          Merge_Log Off	
          Merge_Log_Key log_processado	
          Merge_Log_Trim On	
          Keep_Log Off	
          K8S-Logging.Parser Off	
          K8S-Logging.Exclude Off	
          Labels On	
          Annotations Off
      [FILTER]
          Name                                 multiline
          Match                                kube.*
          multiline.key_content                log
          multiline.parser                     go, java, python
          buffer                               On
          emitter_mem_buf_limit                64M

fluentbit 1.9.9

@idanovo
Copy link

idanovo commented Dec 15, 2022

I'm facing the same issue when working with the multiline parser with the rewrite-tag filter.
I'm using the 0.21.6 version of the fluentbit helm chart that uses the 2.0.6 version of fluentbit.

I'm using the tail input:

    [INPUT]
      Name              tail
      Tag               kube.*
      Path              /var/log/containers/my-app*.log
      multiline.parser  docker, cri
      Mem_Buf_Limit     10MB
      Refresh_Interval  10

And using the re-write tag like this:

   [FILTER]
      Name    rewrite_tag
      Match   kube.*
      Rule    $log .*\"log_audit\":true.*$ audit-logs.$kubernetes_namespace_name false

I can't really use the exclude filter in order to filter out these logs as I need them but I want them to go to other output than the other logs.
Any idea how can I solve this issue? Is there any workaround for this?

This is my complete fluentbit configuration:

config:
  service: |
    [SERVICE]
      Daemon       Off
      Flush        {{ .Values.flush }}
      Log_Level    {{ .Values.logLevel }}
      Parsers_File parsers.conf
      Parsers_File custom_parsers.conf
      HTTP_Server  On
      HTTP_Listen 0.0.0.0
      HTTP_Port   {{ .Values.metricsPort }}
      Health_Check On

  inputs: |
    [INPUT]
      Name              tail
      Tag               kube.*
      Path              /var/log/containers/my-app-server*.log
      multiline.parser  docker, cri
      Mem_Buf_Limit     10MB
      Refresh_Interval  10

    [INPUT]
      Name              tail
      Tag               kube.*
      Path              /var/log/containers/my-other-app*.log
      multiline.parser  docker, cri
      Mem_Buf_Limit     10MB
      Refresh_Interval  10

  filters: |
    [FILTER]
      Name                kubernetes
      Match               kube.*
      Kube_URL            https://kubernetes.default.svc:443
      Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
      Merge_Log           On
      Merge_Log_Key       data
      K8S-Logging.Parser  On
      K8S-Logging.Exclude On
    
    [FILTER]
      Name          nest
      Match         kube.*
      Operation     lift
      Nested_under  kubernetes
      Add_prefix    kubernetes_

    [FILTER]
        Name          nest
        Match         kube.*
        Operation     lift
        Nested_under  kubernetes_labels
        Add_prefix    kubernetes_labels_
    
    [FILTER]
      Name          nest
      Match         kube.*
      Operation     lift
      Nested_under  kubernetes_annotations
      Add_prefix    kubernetes_annotations_
    
    [FILTER]
      Name    rewrite_tag
      Match   kube.*
      Rule    $log .*\"log_audit\":true.*$ audit-logs.$kubernetes_namespace_name false

  outputs: |
    [OUTPUT]
      Name                s3
      Match               audit-logs.*
      bucket              my-bucket-name
      region              {{ .Values.region }}
      total_file_size     10M
      upload_timeout      1m
      use_put_object      on
      compression         gzip
      auto_retry_requests true
      s3_key_format       /namespace=$TAG[1]/year=%Y/month=%m/day=%d/hour=%H/%H:%M:%S-$UUID.gz
      role_arn            {{ .Values.fluentbit_role_arn }}
    
    [OUTPUT]
      Name                          s3
      Match                         kube.*
      bucket                        lakefs-audit-logs-{{ .Values.region }}-{{ .Values.environment }}
      region                        {{ .Values.region }}
      total_file_size               10M
      upload_timeout                1m
      use_put_object                on
      compression                   gzip
      auto_retry_requests           true
      s3_key_format_tag_delimiters ._
      s3_key_format                 /namespace=$TAG[5]/year=%Y/month=%m/day=%d/hour=%H/%H:%M:%S-$UUID.gz
      role_arn                      {{ .Values.fluentbit_role_arn }}
    
    [OUTPUT]
      Name                   loki
      Match                  kube.*
      Host                   {{ .Values.loki.endpoint }}
      port                   443
      tls                    on
      tls.verify             on
      http_user              {{ .Values.loki.username }}
      http_passwd            {{ .Values.loki.password }}
      auto_kubernetes_labels true
      Labels                 collector=fluent-bit, namespace=$kubernetes_namespace_name, pod=$kubernetes_pod_name, container=$kubernetes_container_name, Region={{ .Values.region }}

@jujubetsz
Copy link

jujubetsz commented Jan 31, 2023

I know it's ugly, but since i really need this working, i did setup a cronjob to rollout restart the fluent-bit pods every day at 03:00am.

@rajeev-netomi
Copy link

@PettitWesley Any idea if similar solution is supported for AWS ECS fargate as suggested by @jujubetsz ?

@leonardo-albertovich
Copy link
Collaborator

leonardo-albertovich commented Mar 4, 2023

I'm currently making some unrelated improvements that involve modifying the multi line core machinery so if anyone is able to put together a somewhat reliable reproduction (it would be awesome if it was minimized but understandable if that's not possible) with sample input data I could look into this.

If you need help putting together the reproduction case Just tag me or here or talk to me in slack.

@PettitWesley
Copy link
Contributor

@rajeev-netomi I don't recommend restarting FLB at a certain time if its crashing... I'm not sure of a good solution here but I think you want to restart after it crashes. Or see if you can find an older version which doesn't crash. Sorry!

@github-actions github-actions bot removed the Stale label Aug 3, 2023
@terencebor
Copy link

terencebor commented Aug 16, 2023

I have just Opened a related issue. I found that somehow we have a regression after upgrading to 2.1.2 from 1.9.7 #7782

I am wondering if others have noticed a similar issue. and if downgrading to 1.9.x fixes it for you.

Same issue: upgraded to 2.1.2 from 2.0.9 immediately high CPU, pod restarts and "error registering chunk with tag" messages. Issue is gone once downgrading back to 2.0.9. I have just one multiline [FILTER] with custom [MULTILINE_PARSER] in the config.
"buffer off" eliminates errors on version 2.1.2, but multiline filter stops to work with "buffer off" configured in [FILTER] despite used with tail input.

@ryan65
Copy link

ryan65 commented Aug 17, 2023

I have just Opened a related issue. I found that somehow we have a regression after upgrading to 2.1.2 from 1.9.7 #7782
I am wondering if others have noticed a similar issue. and if downgrading to 1.9.x fixes it for you.

Same issue: upgraded to 2.1.2 from 2.0.9 immediately high CPU, pod restarts and "error registering chunk with tag" messages. Issue is gone once downgrading back to 2.0.9. I have just one multiline [FILTER] with custom [MULTILINE_PARSER] in the config. "buffer off" eliminates errors on version 2.1.2, but multiline filter stops to work with "buffer off" configured in [FILTER] despite used with tail input.

Hi terence.
Thanks for the update. Indeed you are right , I just checked 2.0.9 with my application and cpu is Ok.
So seems the regression started after 2.0.9. Im really surprised no one is looking at this yet. I would say this is a critical bug.
#7782

@terencebor
Copy link

Tried fluentbit v2.1.9 - don't see the "error registering chunk" errors anymore. But I still have the high CPU problem.

@ryan65
Copy link

ryan65 commented Sep 19, 2023

Is there any update on the status of this issue . I would say this is pretty critical.

@adl3n
Copy link

adl3n commented Oct 1, 2023

Hello everyone. I still get an error on version 2.1.9. I request assistance in resolving this issue. I've tried various configurations, and the problem occurred after upgrading from 1.8.6 to 2.1.x.

@ryan65
Copy link

ryan65 commented Oct 2, 2023

Hello everyone. I still get an error on version 2.1.9. I request assistance in resolving this issue. I've tried various configurations, and the problem occurred after upgrading from 1.8.6 to 2.1.x.

I must say this is strange, that no one seems to be looking into this.
Anyway I suggest you either return to your original version for now , or upgrade to 2.0.6. I think this is the last version that doesnt have this issue.

@lucianosarra
Copy link

I installed version 2.1.10 but I still have the problem of high CPU consumption. I downgraded to 1.9.10

@ryan65
Copy link

ryan65 commented Oct 26, 2023

I installed version 2.1.10 but I still have the problem of high CPU consumption. I downgraded to 1.9.10

Hi Luciano. From what I test, the problem started after 2.0.9 so if you like you can try that version.
I wish someone would look into this already.

@lucianosarra
Copy link

lucianosarra commented Oct 27, 2023

I installed version 2.1.10 but I still have the problem of high CPU consumption. I downgraded to 1.9.10

Hi Luciano. From what I test, the problem started after 2.0.9 so if you like you can try that version. I wish someone would look into this already.

Hi Ryan, I confirm that 2.0.9 version doesn't have the CPU issue.

@damien-jacinto
Copy link

Same here, 2.0.9 works , tried 2.1.x without success, didn't tried 2.2.x yet

@alexku7
Copy link

alexku7 commented Jan 15, 2024

2.2.2 is still not working

@ryan65
Copy link

ryan65 commented Jan 16, 2024

2.2.2 is still not working

I'm absolutely shocked that no one is handling this issue. Quite worrying, We simply cant upgrade.

@saurabhgcect
Copy link

Any workaround or update on above issue.

@braydonk
Copy link
Contributor

braydonk commented Feb 22, 2024

I took a look at this issue today.

The root problem: the internal in_emitter plugin that the multiline filter uses to emit the joined logs does not respect the FLB_INPUT_PAUSED state. This state is set and checked automatically in flb_input_chunk.c, and those routines will defensively refuse to append to chunks if the plugin is paused. This is not a problem for most input plugins, as they will simply pause and resume their collectors as the overall input instance is paused. This is not the case for the emitter, where any plugin can just append records to it at any time. So the multiline plugin indiscriminately slams the emitter with new logs, causing the emitter plugin to fail to append most of these and drop the logs. Worth noting that this is different from the documented behaviour in the emitter_mem_buf_limit field
The emitter mem buf limit description from the link above
I actually have no idea how to implement this behaviour; I couldn't find a way for a filter to wait for an input plugin to resume without blocking the entire program (thus deadlocking it essentially, since the emitter wouldn't be able to process the logs it has to make room so it can't ever resume).

This problem exists for both multiline and rewrite_tag, as their usage of emitter (just adding records directly to it) is the same.

I opened a PR shown above that cuts out the middleman and checks for the emitter to be paused before even trying to write to it. This will at least stop the spamming error log. Technically this issue is about high CPU, and this PR does not solve it. The multiline processor is very CPU intensive in general, so when it's run as a filter on the main thread it's no wonder it takes a whole CPU. I'm not sure what the solution to that is.


As a user, it is basically necessary with the design of this plugin to take steps to make the plugin resilient to higher throughputs. This is actually relatively the same as what you'd need to do to any input plugins, it's just a bit odd because you wouldn't think of needing to use those kinds of strategies on a filter plugin. However, to use this plugin effectively, it is essentially necessary to understand that under-the-hood detail of the emitter also being an input plugin which can suffer similar problems to any other input plugin.

Here are the solutions I would recommend, which are not dissimilar to any input plugin:

  • As suggested above, increasing the memory buf size for the emitter is a good first step. With a larger memory buffer, it will be less likely to get paused and thus theoretically drop less logs.
  • You could switch the emitter to filesystem buffering mode instead of memory. Filesystem buffering is designed for the scenario where you want to make sure you keep extra logs while the input plugin is busy. In this case, the plugin is even less likely to pause; it pauses on the storage.max_chunks_up amount. These numbers all depend on what you set for the global storage settings in [SERVICE] (side note: it would be great to be able to set these for the emitter). Read more at https://docs.fluentbit.io/manual/administration/buffering-and-storage
  • If you are on a later version of Fluent Bit, you can run the input plugin threaded and run the multiline filter as a processor. This would make it so the multiline filter doesn't take up all the air in the room by clogging up the main engine thread. This is only possible if you use YAML configs I think, and it might not be possible based on how you've architected your pipeline. Read more at https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/yaml/configuration-file#config_processors

@saurabhgcect
Copy link

@braydonk : I switched emitter to filesystem buffering but running into another issues where Kubernetes Fluent Bit not recovering after Fluentd restart ,chunks were stuck in storage.
My setup fluentbit(2.0.6)->fluentd(1.16.3)->ES/Kibana.After restarting the fluentbit all the chunks were processed from the storage.
I enabled the debug mode and looks like flushing is very slow.I tried the below scenario

  1. Ran a log geneartor pod
    
  2. Configured Fluentbit to capture the log and forward to fluentd
    
  3. Stopped Fluentd service
    
  4. Checked the logs with debug mode enabled-- Fluentbit was keep on trying to connect to fluentd and in parallel chunks were getting written to filesystem.
    
  5. When the count of chunks reached to 3k+(around 2.5 GB),I started fluentd service.
    
  6. Chunks  flushed to fluentd but it took around 5-6 hrs
    

I repeated the above same steps with one more addition of restarting the fluentbit service and saw chunks were processed in less than 10 minutes.

@braydonk
Copy link
Contributor

@saurabhgcect This sounds unrelated to this issue. I'd recommend opening a separate issue with config and logs included.

@braydonk
Copy link
Contributor

braydonk commented Feb 22, 2024

#8473 is the solution that should be used for this. I've closed my PR in favour of it.

Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label May 23, 2024
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 28, 2024
@braydonk
Copy link
Contributor

I believe this should be fixed as of v3.0.3 which has #8473. I'll re-open and close this properly as the issue should be fixed, but if it is not then feel free to re-open with details.

@braydonk braydonk reopened this May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.