in_tail: plugin does not pickup rotated logs under heavy load #1108

holycheater · 2019-02-13T18:12:13Z

Bug Report

Describe the bug
tail_fs_event receives IN_Q_OVERFLOW inotify events from time to time, thus missing IN_MOVE_SELF events.

To Reproduce
tail a lot of files by pattern with heavy writing to them. The setup I have reads around 30 nginx access log files by pattern.

example cfg:

[SERVICE]
    Flush           1
    Daemon          off
    Log_Level       error
    Parsers_File vertis_parsers.conf
    HTTP_Server  On
    HTTP_Listen  127.0.0.1
    HTTP_Port    2020
[INPUT]
    Name tail
    Tag tail_nginx
    Path /var/log/nginx/*/access.log
    DB /var/lib/fluent-bit/nginx.sqlite
    Parser nginx_vertis
    Buffer_Chunk_Size 256kb
    Buffer_Max_Size 256kb
    Mem_Buf_Limit 64mb
[FILTER]
    Name record_modifier
    Match *
    Record _level INFO
    Record _service nginx
    Record _container_name ${HOSTNAME}
[OUTPUT]
    Name http
    Match *
    Port 10223
    URI /fluent-bit
    Format json
    Retry_Limit False

Example fluent-bit log with traces (had to add some extra output to tackle the problem)

[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d1/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d1/access.log read=9428 lines=8
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d2/access.log event
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d4/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d4/access.log read=1449 lines=2
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log read=2277 lines=8
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d1/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d1/access.log read=1037 lines=1
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log read=761 lines=3
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d4/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d4/access.log read=9603 lines=12
[2019/02/10 08:55:30] [trace] inotify_mask: 4000
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log read=894 lines=3
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d2/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d2/access.log read=438 lines=1
[2019/02/10 08:55:30] [trace] inotify_mask: 2

sysctl params related to inotify:

fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 524288

Expected behavior
reload files after log-rotate

Your Environment

Version used: 1.0.3
Operating System and version: ubuntu 16.04
Filters and plugins: in_tail

Additional context
We try to collect access logs on several nginx machines. Every morning after log-rotate tail plugin fails to pick up most of rename events (more often all of them).
Basically, looks like it happens because files are tailed synchronously with receiving inotify events (reading, filling up buffer, processing content) while other events are piling up in the queue.

The text was updated successfully, but these errors were encountered:

bogdanov1609 · 2019-02-25T18:14:17Z

We have the same problem :\
Any news on this?

bogdanov1609 · 2019-03-25T12:12:45Z

ping?

l2dy · 2019-05-21T17:41:59Z

There's a timer checking rotated files every 2.5 seconds, did it work for you?

fluent-bit/plugins/in_tail/tail_fs_stat.c

Lines 176 to 178 in 8cc3a18

    
           /* Set a manual timer to check deleted/rotated files every 2.5 seconds */ 
        
           ret = flb_input_set_collector_time(in, tail_fs_check, 
        
                                              2, 500000000, config);

holycheater · 2019-05-21T20:09:27Z

@l2dy, no it did not because there's inotify version of this code which is running in fluent-bit

fluent-bit/plugins/in_tail/tail_fs_inotify.c

Lines 159 to 160 in f6bb10a

    
           /* File System events based on Inotify(2). Linux >= 2.6.32 is suggested */ 
        
           int flb_tail_fs_init(struct flb_input_instance *in,

l2dy · 2019-05-22T06:10:34Z

Indeed. If you use Fluent Bit built with -DFLB_INOTIFY=Off, can you reproduce this issue?

singalravi · 2019-11-07T10:23:36Z

@holycheater Did you find any solution to this problem?

holycheater · 2019-11-07T10:31:25Z

We made a hack to restart fluent-bit after rotating logs, building fluent-bit without inotify (it would just stat() watched files on a regular interval) should work too

Helmut-Onna · 2019-11-25T14:53:35Z

Having a similar issue in a gcloud k8s cluster and using fluent-bit v1.2.0
After a couple of days running, the number of files open minus closed by the tail plugin stays over 11k.
Number of files is taken from curl -s http://127.0.0.1:2020/api/v1/metrics | jq.

The node becomes problematic with other pods/apps unable to start due to the no space left on device misleading error which its actually the inotify watches limit has been reached.

Increasing the watches is just a workaround and will just delay the issue, as the actual files never reach that high. (after restarting fluentbit the number reported goes to ~120 files actually open)

krancour · 2020-01-10T15:46:44Z

I've been encountering this as well. I've been sending a million lines to stdout very quickly from just one single Docker container. I reliably get about the first ~650,000 lines. The last line I get is always the last line before the log was rotated. I never get even one line from the new file after rotation. Popping into the sqlite database to poke around, record shows no recognition of rotation and an offset that exceeds the new file's length.

@edsiper any insight on this?

rashmichandrashekar · 2020-12-12T03:51:43Z

I am seeing the same issue under high scale. Any workarounds/updates?

akashmantry · 2020-12-14T22:30:03Z

We had the same issue running fluent-bit on EKS. Logs came at a rate of 50 k req/s. After updating fluent-bit to v1.6.8, bumping up the resources, and using a combination of memory and filesystem for buffering messages, we were able to avoid the log loss. We also modified the log rotation on EKS to rotate files only after reaching 10 GB.
Here is the fluent-bit config we have:

  fluent-bit-input.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        Tag               kube.*
        Refresh_Interval  1
        Skip_Long_Lines   On
        Buffer_Chunk_Size 10MB
        Buffer_Max_Size 10MB
        Rotate_wait 60
        storage.type filesystem
        DB /var/log/flb_kube.db
        Read_from_Head On
        Mem_Buf_Limit 500MB
  fluent-bit-output.conf: |2

    [OUTPUT]
        Name kafka
        Match *
        Brokers bg-kafka-private-cp-kafka.default.svc.cluster.local:9092
        Topics kubernetes_cluster
        Timestamp_Format  iso8601
        storage.total_limit_size  10G
        rdkafka.batch.num.messages 10000
        rdkafka.batch.size 10000000
        rdkafka.request.required.acks 1
        rdkafka.linger.ms 2000
        rdkafka.compression.codec lz4
  fluent-bit-service.conf: |
    [SERVICE]
        Flush        1
        Daemon       Off
        Log_Level    info
        Parsers_File parsers.conf
        Parsers_File parsers_custom.conf
        storage.path /var/log/flb-storage/
        storage.backlog.mem_limit 500MB

loburm · 2021-03-08T20:08:23Z

@akashmantry, our team also experienced similar issue in GKE. After a very spiky load (10 MB/s) for a few seconds file is rotated but it seems that Fluent Bit is not tracking a new file (after one more rotation it detects a new file). Do you know if upgrade to 1.6.8 by itself could resolve this issue or we should combine all methods?

akashmantry · 2021-03-08T20:27:48Z

@loburm you might have to try a combination of these to figure out what works for your use case. switching to the latest version is always recommended. for us, moving to Kafka and increasing the log file size eliminated the log loss problems.

loburm · 2021-03-11T10:18:57Z

@edsiper I think we had a similar case in GKE recently. First there was a heavy load (multiple MBs per second), during which log rotation happens. Fluent Bit hasn't managed to detect a new file and continues reading logs only after the next rotation.

github-actions · 2022-01-22T01:49:30Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions · 2022-01-27T02:15:27Z

This issue was closed because it has been stalled for 5 days with no activity.

kiich mentioned this issue Jan 29, 2020

Fluentbit tail missing some big-ish log line even with Buffer_Max_Size set to high value #1902

Closed

bigangryrobot mentioned this issue Dec 2, 2020

Halt at output_pre_cb_flush within connection cleaning up #2826

Closed

rashmichandrashekar mentioned this issue Dec 12, 2020

Need clarification on Rotate_Wait setting in tail plugin #2847

Closed

github-actions bot added the Stale label Jan 22, 2022

github-actions bot closed this as completed Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in_tail: plugin does not pickup rotated logs under heavy load #1108

in_tail: plugin does not pickup rotated logs under heavy load #1108

holycheater commented Feb 13, 2019

bogdanov1609 commented Feb 25, 2019

bogdanov1609 commented Mar 25, 2019

l2dy commented May 21, 2019

holycheater commented May 21, 2019

l2dy commented May 22, 2019

singalravi commented Nov 7, 2019

holycheater commented Nov 7, 2019

Helmut-Onna commented Nov 25, 2019

krancour commented Jan 10, 2020

rashmichandrashekar commented Dec 12, 2020

akashmantry commented Dec 14, 2020 •

edited

Loading

loburm commented Mar 8, 2021

akashmantry commented Mar 8, 2021

loburm commented Mar 11, 2021

github-actions bot commented Jan 22, 2022

github-actions bot commented Jan 27, 2022

in_tail: plugin does not pickup rotated logs under heavy load #1108

in_tail: plugin does not pickup rotated logs under heavy load #1108

Comments

holycheater commented Feb 13, 2019

Bug Report

bogdanov1609 commented Feb 25, 2019

bogdanov1609 commented Mar 25, 2019

l2dy commented May 21, 2019

holycheater commented May 21, 2019

l2dy commented May 22, 2019

singalravi commented Nov 7, 2019

holycheater commented Nov 7, 2019

Helmut-Onna commented Nov 25, 2019

krancour commented Jan 10, 2020

rashmichandrashekar commented Dec 12, 2020

akashmantry commented Dec 14, 2020 • edited Loading

loburm commented Mar 8, 2021

akashmantry commented Mar 8, 2021

loburm commented Mar 11, 2021

github-actions bot commented Jan 22, 2022

github-actions bot commented Jan 27, 2022

akashmantry commented Dec 14, 2020 •

edited

Loading