Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_tail: plugin does not pickup rotated logs under heavy load #1108

Closed
holycheater opened this issue Feb 13, 2019 · 16 comments
Closed

in_tail: plugin does not pickup rotated logs under heavy load #1108

holycheater opened this issue Feb 13, 2019 · 16 comments
Labels

Comments

@holycheater
Copy link

Bug Report

Describe the bug
tail_fs_event receives IN_Q_OVERFLOW inotify events from time to time, thus missing IN_MOVE_SELF events.

To Reproduce
tail a lot of files by pattern with heavy writing to them. The setup I have reads around 30 nginx access log files by pattern.

example cfg:

[SERVICE]
    Flush           1
    Daemon          off
    Log_Level       error
    Parsers_File vertis_parsers.conf
    HTTP_Server  On
    HTTP_Listen  127.0.0.1
    HTTP_Port    2020
[INPUT]
    Name tail
    Tag tail_nginx
    Path /var/log/nginx/*/access.log
    DB /var/lib/fluent-bit/nginx.sqlite
    Parser nginx_vertis
    Buffer_Chunk_Size 256kb
    Buffer_Max_Size 256kb
    Mem_Buf_Limit 64mb
[FILTER]
    Name record_modifier
    Match *
    Record _level INFO
    Record _service nginx
    Record _container_name ${HOSTNAME}
[OUTPUT]
    Name http
    Match *
    Port 10223
    URI /fluent-bit
    Format json
    Retry_Limit False

Example fluent-bit log with traces (had to add some extra output to tackle the problem)

[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d1/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d1/access.log read=9428 lines=8
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d2/access.log event
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d4/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d4/access.log read=1449 lines=2
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log read=2277 lines=8
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d1/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d1/access.log read=1037 lines=1
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log read=761 lines=3
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d4/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d4/access.log read=9603 lines=12
[2019/02/10 08:55:30] [trace] inotify_mask: 4000
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log read=894 lines=3
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d2/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d2/access.log read=438 lines=1
[2019/02/10 08:55:30] [trace] inotify_mask: 2

sysctl params related to inotify:

fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 524288

Expected behavior
reload files after log-rotate

Your Environment

  • Version used: 1.0.3
  • Operating System and version: ubuntu 16.04
  • Filters and plugins: in_tail

Additional context
We try to collect access logs on several nginx machines. Every morning after log-rotate tail plugin fails to pick up most of rename events (more often all of them).
Basically, looks like it happens because files are tailed synchronously with receiving inotify events (reading, filling up buffer, processing content) while other events are piling up in the queue.

@bogdanov1609
Copy link

We have the same problem :\
Any news on this?

@bogdanov1609
Copy link

ping?

@l2dy
Copy link
Contributor

l2dy commented May 21, 2019

There's a timer checking rotated files every 2.5 seconds, did it work for you?

/* Set a manual timer to check deleted/rotated files every 2.5 seconds */
ret = flb_input_set_collector_time(in, tail_fs_check,
2, 500000000, config);

@holycheater
Copy link
Author

@l2dy, no it did not because there's inotify version of this code which is running in fluent-bit

/* File System events based on Inotify(2). Linux >= 2.6.32 is suggested */
int flb_tail_fs_init(struct flb_input_instance *in,

@l2dy
Copy link
Contributor

l2dy commented May 22, 2019

Indeed. If you use Fluent Bit built with -DFLB_INOTIFY=Off, can you reproduce this issue?

@singalravi
Copy link

@holycheater Did you find any solution to this problem?

@holycheater
Copy link
Author

We made a hack to restart fluent-bit after rotating logs, building fluent-bit without inotify (it would just stat() watched files on a regular interval) should work too

@Helmut-Onna
Copy link

Having a similar issue in a gcloud k8s cluster and using fluent-bit v1.2.0
After a couple of days running, the number of files open minus closed by the tail plugin stays over 11k.
Number of files is taken from curl -s http://127.0.0.1:2020/api/v1/metrics | jq.

The node becomes problematic with other pods/apps unable to start due to the no space left on device misleading error which its actually the inotify watches limit has been reached.

Increasing the watches is just a workaround and will just delay the issue, as the actual files never reach that high. (after restarting fluentbit the number reported goes to ~120 files actually open)

@krancour
Copy link
Contributor

I've been encountering this as well. I've been sending a million lines to stdout very quickly from just one single Docker container. I reliably get about the first ~650,000 lines. The last line I get is always the last line before the log was rotated. I never get even one line from the new file after rotation. Popping into the sqlite database to poke around, record shows no recognition of rotation and an offset that exceeds the new file's length.

@edsiper any insight on this?

@rashmichandrashekar
Copy link

I am seeing the same issue under high scale. Any workarounds/updates?

@akashmantry
Copy link

akashmantry commented Dec 14, 2020

We had the same issue running fluent-bit on EKS. Logs came at a rate of 50 k req/s. After updating fluent-bit to v1.6.8, bumping up the resources, and using a combination of memory and filesystem for buffering messages, we were able to avoid the log loss. We also modified the log rotation on EKS to rotate files only after reaching 10 GB.
Here is the fluent-bit config we have:

  fluent-bit-input.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        Tag               kube.*
        Refresh_Interval  1
        Skip_Long_Lines   On
        Buffer_Chunk_Size 10MB
        Buffer_Max_Size 10MB
        Rotate_wait 60
        storage.type filesystem
        DB /var/log/flb_kube.db
        Read_from_Head On
        Mem_Buf_Limit 500MB
  fluent-bit-output.conf: |2

    [OUTPUT]
        Name kafka
        Match *
        Brokers bg-kafka-private-cp-kafka.default.svc.cluster.local:9092
        Topics kubernetes_cluster
        Timestamp_Format  iso8601
        storage.total_limit_size  10G
        rdkafka.batch.num.messages 10000
        rdkafka.batch.size 10000000
        rdkafka.request.required.acks 1
        rdkafka.linger.ms 2000
        rdkafka.compression.codec lz4
  fluent-bit-service.conf: |
    [SERVICE]
        Flush        1
        Daemon       Off
        Log_Level    info
        Parsers_File parsers.conf
        Parsers_File parsers_custom.conf
        storage.path /var/log/flb-storage/
        storage.backlog.mem_limit 500MB

@loburm
Copy link
Contributor

loburm commented Mar 8, 2021

@akashmantry, our team also experienced similar issue in GKE. After a very spiky load (10 MB/s) for a few seconds file is rotated but it seems that Fluent Bit is not tracking a new file (after one more rotation it detects a new file). Do you know if upgrade to 1.6.8 by itself could resolve this issue or we should combine all methods?

@akashmantry
Copy link

@loburm you might have to try a combination of these to figure out what works for your use case. switching to the latest version is always recommended. for us, moving to Kafka and increasing the log file size eliminated the log loss problems.

@loburm
Copy link
Contributor

loburm commented Mar 11, 2021

@edsiper I think we had a similar case in GKE recently. First there was a heavy load (multiple MBs per second), during which log rotation happens. Fluent Bit hasn't managed to detect a new file and continues reading logs only after the next rotation.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Jan 22, 2022
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants