Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] in_tail plugin isn't continue watch log file after logrotate was ran on k8s logs file. #375

Closed
duythinht opened this issue Sep 12, 2017 · 27 comments
Assignees

Comments

@duythinht
Copy link

duythinht commented Sep 12, 2017

I have run fluent-bit for k8s, but after run logrotate, in_tail is not watch log file, which has been rotated.
My fluentbit config:

[SERVICE]
  Flush        5
  Daemon       Off
  Log_Level    debug
  Parsers_File parsers.conf

[INPUT]
  Name          tail
  Tag           kube.*
  Path          /log/*_default_*.log
  DB            /log/kube.db
  Mem_Buf_Limit 5MB
[OUTPUT]
  Name stdout
  Match *

My logrotate config

/var/lib/docker/containers/*/*.log {
    daily
    rotate 1
    copytruncate
    compress
    delaycompress
    missingok
    notifempty
}

I thinks something was wrong after logs file has changed outside container, how I reproduce: I run a fluent-bit containers in docker, mount volume [current_folder]:/log

  • Append log inside containers:
# Run this command inside fluent-bit container
$ echo '{"log":"hello world","stream":"stdout","time":"2017-08-14T07:16:44.331313977Z"}' >> message_default_inside.log
# This work!
  • append log outside of container
# Run this command outside of fluent-bit container
$ echo '{"log":"hello world","stream":"stdout","time":"2017-08-14T07:16:44.331313977Z"}' >> message_default_outside.log
# This not work!

Trying start, restart sometime.

@edsiper
Copy link
Member

edsiper commented Sep 26, 2017

note that when a third-party tool rotate a file Fluent Bit catch this event (which is a file rename), and what it does is to keep monitoring the rotated file for the next 5 seconds (Rotate_Wait option), after that is not longer monitored.

Now when a file is rotated, likely the original application that create the logs will re-create the file (same name), but in order to let Fluent Bit catch that file creation it needs to re-scan the path, this operation is handled by the Refresh_Interval option, by default it re-scan every 60 seconds, I suggest to keep this value low as 5 seconds.

If the issue mentioned do not address the problem explained above, please provide detailed steps to try to reproduce the problem.

@edsiper edsiper self-assigned this Sep 26, 2017
@edsiper
Copy link
Member

edsiper commented Oct 13, 2017

@duythinht is there any pending question/issue on your side ?

@duythinht
Copy link
Author

I'm still troubleshoot this issue. Sometime tail keep working, sometime it's not working (after logrotate running)

@hdiass
Copy link

hdiass commented Oct 25, 2017

I'm also with same issue.
When my app rotates the file fluent-bit container provides this error plugins/in_tail/tail_file.c:688 errno=2
and the log stop being monitored and fluent-bit container gets frozen.
Trying today to change the refresh-interval as @edsiper mentioned and then i will provide feedback.

@hdiass
Copy link

hdiass commented Oct 26, 2017

Changed the refresh-interval didn't helped.. when file rotated fluent-bit didn't monitored it anymore, needed to restart the fluent container.

@edsiper
Copy link
Member

edsiper commented Nov 2, 2017

@hdiass 0.12.7 has been released, please upgrade to that version and let us know if the issue persists.

@hdiass
Copy link

hdiass commented Nov 7, 2017

Hello @edsiper, i upgraded fluent-bit but even though same issue, when file rotates its read anymore by fluent-bit and stays in loop trying to read the file. i've turned on the debug log level to post here the behaviour, if it helps.

[2017/11/06 22:03:07] [debug] [task] destroy task=0x7fca0023c0e0 (task_id=0)
[2017/11/06 22:03:07] [debug] [dyntag tail.0] 0x7fca0028b120 destroy (tag=tail.0)
[2017/11/06 22:03:34] [debug] [in_tail] rotated: /some/directory/file.log -> /some/directory/file.log
[2017/11/06 22:03:34] [debug] [in_tail] removed /some/directory/file.log
[2017/11/06 22:03:36] [debug] [in_tail] append new file: /some/directory/file.log
[2017/11/06 22:03:36] [debug] [in_tail] add to scan queue /some/directory/file.log, offset=10487070
[2017/11/06 22:03:36] [debug] [in_tail] file=/some/directory/file.log promote to TAIL_EVENT
[2017/11/06 22:03:36] [debug] [in_tail] file=/some/directory/file.log cannot promote, unregistering
[2017/11/06 22:03:41] [debug] [in_tail] append new file: /some/directory/file.log
[2017/11/06 22:03:41] [debug] [in_tail] add to scan queue /some/directory/file.log, offset=10487070
[2017/11/06 22:03:41] [debug] [in_tail] file=/some/directory/file.log promote to TAIL_EVENT
[2017/11/06 22:03:41] [debug] [in_tail] file=/some/directory/file.log cannot promote, unregistering
[2017/11/06 22:03:46] [debug] [in_tail] append new file: /some/directory/file.log
[2017/11/06 22:03:46] [debug] [in_tail] add to scan queue /some/directory/file.log, offset=10487070
[2017/11/06 22:03:46] [debug] [in_tail] file=/some/directory/file.log promote to TAIL_EVENT
[2017/11/06 22:03:46] [debug] [in_tail] file=/some/directory/file.log cannot promote, unregistering

@edsiper
Copy link
Member

edsiper commented Nov 7, 2017

@hdiass what kind of rotation mode are you using, copytruncate ?

[2017/11/06 22:03:34] [debug] [in_tail] rotated: /some/directory/file.log -> /some/directory/file.log

why the rotated file have the same name ?

[2017/11/06 22:03:36] [debug] [in_tail] file=/some/directory/file.log promote to TAIL_EVENT
[2017/11/06 22:03:36] [debug] [in_tail] file=/some/directory/file.log cannot promote, unregistering

that means that a file was promoted for inotify but then it failed, mostly because it was deleted.

Looks like your file are being rotated faster than the refresh_interval, please set a refresh_interval of 5 seconds

@hdiass
Copy link

hdiass commented Nov 7, 2017

@edsiper, the application that i want to monitor handles the log file itself, not using logrotate from the system. What the app does for what i can see is create a "backup" file with the old log file and recreates a new log file with the same name. Ok i'll set the refresh interval for that value and test again,

Thanks a lot,

@hdiass
Copy link

hdiass commented Nov 7, 2017

@edsiper I was checking and i already had refresh interval option set on 5, so that will not help

you can find the the config file i'm using below.

[SERVICE]
    Flush          1
    Daemon         off
    Log_Level      debug
    #Log_File      /var/log/fluentbit.log
    Parsers_File   parsers.conf

[INPUT]
    Name              tail
    Path              /some/directory/file.log
    Parser            json-test
    DB                /var/log/flb_kube.db
    Mem_Buf_Limit     10MB
    Buffer_Chunk_Size 8k
    Buffer_Max_Size   32k
    Refresh_Interval  5


[OUTPUT]
    Name   es
    Match  *
    Host   some.elastic.search.com
    Port   1111
    Logstash_Format On
    Retry_Limit False
    Logstash_Prefix some-index-logs
    Index some-index-logs
    Time_Key _@timestamp

also maybe good for you to know, the timestamp between old file last log is really like miliseconds difference from the first timestamp on the new log file.

old log file last line time stamp : "@timestamp":"2017-11-06T22:03:06.198+00:00"
newly created log file first line: "@timestamp":"2017-11-06T22:03:34.274+00:00"

If you can somehow tell me what is the best config here to fluent-bit correcty follow the log after the rotation

@epcim
Copy link
Contributor

epcim commented Feb 5, 2018

I challenge the similar behaviour. We discovered it's related to logrotate "copytruncate" option.

copytruncate

          Truncate the original log file in place after creating a copy, instead of moving the old log file and optionally creating a new one.  It can be used when some program cannot be told to
          close  its logfile and thus might continue writing (appending) to the previous log file forever.  Note that there is a very small time slice between copying the file and truncating it,
          so some logging data might be lost.  When this option is used, the create option will have no effect, as the old log file stays in place.

Fluent bit should recognize number of lines in file, and if that is < then the previous value, it should re-read the file from scratch + reset it's position (whatever to get un-blocked).

@rmoriz
Copy link

rmoriz commented Feb 16, 2018

Can confirm the issue using Fluent-Bit v0.12.13. sqlite3 db keeps the counter even when the log file itself was logrotated ans reset to 0 bytes.

Only workaround I was able to come up with is not to use the DB option. 😒

@edsiper
Copy link
Member

edsiper commented Feb 16, 2018

Copytruncate mode is dangerous and should be avoided in this scenario, in general it leads to data loss.

Counting the number of lines is not a solution since that will mean: for every read(2) go to the beginning of the file and count the number of line breaks (\n). The other solution would be to check for the file size on every read using stat(2), again ..it will be performance killer and a constant pain. It's times better to use a different log rotation mode than copytruncate.

Note that also copytruncate is done by a third party tool, so there is high chances that truncation is done when the application is writing data to the file, there is no "sync". A workaround would be to let Docker handle rotation. There is relevant discussion on this topic on Kubernetes repo:

kubernetes/kubernetes#38495

@rmoriz
Copy link

rmoriz commented Feb 16, 2018

We're using fluent-bit outside of kubernetes/docker. Usually "logrotate" is responsible for logrotation (Debian/Ubuntu). In some cases we're still using "remote_syslog2" which claims to handle this scenario https://github.com/papertrail/remote_syslog2#log-rotation-and-the-behavior-of-remote_syslog - maybe an inspiration?

@edsiper
Copy link
Member

edsiper commented Feb 16, 2018

@rmoriz thanks for the hints.

Actually the papertrail client does specifically the workaround mentioned above: "stat(2) the file when some 'write' operation was done":

https://github.com/papertrail/remote_syslog2/blob/master/vendor/github.com/papertrail/go-tail/follower/follower.go#L170

Note that the workaround will only work if the tool that generated the original log file did not open the file using O_APPEND mode.

I will try an enhancement

edsiper added a commit that referenced this issue Feb 16, 2018
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
@edsiper
Copy link
Member

edsiper commented Feb 16, 2018

@rmoriz @epcim

I pushed some improvements on GIT master to handle file truncation. Would you please re-build and test ?

edsiper added a commit that referenced this issue Feb 17, 2018
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
@epcim
Copy link
Contributor

epcim commented Feb 22, 2018

Confirm 0.13 Dev, tested for a while and seems it really works with logrotate and the above options. No freezes yet.

@rmoriz
Copy link

rmoriz commented Mar 1, 2018

Deployed + tested one week. looks good so far.

@edsiper
Copy link
Member

edsiper commented Mar 1, 2018

thanks everyone for helping on this issue.

Closing as Fixed.

@edsiper edsiper closed this as completed Mar 1, 2018
@sandstrom
Copy link

sandstrom commented Mar 5, 2018

Awesome @edsiper 🏆

Will this be released in the 0.12.x line? It would be very helpful!

@edsiper
Copy link
Member

edsiper commented Mar 6, 2018

@sandstrom
Copy link

@edsiper Awesome, thanks! 💯

micklonge pushed a commit to micklonge/elk_fluent-bit that referenced this issue Jun 26, 2018
Signed-off-by: Eduardo Silva <eduardo@treasure-data.com>
@suckowbiz
Copy link

ping @edsiper

The official documentation here https://fluentbit.io/documentation/0.13/input/tail.html states:

Rotation with truncation (e.g. logrotate's copytruncate mode) is not supported.

Is the documentation outdated or is there still an issue with logrotate and copytruncate?

@edsiper
Copy link
Member

edsiper commented Jul 20, 2018

Documentation needs to be updated, in the other side the note the following requirement:

Note that the workaround will only work if the tool that generated the original log file did not open the file using O_APPEND mode.

@matrey
Copy link

matrey commented Jan 23, 2019

@edsiper FYI the documentation (even for 1.0: https://docs.fluentbit.io/manual/input/tail) still mentions "Rotation with truncation (e.g. logrotate's copytruncate mode) is not supported."

Also, regarding your remark that it "will only work if the tool that generated the original log file did not open the file using O_APPEND mode": does that mean we can expect logs rotated through logrotate's copytruncate to work or not?

@edsiper
Copy link
Member

edsiper commented Jul 2, 2019

Updating the docs now, thanks for catching that.

rawahars pushed a commit to rawahars/fluent-bit that referenced this issue Oct 24, 2022
…fluent#375)

* out_stackdriver: fix incorrect format of local_resource_id, add ending dot

Signed-off-by: Jeff Luo <jeffluoo@google.com>

* out_stackdriver: add missing option severity_key

Signed-off-by: Jeff Luo <jeffluoo@google.com>

Co-authored-by: Eduardo Silva <eduardo@treasure-data.com>
@WqyJh
Copy link

WqyJh commented Nov 24, 2022

What about the copied file, would it be consume from start?
If I had a log file named a.log which was half processed and was copied to a.1.log, the truncated a.log would be processed correctly, but what would happen to a.1.log?
Does its content would be re-consumed or just ignored?
Note that it's possible that content in a.1.log is half processed which means the unprocessed parts should continue to be processed and the processed parts shouldn't be re-consumed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants