Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add follow_inodes flag to tail files by inodes, not by names #1660

Closed
wants to merge 6 commits into from

Conversation

OleksiiDuzhyi
Copy link

Use case

We have a lot of servers with up to 300 application deployed on each server (mostly Java/Scala). Each application writes log files using log4j library. Each log file is rotated daily in the same directory, with up to 10 back up files possible. Events are fed to realtime processing system(critical to time delays up to 1 second) and to HBase. We want to replace current heavy log collector with lightweight solution.

Problem

We tried to use default tail plugin. Most basic configuration will look like this:

<source>
  @type tail
  read_from_head true
  path /var/log/*/app*.log  
  tag app.log.*
  format none
  refresh_interval 10
  pos_file /var/log/pos/app.log.pos
  rotate_wait 3600
  path_key path
  limit_recently_modified 86400
</source>

According to fluentd in_tail plugin documentation - every time file is rotated for such configuration, it will be treated by in_tail plugin as new one, and will be read from it's head, even though file inode has not changed, and in position file we have latest read position (but just for last line before rotation; during rotate_wait time position stored in memory only - https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/in_tail.rb#L554)

This doesn't suit our needs, because:

  • we will generate a lot of unneeded traffic, which will be irrelevant to realtime part
  • disk I/O increase
  • longer ingestion into HBase

One of the possible solution proposed by @repeatedly in #1613 is to avoid '*' in file pattern discovery. This also doesn't suit us, because we'd like to reliably store events with no or minimum loss because they are very crucial for business analysis. Possible data loss scenario:

  1. Fluentd goes down at 23:45 UTC
  2. Log file is rotated at 00:00:00 UTC, app.log is renamed to app.2017-08-02.log (just for example)
  3. Fluentd goes up at 00:15 UTC
  4. Tail plugin reads all data from in new app.log file
  5. Data from app.2017-08-02.log file from 23:45 - 00:00 UTC is lost. No way for us to find out at least latest offset in app.2017-08-02.log because it has been overwritten in position file.

Proposed solution

Add ability to track inodes inside fluentd in_tail plugin to enable combination of '*' in path configuration with rotation inside same directory and read_from_head set to true. In proposed solution we add follow_inodes flag (set to false by default to preserve current behavior). When follow_inodes is set to true and file rotation is detected - we do not reread file, instead we are checking if such inode is present inside pos file - if yes, we don't recreate TailWatcher. TailWatcher is closed after rotate_wait period and file marked as unwatched inside position file after limit_recently_modified time passed. Pos file is cleaned on startup as it was before, but also delete entries for non-existing files.

Please review, and let me know what do you think of such approach?

@repeatedly
Copy link
Member

Thanks for the patch!
I will check it later.

@@ -213,8 +219,30 @@ def expand_paths
paths << path
end
}
excluded = @exclude_path.map { |path| path = date.strftime(path); path.include?('*') ? Dir.glob(path) : path }.flatten.uniq
paths - excluded
excluded = @exclude_path.map {|path| path = date.strftime(path); path.include?('*') ? Dir.glob(path) : path}.flatten.uniq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't change syntax/indent.

excluded = @exclude_path.map {|path| path = date.strftime(path); path.include?('*') ? Dir.glob(path) : path}.flatten.uniq
to_return = paths - excluded
# filter out non existing files, so in case pattern is without '*' we don't do unnecessary work
to_return = to_return.select {|path|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(paths - excluded).to_return.select {|path| seems enough

to_return = paths - excluded
# filter out non existing files, so in case pattern is without '*' we don't do unnecessary work
to_return = to_return.select {|path|
Pathname.new(path).exist?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileTest#exist? is more lightweight because no object instantiation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thx, didn't know about that option

to_return = to_return.select {|path|
Pathname.new(path).exist?
}
Hash[to_return.map {|path|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long lines in Hash[] seems hard to read.
This map can be moved to above line. {}.select {}.map {}.

end

def existence_path
Hash[@tails.keys.map {|path_ino|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With lots of files, e.g. 1000+ files, this map creates lots of object.
So avoiding map is better for the performance.

Of course, expand_paths's map is also same.

@@ -910,5 +996,26 @@ def read_inode
@inode
end
end

class PathInodeTuple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check eql? and other method behaviour of struct object but can't use struct instead of class for this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked eql? of struct - it work as expected, will change to Struct

@repeatedly
Copy link
Member

BTW, all most tests are failed.
So the something makes tests unstable.

@OleksiiDuzhyi
Copy link
Author

I can't figure why tests are failing, maybe you could give me a hand in this?

In my test I do the following:

File.open("#{TMP_DIR}/tail.txt", "wb") {|f|
    f.puts "test1"
    f.puts "test2"
}
............
cleanup_file("#{TMP_DIR}/tail.txt")
File.open("#{TMP_DIR}/tail.txt", "wb") {|f|

cleanup_file code:

  def cleanup_file(path)
    FileUtils.rm_f(path, secure: true)
    if File.exist?(path)
      # ensure files are closed for Windows, on which deleted files
      # are still visible from filesystem
      GC.start(full_mark: true, immediate_mark: true, immediate_sweep: true)
      FileUtils.remove_entry_secure(path, true)
    end
  end

But for some reason I can't open a file for writing after I've deleted it. Even though I was able to write this file few lines above. I got this exception:

Errno::EACCES: Permission denied @ rb_sysopen - C:/projects/fluentd/test/plugin/../tmp/tail/tail.txt

I have no clue why I can't reopen file when I've deleted it. Should I remove file in a different way?

@esselius
Copy link

Will this be merged anytime soon?

@repeatedly
Copy link
Member

@esselius in_tail plugin is widely used so we want to test this patch on several environment before merging.

@OleksiiDuzhyi
Copy link
Author

We decided to move away from fluentd for log collection, also relying on inodes - is not 100% reliable, inodes could be reused which could cause issues

@repeatedly
Copy link
Member

@OleksiiDuzhyi

We decided to move away from fluentd for log collection

What the tool do you use?
And how to solve rotation issue with your tools?

@OleksiiDuzhyi
Copy link
Author

@repeatedly
We decided to go with Apache Flume. From our tests - it shows bigger memory footprint but 2-3 times less CPU usage, which was critical for us.

As input to Flume we wrote a program in C which does what I tried to implement in this PR - tracks inodes, keep state of inode, etc.

@repeatedly
Copy link
Member

@OleksiiDuzhyi Thanks! So inode issue is still remained but flume resolves CPU concern.
Hmm... we hope fluent-bit resovles such case with lower CPU and memory resource.

@OleksiiDuzhyi
Copy link
Author

@repeatedly
Memory usage of fluentd is small as in documentation - in our case for around 1 mln events/min - it was around 80 mb.

@metayd
Copy link

metayd commented Apr 4, 2018

@repeatedly Hello, I have this problem too, is there any plan to merge this patch into master?

@sanjuicgithub
Copy link

@repeatedly , is this patch still in consideration or does fluentd already has any such feature to address this issue ?

@repeatedly
Copy link
Member

repeatedly commented Apr 17, 2020

is this patch still in consideration

Current patch is incomplete because using only inode has a problem for inode reuse case.
So inode and path or something is needed for more reliable file trace.
If we can ignore such situation, we can accept this patch approach.

kenhys added a commit to kenhys/fluentd that referenced this pull request Sep 29, 2020
Based on closed PR fluent#1660 which
is posted by OleksiiDuzhyi because of incomplete test cases.
cosmo0920 pushed a commit to cosmo0920/fluentd that referenced this pull request Nov 11, 2020
Based on closed PR fluent#1660 which
is posted by OleksiiDuzhyi because of incomplete test cases.
cosmo0920 pushed a commit to cosmo0920/fluentd that referenced this pull request Nov 13, 2020
Based on closed PR fluent#1660 which
is posted by OleksiiDuzhyi because of incomplete test cases.
cosmo0920 pushed a commit to cosmo0920/fluentd that referenced this pull request Nov 17, 2020
Based on closed PR fluent#1660 which
is posted by OleksiiDuzhyi because of incomplete test cases.
cosmo0920 pushed a commit to cosmo0920/fluentd that referenced this pull request Nov 20, 2020
Based on closed PR fluent#1660 which
is posted by OleksiiDuzhyi because of incomplete test cases.
cosmo0920 pushed a commit to cosmo0920/fluentd that referenced this pull request Nov 24, 2020
Based on closed PR fluent#1660 which
is posted by OleksiiDuzhyi because of incomplete test cases.
cosmo0920 pushed a commit to cosmo0920/fluentd that referenced this pull request Nov 24, 2020
Based on closed PR fluent#1660 which
is posted by OleksiiDuzhyi because of incomplete test cases.
cosmo0920 pushed a commit that referenced this pull request Nov 25, 2020
Based on closed PR #1660 which
is posted by OleksiiDuzhyi because of incomplete test cases.

Co-authored-by: alexeyd <alexeyd@wix.com>

Signed-off-by: Hiroshi Hatake <hatake@clear-code.com>
cosmo0920 pushed a commit that referenced this pull request Nov 25, 2020
Based on closed PR #1660 which
is posted by OleksiiDuzhyi because of incomplete test cases.

Co-authored-by: alexeyd <alexeyd@wix.com>

Signed-off-by: Hiroshi Hatake <hatake@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants