Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tail] in_tail plugin doesn't refresh newly added files during processing. #573

Closed
Neozaru opened this issue Apr 8, 2015 · 3 comments
Closed

Comments

@Neozaru
Copy link

Neozaru commented Apr 8, 2015

Hello.

I'm using Fluentd for a quite heavy processing (5 < n < 10 plugins) on large files.
I have an "in_tail" plugin, with "path" -> "input*.log".

Today, I tested a scenario.

  1. Create input1.log empty file.
  2. Start Fluentd.
  3. Send 50k lines into "input1.log". (cat x >> y)
  4. Send 50k more lines into "input1.log".
  5. Send 50k more lines into "input2.log".
  6. Send 50k more lines into "input3.log".

Each 50k take 3 minutes to process.
The result is : The 100k lines of the "input1.log" are processed in 6 minutes.
During the processing, the "input2.log" and "input3.log" are not detected.

After 6 minutes (end of first processing), the two other files are finally detected, and they will never be processed since Fluentd will wait for new lines in these files (if I add new lines at this point).

I my environment, a new file will be created each hour. If Fluentd is processing when the file appear, some log lines will be lost.
I tried different refresh_interval, but no differences.

Am I correct if I say that Fluentd should at least record file spawning (and set cursor position to 0) as soon as the refresh_interval is reached ?

Thank you.

EDIT : Ubuntu 14.04, Fluentd 0.12.7, old-fashion config

@repeatedly
Copy link
Member

Could you show me your configuration?
Do you set read_from_head true?

@Neozaru
Copy link
Author

Neozaru commented Apr 9, 2015

With read_from_head option I don't lose any line of log, since it will not "tail" (these files don't rotate).
Files are processed in "alphanumerical" order; which is good for my own use case, but it's not the behavior I expected : To have multiple files watched and processed in parallel.

In fact, file refresh only occurs when the processing loop is over for all current inputs (am I right ?)

EDIT : Input configuration for info :

<source>
  type tail
  path /data/input*.*.log
  pos_file /data/input.pos
  format /^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>.*))?" (?<code>[^ ]*) (?<size>[^ ]*) "(?<referer>[^\"]*)" "(?<agent>[^\"]*)"$/
  time_format %d/%b/%Y:%H:%M:%S %z
  tag input.apache
  refresh_interval 5
  read_from_head true
</source>

@repeatedly
Copy link
Member

Files are processed in "alphanumerical" order

It depends on Ruby implementation. Fluentd doesn't guarantee "alphanumerical" order.

To have multiple files watched and processed in parallel.

Fluentd focuses on streaming log processing, not parallel processing.
in_tail launches only one thread internally. If fluentd launches a thread per file, it consumes lots of memory and CPU.
If you want to process large files in parallel, use embulk instead: http://www.embulk.org/docs/
Note that embulk is for bulk loading so it doesn't have in_tail like watching feature.

In fact, file refresh only occurs when the processing loop is over for all current inputs (am I right ?)

It depends on the order of triggered events.
If refresh timer event is triggered before other IO event, refresh occurs before actual IO call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants