[tail] in_tail plugin doesn't refresh newly added files during processing. #573

Neozaru · 2015-04-08T21:16:53Z

Hello.

I'm using Fluentd for a quite heavy processing (5 < n < 10 plugins) on large files.
I have an "in_tail" plugin, with "path" -> "input*.log".

Today, I tested a scenario.

Create input1.log empty file.
Start Fluentd.
Send 50k lines into "input1.log". (cat x >> y)
Send 50k more lines into "input1.log".
Send 50k more lines into "input2.log".
Send 50k more lines into "input3.log".

Each 50k take 3 minutes to process.
The result is : The 100k lines of the "input1.log" are processed in 6 minutes.
During the processing, the "input2.log" and "input3.log" are not detected.

After 6 minutes (end of first processing), the two other files are finally detected, and they will never be processed since Fluentd will wait for new lines in these files (if I add new lines at this point).

I my environment, a new file will be created each hour. If Fluentd is processing when the file appear, some log lines will be lost.
I tried different refresh_interval, but no differences.

Am I correct if I say that Fluentd should at least record file spawning (and set cursor position to 0) as soon as the refresh_interval is reached ?

Thank you.

EDIT : Ubuntu 14.04, Fluentd 0.12.7, old-fashion config

repeatedly · 2015-04-09T03:57:23Z

Could you show me your configuration?
Do you set read_from_head true?

Neozaru · 2015-04-09T19:24:35Z

With read_from_head option I don't lose any line of log, since it will not "tail" (these files don't rotate).
Files are processed in "alphanumerical" order; which is good for my own use case, but it's not the behavior I expected : To have multiple files watched and processed in parallel.

In fact, file refresh only occurs when the processing loop is over for all current inputs (am I right ?)

EDIT : Input configuration for info :

<source>
  type tail
  path /data/input*.*.log
  pos_file /data/input.pos
  format /^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>.*))?" (?<code>[^ ]*) (?<size>[^ ]*) "(?<referer>[^\"]*)" "(?<agent>[^\"]*)"$/
  time_format %d/%b/%Y:%H:%M:%S %z
  tag input.apache
  refresh_interval 5
  read_from_head true
</source>

repeatedly · 2015-04-09T19:42:09Z

Files are processed in "alphanumerical" order

It depends on Ruby implementation. Fluentd doesn't guarantee "alphanumerical" order.

To have multiple files watched and processed in parallel.

Fluentd focuses on streaming log processing, not parallel processing.
in_tail launches only one thread internally. If fluentd launches a thread per file, it consumes lots of memory and CPU.
If you want to process large files in parallel, use embulk instead: http://www.embulk.org/docs/
Note that embulk is for bulk loading so it doesn't have in_tail like watching feature.

In fact, file refresh only occurs when the processing loop is over for all current inputs (am I right ?)

It depends on the order of triggered events.
If refresh timer event is triggered before other IO event, refresh occurs before actual IO call.

repeatedly closed this as completed Apr 21, 2015

andrews32 mentioned this issue Sep 25, 2017

Multiple Files Tailed and Processed in Parallel? SumoLogic/fluentd-kubernetes-sumologic#44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tail] in_tail plugin doesn't refresh newly added files during processing. #573

[tail] in_tail plugin doesn't refresh newly added files during processing. #573

Neozaru commented Apr 8, 2015

repeatedly commented Apr 9, 2015

Neozaru commented Apr 9, 2015

repeatedly commented Apr 9, 2015

[tail] in_tail plugin doesn't refresh newly added files during processing. #573

[tail] in_tail plugin doesn't refresh newly added files during processing. #573

Comments

Neozaru commented Apr 8, 2015

repeatedly commented Apr 9, 2015

Neozaru commented Apr 9, 2015

repeatedly commented Apr 9, 2015