Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FLUME-3101 Add maxBatchCount config property to Taildir Source. #240

Closed
wants to merge 3 commits into from

Conversation

turcsanyip
Copy link
Contributor

If there are multiple files in the path(s) that need to be tailed and there
is a file written by high frequency, then Taildir can read the batchSize size
events from that file every time. This can lead to an endless loop and Taildir
will only read data from the busy file, while other files will not be
processed.
Another problem is that in this case TaildirSource will be unresponsive to
stop requests too.

This commit handles this situation by introducing a new config property called
maxBatchCount. It controls the number of batches being read consecutively
from the same file. After reading maxBatchCount rounds from a file, Taildir
will switch to another file / will have a break in the processing.

This change is based on hunshenshi's patch.

If there are multiple files in the path(s) that need to be tailed and there
is a file written by high frequency, then Taildir can read the batchSize size
events from that file every time. This can lead to an endless loop and Taildir
will only read data from the busy file, while other files will not be
processed.
Another problem is that in this case TaildirSource will be unresponsive to
stop requests too.

This commit handles this situation by introducing a new config property called
maxBatchCount. It controls the number of batches being read consecutively
from the same file. After reading maxBatchCount rounds from a file, Taildir
will switch to another file / will have a break in the processing.

This change is based on hunshenshi's patch.
Inconsistent locking of the new maxBatchCount instance variable.The same
problem affects other variables in the class too. It will be handled later.
Copy link
Contributor

@szaboferee szaboferee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @turcsanyip
It looks good to me, only some minor changes needed.

@@ -1401,6 +1402,7 @@ Example for agent named a1:
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
a1.sources.ri.numThreshold = 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example is different than the parameter name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it. Fixed.

@@ -1375,6 +1375,7 @@ skipToEnd false Whether to sk
idleTimeout 120000 Time (ms) to close inactive files. If the closed file is appended new lines to, this source will automatically re-open it.
writePosInterval 3000 Interval time (ms) to write the last position of each file on the position file.
batchSize 100 Max number of lines to read and send to the channel at a time. Using the default is usually fine.
maxBatchCount Long.MAX_VALUE Controls the number of batches being read consecutively from the same file. It can be used to prevent reading a file in endless loop when the file is written by high frequency.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a little bit more explanation. for example:
If the source is tailing multiple files and one of them is written at a fast rate, it can prevent other files to be read. In this case lower this value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified.

Copy link
Contributor

@majorendre majorendre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, +1.

- fix parameter name in the example
- refine parameter description
@asfgit asfgit closed this in b252267 Nov 22, 2018
szaboferee pushed a commit to szaboferee/flume that referenced this pull request Jan 16, 2019
If there are multiple files in the path(s) that need to be tailed and there
is a file written by high frequency, then Taildir can read the batchSize size
events from that file every time. This can lead to an endless loop and Taildir
will only read data from the busy file, while other files will not be
processed.
Another problem is that in this case TaildirSource will be unresponsive to
stop requests too.

This commit handles this situation by introducing a new config property called
maxBatchCount. It controls the number of batches being read consecutively
from the same file. After reading maxBatchCount rounds from a file, Taildir
will switch to another file / will have a break in the processing.

This change is based on hunshenshi's patch.

This closes apache#240

Reviewers: Ferenc Szabo, Endre Major

(Peter Turcsanyi via Ferenc Szabo)
waidr pushed a commit to waidr/flume that referenced this pull request Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants