-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FLUME-3101 Add maxBatchCount config property to Taildir Source. #240
Conversation
If there are multiple files in the path(s) that need to be tailed and there is a file written by high frequency, then Taildir can read the batchSize size events from that file every time. This can lead to an endless loop and Taildir will only read data from the busy file, while other files will not be processed. Another problem is that in this case TaildirSource will be unresponsive to stop requests too. This commit handles this situation by introducing a new config property called maxBatchCount. It controls the number of batches being read consecutively from the same file. After reading maxBatchCount rounds from a file, Taildir will switch to another file / will have a break in the processing. This change is based on hunshenshi's patch.
Inconsistent locking of the new maxBatchCount instance variable.The same problem affects other variables in the class too. It will be handled later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @turcsanyip
It looks good to me, only some minor changes needed.
@@ -1401,6 +1402,7 @@ Example for agent named a1: | |||
a1.sources.r1.headers.f2.headerKey1 = value2 | |||
a1.sources.r1.headers.f2.headerKey2 = value2-2 | |||
a1.sources.r1.fileHeader = true | |||
a1.sources.ri.numThreshold = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the example is different than the parameter name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching it. Fixed.
@@ -1375,6 +1375,7 @@ skipToEnd false Whether to sk | |||
idleTimeout 120000 Time (ms) to close inactive files. If the closed file is appended new lines to, this source will automatically re-open it. | |||
writePosInterval 3000 Interval time (ms) to write the last position of each file on the position file. | |||
batchSize 100 Max number of lines to read and send to the channel at a time. Using the default is usually fine. | |||
maxBatchCount Long.MAX_VALUE Controls the number of batches being read consecutively from the same file. It can be used to prevent reading a file in endless loop when the file is written by high frequency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a little bit more explanation. for example:
If the source is tailing multiple files and one of them is written at a fast rate, it can prevent other files to be read. In this case lower this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, +1.
- fix parameter name in the example - refine parameter description
If there are multiple files in the path(s) that need to be tailed and there is a file written by high frequency, then Taildir can read the batchSize size events from that file every time. This can lead to an endless loop and Taildir will only read data from the busy file, while other files will not be processed. Another problem is that in this case TaildirSource will be unresponsive to stop requests too. This commit handles this situation by introducing a new config property called maxBatchCount. It controls the number of batches being read consecutively from the same file. After reading maxBatchCount rounds from a file, Taildir will switch to another file / will have a break in the processing. This change is based on hunshenshi's patch. This closes apache#240 Reviewers: Ferenc Szabo, Endre Major (Peter Turcsanyi via Ferenc Szabo)
sparkcontext stop
If there are multiple files in the path(s) that need to be tailed and there
is a file written by high frequency, then Taildir can read the batchSize size
events from that file every time. This can lead to an endless loop and Taildir
will only read data from the busy file, while other files will not be
processed.
Another problem is that in this case TaildirSource will be unresponsive to
stop requests too.
This commit handles this situation by introducing a new config property called
maxBatchCount. It controls the number of batches being read consecutively
from the same file. After reading maxBatchCount rounds from a file, Taildir
will switch to another file / will have a break in the processing.
This change is based on hunshenshi's patch.