-
Notifications
You must be signed in to change notification settings - Fork 2.9k
NIFI-1118 Update SplitText Processor - add support for split size limits and header line markers. #280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jskora
commented
Mar 15, 2016
- Add "Maximum Fragment Size" property. A new split file will be created if the next line to be added to the current split file exceeds this user-defined maximum file size. In the case where an input line is longer than the fragment size, this line will be output in a separate split file that will exceed the maximum fragment size.
- Add "Header Line Marker Character" property. Lines that begin with these user-defined character(s) will be considered header line(s) rather than a predetermined number of lines. The existing property "Header Line Count" must be zero for this new property and behavior to be used.
- Deprecated the "Remove Trailing Newlines" property.
- Fixed conditional that incorrectly suppressed splits where the content line count equaled the header line count and did not remove empty splits from the session.
- Minor formatting cleanup.
- Exclude test files from RAT check in pom.xml.
…its and header line markers. * Add "Maximum Fragment Size" property. A new split file will be created if the next line to be added to the current split file exceeds this user-defined maximum file size. In the case where an input line is longer than the fragment size, this line will be output in a separate split file that will exceed the maximum fragment size. * Add "Header Line Marker Character" property. Lines that begin with these user-defined character(s) will be considered header line(s) rather than a predetermined number of lines. The existing property "Header Line Count" must be zero for this new property and behavior to be used. * Fixed conditional that incorrectly suppressed splits where the content line count equaled the header line count and did not remove empty splits from the session. * Minor formatting cleanup. * Exclude test files from RAT check in pom.xml.
| @Tags({"split", "text"}) | ||
| @InputRequirement(Requirement.INPUT_REQUIRED) | ||
| @CapabilityDescription("Splits a text file into multiple smaller text files on line boundaries, each having up to a configured number of lines") | ||
| //@CapabilityDescription("Splits a text file into multiple smaller text files on line boundaries, each having up to a configured number of lines") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commented-out line should probably be removed :)
* Fix Pull Request issues.
| } | ||
|
|
||
| numLines++; | ||
| if (totalBytes >= maxByteCount && numLines > maxNumLines) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here appears to be incorrect.
numLines is incremented for each iteration of the loop (unless we return before it is incremented).
This means that numLines <= i
The loop's condition indicates i < maxNumLines
So numLines <= i < maxNumLines
So it is always the case that numLines < maxNumLines, so this condition will never be satisfied because numLines will never be > maxNumLines
Now, looking through the code and doing a bit of testing, this does not appear to return an incorrect result, since countBytesToSplitPoint will handle the logic appropriately itself, but this should be fixed before it is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. "&& numLines > maxNumLines" condition can be removed.
* Fix Pull Request issues - cleanup conditionals logic. - countBytesToSplitPoint() - remove buffering if no output stream provided and add null checks before I/O. - remove unused method and super call in static class SplitInfo.
* Fix Pull Request issues - tests to covering a couple missed edge cases.
|
I think the issues are all covered now and tests have been expanded to get all critical logic and edge cases. |
* Fix Pull Request issues - tests to covering a couple missed edge cases.
- add new test file to RAT exclude list.
|
@jskora do you think this PR can be closed now given the updates made to fix the underlying defects found? A new PR could be submitted which adds the proposed features or goes into ReplaceText or a new processor. |
|
thanks @markobean and @jskora . What do you think about making ignore newlines only be honored/supported when not using the new features you're planning to include or only in very specific configurations? I ask because this, admittedly mistaken, feature is used a lot. Ultimately if that seems to unwieldy we can punt that feature in 1.0, add your new capabilities, and support end of line removal on ReplaceText instead. We just need to remember to document this in the migration guide for 1.0 as this could cause some pretty funky behavior changes for folks. What do you think? |
|
Yes, this should be closed. Done. |