Skip to content

Conversation

@jskora
Copy link
Contributor

@jskora jskora commented Mar 15, 2016

  • Add "Maximum Fragment Size" property. A new split file will be created if the next line to be added to the current split file exceeds this user-defined maximum file size. In the case where an input line is longer than the fragment size, this line will be output in a separate split file that will exceed the maximum fragment size.
  • Add "Header Line Marker Character" property. Lines that begin with these user-defined character(s) will be considered header line(s) rather than a predetermined number of lines. The existing property "Header Line Count" must be zero for this new property and behavior to be used.
  • Deprecated the "Remove Trailing Newlines" property.
  • Fixed conditional that incorrectly suppressed splits where the content line count equaled the header line count and did not remove empty splits from the session.
  • Minor formatting cleanup.
  • Exclude test files from RAT check in pom.xml.

…its and header line markers.

* Add "Maximum Fragment Size" property.  A new split file will be created if the next line to be added to the current split file exceeds this user-defined maximum file size.  In the case where an input line is longer than the fragment size, this line will be output in a separate split file that will exceed the maximum fragment size.
* Add "Header Line Marker Character" property.  Lines that begin with these user-defined character(s) will be considered header line(s) rather than a predetermined number of lines.  The existing property "Header Line Count" must be zero for this new property and behavior to be used.
* Fixed conditional that incorrectly suppressed splits where the content line count equaled the header line count and did not remove empty splits from the session.
* Minor formatting cleanup.
* Exclude test files from RAT check in pom.xml.
@Tags({"split", "text"})
@InputRequirement(Requirement.INPUT_REQUIRED)
@CapabilityDescription("Splits a text file into multiple smaller text files on line boundaries, each having up to a configured number of lines")
//@CapabilityDescription("Splits a text file into multiple smaller text files on line boundaries, each having up to a configured number of lines")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commented-out line should probably be removed :)

* Fix Pull Request issues.
}

numLines++;
if (totalBytes >= maxByteCount && numLines > maxNumLines) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here appears to be incorrect.
numLines is incremented for each iteration of the loop (unless we return before it is incremented).
This means that numLines <= i
The loop's condition indicates i < maxNumLines
So numLines <= i < maxNumLines
So it is always the case that numLines < maxNumLines, so this condition will never be satisfied because numLines will never be > maxNumLines

Now, looking through the code and doing a bit of testing, this does not appear to return an incorrect result, since countBytesToSplitPoint will handle the logic appropriately itself, but this should be fixed before it is merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. "&& numLines > maxNumLines" condition can be removed.

jskora added 2 commits March 16, 2016 10:35
* Fix Pull Request issues
   - cleanup conditionals logic.
   - countBytesToSplitPoint() - remove buffering if no output stream provided and add null checks before I/O.
   - remove unused method and super call in static class SplitInfo.
* Fix Pull Request issues
   - tests to covering a couple missed edge cases.
@jskora
Copy link
Contributor Author

jskora commented Mar 16, 2016

I think the issues are all covered now and tests have been expanded to get all critical logic and edge cases.

jskora added 2 commits March 16, 2016 12:29
* Fix Pull Request issues
   - tests to covering a couple missed edge cases.
   - add new test file to RAT exclude list.
@joewitt
Copy link
Contributor

joewitt commented Apr 19, 2016

@jskora do you think this PR can be closed now given the updates made to fix the underlying defects found? A new PR could be submitted which adds the proposed features or goes into ReplaceText or a new processor.

@markobean
Copy link
Contributor

@joewitt Talked with @jskora and concur this PR can be closed. A new PR will be opened later with the new features and also with the Return Trailing Newlines bug fix included (if RTN is still included.)

@joewitt
Copy link
Contributor

joewitt commented Apr 19, 2016

thanks @markobean and @jskora . What do you think about making ignore newlines only be honored/supported when not using the new features you're planning to include or only in very specific configurations? I ask because this, admittedly mistaken, feature is used a lot. Ultimately if that seems to unwieldy we can punt that feature in 1.0, add your new capabilities, and support end of line removal on ReplaceText instead. We just need to remember to document this in the migration guide for 1.0 as this could cause some pretty funky behavior changes for folks.

What do you think?

@mosermw
Copy link
Member

mosermw commented Jun 1, 2016

@jskora Are we closing this PR? It appears the latest PR for NIFI-1118 is PR#444

@jskora
Copy link
Contributor Author

jskora commented Jun 22, 2016

Yes, this should be closed. Done.

@jskora jskora closed this Jun 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants