NIFI-1118 Update SplitText Processor - add support for split size limits and header line markers. #280

jskora · 2016-03-15T15:37:45Z

Add "Maximum Fragment Size" property. A new split file will be created if the next line to be added to the current split file exceeds this user-defined maximum file size. In the case where an input line is longer than the fragment size, this line will be output in a separate split file that will exceed the maximum fragment size.
Add "Header Line Marker Character" property. Lines that begin with these user-defined character(s) will be considered header line(s) rather than a predetermined number of lines. The existing property "Header Line Count" must be zero for this new property and behavior to be used.
Deprecated the "Remove Trailing Newlines" property.
Fixed conditional that incorrectly suppressed splits where the content line count equaled the header line count and did not remove empty splits from the session.
Minor formatting cleanup.
Exclude test files from RAT check in pom.xml.

…its and header line markers. * Add "Maximum Fragment Size" property. A new split file will be created if the next line to be added to the current split file exceeds this user-defined maximum file size. In the case where an input line is longer than the fragment size, this line will be output in a separate split file that will exceed the maximum fragment size. * Add "Header Line Marker Character" property. Lines that begin with these user-defined character(s) will be considered header line(s) rather than a predetermined number of lines. The existing property "Header Line Count" must be zero for this new property and behavior to be used. * Fixed conditional that incorrectly suppressed splits where the content line count equaled the header line count and did not remove empty splits from the session. * Minor formatting cleanup. * Exclude test files from RAT check in pom.xml.

markap14 · 2016-03-15T20:21:59Z

...le/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java

 @Tags({"split", "text"})
 @InputRequirement(Requirement.INPUT_REQUIRED)
-@CapabilityDescription("Splits a text file into multiple smaller text files on line boundaries, each having up to a configured number of lines")
+//@CapabilityDescription("Splits a text file into multiple smaller text files on line boundaries, each having up to a configured number of lines")


The commented-out line should probably be removed :)

* Fix Pull Request issues.

markap14 · 2016-03-16T12:55:31Z

...le/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java

            }
-
            numLines++;
+            if (totalBytes >= maxByteCount && numLines > maxNumLines) {


The logic here appears to be incorrect.
numLines is incremented for each iteration of the loop (unless we return before it is incremented).
This means that numLines <= i
The loop's condition indicates i < maxNumLines
So numLines <= i < maxNumLines
So it is always the case that numLines < maxNumLines, so this condition will never be satisfied because numLines will never be > maxNumLines

Now, looking through the code and doing a bit of testing, this does not appear to return an incorrect result, since countBytesToSplitPoint will handle the logic appropriately itself, but this should be fixed before it is merged.

Agreed. "&& numLines > maxNumLines" condition can be removed.

* Fix Pull Request issues - cleanup conditionals logic. - countBytesToSplitPoint() - remove buffering if no output stream provided and add null checks before I/O. - remove unused method and super call in static class SplitInfo.

* Fix Pull Request issues - tests to covering a couple missed edge cases.

jskora · 2016-03-16T15:42:13Z

I think the issues are all covered now and tests have been expanded to get all critical logic and edge cases.

* Fix Pull Request issues - tests to covering a couple missed edge cases.

- add new test file to RAT exclude list.

joewitt · 2016-04-19T04:39:23Z

@jskora do you think this PR can be closed now given the updates made to fix the underlying defects found? A new PR could be submitted which adds the proposed features or goes into ReplaceText or a new processor.

markobean · 2016-04-19T12:59:54Z

@joewitt Talked with @jskora and concur this PR can be closed. A new PR will be opened later with the new features and also with the Return Trailing Newlines bug fix included (if RTN is still included.)

joewitt · 2016-04-19T14:43:57Z

thanks @markobean and @jskora . What do you think about making ignore newlines only be honored/supported when not using the new features you're planning to include or only in very specific configurations? I ask because this, admittedly mistaken, feature is used a lot. Ultimately if that seems to unwieldy we can punt that feature in 1.0, add your new capabilities, and support end of line removal on ReplaceText instead. We just need to remember to document this in the migration guide for 1.0 as this could cause some pretty funky behavior changes for folks.

What do you think?

mosermw · 2016-06-01T13:52:07Z

@jskora Are we closing this PR? It appears the latest PR for NIFI-1118 is PR#444

jskora · 2016-06-22T12:35:11Z

Yes, this should be closed. Done.

markap14 reviewed Mar 15, 2016
View reviewed changes

NIFI-1118 Update SplitText Processor

eaef479

* Fix Pull Request issues.

markap14 reviewed Mar 16, 2016
View reviewed changes

jskora added 2 commits March 16, 2016 10:35

NIFI-1118 Update SplitText Processor

ab79757

* Fix Pull Request issues - cleanup conditionals logic. - countBytesToSplitPoint() - remove buffering if no output stream provided and add null checks before I/O. - remove unused method and super call in static class SplitInfo.

NIFI-1118 Update SplitText Processor

3d4da30

* Fix Pull Request issues - tests to covering a couple missed edge cases.

jskora added 2 commits March 16, 2016 12:29

NIFI-1118 Update SplitText Processor

cbaa7e9

* Fix Pull Request issues - tests to covering a couple missed edge cases.

NIFI-1118 Update SplitText Processor

f849d0b

- add new test file to RAT exclude list.

jskora closed this Jun 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIFI-1118 Update SplitText Processor - add support for split size limits and header line markers. #280

NIFI-1118 Update SplitText Processor - add support for split size limits and header line markers. #280

Uh oh!

jskora commented Mar 15, 2016

Uh oh!

markap14 Mar 15, 2016

Uh oh!

markap14 Mar 16, 2016

Uh oh!

markobean Mar 16, 2016

Uh oh!

jskora commented Mar 16, 2016

Uh oh!

joewitt commented Apr 19, 2016

Uh oh!

markobean commented Apr 19, 2016

Uh oh!

joewitt commented Apr 19, 2016

Uh oh!

mosermw commented Jun 1, 2016

Uh oh!

jskora commented Jun 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NIFI-1118 Update SplitText Processor - add support for split size limits and header line markers. #280

NIFI-1118 Update SplitText Processor - add support for split size limits and header line markers. #280

Uh oh!

Conversation

jskora commented Mar 15, 2016

Uh oh!

markap14 Mar 15, 2016

Choose a reason for hiding this comment

Uh oh!

markap14 Mar 16, 2016

Choose a reason for hiding this comment

Uh oh!

markobean Mar 16, 2016

Choose a reason for hiding this comment

Uh oh!

jskora commented Mar 16, 2016

Uh oh!

joewitt commented Apr 19, 2016

Uh oh!

markobean commented Apr 19, 2016

Uh oh!

joewitt commented Apr 19, 2016

Uh oh!

mosermw commented Jun 1, 2016

Uh oh!

jskora commented Jun 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants