Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-2802] Support multi-byte custom separator in TextIO #3779

Closed
wants to merge 3 commits into from

Conversation

echauchot
Copy link
Contributor

@echauchot echauchot commented Aug 29, 2017

Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
  • Each commit in the pull request should have a meaningful subject line and body.
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

R: @jkff

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) to 69.937% when pulling 16dda22 on echauchot:TextIO-Record-Separator into e33cc24 on apache:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 69.947% when pulling 16dda22 on echauchot:TextIO-Record-Separator into e33cc24 on apache:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.03%) to 69.928% when pulling 16dda22 on echauchot:TextIO-Record-Separator into e33cc24 on apache:master.

PAssert.that(output).containsInAnyOrder(expected);
p.run();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a few tests using SourceTestUtils.assertSplitAtFractionExhaustive, similar to testSplittingSourceWithMixedDelimitersAndNonEmptyBytesAtEnd and ones around it.

I tried adding one such test and it failed - implementing this properly is very tricky, especially if you allow arbitrary delimiters that can self-overlap, such as || in your example: how should we interpret abc|||xyz - as abc|, xyz - or as abc, |xyz? And how do we consistently enforce this interpretation if the file is split by the runner into chunks differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points, thanks for pointing me at them! I'll try to add those tests. It is tricky indeed. I understand that you did not have the overlap problem with \r and \n because any combination of \r and \n , no matter where we split, will result in a split and no \r or \n in the records. This is not the case with fixed custom multi-byte separator as you pointed. I'll think about this problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed the parser to work no matter what the source split point is and I have added the corresponding test with the custom separator.
What I have now to support is the overlap that you were talking about. Obviously I cannot rewind the offset of (separator.size + 1) to allow only one byte overlap, neither can I rewind the offset of (2*sperator.size) to allow maximum overlap because it might produce duplicate record if a record.size < separator.size. I cannot either catch anything to state that the file format is wrong in case of overlap because I will get no exception, just flaky record content depending on the runner / source split point. Do you have any idea on how to support overlap of separator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about simply prohibiting separators that can have self-overlap? A separator can have self-overlap, and hence cause ambiguous parsing, if it has a suffix that is also its prefix - i.e. there exists "i" < separator.length such that the last "i" bytes == the first "i" bytes - can simply check that in the "withSeparator" method using a quadratic algorithm cause separator is unlikely to be more than a couple of bytes.

Copy link
Contributor Author

@echauchot echauchot Sep 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I like this, it's simple and efficient. still, we have (odd) client use cases where separator might be \n xxxxxx \n but in worse cases 10 bytes, so adding a verification algorithm that is O(10²) complex but that is run only once at pipeline construction time will do. Thanks.

@echauchot
Copy link
Contributor Author

Thanks for your review once again Eugene!

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.002%) to 69.957% when pulling ce69935 on echauchot:TextIO-Record-Separator into e33cc24 on apache:master.

@echauchot
Copy link
Contributor Author

rebased on master

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) to 69.722% when pulling 43e9faa on echauchot:TextIO-Record-Separator into c9653f2 on apache:master.

@echauchot
Copy link
Contributor Author

I did the check that provided separator does not self-overlaps. PTAL

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.004%) to 69.734% when pulling a01a231 on echauchot:TextIO-Record-Separator into c9653f2 on apache:master.

Copy link
Contributor

@jkff jkff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do a couple of minor touch-ups and merge.

@asfgit asfgit closed this in b844126 Sep 1, 2017
@jkff
Copy link
Contributor

jkff commented Sep 1, 2017

Merged with changes:

  • Renamed separator to delimiter for consistency with previous terminology
  • More compact code for detecting self-overlap
  • More extensive exhaustive splitting test (code still passed the test without changes!)

@echauchot
Copy link
Contributor Author

echauchot commented Sep 4, 2017

Thanks @jkff !

  • yes, I had moved delimiter to separator because of the name of the method findSeparatorBounds(), as soon as there is only one name, it's fine for me.
  • for the isSelfOverlaping() implementation: it is more compact and maintainable but a little less performant :) (if we mesure complexity using number of comparison of bytes the new implementation will always do 1+2+ ...+n comparison if overlaps is n bytes or n = length -1 if no-overlap) but it is not important as this method will be run only once per text file, I agree maintainability is more important than performance in that case, just kidding :)

@echauchot echauchot deleted the TextIO-Record-Separator branch September 4, 2017 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants