[BEAM-2802] Support multi-byte custom separator in TextIO #3779

echauchot · 2017-08-29T11:32:09Z

Follow this checklist to help us incorporate your contribution quickly and easily:

Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
Each commit in the pull request should have a meaningful subject line and body.
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

coveralls · 2017-08-29T13:09:24Z

Coverage decreased (-0.02%) to 69.937% when pulling 16dda22 on echauchot:TextIO-Record-Separator into e33cc24 on apache:master.

coveralls · 2017-08-29T13:13:06Z

Coverage decreased (-0.01%) to 69.947% when pulling 16dda22 on echauchot:TextIO-Record-Separator into e33cc24 on apache:master.

coveralls · 2017-08-29T13:29:27Z

Coverage decreased (-0.03%) to 69.928% when pulling 16dda22 on echauchot:TextIO-Record-Separator into e33cc24 on apache:master.

jkff · 2017-08-29T21:38:12Z

sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOReadTest.java

+    PAssert.that(output).containsInAnyOrder(expected);
+    p.run();
+  }
+


Please add a few tests using SourceTestUtils.assertSplitAtFractionExhaustive, similar to testSplittingSourceWithMixedDelimitersAndNonEmptyBytesAtEnd and ones around it.

I tried adding one such test and it failed - implementing this properly is very tricky, especially if you allow arbitrary delimiters that can self-overlap, such as || in your example: how should we interpret abc|||xyz - as abc|, xyz - or as abc, |xyz? And how do we consistently enforce this interpretation if the file is split by the runner into chunks differently?

Good points, thanks for pointing me at them! I'll try to add those tests. It is tricky indeed. I understand that you did not have the overlap problem with \r and \n because any combination of \r and \n , no matter where we split, will result in a split and no \r or \n in the records. This is not the case with fixed custom multi-byte separator as you pointed. I'll think about this problem.

I have fixed the parser to work no matter what the source split point is and I have added the corresponding test with the custom separator.
What I have now to support is the overlap that you were talking about. Obviously I cannot rewind the offset of (separator.size + 1) to allow only one byte overlap, neither can I rewind the offset of (2*sperator.size) to allow maximum overlap because it might produce duplicate record if a record.size < separator.size. I cannot either catch anything to state that the file format is wrong in case of overlap because I will get no exception, just flaky record content depending on the runner / source split point. Do you have any idea on how to support overlap of separator?

How about simply prohibiting separators that can have self-overlap? A separator can have self-overlap, and hence cause ambiguous parsing, if it has a suffix that is also its prefix - i.e. there exists "i" < separator.length such that the last "i" bytes == the first "i" bytes - can simply check that in the "withSeparator" method using a quadratic algorithm cause separator is unlikely to be more than a couple of bytes.

yes, I like this, it's simple and efficient. still, we have (odd) client use cases where separator might be \n xxxxxx \n but in worse cases 10 bytes, so adding a verification algorithm that is O(10²) complex but that is run only once at pipeline construction time will do. Thanks.

echauchot · 2017-08-30T08:37:36Z

Thanks for your review once again Eugene!

coveralls · 2017-08-31T16:24:05Z

Coverage decreased (-0.002%) to 69.957% when pulling ce69935 on echauchot:TextIO-Record-Separator into e33cc24 on apache:master.

echauchot · 2017-09-01T08:52:37Z

rebased on master

coveralls · 2017-09-01T10:23:22Z

Coverage decreased (-0.02%) to 69.722% when pulling 43e9faa on echauchot:TextIO-Record-Separator into c9653f2 on apache:master.

echauchot · 2017-09-01T13:40:54Z

I did the check that provided separator does not self-overlaps. PTAL

coveralls · 2017-09-01T16:24:42Z

Coverage decreased (-0.004%) to 69.734% when pulling a01a231 on echauchot:TextIO-Record-Separator into c9653f2 on apache:master.

jkff

Thanks, will do a couple of minor touch-ups and merge.

jkff · 2017-09-01T19:08:35Z

Merged with changes:

Renamed separator to delimiter for consistency with previous terminology
More compact code for detecting self-overlap
More extensive exhaustive splitting test (code still passed the test without changes!)

echauchot · 2017-09-04T08:17:24Z

Thanks @jkff !

yes, I had moved delimiter to separator because of the name of the method findSeparatorBounds(), as soon as there is only one name, it's fine for me.
for the isSelfOverlaping() implementation: it is more compact and maintainable but a little less performant :) (if we mesure complexity using number of comparison of bytes the new implementation will always do 1+2+ ...+n comparison if overlaps is n bytes or n = length -1 if no-overlap) but it is not important as this method will be run only once per text file, I agree maintainability is more important than performance in that case, just kidding :)

echauchot force-pushed the TextIO-Record-Separator branch from 974966a to fc5a627 Compare August 29, 2017 11:33

jkff reviewed Aug 29, 2017

View reviewed changes

[BEAM-2802] Support multi-byte custom separator in TextIO

a47ddd8

echauchot force-pushed the TextIO-Record-Separator branch from ce69935 to a47ddd8 Compare September 1, 2017 08:52

Checkstyle

43e9faa

Forbid using separators that can self-overlap

a01a231

jkff approved these changes Sep 1, 2017

View reviewed changes

asfgit closed this in b844126 Sep 1, 2017

echauchot deleted the TextIO-Record-Separator branch September 4, 2017 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-2802] Support multi-byte custom separator in TextIO #3779

[BEAM-2802] Support multi-byte custom separator in TextIO #3779

echauchot commented Aug 29, 2017 •

edited

Loading

coveralls commented Aug 29, 2017

coveralls commented Aug 29, 2017

coveralls commented Aug 29, 2017

jkff Aug 29, 2017

echauchot Aug 30, 2017

echauchot Aug 31, 2017

jkff Aug 31, 2017

echauchot Sep 1, 2017 •

edited

Loading

echauchot commented Aug 30, 2017

coveralls commented Aug 31, 2017

echauchot commented Sep 1, 2017

coveralls commented Sep 1, 2017

echauchot commented Sep 1, 2017

coveralls commented Sep 1, 2017

jkff left a comment

jkff commented Sep 1, 2017

echauchot commented Sep 4, 2017 •

edited

Loading

[BEAM-2802] Support multi-byte custom separator in TextIO #3779

[BEAM-2802] Support multi-byte custom separator in TextIO #3779

Conversation

echauchot commented Aug 29, 2017 • edited Loading

coveralls commented Aug 29, 2017

coveralls commented Aug 29, 2017

coveralls commented Aug 29, 2017

jkff Aug 29, 2017

Choose a reason for hiding this comment

echauchot Aug 30, 2017

Choose a reason for hiding this comment

echauchot Aug 31, 2017

Choose a reason for hiding this comment

jkff Aug 31, 2017

Choose a reason for hiding this comment

echauchot Sep 1, 2017 • edited Loading

Choose a reason for hiding this comment

echauchot commented Aug 30, 2017

coveralls commented Aug 31, 2017

echauchot commented Sep 1, 2017

coveralls commented Sep 1, 2017

echauchot commented Sep 1, 2017

coveralls commented Sep 1, 2017

jkff left a comment

Choose a reason for hiding this comment

jkff commented Sep 1, 2017

echauchot commented Sep 4, 2017 • edited Loading

echauchot commented Aug 29, 2017 •

edited

Loading

echauchot Sep 1, 2017 •

edited

Loading

echauchot commented Sep 4, 2017 •

edited

Loading