-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-2802] Support multi-byte custom separator in TextIO #3779
Conversation
974966a
to
fc5a627
Compare
PAssert.that(output).containsInAnyOrder(expected); | ||
p.run(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a few tests using SourceTestUtils.assertSplitAtFractionExhaustive, similar to testSplittingSourceWithMixedDelimitersAndNonEmptyBytesAtEnd and ones around it.
I tried adding one such test and it failed - implementing this properly is very tricky, especially if you allow arbitrary delimiters that can self-overlap, such as ||
in your example: how should we interpret abc|||xyz
- as abc|
, xyz
- or as abc
, |xyz
? And how do we consistently enforce this interpretation if the file is split by the runner into chunks differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points, thanks for pointing me at them! I'll try to add those tests. It is tricky indeed. I understand that you did not have the overlap problem with \r
and \n
because any combination of \r
and \n
, no matter where we split, will result in a split and no \r
or \n
in the records. This is not the case with fixed custom multi-byte separator as you pointed. I'll think about this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed the parser to work no matter what the source split point is and I have added the corresponding test with the custom separator.
What I have now to support is the overlap that you were talking about. Obviously I cannot rewind the offset of (separator.size + 1)
to allow only one byte overlap, neither can I rewind the offset of (2*sperator.size)
to allow maximum overlap because it might produce duplicate record if a record.size < separator.size
. I cannot either catch anything to state that the file format is wrong in case of overlap because I will get no exception, just flaky record content depending on the runner / source split point. Do you have any idea on how to support overlap of separator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about simply prohibiting separators that can have self-overlap? A separator can have self-overlap, and hence cause ambiguous parsing, if it has a suffix that is also its prefix - i.e. there exists "i" < separator.length such that the last "i" bytes == the first "i" bytes - can simply check that in the "withSeparator" method using a quadratic algorithm cause separator is unlikely to be more than a couple of bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I like this, it's simple and efficient. still, we have (odd) client use cases where separator might be \n xxxxxx \n
but in worse cases 10 bytes, so adding a verification algorithm that is O(10²) complex but that is run only once at pipeline construction time will do. Thanks.
Thanks for your review once again Eugene! |
ce69935
to
a47ddd8
Compare
rebased on master |
I did the check that provided separator does not self-overlaps. PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will do a couple of minor touch-ups and merge.
Merged with changes:
|
Thanks @jkff !
|
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue.mvn clean verify
to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.R: @jkff