-
-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long lines are corrupted when read by _readNextSlow #14
Comments
Oh. That's obviously not intentional. I'll be happy to fix that. Do you happen to have a simple reproduction? (I assume you encountered this with some code so in case you have anything that'd help -- I will need a regression test at any rate, to ensure it won't break again). |
Reproduction is very simple, create a file with at least one line of length 40000 (or anything > 2 x 16000). The attaches example file has a single line composed of 16000 '0', 16000 '1' and 15990 '2'. When you sort this file with TextFileSorter which calls RawTextLineReader (and _readNextSlow), the result is a file with a single line containing only the '0's and '2's, the '1's have been discarded as they occupy the second 16000 block. |
How embarrassing. :) Thank you for reporting the issue; I added a simple test, fixed the issue, and will release 1.0.1 next. |
Thanks for the quick fix. |
RawTextFileReader corrupts lines which span more than 32000 bytes (twice the size of the allocated buffer).
When no CR/LF is found within the first 16000 bytes, _readNextSlow is called. This method has a while loop but the loop will ignore any chunk of 16000 bytes which does not contain a CR/LF thus leading to corrupted input.
The text was updated successfully, but these errors were encountered: