Long lines are corrupted when read by _readNextSlow #14

hbs · 2017-02-11T23:01:50Z

RawTextFileReader corrupts lines which span more than 32000 bytes (twice the size of the allocated buffer).

When no CR/LF is found within the first 16000 bytes, _readNextSlow is called. This method has a while loop but the loop will ignore any chunk of 16000 bytes which does not contain a CR/LF thus leading to corrupted input.

cowtowncoder · 2017-02-12T00:12:22Z

Oh. That's obviously not intentional.
Thank you for reporting this!

I'll be happy to fix that. Do you happen to have a simple reproduction? (I assume you encountered this with some code so in case you have anything that'd help -- I will need a regression test at any rate, to ensure it won't break again).

hbs · 2017-02-12T07:41:59Z

Reproduction is very simple, create a file with at least one line of length 40000 (or anything > 2 x 16000).
example.txt

The attaches example file has a single line composed of 16000 '0', 16000 '1' and 15990 '2'.

When you sort this file with TextFileSorter which calls RawTextLineReader (and _readNextSlow), the result is a file with a single line containing only the '0's and '2's, the '1's have been discarded as they occupy the second 16000 block.

cowtowncoder · 2017-02-14T03:46:05Z

How embarrassing. :)

Thank you for reporting the issue; I added a simple test, fixed the issue, and will release 1.0.1 next.
Should be available via Maven Central within couple of hours.

hbs · 2017-02-14T06:25:50Z

Thanks for the quick fix.

cowtowncoder added this to the 1.0 milestone Feb 14, 2017

cowtowncoder closed this as completed in 5c5bb61 Feb 14, 2017

cowtowncoder modified the milestones: 1.0, 1.0.1 Feb 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long lines are corrupted when read by _readNextSlow #14

Long lines are corrupted when read by _readNextSlow #14

hbs commented Feb 11, 2017

cowtowncoder commented Feb 12, 2017

hbs commented Feb 12, 2017

cowtowncoder commented Feb 14, 2017

hbs commented Feb 14, 2017

Long lines are corrupted when read by _readNextSlow #14

Long lines are corrupted when read by _readNextSlow #14

Comments

hbs commented Feb 11, 2017

cowtowncoder commented Feb 12, 2017

hbs commented Feb 12, 2017

cowtowncoder commented Feb 14, 2017

hbs commented Feb 14, 2017