Iterator parser #97

vovapolu · 2016-07-21T19:27:07Z

This pull-request adds new type of parsing - Iterator parsing that can process streaming data without holding all data in the memory.

It introduces a new class - ParserInput which is able to contain two types of data: IndexedSeq and Iterator of IndexedSeqs. This class now is used as a source for parsing instead of IndexedSeq or String in previous versions. The most interesting and important method in it is dropBuffer that has special behavior for Iterator mode, it drops "prefix" of data from inner buffer in ParserInput, thus reducing the memory consumption during parsing. The details of how it works are described in the docs of ParserInput.

Another modification is adding new flags into the ParseCtx class. They are isFork, isNoCut and isCapturing. They control calling of dropBuffer and they are changed when the parsing process goes into corresponding parser unit. 'isFork' turns on in Either, Optional and Repeat parsers, isNoCut in NoCut, isCapturing in Capturing and in WhitespaceApi.CustomSequence (to completely block any dropBuffer).

Benchmarks:

ScalaParse performance test on master (in avg.)

First number is a count of regular parsings of a quite big file during 10 secs, seconds is the same parsings, but with invalid file and tracing of the result Failure.

76 15

ScalaParse performance test on this PR (in avg.)

72 15

ScalaParse Iterator tests

The input of ScalaParse was wrapped into the Iterator by dividing it in batches with varying size. And using this iterator, as usual, the number of passes within 10 seconds was measured with the maximum size of ParserInput inner buffer.

Size of batch	Avg. result	Max buffer size
1	48	1587
2	51	1587
4	50	1587
16	54	1587
64	57	1635
1024	58	2523
4096	57	5595

Distibutions of buffer size during drops in parsing an iterator data

The 1-sized Iterator was chosen for testing distribution of inner buffer size during parsing.

Buffer Length	Count
0	5139
1	17564
2	3456
3	1056
4	1525
5	421
6	1404
7	311
8	2382
9	1448
10	1638
11-33	6513
34-56	898
57-79	464
80-102	321
103-136	74
138-181	96
185-245	103
246-312	103
315-477	113
485-958	88
996-1587	22

lihaoyi · 2016-08-04T13:21:45Z

byteparse/jvm/src/test/scala/byteparse/LargeBmpIteratorTests.scala

+import BmpParser._
+
+object LargeBmpIteratorTests extends TestSuite {
+  val url = "https://raw.githubusercontent.com/lihaoyi/fastparse/master/byteparse/jvm/src/test/resources/lena.bmp"


Could we read this off disk as an InputStream rather than reading off the web? Otherwise we'll end up in odd situations where the tests will pass on the PR but after you commit it the repo on github changes and the tests fail, or vice versa

lihaoyi · 2016-08-04T13:56:48Z

Left some high-level notes of issues we should fix before doing a more detailed in-the-weeds review.

As part of this review, I'd like you to run some performance benchmarks on various parts of the design space and post the results here, for future reference. For example:

What perf impact does this diff have on non-iterator String/Array[Byte] parsing? We have extra code in some of the main code paths setting flags, after all, and I'd like to know if/how-much it slowed things down
What perf impact does the various sized chunks in Iterator[T] parsing have? e.g. for chunk size 1, 2, 4, 8... 512, 1024, 2048 chars/bytes, on streaming ScalaParse, CssParse, BmpParse and JsonParse? And how does it compare to the non-Iterator parsing performance?
What is the distribution of lengths of the buffer when dropBuffer is called, for ScalaParse, CssParse, BmpParse, ClassParse (?), for the tiniest possible buffer (1 byte/char)?

While tabulating this information is not strictly part of the "code", they would really help us understand and review the actual behavioral characteristics of this change, and will serve as a baseline of facts that future discussions around fastparse can use.

lihaoyi · 2016-08-04T14:27:26Z

fastparse/shared/src/test/scala/fastparse/iterator/IteratorTests.scala

+import fastparse.all._
+import utest._
+
+object IteratorTests extends TestSuite {


I don't think the unit tests we have here are sufficient; we should add some unit tests that (or modify these to) subclass ParserInput and hook into dropBuffer in order to verify that dropBuffer indeed gets called at exactly the right moments within these tests, and that the buffer does not grow past whatever size we think it should not grow past.

That will let us be confident that we're actually being as streaming as we should be streaming, and not accidentally loading the whole thing into memory before dealing with it

…require less memory during parsing

…ffer

…ropBuffer

lihaoyi · 2016-08-16T12:29:17Z

perftests/shared/src/main/scala/perftests/Utils.scala

+            () => parser.parse(dataFail).asInstanceOf[Parsed.Failure[ElemType]].extra.traced)).flatten)
+    println(results.map(_.mkString(" ")).mkString("\n"))
+
+    val sizes = Seq(1, 2,/* 4, 16, 64,*/ 1024/*, 4096*/)


Can we leave all of these un-commented?

We can, but it could take a lot of time to do all of these benchmarks.

Oh ok, let's leave them out then

lihaoyi · 2016-08-16T12:39:34Z

fastparse/shared/src/test/scala/fastparse/iterator/IteratorTests.scala

+        val Parsed.Success(res, _) = all.parseInput(input)
+        assert(res == "abded")
+      }
+    }


Could we add one some small test cases to test the behavior of dropBuffer using the WhitespaceApi? Given that that was the cause of some bugs and frustration, we need to make sure that it keeps working.

As part of that, I would like a test case for the whitespace-ignoring a ~ (b ~ (Pass ~/ Pass)) case. As discussed in chat, this currently would not correctly dropBuffer at the ~/. Regardless we should still verify that fact, and if/when we decide to fix this behavior, we can update the test

vovapolu changed the title ~~Iterator Parser~~ **WIP** Iterator Parser Jul 21, 2016

vovapolu changed the title **WIP** Iterator Parser WIP Iterator Parser Jul 21, 2016

lihaoyi mentioned this pull request Jul 24, 2016

true stream parsing #94

Closed

Added ParseInput and changed the error messages format

384bd71

vovapolu force-pushed the master branch from cbfc984 to 384bd71 Compare July 24, 2016 13:46

vovapolu changed the title ~~WIP Iterator Parser~~ New error messages format Jul 24, 2016

Vladimir Polushin and others added 4 commits July 24, 2016 18:49

Fixed tests and added pretty index in log messages

cc1f223

Reverted tests

3f64ec4

Initial commit for Iterator parser

62092ad

Added basic unit-tests

45852a1

vovapolu changed the title ~~New error messages format~~ Iterator parser Jul 28, 2016

lihaoyi reviewed Aug 4, 2016
View reviewed changes

vovapolu added 7 commits August 6, 2016 17:00

Some fixes and docs for ParserInput

8a76b8d

Added performance iterator tests and fixed main parsers so that they …

c5e0208

…require less memory during parsing

Uncommented perf tests

8c47124

Made the parse method accept original type instead of IndexedSeq

f14d26b

Extented iterator tests and fixed WhitespaceApi bug with early dropBu…

600cb3d

…ffer

Updated the docs of ParserInput

427c2c3

Fixed WhitespaceApi.CustomSequence bug by adding checks for 0-sized d…

ff75669

…ropBuffer

lihaoyi reviewed Aug 16, 2016
View reviewed changes

Added whitespaceApi unit-tests to the Iterator Tests and minor fixes

05bb2b5

vovapolu merged commit 1b7f5ff into com-lihaoyi:master Aug 17, 2016

vovapolu mentioned this pull request Aug 17, 2016

Zero elements check in Iterator Parsing #102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterator parser #97

Iterator parser #97

vovapolu commented Jul 21, 2016 •

edited

Loading

lihaoyi Aug 4, 2016

lihaoyi commented Aug 4, 2016

lihaoyi Aug 4, 2016

lihaoyi Aug 16, 2016

vovapolu Aug 16, 2016

lihaoyi Aug 16, 2016

lihaoyi Aug 16, 2016

Iterator parser #97

Iterator parser #97

Conversation

vovapolu commented Jul 21, 2016 • edited Loading

Benchmarks:

ScalaParse performance test on master (in avg.)

ScalaParse performance test on this PR (in avg.)

ScalaParse Iterator tests

Distibutions of buffer size during drops in parsing an iterator data

lihaoyi Aug 4, 2016

Choose a reason for hiding this comment

lihaoyi commented Aug 4, 2016

lihaoyi Aug 4, 2016

Choose a reason for hiding this comment

lihaoyi Aug 16, 2016

Choose a reason for hiding this comment

vovapolu Aug 16, 2016

Choose a reason for hiding this comment

lihaoyi Aug 16, 2016

Choose a reason for hiding this comment

lihaoyi Aug 16, 2016

Choose a reason for hiding this comment

vovapolu commented Jul 21, 2016 •

edited

Loading