Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterator parser #97

Merged
merged 13 commits into from
Aug 17, 2016
Merged

Iterator parser #97

merged 13 commits into from
Aug 17, 2016

Conversation

vovapolu
Copy link
Collaborator

@vovapolu vovapolu commented Jul 21, 2016

This pull-request adds new type of parsing - Iterator parsing that can process streaming data without holding all data in the memory.

It introduces a new class - ParserInput which is able to contain two types of data: IndexedSeq and Iterator of IndexedSeqs. This class now is used as a source for parsing instead of IndexedSeq or String in previous versions. The most interesting and important method in it is dropBuffer that has special behavior for Iterator mode, it drops "prefix" of data from inner buffer in ParserInput, thus reducing the memory consumption during parsing. The details of how it works are described in the docs of ParserInput.

Another modification is adding new flags into the ParseCtx class. They are isFork, isNoCut and isCapturing. They control calling of dropBuffer and they are changed when the parsing process goes into corresponding parser unit. 'isFork' turns on in Either, Optional and Repeat parsers, isNoCut in NoCut, isCapturing in Capturing and in WhitespaceApi.CustomSequence (to completely block any dropBuffer).

Benchmarks:

ScalaParse performance test on master (in avg.)

First number is a count of regular parsings of a quite big file during 10 secs, seconds is the same parsings, but with invalid file and tracing of the result Failure.

76 15

ScalaParse performance test on this PR (in avg.)

72 15 

ScalaParse Iterator tests

The input of ScalaParse was wrapped into the Iterator by dividing it in batches with varying size. And using this iterator, as usual, the number of passes within 10 seconds was measured with the maximum size of ParserInput inner buffer.

Size of batch Avg. result Max buffer size
1 48 1587
2 51 1587
4 50 1587
16 54 1587
64 57 1635
1024 58 2523
4096 57 5595

Distibutions of buffer size during drops in parsing an iterator data

The 1-sized Iterator was chosen for testing distribution of inner buffer size during parsing.

Buffer Length Count
0 5139
1 17564
2 3456
3 1056
4 1525
5 421
6 1404
7 311
8 2382
9 1448
10 1638
11-33 6513
34-56 898
57-79 464
80-102 321
103-136 74
138-181 96
185-245 103
246-312 103
315-477 113
485-958 88
996-1587 22

@vovapolu vovapolu changed the title Iterator Parser **WIP** Iterator Parser Jul 21, 2016
@vovapolu vovapolu changed the title **WIP** Iterator Parser WIP Iterator Parser Jul 21, 2016
@lihaoyi lihaoyi mentioned this pull request Jul 24, 2016
@vovapolu vovapolu changed the title WIP Iterator Parser New error messages format Jul 24, 2016
@vovapolu vovapolu changed the title New error messages format Iterator parser Jul 28, 2016
import BmpParser._

object LargeBmpIteratorTests extends TestSuite {
val url = "https://raw.githubusercontent.com/lihaoyi/fastparse/master/byteparse/jvm/src/test/resources/lena.bmp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we read this off disk as an InputStream rather than reading off the web? Otherwise we'll end up in odd situations where the tests will pass on the PR but after you commit it the repo on github changes and the tests fail, or vice versa

@lihaoyi
Copy link
Member

lihaoyi commented Aug 4, 2016

Left some high-level notes of issues we should fix before doing a more detailed in-the-weeds review.

As part of this review, I'd like you to run some performance benchmarks on various parts of the design space and post the results here, for future reference. For example:

  • What perf impact does this diff have on non-iterator String/Array[Byte] parsing? We have extra code in some of the main code paths setting flags, after all, and I'd like to know if/how-much it slowed things down
  • What perf impact does the various sized chunks in Iterator[T] parsing have? e.g. for chunk size 1, 2, 4, 8... 512, 1024, 2048 chars/bytes, on streaming ScalaParse, CssParse, BmpParse and JsonParse? And how does it compare to the non-Iterator parsing performance?
  • What is the distribution of lengths of the buffer when dropBuffer is called, for ScalaParse, CssParse, BmpParse, ClassParse (?), for the tiniest possible buffer (1 byte/char)?

While tabulating this information is not strictly part of the "code", they would really help us understand and review the actual behavioral characteristics of this change, and will serve as a baseline of facts that future discussions around fastparse can use.

import fastparse.all._
import utest._

object IteratorTests extends TestSuite {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the unit tests we have here are sufficient; we should add some unit tests that (or modify these to) subclass ParserInput and hook into dropBuffer in order to verify that dropBuffer indeed gets called at exactly the right moments within these tests, and that the buffer does not grow past whatever size we think it should not grow past.

That will let us be confident that we're actually being as streaming as we should be streaming, and not accidentally loading the whole thing into memory before dealing with it

() => parser.parse(dataFail).asInstanceOf[Parsed.Failure[ElemType]].extra.traced)).flatten)
println(results.map(_.mkString(" ")).mkString("\n"))

val sizes = Seq(1, 2,/* 4, 16, 64,*/ 1024/*, 4096*/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave all of these un-commented?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but it could take a lot of time to do all of these benchmarks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok, let's leave them out then

val Parsed.Success(res, _) = all.parseInput(input)
assert(res == "abded")
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add one some small test cases to test the behavior of dropBuffer using the WhitespaceApi? Given that that was the cause of some bugs and frustration, we need to make sure that it keeps working.

As part of that, I would like a test case for the whitespace-ignoring a ~ (b ~ (Pass ~/ Pass)) case. As discussed in chat, this currently would not correctly dropBuffer at the ~/. Regardless we should still verify that fact, and if/when we decide to fix this behavior, we can update the test

@vovapolu vovapolu merged commit 1b7f5ff into com-lihaoyi:master Aug 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants