-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iterator parser #97
Iterator parser #97
Conversation
import BmpParser._ | ||
|
||
object LargeBmpIteratorTests extends TestSuite { | ||
val url = "https://raw.githubusercontent.com/lihaoyi/fastparse/master/byteparse/jvm/src/test/resources/lena.bmp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we read this off disk as an InputStream rather than reading off the web? Otherwise we'll end up in odd situations where the tests will pass on the PR but after you commit it the repo on github changes and the tests fail, or vice versa
Left some high-level notes of issues we should fix before doing a more detailed in-the-weeds review. As part of this review, I'd like you to run some performance benchmarks on various parts of the design space and post the results here, for future reference. For example:
While tabulating this information is not strictly part of the "code", they would really help us understand and review the actual behavioral characteristics of this change, and will serve as a baseline of facts that future discussions around fastparse can use. |
import fastparse.all._ | ||
import utest._ | ||
|
||
object IteratorTests extends TestSuite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the unit tests we have here are sufficient; we should add some unit tests that (or modify these to) subclass ParserInput
and hook into dropBuffer
in order to verify that dropBuffer
indeed gets called at exactly the right moments within these tests, and that the buffer does not grow past whatever size we think it should not grow past.
That will let us be confident that we're actually being as streaming as we should be streaming, and not accidentally loading the whole thing into memory before dealing with it
…require less memory during parsing
() => parser.parse(dataFail).asInstanceOf[Parsed.Failure[ElemType]].extra.traced)).flatten) | ||
println(results.map(_.mkString(" ")).mkString("\n")) | ||
|
||
val sizes = Seq(1, 2,/* 4, 16, 64,*/ 1024/*, 4096*/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave all of these un-commented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but it could take a lot of time to do all of these benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, let's leave them out then
val Parsed.Success(res, _) = all.parseInput(input) | ||
assert(res == "abded") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add one some small test cases to test the behavior of dropBuffer
using the WhitespaceApi
? Given that that was the cause of some bugs and frustration, we need to make sure that it keeps working.
As part of that, I would like a test case for the whitespace-ignoring a ~ (b ~ (Pass ~/ Pass))
case. As discussed in chat, this currently would not correctly dropBuffer
at the ~/
. Regardless we should still verify that fact, and if/when we decide to fix this behavior, we can update the test
This pull-request adds new type of parsing -
Iterator
parsing that can process streaming data without holding all data in the memory.It introduces a new class -
ParserInput
which is able to contain two types of data:IndexedSeq
andIterator
ofIndexedSeq
s. This class now is used as a source for parsing instead ofIndexedSeq
orString
in previous versions. The most interesting and important method in it isdropBuffer
that has special behavior forIterator
mode, it drops "prefix" of data from inner buffer inParserInput
, thus reducing the memory consumption during parsing. The details of how it works are described in the docs ofParserInput
.Another modification is adding new flags into the
ParseCtx
class. They areisFork
,isNoCut
andisCapturing
. They control calling ofdropBuffer
and they are changed when the parsing process goes into corresponding parser unit. 'isFork' turns on inEither
,Optional
andRepeat
parsers,isNoCut
inNoCut
,isCapturing
inCapturing
and inWhitespaceApi.CustomSequence
(to completely block anydropBuffer
).Benchmarks:
ScalaParse performance test on master (in avg.)
First number is a count of regular parsings of a quite big file during 10 secs, seconds is the same parsings, but with invalid file and tracing of the result
Failure
.ScalaParse performance test on this PR (in avg.)
ScalaParse Iterator tests
The input of ScalaParse was wrapped into the
Iterator
by dividing it in batches with varying size. And using this iterator, as usual, the number of passes within 10 seconds was measured with the maximum size ofParserInput
inner buffer.Distibutions of buffer size during drops in parsing an iterator data
The 1-sized
Iterator
was chosen for testing distribution of inner buffer size during parsing.