Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peeking lexer optimizations #263

Merged
merged 4 commits into from
Dec 3, 2022

Conversation

petee-d
Copy link
Contributor

@petee-d petee-d commented Aug 13, 2022

Hi @alecthomas,

when rebasing #213 to master, I decided it would be best to commit my stuff in smaller chunks. The first chunk is some optimizations I did to PeekingLexer as it was becoming the bottleneck to the generated parser implementation.

  1. Optimizing PeekingLexer.Clone - Checkpoint
    It does an allocation that could be avoided. One way is to just have parseContext and my generated code to store it by value (keeping pointer receiver for its methods though). This already helps quite a lot, but I went a bit further in my generated code - wrapping the internal cursors in a public lexer.Checkpoint type. When a branch needs a backup of the PeekingLexer state, it will just need to copy this small struct and restore it only if the branch was not accepted and it's even a little bit faster than copying & replacing the entire PeekingLexer struct.

  2. Optimizing PeekingLexer.Clone - removing it
    Using the Checkpoint pattern in the reflection parser is tough as it needs to explicitly apply an accepted branch instead of reverting to a backup of a failed branch (the way that was nicer to use in the generated code). So my proposal would be to keep the Checkpoint pattern only for generated code. But I added a commit (independent of my other changes) that I can include if you like that stores PeekingLexer in parseContext by value and avoids the Clone allocation altogether.

  3. Optimizing PeekingLexer.Peek
    Currently Peek (which is used quite often by the reflective parser and even more often by my generated parser) needs to skip elided tokens every time it's called. It also checks for the length of PeekingLexer.tokens and returns p.eof in a lot of places - I think it's more elegant to just have the EOF token as the last token and prevent Next from advancing beyond it.
    This optimization does this work once in Upgrade and Next and makes Peek super simple, it also removes the eof field. The performance improvement for the Thrift benchmark is barely noticeable, but a separate benchmark I added shows a >3x improvement for Peek calls; it was also more more noticeable in my generated parser.
    I mentioned this whole thing in point 2 in Generating parser code #213 (comment) as the initial implementation changed the behavior of RawCursor and thus also Range. This new implementation adds a separate cursor for knowing when the next non-elided token is, making it a bit slower, but prevents a change of behavior.

  4. Working with lexer.Token as pointer
    Token is not a small struct and in most cases it's more efficient to work with it as a pointer instead of copying it - assuming of course no allocations are needed as the tokens are already stored in PeekingLexer.tokens. Changing Token methods to a pointer receiver alone improved CPU perf in the Thrift benchmark by ~2.5%, trying to make Next, Peek and other methods return it as a pointer made the improvement ~4%. Such a
    Considering this would be a breaking change to a public (somewhat internal, but still public) API, I decided it wasn't worth it. But if you think it's OK, I still added a new method Current, which is a version of Peek that returns a pointer. BenchmarkPeekingLexer_Current still shows that almost 3x faster than Peek. If you don't like it, I don't need to include it. I also used pointers at least inside the PeekingLexer methods and got a tiny CPU improvement in the Thrift benchmark out of it.

  5. BatchLexer interface
    This is actually unrelated to the code generation, but when playing with a manually written lexer for Jinja, I found it both more convenient and more efficient (thanks to a much lower number of function calls and more efficient allocations) to have the ability to yield multiple (in my case up to 50) tokens at a time. The interface and the related implementation is just a soft suggestion, if you don't like it, I can remove it. But it's optional - if a Lexer chooses to implement it, it can gain extra performance, if it doesn't, no problem.

After all but the last optimization (BatchLexer isn't used by any built-in lexer - yet), the final comparison looks like:

BenchmarkParticipleThriftGenerated-Before-8   	   12105	    292277 ns/op	        27.46 MiB/s	  178816 B/op	    2167 allocs/op
BenchmarkParticipleThriftGenerated-After-8   	   12597	    269844 ns/op	        28.58 MiB/s	  161920 B/op	    1903 allocs/op

Perhaps even more important is that according to a profiler, now it spends almost 7x less time in the PeekingLexer methods which I intended to optimize.

@alecthomas
Copy link
Owner

Welcome back :)

These don't seem unreasonable, but do you have any benchmark comparisons if just lexing (vs. lexing+parsing)?

// Checkpoint wraps the mutable state of the PeekingLexer.
//
// Copying and restoring just this state is a bit faster than copying the entire PeekingLexer.
type Checkpoint struct {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I try to weigh up the cost of having a permanent public API that will need to be supported "forever" vs. the benefit. In this case a very marginal improvement in performance doesn't seem worth it, so I'd prefer just to copy PeekingLexer.

Copy link
Owner

@alecthomas alecthomas Aug 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think let's get in the macro optimisations first, then worry about micro-optimisations. The generalised downside of micro-optimisations is that they usually make the code less clear, so I'd prefer to avoid them for now.

// Clone creates a clone of this PeekingLexer at its current token.
//
// The parent and clone are completely independent.
func (p *PeekingLexer) Clone() *PeekingLexer {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to remove this, we're still in beta!

lexer/api.go Outdated
@@ -47,6 +47,11 @@ type Lexer interface {
Next() (Token, error)
}

// BatchLexer consumes and returns a batch of tokens at a time. A Lexer may choose to implement this for extra speed.
type BatchLexer interface {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem unreasonable, but how about we put this in a separate PR. Do you have any benchmarks against a lexer before/after implementing this?

@klondikedragon
Copy link
Contributor

@petee-d -- there has been some good progress on getting the codegen lexer into shape, it now follows the pattern suggested by #264 to serialize the lexer definition into JSON and have a separate participle utility program take the JSON and produce the code, and some conformance tests were added with some fixes. The generated lexer is now passing the test suite for my app (few hundred test cases) and delivering 8x lexing speedup and with O(1) allocation. When using your experimental parser codegen branch before, I was able to see similar speedups during parsing, so I'm excited about the possibility of combining the two. :)

I'm motivated to contribute and try and help get codegen for parsing across the finish line, and wanted to check in with you and @alecthomas what might make sense as next steps. If you're able to finish rebasing the experimental parser generator branch with the current participle (lots of changes I know, and it looks like you made good progress on this PR!), I could then try to take it from there to adapt to the suggestions in #264. I know you (and Alec!) are doing this work in your personal time, so I'm super grateful for anything you might have time to contribute and understand if that's not possible right now. Let me know and I could also try to take it from where the experimental branch is currently at too.

@alecthomas - I'm curious to get your thoughts on this overall direction as well. Thanks!

@petee-d
Copy link
Contributor Author

petee-d commented Nov 3, 2022

Hey guys, yeah, it's been very difficult for me to get capacity to work on this in the last few months, especially since any capacity I could dedicate towards open-source projects went to a different library. But that's now getting less and less demanding and I should have time to finish this.

Let's start with this PR. @alecthomas, could you sum up your recommendations as for which of these changes to keep and which to separate out (either to possibly include later, or to completely abandon)? My personal preference after reading your comments would be:

  1. Optimizing PeekingLexer.Clone - Checkpoint - probably keep. It's not just about copying less data, I personally find the explicit methods to create and restore checkpoints more readable than just replacing the entire peeking lexer with the copy. In the generated code, it looks like p.Lex.LoadCheckpoint(branchCheckpoint), with copying the entire lexer it would look like *p.Lex = branchLexer or something like that - I don't know, but it feels less readable to me.
  2. Optimizing PeekingLexer.Clone - removing it - keep - if you don't mind removing it, I think it would be nice
  3. Optimizing PeekingLexer.Peek - keep - probably no reason not to
  4. Working with lexer.Token as pointer - keep - same
  5. BatchLexer interface - skip - I agree it's not really necessary and have no problem removing this for now, maybe I'll come back to it later

What do you think about each?

@petee-d
Copy link
Contributor Author

petee-d commented Nov 7, 2022

Regarding 1., I just realized I didn't include MakeCheckpoint and LoadCheckpoint in this PR. Well, basically there are these 3 options:

  1. Copying the entire lexer, would look like: branchLexer := *p.Lex and *p.Lex = branchLexer
  2. Checkpoint struct manipulated using methods: branchCheckpoint := p.Lex.MakeCheckpoint() and p.Lex.LoadCheckpoint(branchCheckpoint)
  3. Checkpoint struct used directly to save/load a checkpoint: branchCheckpoint := p.Lex.Checkpoint and p.Lex.Checkpoint = branchCheckpoint

I think when I started rebasing my changes (which initially used approach 2.), I decided to instead go for approach 3., but kinda forgot about it. Can't say I care too much about which of these approaches we ultimately pick. I lean towards 3. at the moment, I guess, seems the most elegant, plus has that tiny performance improvement of not copying immutable PeekingLexer fields?

By the way, I finally found time for coding today and I finished resolving conflicts against the current participle master - I gave up on using git merge (the conflicts were just too massive and difficult to read) and instead manually copy-pasted the changes from the original branch diff - was actually faster that way. :D My test suite for the Jinja template language is now passing.

I'm not even close to done though, as I still need to implement support for the 2 newly added node types (custom and union), see if there were any changes to participle behavior I need to reflect in my generated code, work my way through tens of TODOs I left for fixing some obscure differences from native parser behavior (especially error reporting and lookahead) and then actually make generating the parser reasonably easy. I will also need your input in #264.

@klondikedragon
Copy link
Contributor

@petee-d - it's exciting you've got some time to spend on this effort, and that you have finished an initial rebasing! Thank you! If you publish your updated branch, I could give it a try on my grammar / test suite as well. I'm not using the custom/union node types, so it feels there is a chance it could pass the test suite in its current state. :)

@alecthomas is certainly the authority so I'm only offering my 0.02 on the other items :)

Items 2, 3, and 4 feel like pretty "local" changes and so there perhaps isn't a big debate on whether or not to accept them. Item 2 avoiding an allocation in a hot path sounds pretty valuable. Items 3 and 4 it's hard to tell if they affect the parser-level benchmark or not, but it is just changing a few internal functions of PeekingLexer for the most part.

It seems like the main open question with this PR is item 1 and whether or not to just copy the whole PeekingLexer struct or have this separate Checkpoint struct. A slice header is a pointer plus two integers, and a map I believe is just a pointer. So splitting out the struct is saving copying maybe a few extra words vs approach 1. Is that generating any noticeable difference in macro (parser-level) benchmarks? If not, it sounds like Alec preferred the readability/maintainability of just copying the whole struct. If it does have a macro-effect that is major, then perhaps it's worth the change. I'm just thinking that a few % points of CPU speed may not be a big thing to include (vs maintainability cost of another public API on PeekingLexer) if the generated parser gets 10x-20x speedup (and gen lexer gets 8x+ vs reflection).

@petee-d
Copy link
Contributor Author

petee-d commented Nov 8, 2022

@klondikedragon I ran a benchmark (parsing only on Jinja) and the Checkpoint optimization was actually even more impactful than I remembered. Benchmarks were run with -test.count=3 -test.benchtime=3s, picking the best of 3, each line starts with the approach number in my post above.

1: BenchmarkParseOnly-8   	   24804	    145576 ns/op	   99288 B/op	     792 allocs/op
2: BenchmarkParseOnly-8   	   26631	    135601 ns/op	   99288 B/op	     792 allocs/op
3: BenchmarkParseOnly-8   	   25071	    135069 ns/op	   99288 B/op	     792 allocs/op

Unsurprisingly, 2. and 3. were equal as the MakeCheckpoint and LoadCheckpoint methods are inlined anyway. Comparing just copying the entire lexer vs Checkpoint:

BenchmarkPeekingLexer_CopyWhole-8   		829165962	         1.421 ns/op
BenchmarkPeekingLexer_CopyCheckpoint-8   	1000000000	         0.3934 ns/op

I got curious and tried adding an unused map or a slice to Checkpoint, the slice actually slowed it down the most, noticeably more than adding 24 bytes of non-pointer data. Not entirely sure what goes on on the background, but maybe Go has to be more careful manipulating pointer data due to garbage collection? (there probably is a reason why there is a separate memclrNoHeapPointers in the runtime)

Anyway, I wouldn't say performance is the main reason why I prefer approach 3. - I just find it more readable. The maintainability cost isn't really that high, it's just a new public type and new Checkpoint attribute of PeekingLexer. There isn't really a way to mess up the lexer state by manipulating it incorrectly as the fields of Checkpoint are still private - you can only use it to restore a checkpoint or reset the lexer to the start.

@klondikedragon
Copy link
Contributor

@petee-d -- pretty interesting the copy of the slice was the biggest source of slowdown in copying the struct, I wouldn't have guessed that! It makes me curious to learn more about how Go does GC. Apparently there is a build flag where it will output escape analysis info, go build -gcflags="-m" ...

@alecthomas - @petee-d makes a compelling case for the "Checkpoint" style modification to PeekingLexer, with a ~7% overall speedup of parsing, and the "3" approach described above is copying/restoring the Checkpoint field of the struct instead of the whole struct itself, so pretty similar semantics to the current code. What do you think?

@petee-d
Copy link
Contributor Author

petee-d commented Nov 10, 2022

I pushed the rebased branch with the last commit (BatchLexer) dropped, so this is ready to go if @alecthomas agrees with the plan.

@alecthomas
Copy link
Owner

Optimizing PeekingLexer.Clone - Checkpoint - probably keep. It's not just about copying less data, I personally find the explicit methods to create and restore checkpoints more readable than just replacing the entire peeking lexer with the copy. In the generated code, it looks like p.Lex.LoadCheckpoint(branchCheckpoint), with copying the entire lexer it would look like *p.Lex = branchLexer or something like that - I don't know, but it feels less readable to me.
Optimizing PeekingLexer.Clone - removing it - keep - if you don't mind removing it, I think it would be nice

I like the checkpoint approach, works for me 👍

Checkpoint struct manipulated using methods: branchCheckpoint := p.Lex.MakeCheckpoint() and p.Lex.LoadCheckpoint(branchCheckpoint)

I prefer this approach because it makes the API more explicit.

Optimizing PeekingLexer.Peek - keep - probably no reason not to
Working with lexer.Token as pointer - keep - same

I think I'd prefer to either switch Peek() (and Next()) to return pointers and drop Current(), or keep the current value interface. From memory the only reasons Peek() uses a value is historical and for compatibility with the lexer interface. Neither of these are relevant anymore and as you say, because all tokens are in memory, just returning a pointer might be faster. I do wonder if this is true though? What is the performance difference of just this change?

BatchLexer interface - skip - I agree it's not really necessary and have no problem removing this for now, maybe I'll come back to it later

👍

@petee-d
Copy link
Contributor Author

petee-d commented Nov 16, 2022

I prefer this approach because it makes the API more explicit.
OK, 2 wasn't my favorite anymore, but will add those methods. :)

(Current() pointers) What is the performance difference of just this change?
Benchmark with the generated parser for Jinja, pre-lexed input.

BenchmarkParseOnly-8-Pointer   	   21108	    155065 ns/op	   99288 B/op	     792 allocs/op
BenchmarkParseOnly-8-ByValue   	   18817	    191377 ns/op	   99288 B/op	     792 allocs/op

I.e. quite significant.

Changing Next() and Peek() to return pointers and refactoring preexisting code would be possible, although it's a bit more than I bargained for. :) But I could do it if you want.

@alecthomas
Copy link
Owner

I have the pointer conversion done here: #283

@alecthomas
Copy link
Owner

I'll merge that in.

@alecthomas
Copy link
Owner

Thanks for doing the benchmarks BTW 🙏

Peter Dolak and others added 4 commits November 16, 2022 17:56
The state itself is still in private fields, thus Checkpoint being a
public field doesn't break encapsulation.
Roughly 5% less CPU time and 12% fewer allocations in the Thrift
benchmark with generated lexer.
The attached Benchmark is over 3x faster after the optimization
@petee-d
Copy link
Contributor Author

petee-d commented Nov 18, 2022

Forgot to comment that I updated the pull request after rebasing to your changes to the PeekingLexer (not a pleasant conflict, I wished you did this after merging this PR 😅). IMHO it's ready for merging.

@petee-d
Copy link
Contributor Author

petee-d commented Dec 3, 2022

@alecthomas Do you agree this can be merged? If so, can you?

Writing tests revealed quite a lot of differences between errors & partial parse trees reported by the reflective parser and my generated code. But I'm slowly getting through them, I fixed the vast majority of them by modifying the code generator and a few by improving participle itself (small changes of functionality that only affect edge cases for error reporting and should be beneficial). I have my WIP code at https://github.com/alecthomas/participle/compare/master...petee-d:parser-code-generation?expand=1 and it's not yet ready for review (there will be some refactoring of the code generator and it needs documenting once it stabilizes).

But what I would like to you to look at is the public API for generating parser code and using it. I documented it here, you can also see it in action in the Thrift parser. The point of the whole custom type embedding participle.GeneratedParserBase is being able to delete the generated code file while still being able to compile everything. Latest benchmark for the Thrift parser:

BenchmarkGoThriftParser      	        	    2156	    528231 ns/op	         1.631 MiB/s	  148952 B/op	    2628 allocs/op
BenchmarkParticipleThriftNative            	    2887	    411767 ns/op	         2.183 MiB/s	  172405 B/op	    2034 allocs/op
BenchmarkParticipleThriftGeneratedLexer   	    3655	    298197 ns/op	         2.764 MiB/s	  168008 B/op	    1798 allocs/op
BenchmarkParticipleThriftGeneratedBoth   	   16824	     69713 ns/op	        12.72 MiB/s	   45896 B/op	      98 allocs/op

The last line combines the lexer generated with current master and my generated parser. Lexing is the slow part here, taking around 70% of the time.

@petee-d
Copy link
Contributor Author

petee-d commented Dec 3, 2022

Oh and @klondikedragon, I would love to hear your feedback after trying it on your grammar as well. Especially any bugs or usability issues you notice.

@alecthomas alecthomas merged commit 5adbb7c into alecthomas:master Dec 3, 2022
@alecthomas
Copy link
Owner

Nice!

@klondikedragon
Copy link
Contributor

@petee-d @alecthomas - exciting progress! I've tried out the WIP parsergen code with nice results, I'll post details to #213 shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants