-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't lex the entire input immediately #1110
Merged
fitzgen
merged 4 commits into
bytecodealliance:main
from
alexcrichton:dont-store-tokens-in-memory
Jul 11, 2023
Merged
Don't lex the entire input immediately #1110
fitzgen
merged 4 commits into
bytecodealliance:main
from
alexcrichton:dont-store-tokens-in-memory
Jul 11, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is in preparation for an upcoming change where lexing will be performed lazily rather than eagerly. In this situation lexing is possibly done during a `peek` operation which means that errors can happen at that time. This commit purely updates the related methods to return `Result<Option<T>>` instead of `Option<T>`, but it doesn't actually introduce any functional change other than that. The return value from these methods is always `Ok(...)` at this time, and a future commit will change that to possibly being an error. This is split out from the future change since it's a fair amount of churn with lots of new `?` operators popping up everywhere.
This commit fixes a few issues with annotation-related support in the wasm text format: * The `@producers` section is not parsed and recognized for components where previously it was ignored. * Similarly the `@name` annotation is not properly recognized for components instead of being ignored in a few places. * Registration of annotations has been moved to top-level parsers to avoid adding and re-adding entries to the known annotations map.
This commit is a change to the internals of parsing for wasm text files. Previously the entire input would be lex'd and stored into an array for parsing to access. This is, however, the largest contributing factor to the peak memory usage reported in bytecodealliance#1095. Tokens are only used a handful of times so buffering the entire file in-memory is quite wasteful. This change was tricky to apply, however, because the original rationale for lexing in this manner was performance related. The recursive-descent style `Parser` trait encourages `peek`-ing tokens and will often involve attempting a parse only to later unwind and try something else instead. This means that each individual token, especially whitespace and comments, would otherwise naively be lexed many times. For example if the buffer of all tokens is "simply" removed some instrumented analysis showed that over half of all tokens in the input file were lexed more than once. This means that simply removing the buffer resulted in a performance regression. This performance regression is in some sense inherently not addressable with a lazy-lexing strategy. I implemented a fixed-width cache of the latest tokens lex'd but it still didn't perform as well as caching all tokens. I think this is because lexing is quite fast and adding the layer of the cache spent a lot of time in checking and managing the cache. While this performance regression may not be 100% fixable though I've settled on a strategy that's a bit more of a half-measure. The general idea for the `Parser` is now that it stores the current position in the file plus the next "significant" token at the same time. Here "significant" means not-whitespace and not-comments for example. This enables the parser to always know the next token having pre-skipped whitespace and comments. Thus any single-token "peek" operations don't actually need to do any lexing, they can instead look at the current token and decide what to do based on that. This is enabled with a few new `Cursor::peek_*` methods to avoid generating the next `Cursor` which would otherwise require a lexing operation. Overall this means that the majority of tokens in the case of bytecodealliance#1095 are lexed only once. There's still ~10% of tokens that are lexed 2+ times, but the performance numbers are before this commit parsing that file took 7.6s with 4G of memory, and after this commit it takes 7.9s with 2G of memory. This means that for a 4% regression in time we're reducing memory usage by half.
Taking a look. |
fitzgen
approved these changes
Jul 11, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was wordy, but actually very nice in the end. Thanks!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit is a change to the internals of parsing for wasm text files.
Previously the entire input would be lex'd and stored into an array for
parsing to access. This is, however, the largest contributing factor to
the peak memory usage reported in #1095. Tokens are only used a handful
of times so buffering the entire file in-memory is quite wasteful.
This change was tricky to apply, however, because the original rationale
for lexing in this manner was performance related. The recursive-descent
style
Parser
trait encouragespeek
-ing tokens and will often involveattempting a parse only to later unwind and try something else instead.
This means that each individual token, especially whitespace and
comments, would otherwise naively be lexed many times. For example if
the buffer of all tokens is "simply" removed some instrumented analysis
showed that over half of all tokens in the input file were lexed more
than once. This means that simply removing the buffer resulted in a
performance regression.
This performance regression is in some sense inherently not addressable
with a lazy-lexing strategy. I implemented a fixed-width cache of the
latest tokens lex'd but it still didn't perform as well as caching all
tokens. I think this is because lexing is quite fast and adding the
layer of the cache spent a lot of time in checking and managing the
cache. While this performance regression may not be 100% fixable though
I've settled on a strategy that's a bit more of a half-measure.
The general idea for the
Parser
is now that it stores the currentposition in the file plus the next "significant" token at the same time.
Here "significant" means not-whitespace and not-comments for example.
This enables the parser to always know the next token having pre-skipped
whitespace and comments. Thus any single-token "peek" operations don't
actually need to do any lexing, they can instead look at the current
token and decide what to do based on that. This is enabled with a few
new
Cursor::peek_*
methods to avoid generating the nextCursor
whichwould otherwise require a lexing operation.
Overall this means that the majority of tokens in the case of #1095 are
lexed only once. There's still ~10% of tokens that are lexed 2+ times,
but the performance numbers are before this commit parsing that file
took 7.6s with 4G of memory, and after this commit it takes 7.9s with 2G
of memory. This means that for a 4% regression in time we're reducing
memory usage by half.
Note that this PR has a few lead-up commits separated out from the "main" change to make this a bit easier to review, but they're required to enable the final change of not lexing the whole input immediately. The first commit is plumbing more results in more places and the second commit is fallout from if the "next" token is stored at all times then previously we'd accidentally skip over annotations which would be registered before the next token was reached, so annotations had to be registered earlier in the parsing process which led to issues cropping up.