Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Lexer string interning #1758

Closed
wants to merge 4 commits into from
Closed

Conversation

Razican
Copy link
Member

@Razican Razican commented Dec 22, 2021

This Pull Request is part of #279.

It adds a string interner to Boa, which allows many types to not contain heap-allocated strings, and just contain a NonZeroUsize instead. This can move types to the stack (hopefully I'll be able to move Token, for example, maybe some Node types too.

Note that the internet is for now only available in the lexer. Next steps (in this PR or future ones) would include also using interning in the parser, and finally in execution. The idea is that strings should be represented with a Sym until they are displayed.

Talking about display. I have changed the ParseError type in order to not contain anything that could contain a Sym (basically tokens), which might be a bit faster, but what is important is that we don't depend on the interner when displaying errors.

The issue I have now is in order to display tokens. This requires the interner if we want to know identifiers, for example. The issue here is that Rust doesn't allow using a fmt::Formatter (only in nightly), which is making my head hurt. Maybe someone of you can find a better way of doing this.

Then, about cursor.expect(), this is the only place where we don't have the expected token type as a static string, so it's failing to compile. We have the option of changing the type definition of ParseError to contain an owned string, but maybe we can avoid this by having a &'static str come from a TokenKind with the default values, such as "identifier" for an identifier. I wanted for you to think about it and maybe we can just add that and avoid allocations there.

Oh, and this depends on the VM-only branch, so that has to be merged before :)

Another thing to check: should the interner be in its own module?

@Razican Razican added enhancement New feature or request help wanted Extra attention is needed performance Performance related changes and issues parser Issues surrounding the parser lexer Issues surrounding the lexer memory PRs and Issues related to the memory management or memory footprint. execution Issues or PRs related to code execution ast Issue surrounding the abstract syntax tree Internal Category for changelog labels Dec 22, 2021
@codecov
Copy link

codecov bot commented Dec 24, 2021

Codecov Report

Merging #1758 (0d38f91) into main (76a27ce) will decrease coverage by 1.29%.
The diff coverage is 59.01%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1758      +/-   ##
==========================================
- Coverage   57.02%   55.72%   -1.30%     
==========================================
  Files         199      201       +2     
  Lines       16842    17336     +494     
==========================================
+ Hits         9604     9661      +57     
- Misses       7238     7675     +437     
Impacted Files Coverage Δ
boa/src/builtins/function/mod.rs 33.57% <0.00%> (-2.65%) ⬇️
boa/src/builtins/typed_array/mod.rs 3.79% <0.00%> (-0.01%) ⬇️
boa/src/object/mod.rs 28.76% <ø> (-0.53%) ⬇️
boa_cli/src/main.rs 6.25% <0.00%> (+0.36%) ⬆️
boa_tester/src/exec/js262.rs 0.00% <ø> (ø)
boa_unicode/src/lib.rs 69.69% <ø> (ø)
boa_wasm/src/lib.rs 0.00% <0.00%> (ø)
boa/src/syntax/parser/statement/mod.rs 39.87% <33.03%> (-2.01%) ⬇️
boa/src/syntax/parser/function/mod.rs 45.28% <40.00%> (-4.37%) ⬇️
...arser/expression/primary/object_initializer/mod.rs 44.48% <41.75%> (-4.97%) ⬇️
... and 116 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 76a27ce...0d38f91. Read the comment docs.

@Razican Razican changed the base branch from main to justVM December 24, 2021 11:49
@bors bors bot changed the base branch from justVM to main December 25, 2021 17:56
@github-actions
Copy link

github-actions bot commented Dec 25, 2021

Test262 conformance changes

VM implementation

Test result main count PR count difference
Total 87,200 87,200 0
Passed 40,828 40,828 0
Ignored 19,493 19,493 0
Failed 26,879 26,879 0
Panics 0 0 0
Conformance 46.82% 46.82% 0.00%

@Razican
Copy link
Member Author

Razican commented Dec 25, 2021

I will try to solve #503 with this.

@Razican Razican added this to In progress in String interning Dec 26, 2021
@Razican Razican changed the title String interning Lexer string interning Dec 26, 2021
@Razican
Copy link
Member Author

Razican commented Dec 26, 2021

I tried to start implementing this for the parser, but the changes are huge. I created a project and will create new issues to implement the interner for the parser, the compiler and the executor, but I think this is ready for review and merge.

I did some benchmarks to see which backend was faster, and for now, I selected the fastest one in my machine. This might change once we implement the interner in more places of the engine.

@Razican Razican removed the execution Issues or PRs related to code execution label Dec 26, 2021
@Razican Razican added this to the v0.14.0 milestone Dec 26, 2021
Copy link
Member

@RageKnify RageKnify left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the interner in a global hidden by an API might make it easier to develop boa (since we wouldn't need to pass its reference around everywhere), but I don't feel too strongly either way.

@raskad
Copy link
Member

raskad commented Dec 28, 2021

Having the interner in a global hidden by an API might make it easier to develop boa (since we wouldn't need to pass its reference around everywhere), but I don't feel too strongly either way.

@Razican correct me if I'm wrong, but I think that would not work, because the interner must be specific to one parsed block of code, so it can be deallocated when the code block is being dropped right?

@Razican
Copy link
Member Author

Razican commented Dec 28, 2021

Having the interner in a global hidden by an API might make it easier to develop boa (since we wouldn't need to pass its reference around everywhere), but I don't feel too strongly either way.

@Razican correct me if I'm wrong, but I think that would not work, because the interner must be specific to one parsed block of code, so it can be deallocated when the code block is being dropped right?

If you don't de-allocate, you could use the same interner multiple times, in theory, but if the application is running for long, you might have a huge memory usage.

@RageKnify
Copy link
Member

RageKnify commented Dec 28, 2021

Yeah, I guess depending on the use case you would want to drop the Interner to avoid a never shrinking heap structure.
If boa were embedded in something like node then I don't think this would matter, right? Since once everything is parsed we shouldn't intern that many more strings.
Even in the browser, the only problem is a browser tab which keeps evaluating JS with new identifiers, right?

As I said, I can accept this choice, just not sure it makes much of a difference, if the Interner is part of the Context then the cli and future browsers (in the context of a given browser tab) would only have 1 at a time any ways.

Maybe for game engines or other embedding use cases it would be relevant to not use just 1 (Interner) since it would make sense to create new Contexts and drop old ones when the new ones are created. In that case I can see the possibility of the heap growing because the user keeps changing the name of their variables/functions with "updates" to their script.

(I feel this last point is good enough reason to keep it in the Context)

@Razican
Copy link
Member Author

Razican commented Dec 28, 2021

I'm also thinking on servers using it as their scripting language of choice. If they can run multiple scripts, in different days or so, I think this approach could be better.

Copy link
Member

@jasonwilliams jasonwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I agree with Rageknify's concern around this being passed everywhere, but I don't see any other option from what has been discussed

/// The string interner for Boa.
///
/// This is a type alias that makes it easier to reference it in the code.
pub type Interner = StringInterner<BucketBackend<Sym>>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come you chose bucketbackend over stringbackend? Im guessing because the use of static

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Statics seem to be the reason. It probably makes sense that we try out the stringbackend when we use the interner in the parser.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason was that I tried locally all backends and this was the one giving better results, but I have no personal preference xD in the future once we use the interner everywhere we can benchmark again.

@raskad
Copy link
Member

raskad commented Jan 22, 2022

bors r+

bors bot pushed a commit that referenced this pull request Jan 22, 2022
This Pull Request is part of #279.

 It adds a string interner to Boa, which allows many types to not contain heap-allocated strings, and just contain a `NonZeroUsize` instead. This can move types to the stack (hopefully I'll be able to move `Token`, for example, maybe some `Node` types too.

Note that the internet is for now only available in the lexer. Next steps (in this PR or future ones) would include also using interning in the parser, and finally in execution. The idea is that strings should be represented with a `Sym` until they are displayed.

Talking about display. I have changed the `ParseError` type in order to not contain anything that could contain a `Sym` (basically tokens), which might be a bit faster, but what is important is that we don't depend on the interner when displaying errors.

The issue I have now is in order to display tokens. This requires the interner if we want to know identifiers, for example. The issue here is that Rust doesn't allow using a `fmt::Formatter` (only in nightly), which is making my head hurt. Maybe someone of you can find a better way of doing this.

Then, about `cursor.expect()`, this is the only place where we don't have the expected token type as a static string, so it's failing to compile. We have the option of changing the type definition of `ParseError` to contain an owned string, but maybe we can avoid this by having a `&'static str` come from a `TokenKind` with the default values, such as "identifier" for an identifier. I wanted for you to think about it and maybe we can just add that and avoid allocations there.

Oh, and this depends on the VM-only branch, so that has to be merged before :)

Another thing to check: should the interner be in its own module?
@bors
Copy link

bors bot commented Jan 22, 2022

Build failed:

@Razican
Copy link
Member Author

Razican commented Jan 22, 2022

Build failed:

I will rebase this, it seems to fail due to Rust 1.58 lints.

@Razican
Copy link
Member Author

Razican commented Jan 22, 2022

LGTM

I agree with Rageknify's concern around this being passed everywhere, but I don't see any other option from what has been discussed

One option here would be to have the interner in the parser, as a reference. Might make sense actually, and I might do so in the other PR, if you're OK with it.

@Razican
Copy link
Member Author

Razican commented Jan 22, 2022

I had to revert the upgrade to Wasm-bindgen 0.2.79 due to rustwasm/wasm-bindgen#2774.

@RageKnify
Copy link
Member

bors r+

bors bot pushed a commit that referenced this pull request Jan 22, 2022
This Pull Request is part of #279.

 It adds a string interner to Boa, which allows many types to not contain heap-allocated strings, and just contain a `NonZeroUsize` instead. This can move types to the stack (hopefully I'll be able to move `Token`, for example, maybe some `Node` types too.

Note that the internet is for now only available in the lexer. Next steps (in this PR or future ones) would include also using interning in the parser, and finally in execution. The idea is that strings should be represented with a `Sym` until they are displayed.

Talking about display. I have changed the `ParseError` type in order to not contain anything that could contain a `Sym` (basically tokens), which might be a bit faster, but what is important is that we don't depend on the interner when displaying errors.

The issue I have now is in order to display tokens. This requires the interner if we want to know identifiers, for example. The issue here is that Rust doesn't allow using a `fmt::Formatter` (only in nightly), which is making my head hurt. Maybe someone of you can find a better way of doing this.

Then, about `cursor.expect()`, this is the only place where we don't have the expected token type as a static string, so it's failing to compile. We have the option of changing the type definition of `ParseError` to contain an owned string, but maybe we can avoid this by having a `&'static str` come from a `TokenKind` with the default values, such as "identifier" for an identifier. I wanted for you to think about it and maybe we can just add that and avoid allocations there.

Oh, and this depends on the VM-only branch, so that has to be merged before :)

Another thing to check: should the interner be in its own module?
@bors
Copy link

bors bot commented Jan 22, 2022

Pull request successfully merged into main.

Build succeeded:

@bors bors bot changed the title Lexer string interning [Merged by Bors] - Lexer string interning Jan 22, 2022
@bors bors bot closed this Jan 22, 2022
String interning automation moved this from In progress to Done Jan 22, 2022
@bors bors bot deleted the feature/interner branch January 22, 2022 18:19
@Razican
Copy link
Member Author

Razican commented Jan 22, 2022

LGTM
I agree with Rageknify's concern around this being passed everywhere, but I don't see any other option from what has been discussed

One option here would be to have the interner in the parser, as a reference. Might make sense actually, and I might do so in the other PR, if you're OK with it.

Actually, now that I check it, the only way to do this would be to change all the methods to receive a &mut Parser, which would include a &mut Interner, but it might not be easy. I guess this should be done in a different PR.

bors bot pushed a commit that referenced this pull request Jan 23, 2022
This builds on top of #1758 to try to bring #1763 to life.

Something that should probably be done here would be to convert `JsString` to a `Sym` internally. Then, further optimizations could be done adding common strings to a custom interner type (those that we know statically).

This is definitely work in progress, but I would like to have feedback on the API, and feel free to contribute.

Co-authored-by: raskad <32105367+raskad@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ast Issue surrounding the abstract syntax tree enhancement New feature or request Internal Category for changelog lexer Issues surrounding the lexer memory PRs and Issues related to the memory management or memory footprint. parser Issues surrounding the parser performance Performance related changes and issues
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants