Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String interning #279

Closed
HalidOdat opened this issue Mar 17, 2020 · 11 comments
Closed

String interning #279

HalidOdat opened this issue Mar 17, 2020 · 11 comments
Assignees
Labels
memory PRs and Issues related to the memory management or memory footprint. performance Performance related changes and issues
Milestone

Comments

@HalidOdat
Copy link
Member

Will there be a string interner for more affective memory use and string comparisons.

Like rustc does it.

@HalidOdat
Copy link
Member Author

More specifically rustc interner.

Rustc Symbol doc:

An interned string.

Internally, a `Symbol` is implemented as an index, and all operations
(including hashing, equality, and ordering) operate on that index. The use
of `rustc_index::newtype_index!` means that `Option<Symbol>` only takes up 4 bytes
...

Which makes comparisons between symbols trivial, it is just a comparison between two u32.

The cost is when interning a string, since it requires to do a HashMap lookup.
Getting the string from a symbol is trivial since it is only indexing into an array(Vec).

I found a implementation of this string-interner

@jasonwilliams
Copy link
Member

Yep, this is interesting.
The plan was always to move to something more efficient but start with strings first. (just to make it easier for implementation against the spec)

I had a conversation the with Spidermonkey devs about how we use strings for internal slots, but later want to move to something more faster. I think they do something like this.

I don't know very much about https://crates.io/crates/string-interner so this will be new to me, but im happy to let this issue be the conversation point.

@jasonwilliams
Copy link
Member

https://matklad.github.io/2020/03/22/fast-simple-rust-interner.html

@HalidOdat
Copy link
Member Author

Nice! This is pretty mush how rustc does it. and how string-interner crate is implemented.

@Razican Razican added the performance Performance related changes and issues label Apr 1, 2020
@Razican
Copy link
Member

Razican commented Apr 13, 2020

I think that the string-interner crate is the way to go. It already provides a really nice, safe interner, and they are even discussing the optimisations laid out in the post you mentioned in Robbepop/string-interner#16.

This could probably allow avoiding the use of String in TokenKind, which would make it Copy pretty easily, and reduce the memory footprint and improve the CPU cache usage.

@Razican
Copy link
Member

Razican commented Apr 21, 2020

I will give this a look, and maybe #336 too, at the same time.

@Razican Razican self-assigned this Apr 29, 2020
@HalidOdat
Copy link
Member Author

HalidOdat commented Apr 30, 2020

Hi @Razican

Correct me if I'm wrong!!!

String interning only has a good effect if the string that's it is interning is supposed to be immutable, like identifiers.
I don't think string interning should be used in ValueData::String since it can mutate.
For Example:

var x = "";
for(var i = 0; i < 100; i++) {
	x += "1"
}

Without string interning in ValueData::String, x will be one string in memory.
With string interning in ValueData::String there will be in memory "", "1", "11", "111", etc.
And these values will last till interner goes out of scope (pretty much the end of the program).
So were in a way leaking memory (not really we still have ponters to the strings in memory).

@Razican
Copy link
Member

Razican commented Apr 30, 2020

String interning only has a good effect if the string that's it is interning is supposed to be immutable, like identifiers.

Yes, I was just trying a couple of things, but it does seem that way!

@Razican
Copy link
Member

Razican commented Jul 15, 2020

Interesting review of interners here. This could maybe co-exist with #565, or replace it.

With this review in mind, it's maybe worth taking a look at lasso too.

@Razican
Copy link
Member

Razican commented Jan 11, 2021

This might partially conflict with #736. We might need to create our own interner that allows using UTF-16 strings.

@Razican Razican added the memory PRs and Issues related to the memory management or memory footprint. label Dec 22, 2021
@Razican Razican added this to In progress in String interning Dec 26, 2021
bors bot pushed a commit that referenced this issue Jan 22, 2022
This Pull Request is part of #279.

 It adds a string interner to Boa, which allows many types to not contain heap-allocated strings, and just contain a `NonZeroUsize` instead. This can move types to the stack (hopefully I'll be able to move `Token`, for example, maybe some `Node` types too.

Note that the internet is for now only available in the lexer. Next steps (in this PR or future ones) would include also using interning in the parser, and finally in execution. The idea is that strings should be represented with a `Sym` until they are displayed.

Talking about display. I have changed the `ParseError` type in order to not contain anything that could contain a `Sym` (basically tokens), which might be a bit faster, but what is important is that we don't depend on the interner when displaying errors.

The issue I have now is in order to display tokens. This requires the interner if we want to know identifiers, for example. The issue here is that Rust doesn't allow using a `fmt::Formatter` (only in nightly), which is making my head hurt. Maybe someone of you can find a better way of doing this.

Then, about `cursor.expect()`, this is the only place where we don't have the expected token type as a static string, so it's failing to compile. We have the option of changing the type definition of `ParseError` to contain an owned string, but maybe we can avoid this by having a `&'static str` come from a `TokenKind` with the default values, such as "identifier" for an identifier. I wanted for you to think about it and maybe we can just add that and avoid allocations there.

Oh, and this depends on the VM-only branch, so that has to be merged before :)

Another thing to check: should the interner be in its own module?
bors bot pushed a commit that referenced this issue Jan 22, 2022
This Pull Request is part of #279.

 It adds a string interner to Boa, which allows many types to not contain heap-allocated strings, and just contain a `NonZeroUsize` instead. This can move types to the stack (hopefully I'll be able to move `Token`, for example, maybe some `Node` types too.

Note that the internet is for now only available in the lexer. Next steps (in this PR or future ones) would include also using interning in the parser, and finally in execution. The idea is that strings should be represented with a `Sym` until they are displayed.

Talking about display. I have changed the `ParseError` type in order to not contain anything that could contain a `Sym` (basically tokens), which might be a bit faster, but what is important is that we don't depend on the interner when displaying errors.

The issue I have now is in order to display tokens. This requires the interner if we want to know identifiers, for example. The issue here is that Rust doesn't allow using a `fmt::Formatter` (only in nightly), which is making my head hurt. Maybe someone of you can find a better way of doing this.

Then, about `cursor.expect()`, this is the only place where we don't have the expected token type as a static string, so it's failing to compile. We have the option of changing the type definition of `ParseError` to contain an owned string, but maybe we can avoid this by having a `&'static str` come from a `TokenKind` with the default values, such as "identifier" for an identifier. I wanted for you to think about it and maybe we can just add that and avoid allocations there.

Oh, and this depends on the VM-only branch, so that has to be merged before :)

Another thing to check: should the interner be in its own module?
@Razican
Copy link
Member

Razican commented Jan 31, 2022

I think we can close this issue, as we are now using an interner in most of the engine.

@Razican Razican closed this as completed Jan 31, 2022
String interning automation moved this from In progress to Done Jan 31, 2022
@Razican Razican added this to the v0.14.0 milestone Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
memory PRs and Issues related to the memory management or memory footprint. performance Performance related changes and issues
Projects
No open projects
Development

No branches or pull requests

3 participants