Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider UTF-8? #6570

Open
ppenzin opened this issue Jan 9, 2021 · 12 comments
Open

Consider UTF-8? #6570

ppenzin opened this issue Jan 9, 2021 · 12 comments

Comments

@ppenzin
Copy link
Member

ppenzin commented Jan 9, 2021

We are using UTF-16, which has some cross platform compatibility issues, starting with different type used on different platforms - wchar_t on Windows and char16_t everywhere else, with former being a fundamental type (not requiring an include), but latter - defined as an unsigned short. This is thanks to 'nix toolchains picking UTF-32 (which is excessive for most uses) as default Unicode type.

UTF-8 is more cross platform, and supported in Windows since Win10. For prior versions of Windows it should be possible to convert to UTF-16 for things that require it. A quick search shows that V8 might use UTF-8.

I am not sure whether or not UTF-8 would be easier for our text-related projects, like RegEx support. @rhuanjl do you have any thoughts?

@Fly-Style
Copy link
Contributor

I think it's a good idea. UTF-8 has good x-platform compatibility and easier support.
But I also would like to hear @rhuanjl opinion.

@rhuanjl
Copy link
Collaborator

rhuanjl commented Jan 9, 2021

Four reservations:

  1. The scale of this kind of change - this is not quick or easy
  2. Existing APIs - primarily the GetStringPointer legacy API that some projects may use - this gives a pointer to the internal UTF16 string - currently it's an incredibly quick function call - admittedly it's in the ChakraWindows header that has been deprecated in it's entirety for sometime but the micrsoft strategy had been to retain it indefinitely for backwards compat.
  3. RegExp - so there is a feature that chakracore does not currently implement but is meant to which is unicode mode RegExp matching in this mode each character is meant to be examined as a surrogate pair - which if I understand correctly only works properly in utf16.
  4. String methods like getCharCodeAt are all defined (in the JS spec) based on utf16.

This abandoned PR attempted to add support for UTF8 strings alongside UTF16: #5348

If we can come up with good answers to those reservations I think it could be a good change BUT it won't be quick and those are not easy issues.

@ppenzin
Copy link
Member Author

ppenzin commented Jan 10, 2021

Yeah, I don't think this would be a quick change. Though UTF-16 in JS spec would be a major barrier :) Sorry, I didn't know that.

Do you know what other engines do in this respect? I found something in V8 that is very similar to #5348 (thanks for pointing out, by the way).

In case we should not use UTF-8 as the main encoding - I can work on making our UTF-16 handling independent from PAL. Maybe we can add a convenience UTF-8 API as well, though it won't be very high priority.

@Penguinwizzard
Copy link
Contributor

Penguinwizzard commented Jan 11, 2021

Moving to using UTF-8 internally would significantly reduce memory usage in many scenarios, just because you end up saving 50% of memory usage for basic latin characters, which work out to 99.99% of all strings in js.

@ljharb
Copy link
Collaborator

ljharb commented Jan 11, 2021

What evidence do you have to support that UTF-8 would cover 99.99% of all strings in JS?

@rhuanjl
Copy link
Collaborator

rhuanjl commented Jan 11, 2021

Moving to using UTF-8 internally would significantly reduce memory usage in many scenarios, just because you end up saving 50% of memory usage for basic latin characters, which work out to 99.99% of all strings in js.

IMO this is a really strong argument for it

How to ensure that the observable behaviour matches UTF16 for points defined in the spec is the key challenge though.

One option (if memory efficiency is the aim) would be to stick to 8 bit chars for ascii characters and use utf16 strings whenever non-ascii was needed - though this would be a complex change AND wouldn't target the initial motivation of this discussion (getting rid of the wchar type)

@Penguinwizzard
Copy link
Contributor

@ljharb nothing I can share, unfortunately, and remember that it's not 99.99% by size, just by count - there's a ton of short strings used in a lot of places, both in benchmarks (nice, but not terribly impactful) and in popular web frameworks (hugely impactful, but complex for perf tests). Even in non-English pages, latin characters tend to be the majority of javascript strings due to things such as the internals of ad network and tracking code being larger than the site's actual content. If you run a crawler and take snapshots of memory, and then grab string objects from the javascript engine in that data, you can try to reproduce this info.

@rhuanjl
Copy link
Collaborator

rhuanjl commented Jan 11, 2021

Our equation may be different as chakracore is unlikely to be used in a web browser again anytime soon BUT the memory efficiency would be a nice win for other apps that may embed it if we can do it.

@Penguinwizzard
Copy link
Contributor

There's the option of having multiple types, and using profile data to use the "right" type in each situation, but there's a lot of complications - e.g. encoding of broken surrogate pairs if you try to transition to UTF-8.

@ppenzin
Copy link
Member Author

ppenzin commented Jan 12, 2021

How to ensure that the observable behaviour matches UTF16 for points defined in the spec is the key challenge though.

I think we would have to do some conversions between UTF-8 and UTF-16 when needed, for example for codePointAt.

One option (if memory efficiency is the aim) would be to stick to 8 bit chars for ascii characters and use utf16 strings whenever non-ascii was needed - though this would be a complex change AND wouldn't target the initial motivation of this discussion (getting rid of the wchar type)

This is somewhat close to how UTF-8 works, it is backwards-compatible with ASCII, but adds more characters outside of that set.

On a side note, going browserless might tilt the balance even more towards ASCII-like subset of unicode, though it may depend on actual use cases.

@ljharb
Copy link
Collaborator

ljharb commented Jan 12, 2021

With codePointAt, with String.fromCodePoint, with String.prototype[Symbol.iterator], with any regular expression operations with the u flag, any usage of Intl, and likely many many other instances.

@ljharb
Copy link
Collaborator

ljharb commented Jan 12, 2021

Also, I'm very skeptical that the very high usage of emoji in vernacular discussion hasn't leaked into JS strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants