-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider UTF-8? #6570
Comments
I think it's a good idea. UTF-8 has good x-platform compatibility and easier support. |
Four reservations:
This abandoned PR attempted to add support for UTF8 strings alongside UTF16: #5348 If we can come up with good answers to those reservations I think it could be a good change BUT it won't be quick and those are not easy issues. |
Yeah, I don't think this would be a quick change. Though UTF-16 in JS spec would be a major barrier :) Sorry, I didn't know that. Do you know what other engines do in this respect? I found something in V8 that is very similar to #5348 (thanks for pointing out, by the way). In case we should not use UTF-8 as the main encoding - I can work on making our UTF-16 handling independent from PAL. Maybe we can add a convenience UTF-8 API as well, though it won't be very high priority. |
Moving to using UTF-8 internally would significantly reduce memory usage in many scenarios, just because you end up saving 50% of memory usage for basic latin characters, which work out to 99.99% of all strings in js. |
What evidence do you have to support that UTF-8 would cover 99.99% of all strings in JS? |
IMO this is a really strong argument for it How to ensure that the observable behaviour matches UTF16 for points defined in the spec is the key challenge though. One option (if memory efficiency is the aim) would be to stick to 8 bit chars for ascii characters and use utf16 strings whenever non-ascii was needed - though this would be a complex change AND wouldn't target the initial motivation of this discussion (getting rid of the wchar type) |
@ljharb nothing I can share, unfortunately, and remember that it's not 99.99% by size, just by count - there's a ton of short strings used in a lot of places, both in benchmarks (nice, but not terribly impactful) and in popular web frameworks (hugely impactful, but complex for perf tests). Even in non-English pages, latin characters tend to be the majority of javascript strings due to things such as the internals of ad network and tracking code being larger than the site's actual content. If you run a crawler and take snapshots of memory, and then grab string objects from the javascript engine in that data, you can try to reproduce this info. |
Our equation may be different as chakracore is unlikely to be used in a web browser again anytime soon BUT the memory efficiency would be a nice win for other apps that may embed it if we can do it. |
There's the option of having multiple types, and using profile data to use the "right" type in each situation, but there's a lot of complications - e.g. encoding of broken surrogate pairs if you try to transition to UTF-8. |
I think we would have to do some conversions between UTF-8 and UTF-16 when needed, for example for
This is somewhat close to how UTF-8 works, it is backwards-compatible with ASCII, but adds more characters outside of that set. On a side note, going browserless might tilt the balance even more towards ASCII-like subset of unicode, though it may depend on actual use cases. |
With codePointAt, with String.fromCodePoint, with |
Also, I'm very skeptical that the very high usage of emoji in vernacular discussion hasn't leaked into JS strings. |
We are using UTF-16, which has some cross platform compatibility issues, starting with different type used on different platforms -
wchar_t
on Windows andchar16_t
everywhere else, with former being a fundamental type (not requiring an include), but latter - defined as an unsigned short. This is thanks to 'nix toolchains picking UTF-32 (which is excessive for most uses) as default Unicode type.UTF-8 is more cross platform, and supported in Windows since Win10. For prior versions of Windows it should be possible to convert to UTF-16 for things that require it. A quick search shows that V8 might use UTF-8.
I am not sure whether or not UTF-8 would be easier for our text-related projects, like RegEx support. @rhuanjl do you have any thoughts?
The text was updated successfully, but these errors were encountered: