Consider UTF-8? #6570

ppenzin · 2021-01-09T03:10:28Z

We are using UTF-16, which has some cross platform compatibility issues, starting with different type used on different platforms - wchar_t on Windows and char16_t everywhere else, with former being a fundamental type (not requiring an include), but latter - defined as an unsigned short. This is thanks to 'nix toolchains picking UTF-32 (which is excessive for most uses) as default Unicode type.

UTF-8 is more cross platform, and supported in Windows since Win10. For prior versions of Windows it should be possible to convert to UTF-16 for things that require it. A quick search shows that V8 might use UTF-8.

I am not sure whether or not UTF-8 would be easier for our text-related projects, like RegEx support. @rhuanjl do you have any thoughts?

The text was updated successfully, but these errors were encountered:

Fly-Style · 2021-01-09T07:32:18Z

I think it's a good idea. UTF-8 has good x-platform compatibility and easier support.
But I also would like to hear @rhuanjl opinion.

rhuanjl · 2021-01-09T11:42:50Z

Four reservations:

The scale of this kind of change - this is not quick or easy
Existing APIs - primarily the GetStringPointer legacy API that some projects may use - this gives a pointer to the internal UTF16 string - currently it's an incredibly quick function call - admittedly it's in the ChakraWindows header that has been deprecated in it's entirety for sometime but the micrsoft strategy had been to retain it indefinitely for backwards compat.
RegExp - so there is a feature that chakracore does not currently implement but is meant to which is unicode mode RegExp matching in this mode each character is meant to be examined as a surrogate pair - which if I understand correctly only works properly in utf16.
String methods like getCharCodeAt are all defined (in the JS spec) based on utf16.

This abandoned PR attempted to add support for UTF8 strings alongside UTF16: #5348

If we can come up with good answers to those reservations I think it could be a good change BUT it won't be quick and those are not easy issues.

ppenzin · 2021-01-10T05:28:48Z

Yeah, I don't think this would be a quick change. Though UTF-16 in JS spec would be a major barrier :) Sorry, I didn't know that.

Do you know what other engines do in this respect? I found something in V8 that is very similar to #5348 (thanks for pointing out, by the way).

In case we should not use UTF-8 as the main encoding - I can work on making our UTF-16 handling independent from PAL. Maybe we can add a convenience UTF-8 API as well, though it won't be very high priority.

Penguinwizzard · 2021-01-11T20:21:17Z

Moving to using UTF-8 internally would significantly reduce memory usage in many scenarios, just because you end up saving 50% of memory usage for basic latin characters, which work out to 99.99% of all strings in js.

ljharb · 2021-01-11T20:51:24Z

What evidence do you have to support that UTF-8 would cover 99.99% of all strings in JS?

rhuanjl · 2021-01-11T21:18:48Z

Moving to using UTF-8 internally would significantly reduce memory usage in many scenarios, just because you end up saving 50% of memory usage for basic latin characters, which work out to 99.99% of all strings in js.

IMO this is a really strong argument for it

How to ensure that the observable behaviour matches UTF16 for points defined in the spec is the key challenge though.

One option (if memory efficiency is the aim) would be to stick to 8 bit chars for ascii characters and use utf16 strings whenever non-ascii was needed - though this would be a complex change AND wouldn't target the initial motivation of this discussion (getting rid of the wchar type)

Penguinwizzard · 2021-01-11T22:12:32Z

@ljharb nothing I can share, unfortunately, and remember that it's not 99.99% by size, just by count - there's a ton of short strings used in a lot of places, both in benchmarks (nice, but not terribly impactful) and in popular web frameworks (hugely impactful, but complex for perf tests). Even in non-English pages, latin characters tend to be the majority of javascript strings due to things such as the internals of ad network and tracking code being larger than the site's actual content. If you run a crawler and take snapshots of memory, and then grab string objects from the javascript engine in that data, you can try to reproduce this info.

rhuanjl · 2021-01-11T22:14:51Z

Our equation may be different as chakracore is unlikely to be used in a web browser again anytime soon BUT the memory efficiency would be a nice win for other apps that may embed it if we can do it.

Penguinwizzard · 2021-01-11T22:21:04Z

There's the option of having multiple types, and using profile data to use the "right" type in each situation, but there's a lot of complications - e.g. encoding of broken surrogate pairs if you try to transition to UTF-8.

ppenzin · 2021-01-12T04:53:11Z

How to ensure that the observable behaviour matches UTF16 for points defined in the spec is the key challenge though.

I think we would have to do some conversions between UTF-8 and UTF-16 when needed, for example for codePointAt.

One option (if memory efficiency is the aim) would be to stick to 8 bit chars for ascii characters and use utf16 strings whenever non-ascii was needed - though this would be a complex change AND wouldn't target the initial motivation of this discussion (getting rid of the wchar type)

This is somewhat close to how UTF-8 works, it is backwards-compatible with ASCII, but adds more characters outside of that set.

On a side note, going browserless might tilt the balance even more towards ASCII-like subset of unicode, though it may depend on actual use cases.

ljharb · 2021-01-12T04:56:20Z

With codePointAt, with String.fromCodePoint, with String.prototype[Symbol.iterator], with any regular expression operations with the u flag, any usage of Intl, and likely many many other instances.

ljharb · 2021-01-12T04:57:00Z

Also, I'm very skeptical that the very high usage of emoji in vernacular discussion hasn't leaked into JS strings.

ppenzin mentioned this issue Jan 9, 2021

PAL compatibility issues #6404

Open

rhuanjl added the Discussion label Mar 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider UTF-8? #6570

Consider UTF-8? #6570

ppenzin commented Jan 9, 2021

Fly-Style commented Jan 9, 2021

rhuanjl commented Jan 9, 2021 •

edited

Loading

ppenzin commented Jan 10, 2021

Penguinwizzard commented Jan 11, 2021 •

edited

Loading

ljharb commented Jan 11, 2021

rhuanjl commented Jan 11, 2021

Penguinwizzard commented Jan 11, 2021

rhuanjl commented Jan 11, 2021

Penguinwizzard commented Jan 11, 2021

ppenzin commented Jan 12, 2021

ljharb commented Jan 12, 2021

ljharb commented Jan 12, 2021

Consider UTF-8? #6570

Consider UTF-8? #6570

Comments

ppenzin commented Jan 9, 2021

Fly-Style commented Jan 9, 2021

rhuanjl commented Jan 9, 2021 • edited Loading

ppenzin commented Jan 10, 2021

Penguinwizzard commented Jan 11, 2021 • edited Loading

ljharb commented Jan 11, 2021

rhuanjl commented Jan 11, 2021

Penguinwizzard commented Jan 11, 2021

rhuanjl commented Jan 11, 2021

Penguinwizzard commented Jan 11, 2021

ppenzin commented Jan 12, 2021

ljharb commented Jan 12, 2021

ljharb commented Jan 12, 2021

rhuanjl commented Jan 9, 2021 •

edited

Loading

Penguinwizzard commented Jan 11, 2021 •

edited

Loading