-
Notifications
You must be signed in to change notification settings - Fork 57
Why should strings be lists of Unicode Scalar Values? #135
Comments
Wrapping my head around why this is necessary, I found it funny that the reason this issue even exists is because we have a range of values that are dedicated to indicating extension in a 16-bit scenario, that do not overlap (have a unique value) regardless of whether we are actually using a 16-bit encoding. Most encodings that provide extension values overlap with the values being encoded, e.g. in UTF-8, values 0x80..0xFF indicate an extension is present, but these are not unique as code points since 0x80..0xFF in code points are actual characters. So this question never comes up. So UTF-16 would have been better off with an encoding that overlaps, but I guess that would have made it less easy for software to ignore that UTF-16 is actually variable size.. |
Interesting thought :) Makes me wonder in turn if the Unicode standard could reasonable settle this decade long issue by adopting what WTF-8 does, and perhaps, say, assigning the visual representation � to (lone) surrogate code points as well. Then it would overlap, roundtrip both ways, that is if I am not missing something, but most notably render future discussions for or against obsolete. Perhaps not labeling it "potentially ill-formed" but "relaxed", "lenient" or "practical" would have helped the situation as well, heh. Would require, however, that UTF-8 implementations need to merge previously split surrogates upon concatenation, and yeah, that would be a considerably large nudge. Too large, most likely. |
Thanks for the thorough writeup, Luke :)
I have been one of the against votes, so perhaps it makes sense to explain why I voted this way. My impression was that deciding for all the next steps was too early. String semantics/encodings in particular have been hotly discussed in the past without resolution so far, so it felt a little odd to me to tie this question (i.e. concretely propose USVs/UTF-8 as if it was natural) to the component model. I worried that the group has not been sufficiently informed on this particular ingredient before polling, which was my motivation for rushing out my presentation with my concerns the weekend before to provide some background. I recognize that this may not have been your intention, of course, but this was my thought process at the time. I appreciated that you clarified during the meeting that we are not actually polling on strings, but now I must admit that I am a little confused as the presence of the component model is used as an argument. I also voted neutral on the general direction of the component model, because I do not see how it helps the more Web-focused use cases and anticipations I am seeing. Now I was not against it (if others want components I would be fine with it), but if it turns out that its existence is used to justify harming other straight-forward use cases (say where one's component is basically the combination of Wasm + JavaScript), I would decide differently in the future when similar questions arise.
Here, I believe the "robust" part goes both ways:
I understand of course that
but I think this is a rather weak argument compared to making occasional breakage the default for many languages on the contrary. I, and perhaps others as well, would prefer a component that works 100% of the time with an additional check over a component that works just 99.9% of the time while risking anything from annoyances to hazards otherwise. A fuzzer for instance will find this reliably, and so will millions of unintentionally fuzzing users who are not necessarily aware of all the ins and outs of string encodings. Also
may be true, but is in my opinion not a very compelling precedent for what should happen in between two function calls. The more modular Wasm becomes, the harder it will become to tell where a function lives respectively if a string argument/return crosses a boundary or not. Plus, what may work today may stop working in the future, and we are certainly on a trajectory towards more breakage, not less. As such I do not think that basing
on the above two reasons is very meaningful. The typical case for these languages will most likely be to interface with modules written in the same language or with JavaScript. Perhaps also WASI here or there, but WASI mostly consumes strings, say as file system paths, which are fine to be "not found", while otherwise returning either raw bytes or well-formed strings anyhow. As such I would question the practical value of this reason. And of course this does not only apply to
in that some languages, by string API design, make it overly easy to accidentally split a surrogate pair into half (can be as easy as a
I can appreciate the direction here of at least keeping the door open to solve it in the Web embedding, but I would once more want to emphasize that this is not a problem exclusive to the Web embedding. I do not see, for instance, how a toolchain would decide whether to utilize a different type or just use a On a more general note, it was obvious to me since the beginning that higher level languages like Java will have a better stab at seamlessly integrating with the Web platform, not only because some of them share a string encoding with JavaScript, but also because they already have matching concepts of, say, strings being references, that in the ideal case can be shared with JavaScript, say with GC where everything lives in a common heap anyway. The component model with its restrictions, on the other hand, seems as if it is on a different trajectory that may in the future even influence other proposals as a kind of precedent, and this makes me really sad because I always hoped we could embrace the sheer potential of a future where JavaScript and WebAssembly become one ("component"). Lastly, in the presence of a proper escape hatch for affected languages, I would be fine with a default |
I don't think it's possible to decide an issue like this one in the absence of an agreed-upon set of goals, use cases and requirements, which is what "the component model" refers to. Now that we have strong CG agreement on this scope, it gives us the appropriate context in which to discuss this question. I don't think there's an alternative approach to hard questions like this.
It's totally reasonable not to be particularly motivated by the component model -- it's not expected to address 100% of use cases or be a universal answer to all interoperability questions; that's why we've adopted a layered approach. However, it's clear that there are many other folks who are strongly motivated by these goals, so I don't think we can simply set aside the component model in this discussion. For a different set of goals, a different proposal is appropriate.
It won't be random, as it will happen quite regularly and independent of the context in which the component is embedded. In contrast, allowing surrogates to cross component boundaries would lead to the random (from the perspective of an individual component) failures. This is the crux of the matter when you consider the full goals of the component model. I agree that if you're restricting your set of goals to focus more exclusively on JS and Web this is less of a concern, but that's not the context of this layered proposal. I want to reemphasize another point which is: like wasm, the component model is not a one-shot standard that has to have everything from day 1. Like wasm, our goal is to start with an MVP and iterate based on real-world experience. Thus, the question to ask isn't: "do there exist any use cases for passing surrogates?" but, rather, "will the initial release of the component model not be viable without the ability to pass surrogates between components?". The data we have here is the years of experience of folks working on Web standards suggests that surrogates are not necessary, which is supported by all the standards evolution described above. |
Given Murphy's law, what you are proposing seems unnecessarily risky. The real-world experience we want to obtain is to some extent (rare) breakage, and that doesn't make it a proper foundation to build a house on for me. I would once more like to remind of the claim that post-MVP "is just an optimization", which is what we voted on, but as far as I am concerned remains unproven, plus not everyone in the group may have been properly informed about when placing their vote. As such I would argue that we are better off when starting with the more inclusive string semantics that are true to the just-an-optimization aspect and gives us the foundation we need to iterate in the future. This adds three more options at the end of the day:
Pre-existing "experience of folks working on Web standards" doesn't convince me at least, especially because we are going to compile a lot of stuff to the Web that hasn't been there before. Btw, it could be as integrated as adding an option like |
Today, whenever JS talks to the outside world (through HTTP APIs, JSON, gRPC, etc), surrogates are replaced with replacement characters and noone considers this data loss because that's simply the expectation when talking to the outside world. As shown in this slide, the component model is not meant to take the place of language-specific modules/packages/libraries, but, rather, encapsulate linked collections of these (making components more like lightweight processes). Thus, the component model is explicitly adopting the same "talking to the outside world" model where surrogates are not expected. If we are really worried about silent breakage, though, the fix would be to have surrogates trap instead of producing replacement characters, so the errors could be caught early and fixed easily. I'm open to discussing that more. (Also, just to clarify, "post-MVP" doesn't refer to a singular follow-up proposal (adding adapter functions), but rather a long sequence of feature proposals over time, the same as with core wasm. Thus, post-MVP is not restricted to only being for optimization by any means.) |
I am still not convinced that these APIs, that require well-formedness under the hood by definition of being HTTP APIs, The theoretical other extreme would be to consider that every string API call should sanitize, which would break languages straight away. We are somewhere in the middle, and given that there are even more APIs in JS for instance (that can be considered modules), and that these deliberately do not sanitize since there is no reason to risk that, I would say we are much closer to function calls here. Unless we want to encourage shipping monoliths only, perhaps, but I am not sure that's a goal :) (Btw, I'd absolutely prefer replacement over erroring for separate reasons if my viewpoint cannot find consensus. Appreciate UTF-16/Latin-1 being considered.) |
I think there is plenty of room in the wasm ecosystem for new ABIs specifically designed for allowing closely-related languages to integrate more tightly than the component model allows. This is already the case with the existing tooling-conventions ABI which is the basis for C/C++/Rust(/FORTRAN?) to link together and pass pointers to memory and functions back and forth. I can imagine another totally different ABI designed specifically around native-JS integration that could be more like what you want. But for the component model, the virtualization goals imply that a component should never assume it knows the language (and whether its wasm or native) of its imports nor of the caller of its exports. |
Do you think this could be designed in a way that it becomes composable? Say, either use just Interface Types to achieve something lightweight as I am envisioning, or Component Model over Interface Types for stronger guarantees? Like, so far it really only differs in One way to achieve this perhaps could be if JS could participate in a Component (achieving Wasm + JS inside) as if it were just one of many modules, but outside of the component we'd enforce stricter guarantees that are useful in more complex scenarios. |
Some use cases I have in mind are:
On the other hand, there is one fun use case being possible with the USV restriction: wasm-string-sanitize, a practically zero code string sanitizer that works universally across every environment supporting Interface Types. Not saying that someone should build that, as it would lead the whole thing ad absurdum, but perhaps good to have in mind that someone could indeed build this. |
I apologize if this is a digression (but maybe it can help?). It seems to me that there are really two use cases here:
(Btw, note that wasm on the web would use option 2 when it wants to use interface types to communicate with wasm components; point 1 is just about interop with JS.) I think there is no perfect way to have optimal web interop as well as optimal UTF usage at the same time / with the same code. We will have the downside of people compiling differently when JS interop is their preference, but that will be the same with a wasm VM embedded on the JVM or CLR I believe, where, again, using the native string may be important for speed and compatibility. So in practice we may see these two patterns of usage, |
Sadly, option 1 is not feasible yet, and probably won't be for a long time. One pain point for example is that modules typically ship with static string data, and without Interface Types, and without GC, and with neither of them considering WTF-16 a proper encoding, including no ETA on Type Imports (or knowing what these actually can do to init strings), I am not seeing anything of significance happening in the near future. This is all connected at the end of the day, and simply choosing UTF-8 now because it is "the best" encoding, even though I presented reasonable concerns and the solution to it is rather trivial, would be something I would not understand. Given how important strings are, this sentiment has the potential to ruin it for some languages, including AssemblyScript, from double re-encoding to occasional breakage and whatnot, and if that's what the open standard chooses, even though it asked me for my first-hand implementer feedback, then I really don't know what I am doing here anymore. |
I don't follow your example with static string data. Can you not use a Overall I believe option 1 works today (in browsers with reference types), just without inlining it is somewhat slow if you do operations on the string. If you mean it is not feasible without inlining, then I think a simple inlining proposal - which is just a performance hint, parallel to branch-hinting - could in fact ship before Interface Types. (Not that the order really matters, but just to respond to your concern.) I may work on this myself, in fact, given the multiple use cases that have come up around it. I'm not ignoring the rest of your comment, but I don't have the unicode expertise to comment on the UTF-8/WTF-16 details here. My point is that, regardless of that debate, if I were compiling a language to wasm GC with the goal of JS interop, then I would use |
Sadly no, On the other hand, if we could just pass strings over Interface Types boundaries without double re-encoding and potential breakage, we'd be much better off, also because we can then keep specializing our string methods for the static type system we are using. And all it takes to achieve that is to make well-formedness the default, while giving us the option to |
If timeliness is the concern, my expectation, which I included in the CG pres, is that browser native implementations will trail significantly behind actual use of components and thus, for the first few years, the way components are used on the Web will be via AOT transpilation into core wasm + JS API performed by bundlers (which is the same path ESMs took). So interface types is by no means the "fast path" for solving the JS-specific problems of how to efficiently produce and consume JS strings; proposing additions to the JS API could be both faster and more appropriately scoped. |
All I am asking for is that Interface Types considers JavaScript/C#/Java/AssemblyScript string encoding important enough to avoid unnecessary double re-encoding and not to force potential breakage on these languages. If it refuses to account for this concern, we'll enter an era of C, Rust and similar languages being the only ones viable for a long time. I don't know how you think about this, but this has the potential to backfire so very badly that I cannot even describe. And AssemblyScript would probably still find a way around it, say by switching to UTF-8 in very unpleasant ways only because JS-like languages are simply not viable, but still, wow. What's even happening here after I informed you about the problem for 3 1/2 years :( |
+1 for IT not being a fast path for anything on the Web. In that same vein, there's plenty of time and appetite to take implementor feedback into account both for the MVP and for follow-on extensions as implementations start to appear. @dcodeIO, I think we all understand that IT will be less appropriate (in terms of correctness, performance, and convenience) for the use cases you are concerned about and that it will be more appropriate for other use cases instead. I don't think it's clear that this will be as catastrophic as you're arguing it will be, though. If it does turn out to cause as much developer pain as you anticipate, we certainly can and should act on that. In short, I think this conversation would benefit greatly from developer feedback on real prototypes and on an MVP, and I don't anticipate anyone changing their minds without that feedback. |
Right, it's neither correct, nor performant, nor convenient, while it is for the first-class club of languages. And JavaScript is not one of them. To me that is catastrophic on many levels :( At this point, though, I do not know what to say anymore. Thank you for taking the time to discuss this with me :) |
Interesting, and surprising. I see there is a long list of existing encodings (including |
Only for |
AssemblyScript is catching up in popularity (3rd place) among languages targeting WebAssembly for one main reason: the similarity it has with JavaScript, the most widely-used programming language in the world (plus AS is done very well!). This is a huge boon for JavaScript web developers: (source) Simple interop will be a great thing to have for the JavaScript-WebAssembly use case, which is definitely growing thanks to AssemblyScript, and if it keeps going at this pace, it is going to pass Rust and C++. C programmers can write string encoders with their eyes closed; I'd bet it is better to make the string stuff easy for the JavaScript devs right from the get go instead. 😊 I'm nowhere near an expert on this topic, but that's the feeling I get from reading this thread. |
If we want to consider |
This is a fixable bug in how Chromium and Gecko interface with Windows. (They should switch from pre-XP native events to XP-or-later native events.) It's not fundamental to the Web Platform given that Trident/EdgeHTML didn't have this problem on Windows and engines on other operating systems don't need to replicate this oddity. It would be really bad to design for this Web engine Windows integration bug. Notably, there is no use case for holding onto the ephemeral
Regardless of the supported value space, as long as the representiation in Wasm memory is either UTF-8 or WTF-8, this seems really inefficient. Realistically, if there were C#/Java-like string manipulation routines operating on this level of granularity and worthwhile exposing to JS, chances are that it would make sense to expose them in a manner that explicitly exposed the 16-bit-code-unit representation. Also, this use case is for a family of languages that have the same string value space among themselves and isn't an appropriate motivation for interface types for general cross-language interop.
The UTFs themselves make the loss of unpaired surrogates mandatory, so "Unicode-like" here would have to mean "wobbly" for this argument to make sense, which would make this use case tautological: requiring wobbliness in order to support wobbly encodings, but that's not a use case that would explain why a wobbly encoding would be needed. |
For those who haven't seen it yet, we are about to "Poll for maintaining single list-of-USV I would appreciate if we could at least talk about my suggested solutions first and establish a definitive commitment for UTF-16 support in the canonical ABI before polling, as I think that would lead to a more constructive outcome than what is being proposed currently. IMO it is too early to decide the USV question without. |
In particular I'd like to talk about "Integrated W/UTF-any" as an alternative to single list-of-USVs.
It largely preserves what is proposed here as its default (except conceptionally lifting "List of Unicode Code Points amending surrogate pairs"), but simplifies matters for users and toolchains in that all of the following questions do not have to be asked and answered (by WTF-16 languages in particular) due to not having to statically determine what kind of mechanism to use:
On the contrary, with "Integrated W/UTF-any", a consumer can categorically set the Wouldn't that be generally preferable in the current and future landscape of languages we want to support? |
I believe that, when you ask what is the net effect of such a design, where you have an ecosystem of components with some setting |
Yeah, I guess I have a concern for the chairs, thanks. |
I looked through the WebIDL and found the following APIs which use USVString: new Blob Most of these APIs are dealing with URLs, the filesystem, or server requests/responses. However, TextEncoder/TextDecoder are quite important, because they are used to send strings between JS/Wasm. This means that right now it is impossible to send unpaired surrogates from JS to Wasm (or from Wasm to JS). So effectively this enforces that all strings must be valid Unicode. If Interface Types allows unpaired surrogates, then we should also change TextEncoder/TextDecoder to accept unpaired surrogates. |
I think this is a little too strong since
As @lukewagner alluded to earlier, adding a WTF-16 encoder and decoder specific to some JS/Web API is exactly the sort of language-aware feature that could be pursued at a level lower than the component model. Is this something you'd be interested in? |
I'd be much more interested in not carelessly breaking stuff that cannot change anyhow ;). P.S. 6 posts left until my presentation goes out of view. |
@dcodeIO, unless I'm misinterpreting, it looks like AS actually performed replacement of isolated surrogates when transferring a string from Wasm to JS, until this recent patch to the loader in June. It was previously using I appreciate that you've recently patched the code to always preserve isolated surrogates, but did anyone complain about the character replacement before, or notice the new preservation behaviour? |
Yes, I noticed that this was broken by a change that was supposed to be an optimization. Apart from that interfacing with JS from AS is such a major pain because this is all so broken and nobody cares that only me and a couple former Wasm enthusiasts are really using it. |
Yes, it looks like this behaviour has been in place from September 2020 to June 2021 AssemblyScript/assemblyscript#1471 - the offending "optimization". |
Excuse me, but are you trying to suggest...? |
Not at all - I'm giving the context that this was quite a long-lived bug. In fact I agree with you that given the desired behaviour, the supposed "optimization" was not semantics-preserving. That being said, it doesn't seem like it broke anyone's AS code to the point that they raised it as an issue. EDIT: the above comment has been destructively edited in a way that changes the meaning of my reply - there are certainly some things I'm trying to suggest. |
Really? |
Sometimes github is more dramatic than TV. Now if I tell my friends I'm watching a drama about unpaired surrogates, I wonder what they will think I mean...? |
Yes, "offending" in the sense that it was the cause of your bug, and "supposed" in the sense that it wasn't really an optimization (since it didn't preserve the semantics you wanted). Apologies if you took any other meaning. I think the key point here is that, while it was clearly a bug, it somewhat demonstrates that the body of existing AS code out there was not broken by long-term changes in sanitisation behaviour. I should also note that consensus on the USV abstraction doesn't mean that AS has to again change its sanitisation behaviour when interop-ing with JS (just don't do it through a component boundary). |
To whoever is responsible for code of conduct compliance... I continue to be dumbfounded by this issue. |
I apologize if this is a little bit of red herring, but I think (having read the thread) it might be worth asking. People generally recognize that UTF-16/UCS-2 were kind of an evolutionary dead-end that arose due to historical accident, right? And that the WTF-8/WTF-16 is somewhat a weird way of helping with that? My question is simply: why are we continuing to carry the baggage of this evolutionary oddty? UTF-8 is around 30 years old, and is without a doubt the de facto standard and has won the user-facing encoding wars. Can we not give a gift to our future selves and future implementors by dropping support for other forms? |
This has already been discussed many times in side threads. Calling UTF16 obsolete will not change to UTF8 encoding in languages such as Java, C#, Dart, JavaScript, etc. They will all have to convert UTF16 to UTF8 and back, even between themselves or with a browser. This is slow, it needs to be validation in user space and no one can guarantee if the validation was done at all and if it was done correctly. Once again, even UTF8 takes into account the legacy of UTF16 (yep!) the range associated to surrogate pairs is also invalid in UTF8. |
@crertel This issue isn't really about encodings per se (since Interface Types supports all encodings, including UTF-16, UTF-32, etc.) Interface Types must support every encoding, because various languages use different encodings (Java, C#, and JavaScript use WTF-16... Python and Haskell use Unicode code points... Swift uses grapheme clusters... Rust, C, C++ use UTF-8... etc.) It is not an option to change every language (and every library), so Wasm must interop with the native string types of many different languages (and many different encodings). In order to achieve this, every language creates an adapter function, which converts from its native string type into a list of Unicode code points. This allows for every language to use its own native encoding, and still interop with every other language. However, this means that all strings must be valid Unicode. Invalid Unicode (such as surrogate pairs) cannot be used. Normally this isn't a problem, since invalid Unicode is generally a bug. However, in some very niche situations languages will want to preserve the invalid Unicode, and they can't do that with the Interface Types proposal. In particular, JavaScript uses WTF-16, and so JavaScript strings can contain invalid Unicode. With the current proposal, it's not possible for JS to send an invalid string to Wasm (or for Wasm to send an invalid string to JS). Some people think that's a good thing, and that invalid strings should not be allowed. Other people think that invalid strings should be allowed, because JS allows it. |
@Pauan, thank you for the very clear and concise summary! |
The concern here is, a bit more precisely, that Interface Types wants to restrict its
...unpaired surrogates (pairs are valid and are transformed normally), that can exist due to how many UTF-16 language APIs are designed for backwards-compatibility reasons, would either need to be replaced or trigger an exception, which is not only unnecessary but can lead to many problems for these languages when communicating with themselves or JavaScript. My general suggestion is that we should allow them to pass their idiomatic strings (i.e. More context on why list-of-USV is not necessarily a good idea can be found in my summary slides above if there is interest. |
I've added one more slide to my summary slides outlining the suggestion of adding a separate |
In WebAssembly/gc#145 (comment), @littledan nicely summarizes my perspective on both the rationale for and the concerns against the decision made:
In my own terms, I would say that the cost of (sometimes redundantly) validating strings seems likely reasonable in inter-trust settings, which is the new layer that the earlier poll on the component model has added to Interface Types. But in settings that are inter-language but still intra-trust, such as the examples that @littledan gives in his comment (e.g. wasm-JS interop/integration), these validations come with costs but not proportionate benefits. So maybe what these discussions mean is that there's room and need for something that enables (more efficient) inter-language (and shared nothing) linking and exchange without the burdens inherent in crossing trust boundaries. If that's the case, then there's good news. In developing session types for Interface Types, I also figured out the lower-level constructs that could be added to WebAssembly (or as a layer just above WebAssembly) to make efficient shared-nothing exchange possible. Interface Types can then be trust-bridging libraries built on top of these primitives, where the lift/lower instructions are actually exports of the libraries (that can still be fused despite being abstract exports). And people can build their own libraries (such as one like what @kmiller68 suggested in the meeting—and others in other threads—where strings are host/GC references managed by the library) on top of these primitives. If there's interest in developing such an inter-language extension to WebAssembly, I'm happy to develop the technology more. But I don't have the time to drum up interest, so that work would have to be done by others—whether it's IT people wanting to develop a common core for engines to support to facilitate IT, or it's people seeing needs not served by IT, or the two group together, I'm happy whichever way. And if there's no interest, I'm fine with sitting on the tech (and maybe turning it into a research paper). |
@RossTate I am more than interested to hear more about your idea. I believe for tight intra component shared encoding small module linking there is definitely some interesting space to explore from IT, over externref to ABIs. What about creating a new issue for this? |
Glad to hear this might fit the space you were looking to fill! Unfortunately I'm in the midst of a big paper deadline (on efficient cross-language interop, albeit of a different sort), so I don't have the time to go into more detail on this for the next couple weeks. But how about you give me a poke if I've failed to follow up on this in a month? Would that be alright? |
@RossTate of course! |
The CG voted in favor of maintaining single list-of-USV string type in the 08/03 meeting, notes, and the polls can be found here. Closing this issue as the original intent of this issue has been resolved, but as there are outstanding issues/approaches that still need discussion, please file new issues so these can be resolved/discussed. |
This issue lays out the reasoning for why I think strings should be lists of Unicode Scalar Values (as currently written in the explainer). This is a fairly nuanced question with the reasoning currently scattered around a number of issues, repos and specs, so I thought it would be useful to collect it all into one focused issue for discussion. The issue reflects discussions with a bunch of folks recently and over the years (@annevk, @hsivonen, @sunfishcode, @fgmccabe, @tschneidereit, @domenic), so I won’t claim credit for the reasoning. Also, to be clear, this issue only answers half of the overall question about string encoding, but I think it’s the first question we have to answer before we can meaningfully talk about string encodings.
(Note: I intend to update the OP in-place if there are any inaccuracies so that it represents a coherent argument.)
First, a bit of context:
Current proposal
As background, the Unicode standards provides two relevant definitions:
Based on these definitions, the current explainer proposes:
char
interface type is a USVstring
interface type is an abbreviation forlist char
Thus,
string
, as currently proposed, contains no surrogates (not just no lone surrogates). For reference: a pair of surrogate Code Units in a valid UTF-16 string is decoded into a single USV and thus valid UTF-16-encoded strings will never decode to strings containing any surrogates.This is not an encoding or in-memory-representation question
The question of whether strings are lists of Unicode Scalar Values is not a question of encoding or memory representation; rather, it’s a question of: “what are the abstract string values produced by decoding and consumed by encoding?”. Without precisely defining what the set of possible abstract string values is, we can’t even begin to discuss string encoding/decoding since we don’t even know what it is we’re trying to encode or decode. This is especially true in the context of Interface Types, where our goal is to support (via adapter functions) fully programmable encoding/decoding in the future.
Thus, if we’re talking about the abstract strings represented by languages like Java, JS and C#, we’re not talking about “WTF-16” (which is an encoding); we’re talking about “lists of code points not containing surrogate pairs (but potentially containing lone surrogates)”, which for brevity I’ll call Wobbly strings, since Wobbly strings are what a Java/JS/C# string can be faithfully decoded into and encoded from. In particular, a Wobbly string can be encoded by either WTF-8 or WTF-16. Note that the set of Wobbly strings is subtly different and smaller than “lists of Code Points” because surrogate pairs decode into necessarily-non-surrogate code points, so there is no way for a Java/JS/C# string to decode into a surrogate pair. The only major languages I know of whose abstract strings are actually “lists of Code Points” are Python 3 and Haskell.
This is a Component Model question
As of our recent CG-05-25 polls, the Interface Types proposal now has the goals and requirements of the Component Model (as presented and summarized). Concretely, this means we’re explicitly concerned with cross-language/toolchain composition, virtualizability and embeddability, which means we’re very much concerned with whether interfaces using
string
will be consumable and implementable by a wide variety of languages and hosts with robust, portable behavior. Thus, use cases exclusively focused on particular combinations of languages+hosts may need to be solved by separate proposals targeting those specific languages+hosts if they are in conflict with the explicit goals of broad language/host interoperability.With all this context in place, I’ll finally get to the reasons for defining
string
to be a list of USVs:Reason 1: many languages have no good way to consume surrogates
I think there are a few categories of affected languages (this is based on brief spelunking, so let me know if I got this wrong and I’ll update it):
First, there are languages that simply fix UTF-8 for their built-in string type, in some cases exposing UTF-8 representation details directly in their string operations. The popular languages I found in this category are: Elixir, Julia, Rust and Swift.
Second, there are languages which define strings as “arbitrary arrays of bytes”, leaving the interpretation up to the library functions that operate on them. For the languages in this category that I looked into, the default encoding (for source text and string literals and sometimes built-in syntax like iteration) is increasingly implicitly assumed to be UTF-8 (due to the fact that, as detailed below, most I/O data is UTF-8). While it may seem like these languages have the most flexibility (and thus ability to accommodate surrogates), when porting existing code, the implicit dependency on UTF-8 (in the form of calls to UTF-8-assuming library functions scattered around the codebase) makes targeting anything other than UTF-8 challenging. The popular languages I found in this category are: C/C++, Go, Lua, PHP and Zig.
Third, there are languages that support a variety of encodings and conversion between them, but still disallow surrogates (among other reasons being that they aren’t generally transcodable). The popular languages I found in this category are: R and Ruby.
In all of these categories, the author of the toolchain that is binding the language to the Interface Types
string
has no great general option for what to do when given a surrogate:For any particular use case, one of these options may be obvious. However, toolchains have to handle the general case, providing good defaults. In addition to the variable ecosystem cost of the different options, there is also a fixed non-negligible cost in wasted time for the N teams working on the N language toolchains, each of which will have to page in this whole problem space and wade through the space of options. In contrast, with a list of USVs, all the above languages can just do the obvious thing they’re already doing.
Reason 2: strings will often need to be serialized over standardized protocols and media formats, which usually disallow surrogates
A common use of Interface Types will be to describe I/O APIs (e.g., for passing data over networks or reading/writing different media formats). Additionally, several of the component model’s virtualizability use cases involve mocking non-I/O APIs in terms of I/O (e.g., turning a normal import call into an RPC, logging call parameters and results, etc). In both these cases, surrogates run in direct conflict with the binary formats of most existing standard network protocols and standard media formats.
In particular, just considering Web-relevant character sets:
On the Web, new APIs and formats created over the last 10 years simply mandate UTF-8, including:
json
andtext
getter functions of fetch, XHR and Blob APIs.There’s also a recent proposal to make this direction more-officially part of the W3C’s design principles.
Thus, insofar as a
string
needs to transit over any of these protocols, formats or APIs, surrogates will be a problem and the implementer of the mapping will have roughly the same bad options listed above as the language toolchains have.While it’s tempting to say “that’s just a specific precondition of particular APIs, not the
string
type’s problem”, the virtualization goals of the component model mean that any interface might be virtualized, so the fact that astring
is being used for one of the above is not a detail of the API. In contrast, all these protocols and formats can easily represent lists of USVs.Reason 3: even the WTF-16 languages will have a bad time if they actually try to pass surrogates across a component boundary
Because of the above two reasons, from the perspective of a WTF-16-language-implemented component, it is a very risky proposition to pass a surrogate across a component boundary (parameter of an import or result of an export). Why? Because there’s no telling whether the other side will trap, convert the surrogate into a replacement character, get mangled or trigger undefined/untested behavior. As an author of a component, there’s also not a fixed set of clients or hosts (that’s the point of components).
Thus, to produce widely-reusable and portable components, even a toolchain for a language that allows lone surrogates would be advised to conservatively scrub these before passing strings to the outside world. In a sense, this is nothing new on the Web: despite JSON being derived from JS, JSON doesn’t allow surrogates while JS does, thus there is an inherent scrubbing process that happens when JS communicates with the outside world via JSON (and similarly with WebSockets, fetch(), etc). Accordingly, the WTF spec specifically advises against ever being used outside of “self-contained systems”.
As an illustrative example: consider instead defining
string
to be a list of Code Points. As explained above, this would meanstring
was a superset of the Wobbly strings supported by Java/JS/C#. Why might we do this? For one thing, it would capture the full expressive range of Python 3 and Haskell strings and APIs (which is the same argument for supporting Wobbly strings, just for a smaller set of languages). For another, it would give us a simple definition ofchar
(= Code Point) andstring
(=list char
), which has a number of practical benefits (in contrast to Wobbly strings, which cannot be a “list char
” for any definition of “char
”). However, now the vast majority of languages and hosts would have to resort to a variant of the abovementioned workarounds which means Python 3 and Haskell would have a Bad Time attempting to actually take advantage of this increased string expressivity. Thus, there would be a distributed cost without a commensurate distributed benefit. I think the situation is the same with Wobbly strings, even if the partitioning of languages is different.What about binary data in strings?
One potential argument for surrogates is that they may be necessary to capture arbitrary binary data, particularly on the Web. To speak to this concern, it’s important to first clarify something: Web IDL has a
ByteString
type that is used for APIs (predominantly HTTP header methods likeHeaders.get()
), where aByteString
is intentionally an arbitrary array of bytes. However,ByteString
does this not by interpreting a JS string as a raw array ofuint16
s (which would have a problem representing byte strings of odd length), but by requiring each JS string element (auint16
value) to be in the range [0, 255], throwing an exception otherwise. Since surrogates are outside the range [0, 255], this means that the one place in the Web Platform where binary data is actually appropriate, surrogates are irrelevant.Outside
ByteString
use cases, there’s still a theoretical possibility of wanting to round-trip binary data throughDOMString
APIs. Talking to folks who have worked for years on the Web IDL and Encoding specs (@annevk, @hsivonen, @domenic), they’re not aware of any valid use cases for such usage ofDOMString
. Indeed, theTextDecoder
API does not provide any way to produce a non-USVString
, due to this same lack of use cases. In fact, there is currently no direct way (i.e., not involvingString.fromCharCode
et al) to decode an array of bytes into a non-USVString
on the Web Platform today.Instead, the natural way for a component to pass binary data is a
list u8
orlist u16
, using JS glue code to convert the byte array into a JS string. If these use cases were found and found to be on performance sensitive paths in real workloads on the Web, then it seems like a Web-specific solution would be appropriate, and I can think of a number of options for how to optimize this path by adding things to the JS API. But ultimately, as an optimization, I don’t think this is something we should preemptively add without compelling data.The text was updated successfully, but these errors were encountered: