Why should strings be lists of Unicode Scalar Values? #135

lukewagner · 2021-06-11T17:48:49Z

This issue lays out the reasoning for why I think strings should be lists of Unicode Scalar Values (as currently written in the explainer). This is a fairly nuanced question with the reasoning currently scattered around a number of issues, repos and specs, so I thought it would be useful to collect it all into one focused issue for discussion. The issue reflects discussions with a bunch of folks recently and over the years (@annevk, @hsivonen, @sunfishcode, @fgmccabe, @tschneidereit, @domenic), so I won’t claim credit for the reasoning. Also, to be clear, this issue only answers half of the overall question about string encoding, but I think it’s the first question we have to answer before we can meaningfully talk about string encodings.

(Note: I intend to update the OP in-place if there are any inaccuracies so that it represents a coherent argument.)

First, a bit of context:

Current proposal

As background, the Unicode standards provides two relevant definitions:

Code Point: an integer referring to the Unicode codespace in the range [0, 0x10FFFF].
Unicode Scalar Value (USV): a Code Point other than a surrogate and thus an integer in one of ranges [0, 0xD7FF] or [0xE000, 0x10FFFF].

Based on these definitions, the current explainer proposes:

The char interface type is a USV
The string interface type is an abbreviation for list char

Thus, string, as currently proposed, contains no surrogates (not just no lone surrogates). For reference: a pair of surrogate Code Units in a valid UTF-16 string is decoded into a single USV and thus valid UTF-16-encoded strings will never decode to strings containing any surrogates.

This is not an encoding or in-memory-representation question

The question of whether strings are lists of Unicode Scalar Values is not a question of encoding or memory representation; rather, it’s a question of: “what are the abstract string values produced by decoding and consumed by encoding?”. Without precisely defining what the set of possible abstract string values is, we can’t even begin to discuss string encoding/decoding since we don’t even know what it is we’re trying to encode or decode. This is especially true in the context of Interface Types, where our goal is to support (via adapter functions) fully programmable encoding/decoding in the future.

Thus, if we’re talking about the abstract strings represented by languages like Java, JS and C#, we’re not talking about “WTF-16” (which is an encoding); we’re talking about “lists of code points not containing surrogate pairs (but potentially containing lone surrogates)”, which for brevity I’ll call Wobbly strings, since Wobbly strings are what a Java/JS/C# string can be faithfully decoded into and encoded from. In particular, a Wobbly string can be encoded by either WTF-8 or WTF-16. Note that the set of Wobbly strings is subtly different and smaller than “lists of Code Points” because surrogate pairs decode into necessarily-non-surrogate code points, so there is no way for a Java/JS/C# string to decode into a surrogate pair. The only major languages I know of whose abstract strings are actually “lists of Code Points” are Python 3 and Haskell.

This is a Component Model question

As of our recent CG-05-25 polls, the Interface Types proposal now has the goals and requirements of the Component Model (as presented and summarized). Concretely, this means we’re explicitly concerned with cross-language/toolchain composition, virtualizability and embeddability, which means we’re very much concerned with whether interfaces using string will be consumable and implementable by a wide variety of languages and hosts with robust, portable behavior. Thus, use cases exclusively focused on particular combinations of languages+hosts may need to be solved by separate proposals targeting those specific languages+hosts if they are in conflict with the explicit goals of broad language/host interoperability.

With all this context in place, I’ll finally get to the reasons for defining string to be a list of USVs:

Reason 1: many languages have no good way to consume surrogates

I think there are a few categories of affected languages (this is based on brief spelunking, so let me know if I got this wrong and I’ll update it):

First, there are languages that simply fix UTF-8 for their built-in string type, in some cases exposing UTF-8 representation details directly in their string operations. The popular languages I found in this category are: Elixir, Julia, Rust and Swift.

Second, there are languages which define strings as “arbitrary arrays of bytes”, leaving the interpretation up to the library functions that operate on them. For the languages in this category that I looked into, the default encoding (for source text and string literals and sometimes built-in syntax like iteration) is increasingly implicitly assumed to be UTF-8 (due to the fact that, as detailed below, most I/O data is UTF-8). While it may seem like these languages have the most flexibility (and thus ability to accommodate surrogates), when porting existing code, the implicit dependency on UTF-8 (in the form of calls to UTF-8-assuming library functions scattered around the codebase) makes targeting anything other than UTF-8 challenging. The popular languages I found in this category are: C/C++, Go, Lua, PHP and Zig.

Third, there are languages that support a variety of encodings and conversion between them, but still disallow surrogates (among other reasons being that they aren’t generally transcodable). The popular languages I found in this category are: R and Ruby.

In all of these categories, the author of the toolchain that is binding the language to the Interface Types string has no great general option for what to do when given a surrogate:

Make incoming surrogates trap. This approach is attractive as it simply makes surrogates “someone else’s fault”, and thus not a corner case that all code in the language’s ecosystem has to worry about. This is an easy answer to pick, however it would make these languages second-class in the component ecosystem because they wouldn't be able to implement the same APIs.
Replace incoming surrogates with the replacement character. This happens by default in many places in many of the above languages that I saw, so it’s also a reasonable default option that avoids putting any burden on the language ecosystem at large. But, as with the previous option, this would make these languages second-class as they wouldn’t be able to faithfully implement the same APIs as other languages.
Produce non-UTF-8 byte strings. This isn’t possible for languages in the first and third categories and risky for languages in the second, due to the increasingly prevalent implicit assumption of UTF-8 noted above. Moreover, unlike the above two options, this is not a “spot fix”: it requires all ported code to use the appropriate non-UTF-8 string operations.
Escape surrogates into valid strings. This could make various simple round-tripping use cases Just Work, without hitting the above snags, but this option implicitly introduces a new micro-format that will need to be supported by any non-trivial string operation that works with the contents of the string (e.g., file system operations), so it’s also not a “spot fix”; it needs ecosystem adoption. Also, escaping can introduce collisions (leading to data corruption) with pre-existing strings since there are no code point sequences reserved for this purpose.
Produce a non-standard/builtin string. This option either requires large-scale changes (converting whole codebases to using the new, non-standard string), which blocks porting use cases, or requires a coercion some time later into the standard string, which means picking one of the above options.

For any particular use case, one of these options may be obvious. However, toolchains have to handle the general case, providing good defaults. In addition to the variable ecosystem cost of the different options, there is also a fixed non-negligible cost in wasted time for the N teams working on the N language toolchains, each of which will have to page in this whole problem space and wade through the space of options. In contrast, with a list of USVs, all the above languages can just do the obvious thing they’re already doing.

Reason 2: strings will often need to be serialized over standardized protocols and media formats, which usually disallow surrogates

A common use of Interface Types will be to describe I/O APIs (e.g., for passing data over networks or reading/writing different media formats). Additionally, several of the component model’s virtualizability use cases involve mocking non-I/O APIs in terms of I/O (e.g., turning a normal import call into an RPC, logging call parameters and results, etc). In both these cases, surrogates run in direct conflict with the binary formats of most existing standard network protocols and standard media formats.

In particular, just considering Web-relevant character sets:

The RFC 2277: IETF Policy on Character Sets and Languages specifies that “When using other charsets than UTF-8, these MUST be registered in the IANA charset registry, if necessary by registering them when the protocol is published.”, and there are no IANA charsets that include surrogates.
The W3C Architectural Specification specifically calls out “Specifications MUST NOT allow the use of surrogate code points.”
The preface of the WHATWG Encoding Standard says “The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the UTF-8 encoding.”
If a format is to support line-breaking or collation, the Unicode specification says the behavior given surrogates is undefined, possibly resulting in an error.
All XML-based formats reject surrogates.
Popular RPC protocols such as Protobufs and CapnProto mandate UTF-8 for strings/text.
GraphQL strings mandate UTF-8.

On the Web, new APIs and formats created over the last 10 years simply mandate UTF-8, including:

RFC 8259, for all JSON documents that aren’t shared as part of a closed ecosystem.
JS files loaded in newer Worker and ES Module contexts.
The WebSockets text stream APIs.
The json and text getter functions of fetch, XHR and Blob APIs.

There’s also a recent proposal to make this direction more-officially part of the W3C’s design principles.

Thus, insofar as a string needs to transit over any of these protocols, formats or APIs, surrogates will be a problem and the implementer of the mapping will have roughly the same bad options listed above as the language toolchains have.

While it’s tempting to say “that’s just a specific precondition of particular APIs, not the string type’s problem”, the virtualization goals of the component model mean that any interface might be virtualized, so the fact that a string is being used for one of the above is not a detail of the API. In contrast, all these protocols and formats can easily represent lists of USVs.

Reason 3: even the WTF-16 languages will have a bad time if they actually try to pass surrogates across a component boundary

Because of the above two reasons, from the perspective of a WTF-16-language-implemented component, it is a very risky proposition to pass a surrogate across a component boundary (parameter of an import or result of an export). Why? Because there’s no telling whether the other side will trap, convert the surrogate into a replacement character, get mangled or trigger undefined/untested behavior. As an author of a component, there’s also not a fixed set of clients or hosts (that’s the point of components).

Thus, to produce widely-reusable and portable components, even a toolchain for a language that allows lone surrogates would be advised to conservatively scrub these before passing strings to the outside world. In a sense, this is nothing new on the Web: despite JSON being derived from JS, JSON doesn’t allow surrogates while JS does, thus there is an inherent scrubbing process that happens when JS communicates with the outside world via JSON (and similarly with WebSockets, fetch(), etc). Accordingly, the WTF spec specifically advises against ever being used outside of “self-contained systems”.

As an illustrative example: consider instead defining string to be a list of Code Points. As explained above, this would mean string was a superset of the Wobbly strings supported by Java/JS/C#. Why might we do this? For one thing, it would capture the full expressive range of Python 3 and Haskell strings and APIs (which is the same argument for supporting Wobbly strings, just for a smaller set of languages). For another, it would give us a simple definition of char (= Code Point) and string (= list char), which has a number of practical benefits (in contrast to Wobbly strings, which cannot be a “list char” for any definition of “char”). However, now the vast majority of languages and hosts would have to resort to a variant of the abovementioned workarounds which means Python 3 and Haskell would have a Bad Time attempting to actually take advantage of this increased string expressivity. Thus, there would be a distributed cost without a commensurate distributed benefit. I think the situation is the same with Wobbly strings, even if the partitioning of languages is different.

What about binary data in strings?

One potential argument for surrogates is that they may be necessary to capture arbitrary binary data, particularly on the Web. To speak to this concern, it’s important to first clarify something: Web IDL has a ByteString type that is used for APIs (predominantly HTTP header methods like Headers.get()), where a ByteString is intentionally an arbitrary array of bytes. However, ByteString does this not by interpreting a JS string as a raw array of uint16s (which would have a problem representing byte strings of odd length), but by requiring each JS string element (a uint16 value) to be in the range [0, 255], throwing an exception otherwise. Since surrogates are outside the range [0, 255], this means that the one place in the Web Platform where binary data is actually appropriate, surrogates are irrelevant.

Outside ByteString use cases, there’s still a theoretical possibility of wanting to round-trip binary data through DOMString APIs. Talking to folks who have worked for years on the Web IDL and Encoding specs (@annevk, @hsivonen, @domenic), they’re not aware of any valid use cases for such usage of DOMString. Indeed, the TextDecoder API does not provide any way to produce a non-USVString, due to this same lack of use cases. In fact, there is currently no direct way (i.e., not involving String.fromCharCode et al) to decode an array of bytes into a non-USVString on the Web Platform today.

Instead, the natural way for a component to pass binary data is a list u8 or list u16, using JS glue code to convert the byte array into a JS string. If these use cases were found and found to be on performance sensitive paths in real workloads on the Web, then it seems like a Web-specific solution would be appropriate, and I can think of a number of options for how to optimize this path by adding things to the JS API. But ultimately, as an optimization, I don’t think this is something we should preemptively add without compelling data.

The text was updated successfully, but these errors were encountered:

aardappel · 2021-06-14T16:47:13Z

Wrapping my head around why this is necessary, I found it funny that the reason this issue even exists is because we have a range of values that are dedicated to indicating extension in a 16-bit scenario, that do not overlap (have a unique value) regardless of whether we are actually using a 16-bit encoding. Most encodings that provide extension values overlap with the values being encoded, e.g. in UTF-8, values 0x80..0xFF indicate an extension is present, but these are not unique as code points since 0x80..0xFF in code points are actual characters. So this question never comes up. So UTF-16 would have been better off with an encoding that overlaps, but I guess that would have made it less easy for software to ignore that UTF-16 is actually variable size..

dcodeIO · 2021-06-15T14:00:20Z

Interesting thought :) Makes me wonder in turn if the Unicode standard could reasonable settle this decade long issue by adopting what WTF-8 does, and perhaps, say, assigning the visual representation � to (lone) surrogate code points as well. Then it would overlap, roundtrip both ways, that is if I am not missing something, but most notably render future discussions for or against obsolete. Perhaps not labeling it "potentially ill-formed" but "relaxed", "lenient" or "practical" would have helped the situation as well, heh. Would require, however, that UTF-8 implementations need to merge previously split surrogates upon concatenation, and yeah, that would be a considerably large nudge. Too large, most likely.

dcodeIO · 2021-06-17T15:01:51Z

Thanks for the thorough writeup, Luke :)

This is a Component Model question

As of our recent CG-05-25 polls, the Interface Types proposal now has the goals and requirements of the Component Model (as presented and summarized).

I have been one of the against votes, so perhaps it makes sense to explain why I voted this way. My impression was that deciding for all the next steps was too early. String semantics/encodings in particular have been hotly discussed in the past without resolution so far, so it felt a little odd to me to tie this question (i.e. concretely propose USVs/UTF-8 as if it was natural) to the component model. I worried that the group has not been sufficiently informed on this particular ingredient before polling, which was my motivation for rushing out my presentation with my concerns the weekend before to provide some background. I recognize that this may not have been your intention, of course, but this was my thought process at the time. I appreciated that you clarified during the meeting that we are not actually polling on strings, but now I must admit that I am a little confused as the presence of the component model is used as an argument.

I also voted neutral on the general direction of the component model, because I do not see how it helps the more Web-focused use cases and anticipations I am seeing. Now I was not against it (if others want components I would be fine with it), but if it turns out that its existence is used to justify harming other straight-forward use cases (say where one's component is basically the combination of Wasm + JavaScript), I would decide differently in the future when similar questions arise.

Concretely, this means we’re explicitly concerned with cross-language/toolchain composition, virtualizability and embeddability, which means we’re very much concerned with whether interfaces using string will be consumable and implementable by a wide variety of languages and hosts with robust, portable behavior.

Here, I believe the "robust" part goes both ways:

If we pick "list of Unicode Scalar Values" (subset, UTF-8/16) we are going to randomly break some languages and their JS interop
if we pick "list of Unicode Code Points except surrogate pairs" (superset, WTF-16/8) we are not hard breaking anyone

I understand of course that

Reason 1: many languages have no good way to consume surrogates

but I think this is a rather weak argument compared to making occasional breakage the default for many languages on the contrary. I, and perhaps others as well, would prefer a component that works 100% of the time with an additional check over a component that works just 99.9% of the time while risking anything from annoyances to hazards otherwise. A fuzzer for instance will find this reliably, and so will millions of unintentionally fuzzing users who are not necessarily aware of all the ins and outs of string encodings.

Also

Reason 2: strings will often need to be serialized over standardized protocols and media formats, which usually disallow surrogates

may be true, but is in my opinion not a very compelling precedent for what should happen in between two function calls. The more modular Wasm becomes, the harder it will become to tell where a function lives respectively if a string argument/return crosses a boundary or not. Plus, what may work today may stop working in the future, and we are certainly on a trajectory towards more breakage, not less.

As such I do not think that basing

Reason 3: even the WTF-16 languages will have a bad time if they actually try to pass surrogates across a component boundary

on the above two reasons is very meaningful. The typical case for these languages will most likely be to interface with modules written in the same language or with JavaScript. Perhaps also WASI here or there, but WASI mostly consumes strings, say as file system paths, which are fine to be "not found", while otherwise returning either raw bytes or well-formed strings anyhow. As such I would question the practical value of this reason.

And of course this does not only apply to

What about binary data in strings?

in that some languages, by string API design, make it overly easy to accidentally split a surrogate pair into half (can be as easy as a substring(0, 1), sometimes yielding something akin to "binary data"), especially when a user is unaware of the underlying string encoding details. And the creators of these languages didn't deliberately decide to make it that easy, but it's merely a side-effect of them valuing backwards-compatibility even more when they upgraded from UCS-2 to UTF-16 with the Unicode Standard not leaving them a better choice. So I think we should do the same, and value backwards-compatibility over Unicode hygiene, which may seem useful in theory but in practice often is the opposite, so we can serve these languages well.

If these use cases were found and found to be on performance sensitive paths in real workloads on the Web, then it seems like a Web-specific solution would be appropriate

I can appreciate the direction here of at least keeping the door open to solve it in the Web embedding, but I would once more want to emphasize that this is not a problem exclusive to the Web embedding. I do not see, for instance, how a toolchain would decide whether to utilize a different type or just use a string, as it may not have the necessary knowledge upfront and cannot statically evaluate the contents of every string. Also, in many cases, a simple module will just be the entire component. This is also especially problematic in code migration, where I totally expect JavaScript modules becoming replaced with WebAssembly modules over time. In the current form, this can break for any language, while otherwise it would only happen for some of them, and likely not in the typical case.

On a more general note, it was obvious to me since the beginning that higher level languages like Java will have a better stab at seamlessly integrating with the Web platform, not only because some of them share a string encoding with JavaScript, but also because they already have matching concepts of, say, strings being references, that in the ideal case can be shared with JavaScript, say with GC where everything lives in a common heap anyway. The component model with its restrictions, on the other hand, seems as if it is on a different trajectory that may in the future even influence other proposals as a kind of precedent, and this makes me really sad because I always hoped we could embrace the sheer potential of a future where JavaScript and WebAssembly become one ("component").

Lastly, in the presence of a proper escape hatch for affected languages, I would be fine with a default string type that leads us into a well-formed feature. But without a mechanism that reliably prevents occasional breakage, I fear we are about to get into real trouble (i.e. in the worst case CVEs on or in combination with the IT MVP for applications that worked perfectly fine including when transpiled to JS but not anymore when compiled to Wasm), and would hence strongly prefer "list of Unicode Code Points except surrogate pairs" (WTF-*) over the much less compelling argument of Unicode hygiene.

lukewagner · 2021-06-17T16:40:02Z

I appreciated that you clarified during the meeting that we are not actually polling on strings, but now I must admit that I am a little confused as the presence of the component model is used as an argument.

I don't think it's possible to decide an issue like this one in the absence of an agreed-upon set of goals, use cases and requirements, which is what "the component model" refers to. Now that we have strong CG agreement on this scope, it gives us the appropriate context in which to discuss this question. I don't think there's an alternative approach to hard questions like this.

I also voted neutral on the general direction of the component model, because I do not see how it helps the more Web-focused use cases and anticipations I am seeing.

It's totally reasonable not to be particularly motivated by the component model -- it's not expected to address 100% of use cases or be a universal answer to all interoperability questions; that's why we've adopted a layered approach. However, it's clear that there are many other folks who are strongly motivated by these goals, so I don't think we can simply set aside the component model in this discussion. For a different set of goals, a different proposal is appropriate.

If we pick "list of Unicode Scalar Values" (subset, UTF-8/16) we are going to randomly break some languages and their JS interop

It won't be random, as it will happen quite regularly and independent of the context in which the component is embedded. In contrast, allowing surrogates to cross component boundaries would lead to the random (from the perspective of an individual component) failures. This is the crux of the matter when you consider the full goals of the component model. I agree that if you're restricting your set of goals to focus more exclusively on JS and Web this is less of a concern, but that's not the context of this layered proposal.

I want to reemphasize another point which is: like wasm, the component model is not a one-shot standard that has to have everything from day 1. Like wasm, our goal is to start with an MVP and iterate based on real-world experience. Thus, the question to ask isn't: "do there exist any use cases for passing surrogates?" but, rather, "will the initial release of the component model not be viable without the ability to pass surrogates between components?". The data we have here is the years of experience of folks working on Web standards suggests that surrogates are not necessary, which is supported by all the standards evolution described above.

dcodeIO · 2021-06-17T19:07:00Z

Given Murphy's law, what you are proposing seems unnecessarily risky. The real-world experience we want to obtain is to some extent (rare) breakage, and that doesn't make it a proper foundation to build a house on for me. I would once more like to remind of the claim that post-MVP "is just an optimization", which is what we voted on, but as far as I am concerned remains unproven, plus not everyone in the group may have been properly informed about when placing their vote.

As such I would argue that we are better off when starting with the more inclusive string semantics that are true to the just-an-optimization aspect and gives us the foundation we need to iterate in the future. This adds three more options at the end of the day:

restricting string semantics post-MVP when we can actually be 100% sure
accounting for this case in well-formed languages' standard libraries (merge lead/trail surrogates upon concat)
adding an ideal copy-only WTF16String to well-formed languages' standard libraries or ecosystems (i.e. opt-in when interfacing with JS)

Pre-existing "experience of folks working on Web standards" doesn't convince me at least, especially because we are going to compile a lot of stuff to the Web that hasn't been there before.

Btw, it could be as integrated as adding an option like list.lower_canon sanitize=true/false while even retaining the option of well-formedness. I would be fine with that, as I consider well-formed use cases important as well, but I would not be OK with leaving the more inclusive use case out. Add UTF-16/32/Latin-1 to the mix one day and we are in a position that we may not even need adapter functions (for strings) anymore. Respective MVP requirement could be: Support Unicode-like string encodings observed in the wild without the need for adapter functions. That's inclusive, that's neutral, that's minimum viable imo. And, of course, I would finally shut up. Even better, I would be fully on board :)

lukewagner · 2021-06-17T22:50:55Z

Today, whenever JS talks to the outside world (through HTTP APIs, JSON, gRPC, etc), surrogates are replaced with replacement characters and noone considers this data loss because that's simply the expectation when talking to the outside world. As shown in this slide, the component model is not meant to take the place of language-specific modules/packages/libraries, but, rather, encapsulate linked collections of these (making components more like lightweight processes). Thus, the component model is explicitly adopting the same "talking to the outside world" model where surrogates are not expected.

If we are really worried about silent breakage, though, the fix would be to have surrogates trap instead of producing replacement characters, so the errors could be caught early and fixed easily. I'm open to discussing that more.

(Also, just to clarify, "post-MVP" doesn't refer to a singular follow-up proposal (adding adapter functions), but rather a long sequence of feature proposals over time, the same as with core wasm. Thus, post-MVP is not restricted to only being for optimization by any means.)

dcodeIO · 2021-06-17T23:10:38Z

I am still not convinced that these APIs, that require well-formedness under the hood by definition of being HTTP APIs, ~~JSON~~ or gRPC, set a compelling precedent for what shall happen in between function calls of modules potentially written in the same language, or when interfacing with JS. As far as I am concerned we are comparing apples and oranges here, since a bunch of function calls that do not sanitize (as it would risk to corrupt half-way) typically happen prior to hitting say the HTTP API that does sanitize because it has a very good reason to. Anticipating that every Wasm module utilizing IT is akin to these APIs seems like quite a stretch.

The theoretical other extreme would be to consider that every string API call should sanitize, which would break languages straight away. We are somewhere in the middle, and given that there are even more APIs in JS for instance (that can be considered modules), and that these deliberately do not sanitize since there is no reason to risk that, I would say we are much closer to function calls here. Unless we want to encourage shipping monoliths only, perhaps, but I am not sure that's a goal :)

(Btw, I'd absolutely prefer replacement over erroring for separate reasons if my viewpoint cannot find consensus. Appreciate UTF-16/Latin-1 being considered.)

lukewagner · 2021-06-18T00:30:22Z

I think there is plenty of room in the wasm ecosystem for new ABIs specifically designed for allowing closely-related languages to integrate more tightly than the component model allows. This is already the case with the existing tooling-conventions ABI which is the basis for C/C++/Rust(/FORTRAN?) to link together and pass pointers to memory and functions back and forth. I can imagine another totally different ABI designed specifically around native-JS integration that could be more like what you want. But for the component model, the virtualization goals imply that a component should never assume it knows the language (and whether its wasm or native) of its imports nor of the caller of its exports.

dcodeIO · 2021-06-18T00:34:23Z

Do you think this could be designed in a way that it becomes composable? Say, either use just Interface Types to achieve something lightweight as I am envisioning, or Component Model over Interface Types for stronger guarantees? Like, so far it really only differs in string well-formedness as far as I am concerned.

One way to achieve this perhaps could be if JS could participate in a Component (achieving Wasm + JS inside) as if it were just one of many modules, but outside of the component we'd enforce stricter guarantees that are useful in more complex scenarios.

dcodeIO · 2021-06-18T12:51:49Z

Some use cases I have in mind are:

Use APIs that give you unpaired surrogates with WebAssembly. This appears to be true for keyboard events in browsers currently, and there may be many more anomalies like this in the various languages that we want to compile to Wasm.
Compile C#/Java/JS-like string manipulation routines to WebAssembly and use them as a module, either from within JavaScript or from other WebAssembly modules. Would currently sanitize ephemeral isolated surrogates early and make it unfeasible. One more concrete use case here is to provide a JSString for interop purposes that can be used in Non-Web environments.
Compile a StringBuilder or similar string utility to WebAssembly and use it as a module. Similar to string manipulation routines, this would currently prematurely sanitize and be unfeasible.
Compile string encoders or decoders to WebAssembly and use them as a module, either from within JavaScript or other WebAssembly modules. Would currently be unfeasible for Unicode-like encodings because of mandatory lossiness.

On the other hand, there is one fun use case being possible with the USV restriction: wasm-string-sanitize, a practically zero code string sanitizer that works universally across every environment supporting Interface Types. Not saying that someone should build that, as it would lead the whole thing ad absurdum, but perhaps good to have in mind that someone could indeed build this.

kripken · 2021-06-18T17:24:39Z

I apologize if this is a digression (but maybe it can help?). It seems to me that there are really two use cases here:

For the Web interop use case, it seems natural to me to use an externref to a JS string anyhow. As @lukewagner suggested in the past, pre-importing of JS methods for string operations could make that fast, if a wasm module needs those methods. That is, this path would not use interface types. Looking at it from another angle, interface types works under the assumption of shared-nothing, so the normal thing is to copy a string when crossing the boundary. But for JS interop we want to use actual JS strings without copying - like languages compiled to JS do today. For wasm to really fit into that JS ecosystem of languages, it needs to use externref.
For the wasm components use case, I agree UTF8 is the cleaner solution. Yes, that means some amount of copying and conversions for some languages, but that's expected in the shared-nothing model, so we may as well use the best encoding, UTF8. Perhaps the type could even be renamed from string to utf8 to make that explicit?

(Btw, note that wasm on the web would use option 2 when it wants to use interface types to communicate with wasm components; point 1 is just about interop with JS.)

I think there is no perfect way to have optimal web interop as well as optimal UTF usage at the same time / with the same code. We will have the downside of people compiling differently when JS interop is their preference, but that will be the same with a wasm VM embedded on the JVM or CLR I believe, where, again, using the native string may be important for speed and compatibility.

So in practice we may see these two patterns of usage, externref for "host strings" and IT strings for "generic string type among wasm components". It would have been really nice if we could have just one pattern, of course. But wasm is a low-level assembly, and overlapping solutions can make sense to get optimal behavior on different use cases.

dcodeIO · 2021-06-18T19:31:46Z

Sadly, option 1 is not feasible yet, and probably won't be for a long time. One pain point for example is that modules typically ship with static string data, and without Interface Types, and without GC, and with neither of them considering WTF-16 a proper encoding, including no ETA on Type Imports (or knowing what these actually can do to init strings), I am not seeing anything of significance happening in the near future. This is all connected at the end of the day, and simply choosing UTF-8 now because it is "the best" encoding, even though I presented reasonable concerns and the solution to it is rather trivial, would be something I would not understand. Given how important strings are, this sentiment has the potential to ruin it for some languages, including AssemblyScript, from double re-encoding to occasional breakage and whatnot, and if that's what the open standard chooses, even though it asked me for my first-hand implementer feedback, then I really don't know what I am doing here anymore.

kripken · 2021-06-18T20:22:01Z

Sadly, option 1 is not feasible yet, and probably won't be for a long time.

I don't follow your example with static string data. Can you not use a TextDecoder to generate a string from them, then store that in an externref? Can even do it all from wasm with the proper imports and use of Reflect.

Overall I believe option 1 works today (in browsers with reference types), just without inlining it is somewhat slow if you do operations on the string.

If you mean it is not feasible without inlining, then I think a simple inlining proposal - which is just a performance hint, parallel to branch-hinting - could in fact ship before Interface Types. (Not that the order really matters, but just to respond to your concern.) I may work on this myself, in fact, given the multiple use cases that have come up around it.

I'm not ignoring the rest of your comment, but I don't have the unicode expertise to comment on the UTF-8/WTF-16 details here. My point is that, regardless of that debate, if I were compiling a language to wasm GC with the goal of JS interop, then I would use externref for strings. (Perhaps Emscripten will do so too, especially once inlining happens, although wasm GC is in a much better position to benefit from this than linear memory languages...)

dcodeIO · 2021-06-18T20:29:56Z

I don't follow your example with static string data. Can you not use a TextDecoder to generate a string from them, then store that in an externref?

Sadly no, TextDecoder does not support JavaScript string encoding. Also, we would have to do that during instantiation somehow, which will fail if memory is not imported respectively no _start is used. And even if we would decide to do that, there is a hefty performance hit attached to that, in that we end up with redundant casts from externref on every string operation, including overhead from crossing the boundary each time, plus we cannot even properly import String instance methods without a lot of glue code.

On the other hand, if we could just pass strings over Interface Types boundaries without double re-encoding and potential breakage, we'd be much better off, also because we can then keep specializing our string methods for the static type system we are using. And all it takes to achieve that is to make well-formedness the default, while giving us the option to lower a string without enforcing surrogate replacement. It's that simple :(

lukewagner · 2021-06-18T20:36:48Z

If timeliness is the concern, my expectation, which I included in the CG pres, is that browser native implementations will trail significantly behind actual use of components and thus, for the first few years, the way components are used on the Web will be via AOT transpilation into core wasm + JS API performed by bundlers (which is the same path ESMs took). So interface types is by no means the "fast path" for solving the JS-specific problems of how to efficiently produce and consume JS strings; proposing additions to the JS API could be both faster and more appropriately scoped.

dcodeIO · 2021-06-18T20:42:32Z

All I am asking for is that Interface Types considers JavaScript/C#/Java/AssemblyScript string encoding important enough to avoid unnecessary double re-encoding and not to force potential breakage on these languages. If it refuses to account for this concern, we'll enter an era of C, Rust and similar languages being the only ones viable for a long time. I don't know how you think about this, but this has the potential to backfire so very badly that I cannot even describe. And AssemblyScript would probably still find a way around it, say by switching to UTF-8 in very unpleasant ways only because JS-like languages are simply not viable, but still, wow. What's even happening here after I informed you about the problem for 3 1/2 years :(

tlively · 2021-06-18T21:30:23Z

+1 for IT not being a fast path for anything on the Web. In that same vein, there's plenty of time and appetite to take implementor feedback into account both for the MVP and for follow-on extensions as implementations start to appear. @dcodeIO, I think we all understand that IT will be less appropriate (in terms of correctness, performance, and convenience) for the use cases you are concerned about and that it will be more appropriate for other use cases instead. I don't think it's clear that this will be as catastrophic as you're arguing it will be, though. If it does turn out to cause as much developer pain as you anticipate, we certainly can and should act on that.

In short, I think this conversation would benefit greatly from developer feedback on real prototypes and on an MVP, and I don't anticipate anyone changing their minds without that feedback.

dcodeIO · 2021-06-18T21:54:48Z

Right, it's neither correct, nor performant, nor convenient, while it is for the first-class club of languages. And JavaScript is not one of them. To me that is catastrophic on many levels :(

At this point, though, I do not know what to say anymore. Thank you for taking the time to discuss this with me :)

kripken · 2021-06-18T22:29:34Z

@dcodeIO

Sadly no, TextDecoder does not support JavaScript string encoding.

Interesting, and surprising. I see there is a long list of existing encodings (including utf-16) so I don't see why they would object to adding JS encoding. Might be worth starting a discussion, or looking for an existing one, but possibly you already have.

MaxGraey · 2021-06-19T00:01:35Z

Interesting, and surprising. I see there is a long list of existing encodings (including utf-16)

Only for TextDecoder while TextEncoder supports only UTF8 by default

trusktr · 2021-06-22T05:07:08Z

AssemblyScript is catching up in popularity (3rd place) among languages targeting WebAssembly for one main reason: the similarity it has with JavaScript, the most widely-used programming language in the world (plus AS is done very well!). This is a huge boon for JavaScript web developers:

(source)

Simple interop will be a great thing to have for the JavaScript-WebAssembly use case, which is definitely growing thanks to AssemblyScript, and if it keeps going at this pace, it is going to pass Rust and C++.

C programmers can write string encoders with their eyes closed; I'd bet it is better to make the string stuff easy for the JavaScript devs right from the get go instead. 😊

I'm nowhere near an expert on this topic, but that's the feeling I get from reading this thread.

PiotrSikora · 2021-06-22T12:29:16Z

If we are really worried about silent breakage, though, the fix would be to have surrogates trap instead of producing replacement characters, so the errors could be caught early and fixed easily. I'm open to discussing that more.

If we want to consider string as a list of USVs, then I believe that we have to trap early to avoid silent corruptions and possible breakages when other components are suddenly replaced by a more strict alternatives, which might be hard to troubleshoot with module linking.

hsivonen · 2021-06-22T13:31:44Z

Use APIs that give you unpaired surrogates with WebAssembly. This appears to be true for keyboard events in browsers currently, and there may be many more anomalies like this in the various languages that we want to compile to Wasm.

This is a fixable bug in how Chromium and Gecko interface with Windows. (They should switch from pre-XP native events to XP-or-later native events.) It's not fundamental to the Web Platform given that Trident/EdgeHTML didn't have this problem on Windows and engines on other operating systems don't need to replicate this oddity.

It would be really bad to design for this Web engine Windows integration bug.

Notably, there is no use case for holding onto the ephemeral DOMString that ends with an unpaired surrogate high surrogate: The DOMString that has a proper USV interpretation will be available in the next moment anyway.

Compile C#/Java/JS-like string manipulation routines to WebAssembly and use them as a module, either from within JavaScript or from other WebAssembly modules. Would currently sanitize ephemeral isolated surrogates early and make it unfeasible. One more concrete use case here is to provide a JSString for interop purposes that can be used in Non-Web environments.

Compile a StringBuilder or similar string utility to WebAssembly and use it as a module. Similar to string manipulation routines, this would currently prematurely sanitize and be unfeasible.

Regardless of the supported value space, as long as the representiation in Wasm memory is either UTF-8 or WTF-8, this seems really inefficient. Realistically, if there were C#/Java-like string manipulation routines operating on this level of granularity and worthwhile exposing to JS, chances are that it would make sense to expose them in a manner that explicitly exposed the 16-bit-code-unit representation.

Also, this use case is for a family of languages that have the same string value space among themselves and isn't an appropriate motivation for interface types for general cross-language interop.

Compile string encoders or decoders to WebAssembly and use them as a module, either from within JavaScript or other WebAssembly modules. Would currently be unfeasible for Unicode-like encodings because of mandatory lossiness.

The UTFs themselves make the loss of unpaired surrogates mandatory, so "Unicode-like" here would have to mean "wobbly" for this argument to make sense, which would make this use case tautological: requiring wobbliness in order to support wobbly encodings, but that's not a use case that would explain why a wobbly encoding would be needed.

dcodeIO · 2021-07-12T10:17:01Z

For those who haven't seen it yet, we are about to "Poll for maintaining single list-of-USV string type" on August 3rd (i.e. void my concerns / no support for DOMStrings). Also fyi, I have been informed by the chair that the follow-up IT-specific meeting suggested at the end of the previous discussion slot has since been cancelled due to "reluctance to spend more time discussing this than already has been done" by the relevant folks.

I would appreciate if we could at least talk about my suggested solutions first and establish a definitive commitment for UTF-16 support in the canonical ABI before polling, as I think that would lead to a more constructive outcome than what is being proposed currently. IMO it is too early to decide the USV question without.

dcodeIO · 2021-07-14T10:32:18Z

In particular I'd like to talk about "Integrated W/UTF-any" as an alternative to single list-of-USVs.

Lift "list of Unicode Code Points amending surrogate pairs" but lower "list of Unicode Scalar Values".

Add an optional passthrough option when lowering.

It largely preserves what is proposed here as its default (except conceptionally lifting "List of Unicode Code Points amending surrogate pairs"), but simplifies matters for users and toolchains in that all of the following questions do not have to be asked and answered (by WTF-16 languages in particular) due to not having to statically determine what kind of mechanism to use:

Is the API we are calling a Web API so we can use the Web embedding mechanism?
Is the API we are calling a WASI API so we need to re-encode / resort to string?
Is the API we are calling a compatible API, say same language, C# or Java, so we can utilize list u16?
Is the API we are calling a well-formed API so we need to re-encode / resort to string?
Are we the single caller of the API (in the final module graph) so we can use a single mechanism?
If used in multiple ways, can we duplicate the API gracefully, say to an internal and external variant? Can we update all callers?
What do we do if we have multiple dynamic internal representations that map to either mechanism, but not statically to one?
How do we reliably know all of the above? Annotate in the source language? Ship separate linking meta data?

On the contrary, with "Integrated W/UTF-any", a consumer can categorically set the passthrough option accordingly and use the same ecosystem-wide string type everywhere, from the Web embedding to separate compilation to linking with known or unknown modules. A language like Rust would omit passsthrough (rely on USVString-to-DOMString conversion which is unavoidable), while a language like AssemblyScript would set passthrough (preserve DOMString where possible), and both can interface with each other with zero knowledge of the other, while reliably mitigating potential breakage where it is possible.

Wouldn't that be generally preferable in the current and future landscape of languages we want to support?

lukewagner · 2021-07-14T20:29:18Z

I believe that, when you ask what is the net effect of such a design, where you have an ecosystem of components with some setting passthrough, some not, some producing/consuming surrogates, some not, you end up with all the same problems outlined in the original comment. I think it's essentially equivalent to adding surrogates to string.

conrad-watt · 2021-08-02T12:38:42Z

@dcodeIO as @dtig has said previously, please take up any concerns you have about CoC issues with the chairs.

dcodeIO · 2021-08-02T12:39:48Z

Yeah, I guess I have a concern for the chairs, thanks.

Pauan · 2021-08-02T12:44:53Z

@dcodeIO Which concrete APIs would that be? Let me guess, storage and networking, that don't really roundtrip synchronously?

I looked through the WebIDL and found the following APIs which use USVString:

new Blob
Client
Clients.openWindow
Credential.id
new EventSource
Response.text
new File
FileSystem.name
FileSystemDirectoryEntry.getDirectory
FileSystemDirectoryEntry.getFile
FileSystemEntry
FormData
Location
PushMessageData
Request
ServiceWorker
TextEncoder
TextDecoder
URL
URLSearchParams
WebAuthentication
Window.open
Worker
XMLHttpRequest

Most of these APIs are dealing with URLs, the filesystem, or server requests/responses.

However, TextEncoder/TextDecoder are quite important, because they are used to send strings between JS/Wasm.

This means that right now it is impossible to send unpaired surrogates from JS to Wasm (or from Wasm to JS). So effectively this enforces that all strings must be valid Unicode.

If Interface Types allows unpaired surrogates, then we should also change TextEncoder/TextDecoder to accept unpaired surrogates.

dcodeIO · 2021-08-02T12:50:55Z

Yeah, the current TextEncoder and TextDecoder are the prime example of what one gets when being OK with carelessly breaking stuff that cannot change. See here and here on where that lead us.

conrad-watt · 2021-08-02T13:03:52Z

This means that right now it is impossible to send unpaired surrogates from JS to Wasm (or from Wasm to JS). So effectively this enforces that all strings must be valid Unicode.

I think this is a little too strong since codePointAt and fromCodePoint can also be used (albeit very non-ergonomically), but I agree with the spirit of the point that theTextDecoder API is sending a strong signal by supporting only valid Unicode.

Yeah, the current TextEncoder and TextDecoder are the prime example of what one gets when being OK with carelessly breaking stuff.

As @lukewagner alluded to earlier, adding a WTF-16 encoder and decoder specific to some JS/Web API is exactly the sort of language-aware feature that could be pursued at a level lower than the component model. Is this something you'd be interested in?

dcodeIO · 2021-08-02T13:06:46Z

I'd be much more interested in not carelessly breaking stuff that cannot change anyhow ;). P.S. 6 posts left until my presentation goes out of view.

conrad-watt · 2021-08-02T13:47:10Z

@dcodeIO, unless I'm misinterpreting, it looks like AS actually performed replacement of isolated surrogates when transferring a string from Wasm to JS, until this recent patch to the loader in June. It was previously using TextDecoder with fatal=false.

I appreciate that you've recently patched the code to always preserve isolated surrogates, but did anyone complain about the character replacement before, or notice the new preservation behaviour?

dcodeIO · 2021-08-02T13:57:27Z

Yes, I noticed that this was broken by a change that was supposed to be an optimization. Apart from that interfacing with JS from AS is such a major pain because this is all so broken and nobody cares that only me and a couple former Wasm enthusiasts are really using it.

conrad-watt · 2021-08-02T14:02:27Z

Yes, it looks like this behaviour has been in place from September 2020 to June 2021

AssemblyScript/assemblyscript#1471 - the offending "optimization".

dcodeIO · 2021-08-02T14:04:04Z

Excuse me, but are you trying to suggest...?

conrad-watt · 2021-08-02T14:09:12Z

Not at all - I'm giving the context that this was quite a long-lived bug. In fact I agree with you that given the desired behaviour, the supposed "optimization" was not semantics-preserving.

That being said, it doesn't seem like it broke anyone's AS code to the point that they raised it as an issue.

EDIT: the above comment has been destructively edited in a way that changes the meaning of my reply - there are certainly some things I'm trying to suggest.

dcodeIO · 2021-08-02T14:16:22Z

The offending "optimization"

the supposed "optimization"

Really?

conartist6 · 2021-08-02T14:21:29Z

Sometimes github is more dramatic than TV. Now if I tell my friends I'm watching a drama about unpaired surrogates, I wonder what they will think I mean...?

conrad-watt · 2021-08-02T14:24:18Z

Really?

Yes, "offending" in the sense that it was the cause of your bug, and "supposed" in the sense that it wasn't really an optimization (since it didn't preserve the semantics you wanted). Apologies if you took any other meaning.

I think the key point here is that, while it was clearly a bug, it somewhat demonstrates that the body of existing AS code out there was not broken by long-term changes in sanitisation behaviour.

I should also note that consensus on the USV abstraction doesn't mean that AS has to again change its sanitisation behaviour when interop-ing with JS (just don't do it through a component boundary).

devsnek · 2021-08-02T14:29:48Z

To whoever is responsible for code of conduct compliance... I continue to be dumbfounded by this issue.

crertel · 2021-08-02T19:36:00Z

I apologize if this is a little bit of red herring, but I think (having read the thread) it might be worth asking.

People generally recognize that UTF-16/UCS-2 were kind of an evolutionary dead-end that arose due to historical accident, right? And that the WTF-8/WTF-16 is somewhat a weird way of helping with that?

My question is simply: why are we continuing to carry the baggage of this evolutionary oddty? UTF-8 is around 30 years old, and is without a doubt the de facto standard and has won the user-facing encoding wars.

Can we not give a gift to our future selves and future implementors by dropping support for other forms?

MaxGraey · 2021-08-02T19:53:13Z

This has already been discussed many times in side threads. Calling UTF16 obsolete will not change to UTF8 encoding in languages such as Java, C#, Dart, JavaScript, etc. They will all have to convert UTF16 to UTF8 and back, even between themselves or with a browser. This is slow, it needs to be validation in user space and no one can guarantee if the validation was done at all and if it was done correctly. Once again, even UTF8 takes into account the legacy of UTF16 (yep!) the range associated to surrogate pairs is also invalid in UTF8.

Pauan · 2021-08-02T20:06:06Z

@crertel This issue isn't really about encodings per se (since Interface Types supports all encodings, including UTF-16, UTF-32, etc.)

Interface Types must support every encoding, because various languages use different encodings (Java, C#, and JavaScript use WTF-16... Python and Haskell use Unicode code points... Swift uses grapheme clusters... Rust, C, C++ use UTF-8... etc.)

It is not an option to change every language (and every library), so Wasm must interop with the native string types of many different languages (and many different encodings).

In order to achieve this, every language creates an adapter function, which converts from its native string type into a list of Unicode code points. This allows for every language to use its own native encoding, and still interop with every other language.

However, this means that all strings must be valid Unicode. Invalid Unicode (such as surrogate pairs) cannot be used. Normally this isn't a problem, since invalid Unicode is generally a bug. However, in some very niche situations languages will want to preserve the invalid Unicode, and they can't do that with the Interface Types proposal.

In particular, JavaScript uses WTF-16, and so JavaScript strings can contain invalid Unicode. With the current proposal, it's not possible for JS to send an invalid string to Wasm (or for Wasm to send an invalid string to JS). Some people think that's a good thing, and that invalid strings should not be allowed. Other people think that invalid strings should be allowed, because JS allows it.

crertel · 2021-08-02T20:23:23Z

@Pauan, thank you for the very clear and concise summary!

dcodeIO · 2021-08-02T23:33:47Z

In order to achieve this, every language creates an adapter function, which converts from its native string type into a list of Unicode code points

The concern here is, a bit more precisely, that Interface Types wants to restrict its string type further than Code Points to Unicode Scalar Values (some Code Points are deemed invalid), so...

Invalid Unicode (such as surrogate pairs) cannot be used

...unpaired surrogates (pairs are valid and are transformed normally), that can exist due to how many UTF-16 language APIs are designed for backwards-compatibility reasons, would either need to be replaced or trigger an exception, which is not only unnecessary but can lead to many problems for these languages when communicating with themselves or JavaScript. My general suggestion is that we should allow them to pass their idiomatic strings (i.e. DOMString in JS, String in C# and Java) through in this situation, because why not.

More context on why list-of-USV is not necessarily a good idea can be found in my summary slides above if there is interest.

dcodeIO · 2021-08-03T12:17:12Z

I've added one more slide to my summary slides outlining the suggestion of adding a separate domstring type. Putting it here for those who already saw the others:

RossTate · 2021-08-03T18:51:23Z

In WebAssembly/gc#145 (comment), @littledan nicely summarizes my perspective on both the rationale for and the concerns against the decision made:

At big component boundaries, surrogate-checked UTF-16 makes sense to me, maybe with an opt-in for WTF-16 omitting surrogate checks. Coming from JS, this regularity will cost a linear-time scan, in addition to probably flattening (of the internal rope data structure) and maybe copying the string (depending what it ends up linking to). I'd be fine with including an unchecked WTF-16 domstring type initially, but I have no strong opinions, but I don't really see the rush.

Within a component, it will often be important to mix multiple programming languages, but it is even more important to avoid linear-time copies, checks, conversions between 8-bit and 16-bit string representations, and flattening of ropes. The importance is greater because there is more frequent expected back-and-forth within small modules of a component. This back-and-forth is amplified by how there can be reference circularities within a component, whereas there is a hierarchical relationship between components.

In my own terms, I would say that the cost of (sometimes redundantly) validating strings seems likely reasonable in inter-trust settings, which is the new layer that the earlier poll on the component model has added to Interface Types. But in settings that are inter-language but still intra-trust, such as the examples that @littledan gives in his comment (e.g. wasm-JS interop/integration), these validations come with costs but not proportionate benefits.

So maybe what these discussions mean is that there's room and need for something that enables (more efficient) inter-language (and shared nothing) linking and exchange without the burdens inherent in crossing trust boundaries. If that's the case, then there's good news. In developing session types for Interface Types, I also figured out the lower-level constructs that could be added to WebAssembly (or as a layer just above WebAssembly) to make efficient shared-nothing exchange possible. Interface Types can then be trust-bridging libraries built on top of these primitives, where the lift/lower instructions are actually exports of the libraries (that can still be fused despite being abstract exports). And people can build their own libraries (such as one like what @kmiller68 suggested in the meeting—and others in other threads—where strings are host/GC references managed by the library) on top of these primitives.

If there's interest in developing such an inter-language extension to WebAssembly, I'm happy to develop the technology more. But I don't have the time to drum up interest, so that work would have to be done by others—whether it's IT people wanting to develop a common core for engines to support to facilitate IT, or it's people seeing needs not served by IT, or the two group together, I'm happy whichever way. And if there's no interest, I'm fine with sitting on the tech (and maybe turning it into a research paper).

ttraenkler · 2021-08-04T00:18:51Z

@RossTate I am more than interested to hear more about your idea. I believe for tight intra component shared encoding small module linking there is definitely some interesting space to explore from IT, over externref to ABIs. What about creating a new issue for this?

RossTate · 2021-08-04T00:50:02Z

Glad to hear this might fit the space you were looking to fill! Unfortunately I'm in the midst of a big paper deadline (on efficient cross-language interop, albeit of a different sort), so I don't have the time to go into more detail on this for the next couple weeks. But how about you give me a poke if I've failed to follow up on this in a month? Would that be alright?

ttraenkler · 2021-08-04T01:39:13Z

@RossTate of course!

dtig · 2021-08-12T23:57:45Z

The CG voted in favor of maintaining single list-of-USV string type in the 08/03 meeting, notes, and the polls can be found here.

Closing this issue as the original intent of this issue has been resolved, but as there are outstanding issues/approaches that still need discussion, please file new issues so these can be resolved/discussed.

This was referenced Jun 11, 2021

Add a draft "canonical ABI" #132

Closed

Discussion: WebAssembly, Unicode and the Web Platform WebAssembly/design#1419

Open

lukewagner changed the title ~~Why should string be lists of Unicode Scalar Values?~~ Why should strings be lists of Unicode Scalar Values? Jun 11, 2021

conrad-watt mentioned this issue Jun 16, 2021

Support UTF-16 as an additional encoding #136

Open

RossTate mentioned this issue Aug 3, 2021

Add notes for 2021-08-03 WebAssembly/meetings#848

Merged

trusktr mentioned this issue Aug 5, 2021

How possible is it to have two string types? #138

Closed

dtig closed this as completed Aug 12, 2021

vmx mentioned this issue Jun 28, 2022

spec: initial WAC spec ipld/ipld#226

Open

Why should strings be lists of Unicode Scalar Values? #135

Why should strings be lists of Unicode Scalar Values? #135

Comments

lukewagner commented Jun 11, 2021

Current proposal

This is not an encoding or in-memory-representation question

This is a Component Model question

Reason 1: many languages have no good way to consume surrogates

Reason 2: strings will often need to be serialized over standardized protocols and media formats, which usually disallow surrogates

Reason 3: even the WTF-16 languages will have a bad time if they actually try to pass surrogates across a component boundary

What about binary data in strings?

aardappel commented Jun 14, 2021

dcodeIO commented Jun 15, 2021

dcodeIO commented Jun 17, 2021 • edited Loading

lukewagner commented Jun 17, 2021

dcodeIO commented Jun 17, 2021 • edited Loading

lukewagner commented Jun 17, 2021

dcodeIO commented Jun 17, 2021 • edited Loading

lukewagner commented Jun 18, 2021

dcodeIO commented Jun 18, 2021 • edited Loading

dcodeIO commented Jun 18, 2021

kripken commented Jun 18, 2021

dcodeIO commented Jun 18, 2021 • edited Loading

kripken commented Jun 18, 2021

dcodeIO commented Jun 18, 2021

lukewagner commented Jun 18, 2021 • edited Loading

dcodeIO commented Jun 18, 2021

tlively commented Jun 18, 2021

dcodeIO commented Jun 18, 2021 • edited Loading

kripken commented Jun 18, 2021

MaxGraey commented Jun 19, 2021 • edited Loading

trusktr commented Jun 22, 2021 • edited Loading

PiotrSikora commented Jun 22, 2021

hsivonen commented Jun 22, 2021

dcodeIO commented Jul 12, 2021

dcodeIO commented Jul 14, 2021

lukewagner commented Jul 14, 2021

conrad-watt commented Aug 2, 2021

dcodeIO commented Aug 2, 2021

Pauan commented Aug 2, 2021

dcodeIO commented Aug 2, 2021 • edited Loading

conrad-watt commented Aug 2, 2021

dcodeIO commented Aug 2, 2021 • edited Loading

conrad-watt commented Aug 2, 2021

dcodeIO commented Aug 2, 2021

conrad-watt commented Aug 2, 2021

dcodeIO commented Aug 2, 2021 • edited Loading

conrad-watt commented Aug 2, 2021 • edited Loading

dcodeIO commented Aug 2, 2021

conartist6 commented Aug 2, 2021

conrad-watt commented Aug 2, 2021

devsnek commented Aug 2, 2021

crertel commented Aug 2, 2021

MaxGraey commented Aug 2, 2021 • edited Loading

Pauan commented Aug 2, 2021

crertel commented Aug 2, 2021

dcodeIO commented Aug 2, 2021 • edited Loading

dcodeIO commented Aug 3, 2021

RossTate commented Aug 3, 2021

ttraenkler commented Aug 4, 2021 • edited Loading

RossTate commented Aug 4, 2021

ttraenkler commented Aug 4, 2021

dtig commented Aug 12, 2021

dcodeIO commented Jun 17, 2021 •

edited

Loading

dcodeIO commented Jun 17, 2021 •

edited

Loading

dcodeIO commented Jun 17, 2021 •

edited

Loading

dcodeIO commented Jun 18, 2021 •

edited

Loading

dcodeIO commented Jun 18, 2021 •

edited

Loading

lukewagner commented Jun 18, 2021 •

edited

Loading

dcodeIO commented Jun 18, 2021 •

edited

Loading

MaxGraey commented Jun 19, 2021 •

edited

Loading

trusktr commented Jun 22, 2021 •

edited

Loading

dcodeIO commented Aug 2, 2021 •

edited

Loading

dcodeIO commented Aug 2, 2021 •

edited

Loading

dcodeIO commented Aug 2, 2021 •

edited

Loading

conrad-watt commented Aug 2, 2021 •

edited

Loading

MaxGraey commented Aug 2, 2021 •

edited

Loading

dcodeIO commented Aug 2, 2021 •

edited

Loading

ttraenkler commented Aug 4, 2021 •

edited

Loading