Skip to content
This repository has been archived by the owner on Aug 17, 2022. It is now read-only.

STRING i32 pairs for UTF-16LE #13

Closed
dcodeIO opened this issue Dec 21, 2017 · 169 comments
Closed

STRING i32 pairs for UTF-16LE #13

dcodeIO opened this issue Dec 21, 2017 · 169 comments

Comments

@dcodeIO
Copy link

dcodeIO commented Dec 21, 2017

Regarding

JavaScript hosts might additionally provide:

STRING | Converts the next two arguments from a pair of i32s to a utf8 string. It treats the first as an address in linear memory of the string bytes, and the second as a length.

Any chance that there'll be support for JS-style strings (UTF-16LE) as well? I know this doesn't really fit into the C/C++ world, but languages approaching things the other way around will most likely benefit when not having to convert back and forth on every host binding call.

@dcodeIO
Copy link
Author

dcodeIO commented Jun 30, 2019

Since this hasn't received any comments yet, allow me to bump this: I am still curious if UTF-16 strings can be supported. In AssemblyScript's case all strings are UTF-16LE already, so only having the option to re-encode (potentially twice if the bound API wants UTF-16) does seem like it should be taken into account.

@fgmccabe
Copy link
Contributor

fgmccabe commented Jul 1, 2019 via email

@dcodeIO
Copy link
Author

dcodeIO commented Jul 1, 2019

If most hosts would require copying unicode16 into urf8 anyway, you may
have trouble with (a).

To me it looks like having such an operator can lead to significantly less work where UTF-16 is already present on both sides of the equation, while implementing the APIs in any case where either side is UTF-8 can easily be implemented by reencoding conditionally. Hence, the module would chose the operator that fits its internal string layout ideally, and the host would do whatever is necessary to make it fit into theirs. This leaves us with those cases:

  1. Both UTF-8: Essentially memcpy
  2. Both UTF-16: Essentially memcpy
  3. One UTF-8, the other UTF-16: Reencoding once

while avoiding the very unfortunate case of

  • Module UTF-16, host UTF-16: Reencode twice because UTF-8 is all the bindings understand

But, essentially, it's a bit early to consider committing to any operators
at the moment.

I see, yet I thought it might make sense to raise this early so once committing to operators, this case is well thought through :)

@jgravelle-google
Copy link
Contributor

while avoiding the very unfortunate case of

  • Module UTF-16, host UTF-16: Reencode twice because UTF-8 is all the bindings understand

Yeah that would be unfortunate.

Any chance that there'll be support for JS-style strings (UTF-16LE)

The real question is, can we skip JS entirely? At which point, what does the host API use internally? For example a similarly bad outcome would be

  • Source UTF-8 -> JS UTF-16 -> Web API UTF-8

So I think the sanest way to handle that is with a declarative API on the bindings layer. Which is what you said earlier:

Hence, the module would chose the operator that fits its internal string layout ideally, and the host would do whatever is necessary to make it fit into theirs.

So the higher-level point is that we should be able to adequately describe the most common/reasonable ways to encode strings, so that we can minimize the number of encodings in the best case.


But, essentially, it's a bit early to consider committing to any operators
at the moment.

Agree + disagree. On the one, it's all in the sketch stage at the moment, where we're feeling out the rough edges. So from a managing-expectations point of view, this makes sense to say.

On the other hand, it's kind of incongruous to say "everything's up in the air, so don't raise any design issues." I don't think that was the intent, but that's kind of how it sounded. A more real translation of how I heard it was "don't worry about this now, we'll figure it out later." To which I would say, as a general principle, that yes we'll figure it out later, but we should raise it now to figure out if we should worry about it now. Especially because multiple people can think about different bits of the spec asynchronously.

@jgravelle-google
Copy link
Contributor

Also something I should mention explicitly:

I find it incredibly likely that we will default to 1 binding expression per ⛄-type per wasm representation (e.g. 1 for linear memory and 1 for gc), which is to say the MVP of ⛄-bindings will have one binding expression per type, because gc will probably not be shipped yet. On that basis, we will probably start with only UTF-8-encoding (I imagine we will drop the utf8-cstr binding too, for similar reasons).

My general mental model here is that we can always add bindings in the future as we find a need for them. And it may be the case that in practice, the re-encoding from UTF-16 isn't enough of a bottleneck to be worth it. Unless it is, at which point we can add that binding, and it will be more obviously useful because we'll have much more real-world data.


Also for AssemblyScript specifically, would it be reasonable to change the internal string representation from UTF-16 to UTF-8 in the presence of ⛄-bindings? It is, after all, "Definitely not a TypeScript to WebAssembly compiler" 😄

@dcodeIO
Copy link
Author

dcodeIO commented Jul 1, 2019

And it may be the case that in practice, the re-encoding from UTF-16 isn't enough of a bottleneck to be worth it. Unless it is, at which point we can add that binding, and it will be more obviously useful because we'll have much more real-world data.

At the end of the day we are building just tools here and one can't know the use case of everyone. Like, any use case extensively calling bound functions with string arguments would hit this and my expectation would be that this'll happen anyway (in certain use cases). Like, if we'd wait, this'll surface sooner or later, so it can as well be addressed from the start, instead of having to tell everyone running into this that their use case is currently not well-supported even through we did see it coming. Especially since specification and implementation of new operators can take a long time again.

Also for AssemblyScript specifically, would it be reasonable to change the internal string representation from UTF-16 to UTF-8 in the presence of ⛄️-bindings? It is, after all, "Definitely not a TypeScript to WebAssembly compiler" 😄

I'm sorry, the "⛄️-bindings" term is new to me. Would you point me into the right direction where I can learn about it? :)

Regarding UTF-8: In fact we have been thinking about this but it doesn't seem feasible, because we are re-implemeting String after the JS-API (with other stdlib components relying on it) and going with something else than UCS-2 representation seems suboptimal since the API is so deeply rooted into the language that mimicking UCS-2 would cost too much perf-wise. After all we are trying to stay as close to TS as reasonable to make picking up AssemblyScript a smooth experience. Also would like to note that this isn't exclusively an AssemblyScript thing, as other languages are using UTF-16LE as well, like everything in the .NET/Mono space.

@jgravelle-google
Copy link
Contributor

Like, if we'd wait, this'll surface sooner or later, so it can as well be addressed from the start, instead of having to tell everyone running into this that their use case is currently not well-supported even through we did see it coming. Especially since specification and implementation of new operators can take a long time again.

It's ultimately a tradeoff. My thoughts here are that it will be strictly easier to spec and implement a bindings proposal that defines 8 operators, as opposed to one that defines 40. So we could just add UTF-16, but we could also just add C-strings and we could just add Scheme cons-list strings and we could just add Haskell lazy cons thunks, and so on. So for MVP I think we need to be really strict as to what exactly is "minimal", and in this context Minimal means "we can reason about strings at all".

We also need to balance the "viable" portion. Originally I was thinking we should avoid reasoning about strings and allocators at all, due to the complexity they add. Further discussion on this (see: #25) made me realize that not having an answer for allocators would compromise the viability of the proposal entirely. On that basis, not having UTF-16 support from day 1 is unlikely to leave the bindings proposal dead in the water.

By means of analogy, I would rather we ship anyref without waiting for the full gc proposal, because anyref on its own is a very enabling feature. It is in many ways suboptimal, but it it is more useful than what we had before. On that basis, I want to be very cautious about adding scope to the bindings MVP, especially when that scope is separable to a v2 that describes an expanded set of binding expressions.

I'm sorry, the "snowman-bindings" term is new to me. Would you point me into the right direction where I can learn about it? :)

Sure, @lukewagner presented at the June CG meeting, and here's the slide deck: https://docs.google.com/presentation/d/1wtAknL-UJWDoIgSbyF5paTBSpVVj-fKU4tiHMxJbSzE/edit

tl;dr does this wasm binding layer we're describing need to reason about WebIDL at its core, or is WebIDL another target with a produce/consume pair? If the latter, and we suspect that is the case, then we're free to design an IDL that better matches what we're trying to do, rather than try to retrofit that on top of WebIDL.

Full notes of the accompanying discussion here: https://github.com/WebAssembly/meetings/blob/master/2019/CG-06.md#webidl-bindings-1-2-hrs

Also would like to note that this isn't exclusively an AssemblyScript thing

Didn't mean to sound like I was saying it was :x, sorry. I was thinking that if AssemblyScript was using UTF-16 for easier FFI with JS, then in the presence of something-bindings it would be possible to decouple that ABI. And also that AssemblyScript would probably have an easier time of making that ABI switch than a more-ossified target like .NET, on account of it's a younger platform.

@dcodeIO
Copy link
Author

dcodeIO commented Jul 2, 2019

My thoughts here are that it will be strictly easier to spec and implement a bindings proposal that defines 8 operators, as opposed to one that defines 40

Makes sense, yeah. Though, to me it seems not overly complex to have a (potentially extensible) immediate operand on str (/ alloc-str) that indicates a well-known encoding. I'd consider UTF-8, UTF-16LE and maybe ASCII here (not sure), with length always provided by the caller (even if null-terminated), but I'm certainly not an expert in this regard.

By means of analogy, I would rather we ship anyref without waiting for the full gc proposal, because anyref on its own is a very enabling feature. It is in many ways suboptimal, but it it is more useful than what we had before.

I totally agree with the anyref mention, but don't entirely agree on the comparison with encodings. anyref is a useful feature on its own with everything else building upon it, while not addressing encoding challenges on introduction of the feature that would need to deal with it leads to half a feature that unnecessarily limits what certain ecosystems with (imo) perfectly legit use case scenarios like UTF-16 can do efficiently.

Sure, @lukewagner presented at the June CG meeting, and here's the slide deck: https://docs.google.com/presentation/d/1wtAknL-UJWDoIgSbyF5paTBSpVVj-fKU4tiHMxJbSzE/edit

Thanks! :)

So, looking at this slide it mentions utf8 exclusively similar to what we have with WebIDL. Not quite sure how it would solve the underlying issue, that is making a compatible string from raw bytes, if it moves the problem from "directly allocating a string compatible with WebIDL bindings" to "creating a DOMString/anyref compatible with ⛄️-bindings" (if I understood this correctly?). For instance, TextEncoder doesn't support UTF-16LE (anymore), but TextDecoder does.

I'd expect that at some point in either implementation "making a compatible string from raw bytes" will be necessary anyway if the primary string implementation is provided by the module, which is likely. Please correct me if I'm missing something here. Ultimately, the issue doesn't have to be solved in the WebIDL spec, but any other spec solving it would be perfectly fine as well - as long as it is solved.

Didn't mean to sound like I was saying it was :x, sorry. I was thinking that if AssemblyScript was using UTF-16 for easier FFI with JS, then in the presence of something-bindings it would be possible to decouple that ABI. And also that AssemblyScript would probably have an easier time of making that ABI switch than a more-ossified target like .NET, on account of it's a younger platform.

All good, your point makes perfect sense. Just wanted to emphasize that, even if AssemblyScript would make this change, this is a broader problem than what it might look like from this issue alone :)

@MaxGraey
Copy link

MaxGraey commented Jul 2, 2019

I thought if WebAssembly implement WebIDL binding it should follow WebIDL spec which support three types of strings: DOMString, ByteString and USVString. Most of WebIDL which relate to WebApi mostly using DOMString which commonly interpreted as UTF-16 encoded strings [RFC2781]. ByteString is actually ASCII and at last USVString which not require concrete encoding format. Addisianal note about USVString from WebIDL spec:

Specifications should only use USVString for APIs that perform text processing and need a string of Unicode scalar values to operate on. Most APIs that use strings should instead be using DOMString, which does not make any interpretations of the code units in the string. When in doubt, use DOMString.

@Pauan
Copy link

Pauan commented Jul 2, 2019

@dcodeIO I'd consider UTF-8, UTF-16LE and maybe ASCII here (not sure)

UTF-8 was intentionally designed as a strict super-set of ASCII, therefore UTF-8 can be used to efficiently transfer ASCII text.

@dcodeIO
Copy link
Author

dcodeIO commented Jul 2, 2019

UTF-8 was intentionally designed as a strict super-set of ASCII, therefore UTF-8 can be used to efficiently transfer ASCII text.

Yeah, tried to be careful there (in regards to C-strings) but the more I think about it the less I believe that this distinction is necessary, especially since any API being bound will very likely be reasonably modern anyway. So that'd leave us with UTF-8 and UTF-16LE. Anything else you could imagine would fit there in terms of "well-known encodings" (in context of modern programming languages)?

@MaxGraey
Copy link

MaxGraey commented Jul 2, 2019

@Pauan WebIDL (except ByteString) and Javascript not using ASCII at all. Strings in javascript represented as UTF-16LE by default but v8 for example can represent strings in different ways and encodings internally. For example during concatenation strings can represent as rope structure which flattened to "normal" string before serialization / father conversion or before passing to Web Api. But that doesn't mean we should use rope structure as default structure for string for example. The same with UTF8

@dcodeIO
Copy link
Author

dcodeIO commented Jul 2, 2019

Side note: USVString looks like it can be described in terms of UTF-32 (not sure if that makes sense as I don't know anything using it for its internal representation). But maybe the least common denominator is UTF here?

@MaxGraey
Copy link

MaxGraey commented Jul 2, 2019

About ByteString in WebIDL

Specifications should only use ByteString for interfacing with protocols that use bytes and strings interchangeably, such as HTTP. In general, strings should be represented with DOMString values, even if it is expected that values of the string will always be in ASCII or some 8 bit character encoding. Sequences or frozen arrays with octet or byte elements, Uint8Array, or Int8Array should be used for holding 8 bit data rather than ByteString.

@Pauan
Copy link

Pauan commented Jul 2, 2019

@MaxGraey I am aware. The purpose of WebIDL bindings is to allow many different languages to use WebIDL APIs without using JavaScript.

Since each language does things differently, that means there needs to be a way to convert from one type to another type.

That's why there's a UTF-8 -> WebIDL string conversion, to allow for languages like Rust to use WebIDL bindings (since Rust uses UTF-8).

@MaxGraey
Copy link

MaxGraey commented Jul 2, 2019

That's why there's a UTF-8 -> WebIDL string conversion, to allow for languages like Rust to use WebIDL bindings (since Rust uses UTF-8).

So every browser which has already implemented WebIDL bindings for Javascript and rest of languages like C#/Mono, Java, Python and other which still popular today should change its internal string representation? I guess all this languages in total much more popular then Rust no matter how it awesome)

@MaxGraey
Copy link

MaxGraey commented Jul 2, 2019

I don't mind utf8-str but I think proposal should care about utf16le-str as well =)

@MaxGraey
Copy link

MaxGraey commented Jul 2, 2019

webidl bindings proposal already care about pretty special and still allow only in C/C++ null-terminated strings (utf8‑cstr). So it already care about backward compatibility for legacy approaches)

@Pauan
Copy link

Pauan commented Jul 2, 2019

So every browser which has already implemented WebIDL bindings for Javascript and rest of languages like C#/Mono, Java, Python and other which still popular today should change its internal string representation?

I'm not sure where you got that idea... you seem to be misunderstanding how all of this works. I suggest you read the recent slides, especially slide 29.

The way that it works is that the browser implements WebIDL strings (using whatever representation it wants, just like how it does right now). And then there are various "binding operators" which convert from other string types to/from the WebIDL strings.

So you can have a binding operator which converts from UTF-8 to WebIDL strings, or a binding operator which converts from UTF-16 to WebIDL strings. The browser doesn't need to change its internal string representation, it just needs to implement a simple conversion function.

I'm also not sure why you're bringing up languages like C#/Mono, Java, or Python... they are also implemented in WebAssembly linear memory, and so they need binding operators. The binding operators are not a "Rust-only" thing, they benefit all languages. That's why it's a UTF-8 conversion, so it can be used by all languages which use UTF-8 strings.

@dcodeIO
Copy link
Author

dcodeIO commented Jul 2, 2019

I'm also not sure why you're bringing up languages like C#/Mono, Java, or Python... they are also implemented in WebAssembly linear memory, and so they need binding operators. The binding operators are not a "Rust-only" thing, they benefit all languages. That's why it's a UTF-8 conversion, so it can be used by all languages which use UTF-8 strings.

I believe the point he wanted to make is that all those languages use UTF-16LE internally so all of them would face the potential performance penalty this issue is about.

@Pauan
Copy link

Pauan commented Jul 2, 2019

I believe the point he wanted to make is that all those languages use UTF-16LE internally so all of them would face the potential performance penalty this issue is about.

Okay, but I never spoke about UTF-16 (which I am in favor of).

I only said that languages which use ASCII do not need a special "ASCII binding operator", since they can use UTF-8 instead.

@MaxGraey
Copy link

MaxGraey commented Jul 2, 2019

I only said that languages which use ASCII do not need a special "ASCII binding operator", since they can use UTF-8 instead.

Yes, just one note it's C (C++ probably as well) and it should use utf8-cstr - null-terminated version of utf8-str: https://github.com/WebAssembly/webidl-bindings/blob/master/proposals/webidl-bindings/Explainer.md#binding-operators-and-expressions

@dcodeIO
Copy link
Author

dcodeIO commented Jul 2, 2019

So, to recap my perspective a little here, maybe one way to avoid re-encoding on every host-binding call, discriminating languages following another UTF standard, could be to make the encoding kind an immediate operand of utf-str and alloc-utf-str (dropping the 8), with valid encodings being UTF-8 (& UTF-8-zero-terminated?), UTF-16LE and potentially UTF-32 (USVString <-> USVString fallback?). Based on the pair of (source-encoding, target-encoding), the host would either preserve the representation if both are equal, or convert into either one depending on what it deems appropriate.

Since those encodings are relatively similar, I'd say that the implementation isn't a significant burden, while solving the issue for most modern programming languages for good.

If it is decided that WebIDL-bindings should not provide string operations, that'd be fine, but in this case whatever is decided-upon as the alternative should take it into account (note that anything based upon TextEncoder currently doesn't).

Hope that makes sense :)

@annevk
Copy link
Member

annevk commented Jul 2, 2019

Note that JavaScript strings are not UTF-16, they're 16-bit buffers. UTF-16 has constraints that JavaScript does not impose.

@MaxGraey
Copy link

MaxGraey commented Jul 2, 2019

Yes, In JavaScript most of operations is not "unicode safe" and interpret that 16-bits as UCS-2 except String#fromCodePoint , String#codePointAt, String#toUpperCase/String#toLowerCase and several others. But UTF16LE and UCS-2 has same 16-bit storage so for simplicity most people call that UTF16 encoding

@annevk
Copy link
Member

annevk commented Jul 2, 2019

The distinction is nonetheless important because you could imagine a language having support for UTF-16 the way Rust has support for UTF-8 (8-bit buffer with constraints) and that's not a good fit for what OP is asking for.

@MaxGraey
Copy link

MaxGraey commented Jul 2, 2019

UCS-2 is a strict subset of UTF-16. It means if we use UCS-2 we could always reinterpret as UTF-16 without any caveats if both have same endian. UTF-16 just understands surrogate pairs - UCS2 isn't.

UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided

So I don't think it's big deal for current topic

@annevk
Copy link
Member

annevk commented Jul 2, 2019

Surrogate pairs are not the issue, lone surrogates are.

@MaxGraey
Copy link

MaxGraey commented Feb 17, 2021

The host cannot expose its internal representation to Wasm, because that would cause massive security problems. In addition, the browser has a very complex internal implementation (Latin1, WTF-16, ropes, etc.) which changes over time, so a single representation will not work.

Yes and no. v8 use special kind of strings (external which 2-bytes and flattened) for FFI which follow USVString or DOMString interfaces of WebIDL and equivalent to WTF-16/UTF-16 accordantly.

@Pauan
Copy link

Pauan commented Feb 17, 2021

@MaxGraey V8 is not the only JS engine (other engines do things differently), and DOM APIs are not the only APIs that use strings. And that just furthers my point that host strings are complicated and use different representations at different times. And that's ignoring the other hosts like the JVM or CLR.

In any case, even if every JS engine used a single representation for strings, it still could not expose it to Wasm for security (and backwards compatibility) reasons, so externref is still mandatory.

@MaxGraey
Copy link

V8 is not the only JS engine (other engines do things differently), and DOM APIs are not the only APIs that use strings. And that just furthers my point that host strings are complicated and use different representations at different times. And that's ignoring the other hosts like the JVM or CLR.

Are you're agree most popular encodings are WTF-8 and WTF-16? Both encodings is minimal common denominator for all langs. Isn't this true?

@Pauan
Copy link

Pauan commented Feb 17, 2021

Are you're agree most popular encodings are WTF-8 and WTF-16? Both encodings is minimal common denominator for all langs. Isn't this true?

No that is not true. One of the most popular languages in the world (Python) does not fit neatly into WTF-8 vs WTF-16.

In addition, I explained in detail why this problem is not about encodings. I will not repeat myself.

@MaxGraey
Copy link

MaxGraey commented Feb 17, 2021

First of all Python support arbitrary encodings (1-byte, 2-byte and 4-bytes). Second, Python is not required to be fast. Third python is interpreted language (not AOT-compiled) so problem with string encoding is not main problem at all in term of performance and interoperability.

@Pauan
Copy link

Pauan commented Feb 17, 2021

First of all Python support arbitrary encodings (1-byte, 2-byte and 4-bytes).

Yes! Exactly! Different languages have radically different string implementations. If it was just about UTF-8 vs UTF-16 then this problem would be far easier to fix, but it's not about encoding.

Second, Python is not required to be fast.

Python still mandates O(1) indexing for code points and O(1) length, otherwise it will turn O(n) algorithms into O(n^2) algorithms.

Python does actually care about performance, perhaps not as much as languages like Rust or Go, but it still cares. A lot of high-performance data science is done in Python (e.g. using NumPy), Python is not a toy.

Python being interpreted has nothing to do with it, it still uses raw byte strings for performance, and the Python interpreter (which does all the string operations) is written in C.

In addition, the behavior of Python differs from WTF-8 / WTF-16, it is more than just performance. That's why they couldn't just run Python on the JVM, they had to create a fork (Jython). The point of Wasm is that languages should be able to work without needing to create a fork of the existing language.

You are so convinced that the only problem is UTF-8 vs UTF-16, but it's simply not true, and I have already explained in detail why it is not true, so I will not repeat myself.

@MaxGraey
Copy link

V8 is not the only JS engine (other engines do things differently)

The same for FF: https://github.com/mozilla/gecko-dev/blob/master/js/src/vm/StringType.cpp#L2054

@MaxGraey
Copy link

MaxGraey commented Feb 17, 2021

A lot of high-performance data science is done in Python (e.g. using NumPy)

I know how CPython works under the hood. NumPy is just thin wrapper over C libs like pocketfft and CBLAS.

@Pauan
Copy link

Pauan commented Feb 17, 2021

The same for FF: https://github.com/mozilla/gecko-dev/blob/master/js/src/vm/StringType.cpp#L2054

  1. That is completely irrelevant, JS strings will always be externref, end of story. I don't know why you're even bringing that up, it makes no difference at all.
  2. That only applies to external strings, in that very file you can see JSString, JSLinearString, JSRope, JSExternalString, and you can see multiple encodings being used. Do you think Wasm only uses external strings?

Since you are clearly not willing to argue in good faith, and you have consistently ignored what I write, I will not argue with you anymore. Goodbye.

@MaxGraey
Copy link

Again JSString, JSLinearString, JSRope all this internal representaion of specific JS engine, but external string API pretty the same for all JS engines and this follow WebIDL interface

@fgmccabe
Copy link
Contributor

The tone in this discussion has not always been helpful.
I believe that part of the reason is that there is a difference between (a) recognizing the current reality (many different representations of strings) and (b) promoting a particular future (a single uniform representation of strings across different languages).
IMO, even if one agrees with (b), this is not the appropriate forum to pursue this objective.

@ttraenkler
Copy link

ttraenkler commented Feb 17, 2021

❤️ Let me just express some love for the efforts made on both sides with good intentions.

The only thing I can add to the good arguments made is I think a lot of the build up in the frustration is the fact that this is a written conversation on a problem with high complexity and far reaching implications which does not allow to convey the emotional intent of a comment made and how it is received on the other side of the planet in a completely different context and background and let's not forget we're in the midst of a pandemic and lockdown.

A good summary of this from the 90s IRC guidelines may be a good read and it would be good to continue this conversation in a more informal setting in good will and a person able to see and relay both sides: https://freenode.net/changuide

@dcodeIO
Copy link
Author

dcodeIO commented Feb 17, 2021

I appreciate the good vibes, but this has been going on for three years now, with influential parts of the CG effectively mobbing and encouraging others to mob one independent person for ages. So I'm sorry, I don't think that taking a deep breath can resolve this. In my opinion this requires to be sanctioned, and should have required someone stepping in long ago, but nobody did that because nobody cared. As I expect this behavior to continue, the only option I have left here is to leave out of sheer protest, also for my own sanity, and in the hope that this is never again being done to anyone coming after me, as it certainly has the potential to seriously harm people who are not as resistant to this stuff as I am.

@aminya
Copy link

aminya commented Feb 18, 2021

@Pauan

So the end result is that if the two modules share the same representation (such as both using externref), it can efficiently pass it without copying, and in the cases where the representations differ it will automatically copy/re-encode. This is the best we can do, there is no way to improve that.

So, you say that the whole WASM ecosystem is going to be made based on an optimization that is not even mentioned in the specs?

Naturally that means if Wasm modules want to ensure efficient transferring, then they should agree on a single representation. Since JS strings are already mandatory if you want to efficiently call host APIs, that seems like the obvious choice to use.

But that optimization is not mentioned anywhere, and we can't build efficient apps based on something that "might" happen.

With regards to the universal strings proposal, it is completely missing the point of why a single universal string type cannot happen.
This is simply not possible. There are too many different representations, and it goes far beyond encodings... we're talking about nul vs non-nul, GC vs non-GC, mutable vs immutable, ropes, multi-encodings, hybrid-encodings, etc.

The process of standardization means agreeing on a "unified interface" that benefits everyone. For example, there is already an effort by Bjarne Stroustrup with the Flats library.
https://youtu.be/ERzENfQ51Ck?t=75

If people wanted to look at the problems the way you look at them then literally any effort for standardization becomes meaningless.

The UniversalString proposal is not going to replace what a language is already using. It is offering an interface so different languages can communicate together using a unified interface.

JS strings will always be externref, end of story.

Is this what the community has decided, or is it a personal opinion?

@Pauan
Copy link

Pauan commented Feb 18, 2021

@aminya So, you say that the whole WASM ecosystem is going to be made based on an optimization that is not even mentioned in the specs?

I'm not sure what you mean, externref is fully specced and already implemented.

Interface types are not fully specced or implemented, and you absolutely should not rely upon it. The interface types proposal has already changed radically, and it might change again.

Interface types do not replace externref, interface types are built on top of externref. Host types (such as JS strings/arrays/objects, and DOM objects) will always use externref (or an interface type wrapper around externref).

So if you use externref, then you will be future-compatible with interface types, since interface types will have lift/lower instructions for externref.

The process of standardization means agreeing on a "unified interface" that benefits everyone.

Standardization only works when there is common ground which can be standardized on. When there is radical deviation, standardization becomes impossible.

Interface types standardizes an abstract string type, and provides ways for languages to convert to/from that string type, and makes it possible for languages to efficiently share strings if they use the same representation. And in the future pre-imports will allow for creating a standardized method API. This sort of standardization is the only way to support the wide variety of string types which exist.

But as I said to @dcodeIO, if you think it is possible to create one string type to rule them all, then please do so, but it is out of scope for interface types, so you should make a new proposal.

Even if universal strings existed, they would not replace externref or linear memory. Some languages would choose to use universal strings and others (such as C/C++/Rust) would not. And many languages would use externref because they care about efficient interop with the host. So you will simply get this situation.

Universal strings are simply one of many concrete types. Interface types is the abstract glue that allows for all the different concrete strings to seamlessly interop with each other. Interface types is at a higher level than universal strings. Universal strings cannot replace interface types, it is a completely separate thing, and so there is no point in arguing about it in the interface types proposal.

For example, there is already an effort by Bjarne Stroustrup with the Flats library.

That's very cool, but it's just another serialization library (like FlatBuffers, Protocol Buffers, Cap'n Proto, etc.)

Despite many standardized serialization libraries already existing, Bjarne chose to create a new one instead, presumably because the existing standards did not work for his needs. Having multiple standards is very common and completely normal. It is rare for a single standard to work for everyone.

The same exact thing happens with strings, each language chooses a string representation based on their individual needs.

The UniversalString proposal is not going to replace what a language is already using. It is offering an interface so different languages can communicate together using a unified interface.

If you want a unified interface, that's what interface types already provides (it's called (list char)).

Universal strings are not an abstract type, it is a concrete type with a concrete representation, just the same as externref. So universal strings cannot be used as a unified type between different languages.

Universal strings only works for a tiny handful of languages, interface types works for everything, without requiring languages to change their representation, because interface types are abstract.

Is this what the community has decided, or is it a personal opinion?

The "community" does not decide things, things are decided by the WebAssembly Working Group. And yes it was decided years ago that externref is the only viable way for the host to expose things to Wasm, because of security reasons.

Wasm is (intentionally) a sandbox, which does not have privileged access to the host. Wasm's memory is separate from the host's memory, and this is necessary in order to prevent all sorts of massive security problems on the web. That will not change.

@aminya
Copy link

aminya commented Feb 18, 2021

JS strings will always be externref, end of story.

Is this what the community has decided, or is it a personal opinion?

The "community" does not decide things, things are decided by the WebAssembly Working Group. And yes it was decided years ago that externref is the only viable way for the host to expose things to Wasm, because of security reasons.

That is not what I read in your sentence. The tone of the comment does not seem to match the code of conduct, and so it has created many questions for me and those who have already left this community.

https://www.w3.org/Consortium/cepc/

@Pauan
Copy link

Pauan commented Feb 18, 2021

@aminya You asked who had made the design and decision for externref, and I answered.

The community of course is free to share their opinions, give advice, give insight, etc. but WebAssembly is not a democracy, decisions are not made by voting, decisions are not made by popular opinion, all decisions are made by consensus of the Working Group. That is how it has always worked.

Note that I am not a part of the Working Group, I have no authority to make decisions (so I am just the same as you). So please stop this "us vs them" mentality, it is not helpful.

The Working Group has decades of experience with the web. Most Working Group members are from Mozilla, Google, Microsoft, and Apple. They also work on the browsers and JS engines, so they are deeply familiar with both the web and the implementation within the browsers. They also have a lot of experience with non-JS languages and non-JS runtimes, and they are well versed in computer science academia. They are the most qualified and experienced to make decisions. You should respect them rather than assuming that they are wrong. Maybe they know things that you don't, so you should listen when they try to explain things.

It is not a violation of the code of conduct for me to answer your question and explain how the system works. Disagreement is also not a violation of the code of conduct (and it is not even morally wrong), especially when the disagreement is on technical matters.

You will not be able to get your way by trying to bully other people by falsely claiming that they are "insulting you" or "mobbing you" or "violating the code of conduct". Please do show the part of the code of conduct which you think I have violated. Quote the section of the code of conduct which I have violated, and quote my message which violates that section.

@conrad-watt
Copy link

The community of course is free to share their opinions, give advice, give insight, etc. but WebAssembly is not a democracy, decisions are not made by voting, decisions are not made by popular opinion, all decisions are made by consensus of the Working Group. That is how it has always worked.

To be clear, the advancement of a proposal through all but the last phase of standardisation is based on Community Group consensus, and our process hasn't been tested yet by a divergence between Community Group and Working Group consensus (which would only matter for the final phase of standardisation). My expectation would be that such a divisive proposal would be rejected at an earlier phase, since Working Group members also contribute to the consensus of the Community Group.

The proposal here could be brought before the Community Group in an attempt to seek consensus for phase 1, and no one, Working Group member or otherwise, is going to stand in the way of this process. It would be the job of the proposal champion to convince the community to achieve consensus for each phase transition.

Without pre-supposing whether the Community Group would be able to form consensus, I'd point out that all the people involved in this conversation are part of this community, and the same debates regarding the extent to which the proposal privileges one set of string representations will likely be played out again in any meeting. These issue conversations can only ever be small-scale attempts to "feel for" how a Community Group consensus-seeking exercise would play out.

@dcodeIO
Copy link
Author

dcodeIO commented Feb 18, 2021

@Pauan I can give one:

It is incredibly unreasonable and arrogant to tell them to "just rewrite your code to use this different string type".

First and foremost, nobody said what's in quotation marks there, and while laying words into another participants mouth you even have the audacity to insult them for it?

I am much more concerned however with the roles the respective CG members, including the champion of this proposal, who is a co-creator of WebAssembly, or one Google employee in another issue where I asked about GC arrays, played in the broader picture of systematic bullying for all these years. This may well be a W3C group failing to uphold minimal standards, and tolerating that for an extended period of time. In my opinion this abuse can only be resolved by decisively expelling those involved for repeatedly and intentionally violating the values the CG once pledged to uphold.

@aminya
Copy link

aminya commented Feb 18, 2021

@Pauan

Note that I am not a part of the Working Group, I have no authority to make decisions (so I am just the same as you).

You say:

#13 (comment)
JS strings will always be externref, end of story.
Since you are clearly not willing to argue in good faith

If you are not a member of the working group, you are not eligible for specifying the end of the story, specifying if people are arguing in good faith, or even specifying that a proposal is good or not. You can just say:
"In my opinion, this proposal does not satisfy my standards".

The impression I get from your above comments is that you are the only decision-maker for the whole community.

@Pauan
Copy link

Pauan commented Feb 18, 2021

@dcodeIO Universal strings (under your current proposal) would require languages to rewrite countless thousands of libraries to accommodate the new string type. I have explained the many technical flaws with your proposal, yet you refuse to change your universal strings proposal.

So the only conclusion I can see is that you think these costs are justified, and that languages like Python should just be ignored. In fact both you and @MaxGraey have outright stated that languages like Python are unimportant and can be ignored. That is obviously unacceptable for a universal string type, especially since Python is one of the most popular languages in the world.

Perhaps I could have phrased it better, and I apologize for that, but look at your own comments for many examples of extremely disrespectful conduct. You have repeatedly accused the Working Group of sabotaging JS, and repeatedly made many baseless claims (such as claiming that we are dismissing universal strings for no reason, even though we have given extensive technical reasons for why it will not work). You also claimed that you are being "mobbed" (even though that has never happened), and you accused the Working Group of encouraging your mobbing (even though that never happened either).


@aminya If you are not a member of the working group, you are not eligible for specifying the end of the story

That is the conclusion that the Working Group came to, I am simply explaining their conclusion. Incidentally, I agree with their conclusion, because the host and Wasm's memory cannot intermingle, therefore externref is the only logical way to support host types. You have not presented any arguments to the contrary.

specifying if people are arguing in good faith

Yes, he was not arguing in good faith. He completely ignored all of the technical points I brought up, and he kept mentioning that external strings are UTF-16 (which is completely irrelevant and changes absolutely nothing about any of my arguments). That is a logical fallacy. Whether something is a logical fallacy or not is objective, it is not a subjective opinion. I have talked extensively with @MaxGraey , and he has consistently engaged in this behavior, this is not a one-time thing.

or even specifying that a proposal is good or not. You can just say: "In my opinion, this proposal does not satisfy my standards".

The arguments I have made are technical and objective, they are not subjective opinions. None of you have engaged with any of the technical arguments I have made.

It should not be necessary to constantly repeat "in my opinion", because that should be assumed when somebody is stating their opinion.

The impression I get from your above comments is that you are the only decision-maker for the whole community.

No, I have simply repeated the conclusions that others have already made, and given objective technical explanations for why those decisions were made, and for why universal strings will not work. No decisions were made, only explanations were given. And rather than engaging with those arguments, instead you try to divert the conversation into political areas.

That will not work. Even if you are right, and I did violate the code of conduct... that still does not make your arguments correct, and it does not make universal strings correct. The only way you can prove universal strings correct is to argue on a technical level, which you have refused to do.

If you are convinced that universal strings are a good idea, then you must create a proposal for it and champion it.

@MaxGraey
Copy link

MaxGraey commented Feb 18, 2021

So the only conclusion I can see is that you think these costs are justified, and that languages like Python should just be ignored. In fact both you and @MaxGraey have outright stated that languages like Python are unimportant and can be ignored. That is obviously unacceptable for a universal string type, especially since Python is one of the most popular languages in the world.

Are you agree compile JavaScript to wasm isn't make sense? But it can via embedding whole VM like QuickJS or JSC. And this uses by Figma's plugin for example. But it's impractical for general case. Exactly the same for Python. Btw JS has at least same popularity as Python. I just want to demonstrate to you that this is not an argument at all. There are languages ​​that compile very well to WebAssembly, but there are also those that absolutely senselessly pull into WebAssembly. Of course, there are projects like pyodide. But these are rather exceptions to the rule. And the speed of the interop does not really matter there. also I guess I clearly explained that Python support Latin1, UTF-8, UTF16 / UCS-2 and UCS-4 seamlessly and simultaneously

@Pauan
Copy link

Pauan commented Feb 18, 2021

@MaxGraey Btw JS has at least same popularity as Python. I just want to demonstrate to you that this is not an argument at all.

Yes... it is an argument, because the point of universal strings is to act as a universal type for interop between various Wasm languages.

It cannot fulfill that goal if it is excluding a large number of popular languages. A "universal string" type which only works for 5 languages is strictly worse than interface types, since interface types allow all languages to interop together, without giving special preference to only some languages.

There are languages ​​that compile very well to WebAssembly, but there are also those that absolutely senselessly pull into WebAssembly.

Generally speaking that isn't true: Wasm is low level enough that almost any language can run at close to native speeds on it. And in the cases where that isn't true, new proposals are created to fix that (e.g. tail calls). Wasm tries hard to not give preferential treatment to languages or groups of languages.

In any case, that is not Wasm's place to decide. Wasm does not decide which languages are "worthy" of being on Wasm, instead it should try to treat all languages as equally as it can.

Imagine the amount of drama that would happen if Wasm started privilege certain languages over others. Or if it started excluding very popular languages like Python.

Even this very thread was started because of the UTF-8 preference in the old proposal for interface types. And yet you are trying to do the same sort of exclusion.

And the speed of the interop does not really matter there.

You do not speak on behalf of those languages, or the projects using those languages. Many of them care very much so about speed.

@aminya
Copy link

aminya commented Feb 18, 2021

@Pauan

The words you use do matter to me and many other community members. Please read the code of conduct

That is the conclusion that the Working Group came to, I am simply explaining their conclusion.

If you want to restate someone's opinion, there are ways for that.
You should say: "The working group has concluded that JsStrings will be only externref" instead of saying "JS strings will always be externref, end of story"

Yes, he was not arguing in good faith. He completely ignored all of the technical points I brought up, and he kept mentioning that external strings are UTF-16 (which is completely irrelevant and changes absolutely nothing about any of my arguments). That is a logical fallacy. Whether something is a logical fallacy or not is objective, it is not a subjective opinion. I have talked extensively with @MaxGraey , and he has consistently engaged in this behavior, this is not a one-time thing.

I don't have the same idea, and I'm sure many others are the same, and because you are not the community manager, you are not eligible to conclude things just based on your personal opinion.

The arguments I have made are technical and objective, they are not subjective opinions. None of you have engaged with any of the technical arguments I have made.

Anything is subjective. Even the rules of physics don't hold in all the conditions.

It should not be necessary to constantly repeat "in my opinion", because that should be assumed when somebody is stating their opinion.

We can't assume things. That is not how language works. Based on the code of conduct, you should be as clear as possible.

Even if you are right, and I did violate the code of conduct.

Based on the code of conduct, you should

  1. Acknowledge that you've done something improper
  2. Briefly apologize. Don't try to explain yourself or minimize the issue
  3. If possible, edit your message, restate your communication in a better way or withdraw your statement. Publicly revising your statement helps define the culture for others

https://www.w3.org/Consortium/cepc/#mistake

If you don't intend to do that, I have no choice but to raise the issue directly with the Ombudspeople as a group or individually.

@MaxGraey
Copy link

MaxGraey commented Feb 18, 2021

Generally speaking that isn't true: Wasm is low level enough that almost any language can run at close to native speeds on it. And in the cases where that isn't true, new proposals are created to fix that (e.g. tail calls). Wasm tries hard to not give preferential treatment to languages or groups of languages.

Sorry, but you're really don't understand what is WebAssembly and how it produced from different languages

We seem to have already agreed that we will ignore each other? But you still continue to discuss with me. Why?

@dcodeIO
Copy link
Author

dcodeIO commented Feb 18, 2021

In fact both you and @MaxGraey have outright stated that languages like Python are unimportant and can be ignored

And yet you are doing it again. I never said that. I'd be very interested to look into the concrete implementation details, actually, and see where I may perhaps be able to help. It may turn out that this is not possible of course, but what you are saying is untrue, and sure it is supported by those encouraging you to target me with it instead of stepping in.

@Pauan
Copy link

Pauan commented Feb 18, 2021

@aminya You should say: "The working group has concluded that JsStrings will be only externref" instead of saying "JS strings will always be externref, end of story"

I agree that I could have phrased it better, but even then I do not think that my words were against the code of conduct.

I don't have the same idea, and I'm sure many others are the same, and because you are not the community manager, and so you are not eligible to conclude things just based on your personal opinion.

You can disagree if you like, but fallacies are objective.

I do not know why you are bringing up community managers. Anybody is free to make whatever discussions or conclusions they like, and also to repeat the conclusions of others, that is not something that only community managers are allowed to do.

Any discussions in this thread are not decisions, because decisions are made by consensus of the Working Group, which is handled via face-to-face meetings.

Anything is subjective. Even the rules of physics don't hold in all the conditions.

That is blatantly false, even more so in this discussion, since computers are intentionally designed to be objective and logical, and we are discussing computer systems.

We can't assume things. That is not how language works.

Language is based almost entirely on assumptions and context, it is very fuzzy. That is how humans work, humans are not robots. You seem to have a very extreme view of the code of conduct, you are wielding it as a weapon to use against anybody you disagree with, so I encourage you to contact an Ombudsperson.

I am not going to argue this matter anymore, since it's clear that you are unwilling to engage in any technical arguments, and are simply trying to strongarm others into silence.

@aminya
Copy link

aminya commented Feb 18, 2021

That is blatantly false

That's not how I think. Based on the code of conduct, you are not allowed to use that word (blantly) against anyone's opinion.

Language is based almost entirely on assumptions and context, it is very fuzzy. That is how humans work, humans are not robots.

That's not how the code of conduct is written, and you can't change the code of conduct based on your personal opinion.

I am not going to argue this matter anymore, since it's clear that you are unwilling to engage in any technical arguments, and are simply trying to strongarm others into silence.

Three years of technical arguments have not been enough, and so the people are so tired that they want to leave this community.

You seem to have a very extreme view of the code of conduct, you are wielding it as a weapon to use against anybody you disagree with, so I encourage you to contact an Ombudsperson.

I am giving you a chance to fix the issue informally.

Based on section 5 of the code of conduct:

"If you don't understand what you did wrong, assume that the hurt party has good cause and accept it."

@dtig
Copy link
Member

dtig commented Feb 18, 2021

Adding a note here that personal attacks on technical discussions are not okay. I will be locking this issue down for now as it could use a cool down period. For technical discussions that need an avenue, I suggest adding agenda items to an upcoming CG meeting.

@WebAssembly WebAssembly locked as too heated and limited conversation to collaborators Feb 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests