Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

passStyleOf must validate a string isWellFormed in order to be Passable #1739

Open
gibson042 opened this issue Aug 26, 2023 · 31 comments · Fixed by #2002
Open

passStyleOf must validate a string isWellFormed in order to be Passable #1739

gibson042 opened this issue Aug 26, 2023 · 31 comments · Fixed by #2002
Assignees
Labels
bug Something isn't working kriskowal-reviewed-2024-01 Issues that kriskowal is satisfied do not need attention from team bug review as of January, 2024 ocapn

Comments

@gibson042
Copy link
Contributor

gibson042 commented Aug 26, 2023

Describe the bug

JavaScript represents strings as sequences of 16-bit integers, each ostensibly representing a UTF-16 code unit, and the native relational operators and Array.prototype.sort compare strings as such rather than by code point:

const nativeCompare = (a, b) => a < b ? -1 : a > b ? 1 : 0;

const sortedCodePoints = [0x026A0, 0x0FB05, 0xFF4A, 0x1D306, 0x1F4A9, 0x1F525];

sortedCodePoints.map(cp => String.fromCodePoint(cp));
// => [ '⚠', 'ſt', 'j', '𝌆', '💩', '🔥' ]

sortedCodePoints.sort(nativeCompare).map(cp => String.fromCodePoint(cp));
// => [ '⚠', 'ſt', 'j', '𝌆', '💩', '🔥' ]

sortedCodePoints.map(cp => String.fromCodePoint(cp)).sort(nativeCompare);
// => [ '⚠', '𝌆', '💩', '🔥', 'ſt', 'j' ]

This leaks into places like the wire formats associated with CapData and passable encodings, and also into internal representations derived from them and relevant to e.g. pattern matching (all of which are relevant for interoperability, cf. #1587).

Steps to reproduce

import "@endo/init";
import { makeMarshal } from "@endo/marshal";
import { makeCopySet, matches, M } from "@endo/patterns";

// Decode a CopySet whose wire-format payload is an array of strings reverse-sorted by code point,
// as might be produced by a remote non-JS system.
// It does not match an `M.set()` pattern.
const body = `{"#tag":"copySet","payload":["🔥","j"]}`;
const capData = JSON.parse(`{"body":"#${JSON.stringify(body).slice(1, -1)}","slots":[]}`);
const set1 = makeMarshal().fromCapData(capData);
// => Object [copySet] { payload: [ '🔥', 'j' ] }
matches(set1, M.set());
// => false

// A locally-constructed clone reverse-sorts the same payload by code unit,
// and *does* match `M.set()`.
const set2 = makeCopySet(set1.payload);
// => Object [copySet] { payload: [ 'j', '🔥' ] }
matches(set2, M.set());
// => true

Expected behavior

I guess we need to decide on whether the necessary sorting is based on 16-bit code units or on code points, or alternatively that it is a responsibility of decoding (but that requires understanding the tag, because this lack of user-specifiable ordering is a special feature of CopySets, CopyBags, and CopyMaps, and in the latter case requires coordinating parallel key/value arrays).

Or alternatively, we could put something into the encoding layer below tagged typing to explicitly expose sets and/or maps in the data model such that equivalence ignores wire-format key ordering—perhaps in CapData, any array with a privileged first element is a set (e.g., { "body": "#[\"(\", \"foo\", \"bar\", \"baz\"]", "slots": [] } might encode the logical set of "foo", "bar", and "baz" using smallcaps). But note that the wire format of a map or bag would need to be something like an array of [key, value] entries rather than independent (and thus independently reorderable) collections of keys and values.

And either way, that decision should propagate into the OCapN discussion.

@gibson042 gibson042 added the bug Something isn't working label Aug 26, 2023
@erights
Copy link
Contributor

erights commented Aug 27, 2023

In XS, which uses CESU-8 for builtin strings for us at our request, how much more expensive would it be to convert two strings to UTF-8 (or rather, UTF-8 together with only unpaired surrogates) and then compare those?

Attn @phoddie @patrick-soquet

My sense is that we should NOT ask Moddable to switch us back to their UTF-8 support as the representation of builtin strings, as that support will never be consistent with JS. What happens when appending a string that ends with an unpaired surrogate with a string that starts with an unpaired surrogate? With CESU-8, we maintain reasonable compat with JS. With UTF-8 for builtin strings, we cannot.

Besides, if we relied on native UTF-8 string representation from the engine, our code would not be portable to other JS engines. So really, our only choice is conversion on comparison in userland, possibly with some userland finite memo (since we cannot use weakness. Or perhaps a transient memo?). This is the best we should try to do if we want to repair the sort-order issue towards Unicode. If it is cheap enough, we should do that. If it is not cheap enough, we need to figure out what we want and what to negotiate with ocapn.

@erights
Copy link
Contributor

erights commented Aug 27, 2023

Hi @gibson042 I assigned this to the two of us, as we seem to have the most interest -- or patience! -- with this topic ;)

@mhofman
Copy link
Contributor

mhofman commented Aug 27, 2023

Can't we leverage a host implemented compare function? With a fallback to a JS implementation if an optimized host compare is not available.

I think at the end of the day we should make sure none of our logic relies on the internal encoding of strings. I guess there is the question of what should the behavior be if the encoding allows invalid Unicode codepoints.

@phoddie
Copy link

phoddie commented Aug 27, 2023

If I'm understanding this correctly...

Cases where the result is different from the standard, here as a result of the unusual string encoding choice, are conformance bugs. They should be fixed in the engine.

As for strings "on the wire", the situation is the same as with any engine:

  • Scripts can efficiently export a UTF-8 encoding of a string using TextEncoder and efficiently import UTF-8 encoded strings with TextDecoder
  • Native code that accesses a JavaScript string is responsible for converting from the internal format (here CESU-8) to whatever the wire format uses

FWIW – the CESU-8 encoding, if I recall correctly, has better conformance as measured by test262 than the UTF-8 encoding. I don't think we should move backwards.

@kriskowal
Copy link
Member

kriskowal commented Aug 27, 2023

The crux of this particular issue is that the concern of key order and comparison needs to be consistent and priced right for both the memory and wire models, although those representations necessarily differ, and be a good compromise between multiple languages. I do not think we need to transcode in memory to compare.

@erights
Copy link
Contributor

erights commented Aug 27, 2023

FWIW – the CESU-8 encoding, if I recall correctly, has better conformance as measured by test262 than the UTF-8 encoding. I don't think we should move backwards.

I fully agree. That's the point I was making with the string append example. With UTF-8, there's no way the behavior of that example would be compat with JS. I am very grateful that our internal encoding is CESU-8.

@erights
Copy link
Contributor

erights commented Aug 27, 2023

Can't we leverage a host implemented compare function? With a fallback to a JS implementation if an optimized host compare is not available.

That sounds good to me. Or better, if TextEncoder/TextDecoder is fast enough to use to prepare the data for comparison, then the remaining need would be for a fast comparison by byte data, which we will eventually need anyway. (No rush on that.)

I think at the end of the day we should make sure none of our logic relies on the internal encoding of strings. I guess there is the question of what should the behavior be if the encoding allows invalid Unicode codepoints.

What does TextEncoder/TextDecoder do with invalid Unicode codepoints (i.e., lone surrogates)? What are they supposed to do?

@kriskowal
Copy link
Member

kriskowal commented Aug 27, 2023

I can only imagine TextEncoder and TextDecoder as performing well enough as a stopgap for having an as-if-UTF-8 native comparison function. Avoiding allocation is usually the first big win when performance comes to roost.

@mhofman
Copy link
Contributor

mhofman commented Aug 28, 2023

Right a comparison function should not allocate. Also to be pedantic, we need a Unicode codepoint comparison. I believe an advantage of UTF-8 encoding is that comparing bytes result in the same as comparing codepoints, but we don't need to encode to UTF-8 to perform that comparison.

Also given that there is no comparison for bytearrays, I don't think TextEncoder really helps here. Iterating code units is probably sufficient, as long as we can handle both CESU-8 and UCS-2.

@erights
Copy link
Contributor

erights commented Aug 28, 2023

I just gotta gripe for a moment:

UTF-8 does preserve Unicode ordering
CESU-8 does preserve UTF-16 ordering
When UTF-16 was designed, AFAICT it could have been designed to have preserved Unicode ordering, in which case all these difficulties would have been avoided. CESU-8 and UTF-8 would have had the same ordering. Sigh. Missed opportunity.

If someone knows the history of the decision, I'd be curious. Was this considered? Had 16 bit Unicode already made commitments that prevented UTF-16 from preserving 21 bit Unicode order?

@gibson042
Copy link
Contributor Author

Had 16 bit Unicode already made commitments that prevented UTF-16 from preserving 21 bit Unicode order?

As I understand it, the short answer is "yes". Unicode 1.1 had already claimed the top of the range, defining U+FFFD REPLACEMENT CHARACTER and annotating U+FFFE and U+FFFF as "Not character codes" (the former of which is the transposition of the U+FEFF byte order mark used for endianness identification in UTF-16 and possibly in its UCS-2 predecessor), so any 16-bit scheme that preserved lexicographic ordering by code point would need to start with a U+FFFF prefix followed by lower code units and thereby run afoul of the "overlap" problems mentioned in the encoding forms FAQ (barring a backwards-incompatible move that wasn't going to be palatable after what RFC 2781 refers to as the "Korean mess").

@phoddie
Copy link

phoddie commented Aug 28, 2023

Given the flow of this conversation, I feel like I'm overlooking a key assumption. I don't see how a reasonable behavior can be achieved solely by script-level patches. That would work to provide a different default comparison function for sort. But it won't change the comparison operators (<, >, etc) so custom comparison functions and scripts that perform string comparisons would still encounter non-conforming behavior.

@gibson042
Copy link
Contributor Author

@phoddie The contributors to this repository don't need ECMAScript built-in methods or operators to deviate from the specification (and in fact they must not), but we do need to come to some explicit decisions (for producing, always sort by code unit vs. always sort by code point, and for consuming, insist sorting by code unit vs. insist sorting by code point vs. normalize at ingress) and consistent implementation thereof to support interchange.

I think you were mentioned to weigh in on the best available strategy for comparing ECMAScript strings lexicographically by code point in XS and how it performs relative to native operators comparing by code unit.

@mhofman
Copy link
Contributor

mhofman commented Aug 28, 2023

Can someone confirm my understanding of the following:

  • CESU-8 is an internal representation used by XS, which has the same ordering as UCS-2 (UTF-16 code unit), and thus is not visible to JS programs
    • What about the length of strings in XS?
  • Some packages in endo currently rely on the JS lexical ordering of strings, which is not portable to other systems which don't use UTF-16 code unit order. These operation to become interoperable would need to switch to a different comparator which uses Unicode code points instead.
    • This comparator function does not need to do any allocation to perform its comparison. I could be implemented as a native function

@phoddie
Copy link

phoddie commented Aug 28, 2023

@gibson042 Thank you for the notes. I understand now that the concern here is with standard JavaScript language behavior, not a conformance issue with the XS implementation.

What you want is a comparison function that sorts as if the strings were UTF-8 encoded. That will be used to ease interoperability with systems that are not JavaScript-based. For efficiency, this UTF-8 sort should be implemented in native code when running on XS. That will be fast and not increase the burden on the GC. Even straightforward, unoptimized C will perform better.

@phoddie
Copy link

phoddie commented Aug 28, 2023

What about the length of strings in XS?

They are conformant. That was among the benefits of moving to CESU-8.

@phoddie
Copy link

phoddie commented Aug 30, 2023

Now that I think I understand the goal (thank you for your patience), a solution would look something like this?

  • A JavaScript implementation of a comparison function that gives UTF-8 sorting. This would be used when running on Node.
  • A C implementation of an equivalent function for use with XS. This would be not perform allocations.
  • The C function will be integrated in xsnap. The precedent seems to be as a global (see xsBuildAgent).

My quick search for an implementation of a UTF-8 comparison function in JavaScript came up empty. We can surely write a C version from scratch, but a working algorithm as a starting point would be preferred.

@kriskowal
Copy link
Member

I very much doubt the existence of a code-point-order comparator for JavaScript strings. If we can’t find one (and please forgive me if this is obvious), we can probably convert a UTF-8 encoder into a comparator that co-iterates over a pair of string internals.

@phoddie
Copy link

phoddie commented Aug 30, 2023

I very much doubt the existence of a code-point-order comparator for JavaScript strings

Your intuition matches my search results. So.

I very much doubt the existence of a code-point-order comparator for JavaScript strings. If we can’t find one (and please forgive me if this is obvious), we can probably convert a UTF-8 encoder into a comparator that co-iterates over a pair of string internals.

Yes, something like that would be straightforward. There's no problem to do that, just looking to avoid reinvention if possible. That's probably optimistic at this point.

@gibson042
Copy link
Contributor Author

gibson042 commented Aug 30, 2023

const compareByCodeUnits = (left, right) => {
  for (let i = 0; ; ) {
    const leftCodepoint = left.codePointAt(i);
    const rightCodepoint = right.codePointAt(i);
    if (leftCodepoint === undefined && rightCodepoint === undefined) {
      return 0;
    } else if (leftCodepoint === undefined) {
      // left is a prefix of right.
      return -1;
    } else if (rightCodepoint === undefined) {
      // right is a prefix of left
      return 1;
    } else if (leftCodepoint < rightCodepoint) {
      return -1;
    } else if (leftCodepoint > rightCodepoint) {
      return 1;
    }
    i += leftCodepoint <= 0xffff ? 1 : 2;
  }
};


[ '⚠', 'ſt', 'j', '𝌆', '💩', '🔥' ].sort(compareByCodeUnits);
// => [ '⚠', 'ſt', 'j', '𝌆', '💩', '🔥' ]

[ '⚠', '𝌆', '💩', '🔥', 'ſt', 'j' ].sort(compareByCodeUnits);
// => [ '⚠', 'ſt', 'j', '𝌆', '💩', '🔥' ]

[ '⚠', '𝌆', '💩', '🔥', 'ſt', 'j' ].sort((a, b) => compareByCodeUnits(b, a));
// => [ '🔥', '💩', '𝌆', 'j', 'ſt', '⚠' ]

[ '⚠', '𝌆', '💩', '🔥', 'ſt', 'j' ]
  .flatMap((a, i, arr) => [...arr.map(b => b + a), a])
  .sort(compareByCodeUnits);
/* =>
[
  '⚠',    '⚠⚠',   '⚠ſt',   '⚠j',  '⚠𝌆',
  '⚠💩',  '⚠🔥',  'ſt',    'ſt⚠',   'ſtſt',
  'ſtj',  'ſt𝌆',   'ſt💩',  'ſt🔥',  'j',
  'j⚠',  'jſt',  'jj', 'j𝌆',  'j💩',
  'j🔥', '𝌆',    '𝌆⚠',   '𝌆ſt',   '𝌆j',
  '𝌆𝌆',   '𝌆💩',  '𝌆🔥',  '💩',   '💩⚠',
  '💩ſt',  '💩j', '💩𝌆',  '💩💩', '💩🔥',
  '🔥',   '🔥⚠',  '🔥ſt',  '🔥j', '🔥𝌆',
  '🔥💩', '🔥🔥'
]
*/

@phoddie
Copy link

phoddie commented Aug 30, 2023

Perfect. Thank you!

@kriskowal
Copy link
Member

@gibson042 is it safe to assume that the left and right indices will always be the same, since the algorithm short-circuits at a divergence?

@gibson042
Copy link
Contributor Author

Yes, good observation. Updated, and also added a demonstration for strings of more than one code point.

@kriskowal
Copy link
Member

@gibson042 What’s the status of this issue? Did we make a choice?

@erights
Copy link
Contributor

erights commented Jan 8, 2024

Skimming back over the above thread, the meeting of minds seems to be that we want compareRank/compareKeys/compactOrdered to converge on Unicode code point order, which would be the same as a lexicographic byte order of UTF8 encoding of the strings. We use @gibson042 's compareByCodeUnits above as executable specification, as well as the implementation on non-XS platforms. Perhaps we also use it as an interim implementation on XS if it is fast enough for that purpose, although the thread above assumes (probably correctly) that it would be painfully slow for that purpose.

In any case, the plan would be to use a native implementation of compareByCodeUnits on XS. Until that native implementation exists, perhaps we stick with UTF16 code unit order. Though, obviously, the longer we wait before switching compareRank/compareKeys/compactOrdered, the more painful the compat risk will be when we do switch.

@erights
Copy link
Contributor

erights commented Jan 8, 2024

It is less clear to me whether we have a meeting of the minds on unpaired surrogates.

@erights
Copy link
Contributor

erights commented Jan 8, 2024

Obviously, to be coordinated with OCapN effort as well. Note: OCapN meeting this Tuesday. Topic for agenda?

@phoddie
Copy link

phoddie commented Jan 8, 2024

Until that native implementation exists, perhaps we stick with UTF16 code unit order. Though, obviously, the longer we wait before switching compareRank/compareKeys/compactOrdered, the more painful the compat risk will be when we do switch.

It looks like a native implementation of compareByCodeUnits should be straightforward. Is that sufficient for you to implement compareRank/compareKeys/compactOrdered?

It is less clear to me whether we have a meeting of the minds on unpaired surrogates.

The code from @gibson042 is clear on that. But, maybe you are referring to how that is defined in the spec of the wire format?

@erights
Copy link
Contributor

erights commented Jan 8, 2024

The code from @gibson042 is clear on that.

Looked. It relies on codePointAt, so I looked at

https://tc39.es/ecma262/multipage/text-processing.html#sec-string.prototype.codepointat

which says

If a valid UTF-16 surrogate pair does not begin at pos, the result is the code unit at pos.

Given that, yes it is clear and is in any case my preferred semantics for unpaired surrogates.

It looks like a native implementation of compareByCodeUnits should be straightforward. Is that sufficient for you to implement compareRank/compareKeys/compactOrdered?

I think so. Any objections?

@kriskowal kriskowal added the kriskowal-reviewed-2024-01 Issues that kriskowal is satisfied do not need attention from team bug review as of January, 2024 label Jan 10, 2024
@erights
Copy link
Contributor

erights commented Jan 20, 2024

Jan 2024 OCapN meeting notes record that we agreed that strings can only be well-formed Unicode, i.e., cannot contain unpaired surrogates. For JavaScript, if a string does not pass the isWellFormed predicate, then it is not a Passable string. Agoric has yet to implement this validation, but will.

In light of that, I'm renaming this issue to keep track of that need.

@erights erights changed the title JavaScript string details leak into data representation JavaScript strings must be well formed Unicode strings to be Passable Jan 20, 2024
@erights erights changed the title JavaScript strings must be well formed Unicode strings to be Passable passStyleOf must validate a string isWellFormed in order to be Passable Jan 20, 2024
@erights
Copy link
Contributor

erights commented Mar 13, 2024

Reopening, because endojs/endo#2002 itself hides the feature behind a feature flag that we currently have disabled by default. Once we change this to default to enabled, then we can close this issue.

@erights erights reopened this Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working kriskowal-reviewed-2024-01 Issues that kriskowal is satisfied do not need attention from team bug review as of January, 2024 ocapn
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants