Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] string.ord or string.lt + string.gt #37

Open
MaxGraey opened this issue Jul 8, 2022 · 4 comments
Open

[Feature] string.ord or string.lt + string.gt #37

MaxGraey opened this issue Jul 8, 2022 · 4 comments

Comments

@MaxGraey
Copy link

MaxGraey commented Jul 8, 2022

I don't think string.eq alone will be enough. For string sorting we also need string.lt (a < b) and string.gt (a > b) operations.

Or ideally replacing string.eq (or at least addition) with the more generalized string.ord which perform lexicographical comparison and return -1 if (a < b), +1 if (a > b) and 0 if (a == b).

WDYT?

@wingo
Copy link
Collaborator

wingo commented Jul 11, 2022

Related to #5.

I think the question is really, "enough for what". I'm sympathetic to the use case, but I think that we may be able to get by with something more limited until we have more data indicating that more instructions would be enough (that word again!) of a win that we should standardize them. Three points:

Firstly, note that string.eq is inherently faster for its use case than computing lexicographic sort order; for different-length strings, we know immediately that they are unequal, but we don't know their order until examining their contents. (Of course optimizing compilers could recover the == 0 pattern but it's probably best for this proposal to include more predictable primitives.)

Secondly, I think there may be OK MVP solutions. Just trying to put myself in your shoes here -- If I were implementing a compiler for a language that required lexicographic ordering and I knew that I were targetting the web, I would implement an MVP string.ord by calling out to JS. Easy to implement and runs fast. If I wanted to polyfill in wasm based on the existing stringref MVP, I'd probably call out to an internal helper that makes two iterator views and implements the ordering. I think it's reasonable to hope that escape analysis at the optimizing compiler tier would eventually eliminate the iterator allocations, if iterator view performance ever became a priority. Or, to avoid allocation, I might encode WTF-8 to linear memory (or a cached i8 array) and use memcmp, if that's already part of the runtime.

Finally, as you might start requiring more and more elaborate string algorithms (collation, line breaking, normalization, etc), I'd definitely want to be leaning on some system library, be it another WASI component or JS/Intl. I can see how having string.ord in the stringrefs MVP might be useful but with the broader view I could also see it being quite acceptable from both a performance and maintenance POV in the system's string support library.

@MaxGraey
Copy link
Author

MaxGraey commented Jul 11, 2022

I don't think string.ord is as complicated as toUpperCase / toLowerCase. It's a very basic operation just combining the two and could be polyfilled as (a > b) - (a < b). But stringref doesn't provide these operations either. I suggest keeping string.eq but also adding string.ord. Of course the user can create his own implementation of string.ord compare each podepoint individually during iterating whole minimal string. But this would be extremely inefficient. And given that string.ord is likely to be used to sort an array of strings, this would be frustratingly slow. In fact, I see no problem adding string.org to the MVP. It's much easier, considering that many Wasm engines already have similar functionality for JS code

@rossberg
Copy link
Member

I believe string ordering on Unicode strings is locale-dependent and generally a pretty complex operation.

@MaxGraey
Copy link
Author

MaxGraey commented Jul 11, 2022

string.ord could be locale independent (for local-dependent comparision we need Intl which can be out of scope for now and probably reserve some immediate flag/hint for future). And for latin1 and WTF16 it's mostly similar as memory comparision (which btw also doesn't support in WebAssembly and bulk memory ops). For UTF16/UTF8/WTF8 it may be a little more complicated but still relatively simple. See this article for example: https://icu-project.org/docs/papers/utf16_code_point_order.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants