[BasicCABI] Pass compiler-rt 128-bit return values in memory #223

aheejin · 2024-03-19T02:22:12Z

Currently our LLVM Wasm backend returns 128-bit values as two i64s in case multivalue is enabled:
https://github.com/llvm/llvm-project/blob/a5f576e5961ecc099bd7ccf8565da090edc84b0d/llvm/lib/Target/WebAssembly/WebAssemblyRuntimeLibcallSignatures.cpp#L697-L700

But given that neither emscripten nor wasm-sdk seem to provide a multivalue version of compiler-rt, it looks this has not been working so far and the reason we haven't heard complaints was likely that no one was using compiler-rt with multivalue enabled.

Maintaining and providing two different versions of compiler-rt is a cumbersome thing for toolchains, and emscripten already has to provide multiple versions of many libraries (e.g. threaded vs. non-threaded, debug vs. release, exception-enabled vs. disabled, ...).

Also enabling the multivalue return on several compiler-rt functions that have a 128-bit return value wouldn't affect performance in a meaningful way, given that there are not many of them.

I had a chance to chat with several people who contribute here offline this morning, and it looked we agreed that there is not much benefit to enabling multivalue return in compiler-rt functions. One thing I'm not sure is whether we decided to disable 128-bit multivalue returns for only compiler-rt functions or for all user functions. This PR currently says we do that only for compiler-rt; please let me know if you think we should do otherwise.

Currently our LLVM Wasm backend returns 128-bit values as two `i64`s in case multivalue is enabled: https://github.com/llvm/llvm-project/blob/a5f576e5961ecc099bd7ccf8565da090edc84b0d/llvm/lib/Target/WebAssembly/WebAssemblyRuntimeLibcallSignatures.cpp#L697-L700 But given that neither emscripten nor wasm-sdk seem to provide a multivalue version of compiler-rt, it looks this has not been working so far and the reason we haven't heard complaints was likely that no one was using compiler-rt with multivalue enabled. Maintaining and providing two different versions of compiler-rt is a cumbersome thing for toolchains, and emscripten already has to provide multiple versions of many libraries (e.g. threaded vs. non-threaded, debug vs. release, exception-enabled vs. disabled, ...). Also enabling the multivalue return on several compiler-rt functions that have a 128-bit return value wouldn't affect performance in a meaningful way, given that there are not many of them. I had a chance to chat with several people who contribute here offline this morning, and it looked we agreed that there is not much benefit to enabling multivalue return in compiler-rt functions. One thing I'm not sure is whether we decided to disable 128-bit multivalue returns for only compiler-rt functions or for all user functions. This PR currently says we do that only for compiler-rt; please let me know if you think we should do otherwise.

tlively · 2024-03-19T04:14:53Z

One thing I'm not sure is whether we decided to disable 128-bit multivalue returns for only compiler-rt functions or for all user functions. This PR currently says we do that only for compiler-rt; please let me know if you think we should do otherwise.

Since the general principle we've been discussing is that changing feature flags should never change the ABI, I think it makes sense to disable 128-bit multivalue returns everywhere. If we want, we can continue allowing them in the experimental multivalue ABI, though, since that is a separate ABI.

fitzgen · 2024-03-19T15:34:41Z

Also enabling the multivalue return on several compiler-rt functions that have a 128-bit return value wouldn't affect performance in a meaningful way, given that there are not many of them.

In Wasmtime, we've seen a lot of bottlenecks on compiler-rt's __muilti3, especially in various cryptography functions, and the two big reasons this is a bottleneck is because (1) we don't do inlining in Cranelift and assume that the toolchain has already done it, which isn't generally the case for compiler-rt functions because they are so late in the pipeline IIUC, and (2) because the returns are passed through memory rather than on the Wasm stack, and we can't generally prove that the memory accesses won't trap or that the stores aren't otherwise unobserved, so we have to perform them.

More details/investigation by @jameysharp over here: bytecodealliance/wasmtime#4077 (comment)

It would be a shame if we precluded returning i28s directly from compiler-rt, and standardized (2) as a permanent bottleneck, barring new core wasm instructions specifically for eliminating __multi3: WebAssembly/design#1495.

TerrorJack · 2024-03-19T15:36:58Z

This will also hurt users that do compile their own entire sysroot with experimental mv abi and expect compiler-rt to not have in-memory overhead with i128s.

sbc100 · 2024-03-19T16:41:37Z

which isn't generally the case for compiler-rt functions because they are so late in the pipeline IIUC,

Doesn't wasm-opt take care of inlining that function? Or is there some toolchain that doesn't include running wasm-opt on the final binary?

jedisct1 · 2024-03-19T16:45:09Z

Doesn't wasm-opt take care of inlining that function? Or is there some toolchain that doesn't include running wasm-opt on the final binary?

The Zig toolchain always recompiles the sysroot with the same flags (including wasm extensions) as the rest of the code, and doesn't run wasm-opt.

sbc100 · 2024-03-19T16:45:34Z

BasicCABI.md

+[1] In case of parameters, `long long double`, `__int128`, and `__float128` are
+passed directly as two `i64` values. In case of return values, for
+non-compiler-rt functions, they are passed directly as two `i64` values if
+[multivalue](https://github.com/WebAssembly/multi-value) is enabled and


I don't think this is true is it? Enable multi-value doesn't change the ABI, right?

We do have an experimental multi-value ABI but its not official, and we probably don't want to be mentioning it at all here.

Will fix this once we decide what to do with the feature flag / experimental flag / ABI.

sbc100 · 2024-03-19T16:48:13Z

Doesn't wasm-opt take care of inlining that function? Or is there some toolchain that doesn't include running wasm-opt on the final binary?

The Zig toolchain always recompiles the sysroot with the same flags (including wasm extensions) as the rest of the code, and doesn't run wasm-opt.

So is there room to add a wasm-opt step there to optimize the final output? Or does zig already take care of things like inlining itself? Regardless of the toolchain it should be possible to do whole program optimizing such as inlining __multi3, right?

jedisct1 · 2024-03-19T17:28:52Z

So is there room to add a wasm-opt step there to optimize the final output?

This is very unlikely. The toolchain is self-contained, and doesn't spawn external commands.

Or does zig already take care of things like inlining itself? Regardless of the toolchain it should be possible to do whole program optimizing such as inlining __multi3, right?

For apps written in Zig, it can take care of inlining. Not for apps written in C/C++.

__int128 mul(__int128 a, __int128 b) {  return a * b; }

zig cc -O3 -c --target=wasm32-freestanding -mcpu=baseline+multivalue a.c

(import "env" "__multi3" (func $fimport$0 (param i64 i64 i64 i64) (result i64 i64)))

sbc100 · 2024-03-19T17:46:29Z

So is there room to add a wasm-opt step there to optimize the final output?

This is very unlikely. The toolchain is self-contained, and doesn't spawn external commands.

Or does zig already take care of things like inlining itself? Regardless of the toolchain it should be possible to do whole program optimizing such as inlining __multi3, right?

For apps written in Zig, it can take care of inlining. Not for apps written in C/C++.
__int128 mul(__int128 a, __int128 b) {  return a * b; }
zig cc -O3 -c --target=wasm32-freestanding -mcpu=baseline+multivalue a.c

Can I ask what baseline+multivalue does? Does it turn simple enable the multi-value feature, or does it turn on the experimental multi-value C ABI in clang?

Our current plan for clang/llvm is that just turning on the feature shouldn't effect the C ABI at all.

jedisct1 · 2024-03-19T18:04:16Z

Can I ask what baseline+multivalue does? Does it turn simple enable the multi-value feature, or does it turn on the experimental multi-value C ABI in clang?

baseline is a synonym for mvp and adding +multivalue to the target CPU does the same thing as -mmultivalue.

kripken · 2024-03-19T20:01:43Z

BasicCABI.md

+array                        | indirect      | N/A                |
+
+[1] In case of parameters, `long long double`, `__int128`, and `__float128` are
+passed directly as two `i64` values. In case of return values, for


[1] appears once in the "Parameter" column, so I'm confused about seeing "in case of return values" here. Should that part be in [2]?

Right, it should go to [2]. Given that whether we are gonna do this or not is unclear, will fix it later (if we go this route)

kripken · 2024-03-19T20:14:49Z

The Zig toolchain always recompiles the sysroot with the same flags (including wasm extensions) as the rest of the code, and doesn't run wasm-opt.

LTO is another way to get such functions inlined. Does Zig support that?

Otherwise, in general, a huge advantage of wasm over typical targets is exactly that it is very simple to inline calls even without LTO. So even if a toolchain like Zig doesn't want to run an external tool like wasm-opt (fair enough) then it can implement such inlining itself, without much code or effort. (Inlining heuristics can be tricky, but for a specific toolchain like Zig it could just say "always inline __muilti3" which it controls anyhow.)

For those reasons I'd suggest not worrying much about the inlining aspect here.

aheejin · 2024-03-19T22:54:26Z

These are my current thoughts:

I agree with #223 (comment) that we shouldn't treat compiler-rt and other libraries differently. One big motivator for this PR was it's hard for toolchains to maintain two different versions of libraries, and it turns out not only compiler-rt but other libraries like libc and libc++ also have i128 return values. But from the compiler's point of view, those library functions, many of which are not listed in https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/RuntimeLibcalls.def, are not any different from user functions, so there's no way of disabling multivalue returns for only libraries and not user functions. If we disable it only for the runtime libcalls listed in https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/RuntimeLibcalls.def, it doesn't end up solving the two-libraries problem in the toolchain. (In practice emscripten already maintains several different versions of certain libraries (e.g. EH vs. non-EH, multithreaded vs. non-multhreaded, ...), so adding multivalue vs. non-multivalue will double the number of already existing combinations of builds.)

Unless you are not building all your libraries from scratch by yourself, which Zig seems to do, disabling multivalue return doesn't regress the status quo for most of the users, given that neither emscripten nor wasi-sdk currently provides multivalue returning versions of libraries. But I also understand the concern that it's suboptimal to preclude the possibility of returning multivalue in the future for better performance. I had assumed the usage of those functions wouldn't affect the performance in a meaningful way but apparently there were cases they did.

I chatted w/ @dschuff offline and we agreed that while enabling a certain feature can (and usually do) change the output code, it might not be desirable for that to change the ABI, and changing the ABI should be dependent on a separate flag. And we don't need to distinguish compiler-rt vs. non-compiler-rt or library vs. user code. We already have an experimental flag that enables multivalue for struct return values: -target-abi -Xclang experimental-mv. Currently this only controls structs and not 128-bit values, which are controlled by -mmultivalue instead, which we think should be changed.

So our suggestion is, how about making the lowering of 128-bit return values also depend on -target-abi -Xclang experimental-mv? This is a clang flag and transferring this info available to the backend may involve some hack but I think it's doable (@tlively, do you think it's unreasonably hard?) .

Downsides of this approach can be 1. current users of multivalue-returning feature have to opt in to use that option 2.
if they want to enable multivalue returns for 128-bit values, they have to enable multivalue returns for structs too, because the single option controls both. I think this can be a reasonable compromise and I'm not really sure if we want one more experimental flag that exclusively controls 128-bit value returns. One thing I'd like to fix before doing this is, according to @tlively, we don't have a limit on the size of struct that can be returned by multivalue, so if we have a function that returns a struct with 100 i32s, it will return 100 values via multivalue. I think we should put some limit on this, which can be some arbitrary number that sounds reasonable, e.g., 4 or something. I remember @tlively did some experiment on this a long time ago and don't remember the result was, but I don't think we want to invest more time on redoing the experiments to pick an optimal number now. @tlively, do you have a suggestion on a reasonable limit?

I'd like to hear what people think about this way forward (making -target-abi -Xclang experimental-mv control all multivalue return). Thank you for all the feedback!

TerrorJack · 2024-03-19T23:02:00Z

As long as -mmultivalue -Xclang -target-abi -Xclang experimental-mv does not regress from current behavior of always passing i128 via multi value (even for compiler-rt when compiled from source) then it's fine for us. It would be even nicer to mention this in the ABI document, along with a warning sign that says it's highly experimental and require recompiling sysroot, but that's just my two cents.

I think we should put some limit on this

That's better discussed on another thread, but my option is such limits, if any, should be treated as "Implementation Limitations" in core wasm spec.

alexcrichton · 2024-03-19T23:11:41Z

If I might throw out another possible alternative, could it perhaps also be possible to write this down in source itself?

For example compiler-rt is specific to a toolchain version (or so I believe) which means that even if the default C ABI is to not use multi-value compiler-rt itself could use multi-value. That would enable defining __multi3, for example, using a multi-value-based-return without affecting everyone else's C ABI. This wouldn't require multiple builds with/without multivalue, instead it's a decision one day of whether __multi3, and only __multi3 (or other compiler-rt intrinsics) should use multi-value for their return values.

This would also solve a problem of being able to intermix multi-value-returns-and-not. For example the outermost function might want to use multi-value as it's trying to match an exact API "shape" of the wasm module being created, but everything else just wants to interoperate with the system which is likely not going to use multi-value for some time now. Enabling the two ABIs to be present in the same module would help avoid making multi-value an all-or-nothing decision.

In such a world I would imagine that -target-abi -Xclang experimental-mv also would not change, it's switching the default ABI away to something else.

sbc100 · 2024-03-19T23:19:23Z

In Wasmtime, we've seen a lot of bottlenecks on compiler-rt's __muilti3, especially in various cryptography functions

Is the problem here that the cryptography functions are compiling with the assumption that they have native int128 support? Perhaps if they would prefer a different codepath if they knew that int128 were being software emulated? i.e. is there some HAVE_NATIVE_INT128 macro that they are getting wrong for wasm?

aheejin · 2024-03-19T23:47:36Z

@TerrorJack

I think we should put some limit on this

That's better discussed on another thread, but my option is such limits, if any, should be treated as "Implementation Limitations" in core wasm spec.

The limitation we want to impose here is not the "limitation" in the core spec sense, which is used for the maximum number of something that the engine implementation supports:
https://webassembly.github.io/spec/core/appendix/implementation.html
https://webassembly.github.io/spec/js-api/index.html#limits

What I suggested, the number of items in a struct that will be returned as multivalue, which I assume to be less than 10, is more of an internal parameter that drives the optimization. It's not that the implementation cannot handle 1000 return values; it's just that it's bad for the code quality and we wouldn't want to do that. I think it might be worth mentioning it in this ABI document, but I don't think that should go into the core spec.

aheejin · 2024-03-19T23:51:42Z

@alexcrichton

If I might throw out another possible alternative, could it perhaps also be possible to write this down in source itself?

For example compiler-rt is specific to a toolchain version (or so I believe) which means that even if the default C ABI is to not use multi-value compiler-rt itself could use multi-value. That would enable defining __multi3, for example, using a multi-value-based-return without affecting everyone else's C ABI. This wouldn't require multiple builds with/without multivalue, instead it's a decision one day of whether __multi3, and only __multi3 (or other compiler-rt intrinsics) should use multi-value for their return values.

If __multi3's return type changes, wouldn't the caller (user code) need know about it to match the signature? If __multi3 is user code we would have a declaration in the user code somewhere, but if it's compiler-rt, we don't know the signature, and that's why we have the list of compiler-rt functions and their signatures here: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/WebAssembly/WebAssemblyRuntimeLibcallSignatures.cpp Doesn't this count as a change of ABI?

sbc100 · 2024-03-19T23:58:16Z

@alexcrichton

If I might throw out another possible alternative, could it perhaps also be possible to write this down in source itself?
For example compiler-rt is specific to a toolchain version (or so I believe) which means that even if the default C ABI is to not use multi-value compiler-rt itself could use multi-value. That would enable defining __multi3, for example, using a multi-value-based-return without affecting everyone else's C ABI. This wouldn't require multiple builds with/without multivalue, instead it's a decision one day of whether __multi3, and only __multi3 (or other compiler-rt intrinsics) should use multi-value for their return values.

If __multi3's return type changes, wouldn't the caller (user code) need know about it to match the signature?

I believe the libcall function such as __multi3 are not designed to be directly callable from C code. They only need to be callable from the llvm backend, when it generates calls to them programatically.

So in theory we can use whatever ABI we like and I think it is allowed to differ from the normal C ABI. See https://github.com/llvm/llvm-project/blob/12a2bc301fe83eea3b214428827d712c8cfb28a9/compiler-rt/lib/builtins/int_lib.h#L23-L31

fitzgen · 2024-03-20T15:29:44Z

I believe the libcall function such as __multi3 are not designed to be directly callable from C code. They only need to be callable from the llvm backend, when it generates calls to them programatically.

So in theory we can use whatever ABI we like and I think it is allowed to differ from the normal C ABI. See https://github.com/llvm/llvm-project/blob/12a2bc301fe83eea3b214428827d712c8cfb28a9/compiler-rt/lib/builtins/int_lib.h#L23-L31

Using a custom, libcall-specific ABI that avoids touching memory for __multi3 seems like it would be ideal in this case.

sbc100 · 2024-03-20T15:47:34Z

I believe the libcall function such as __multi3 are not designed to be directly callable from C code. They only need to be callable from the llvm backend, when it generates calls to them programatically.
So in theory we can use whatever ABI we like and I think it is allowed to differ from the normal C ABI. See https://github.com/llvm/llvm-project/blob/12a2bc301fe83eea3b214428827d712c8cfb28a9/compiler-rt/lib/builtins/int_lib.h#L23-L31

Using a custom, libcall-specific ABI that avoids touching memory for __multi3 seems like it would be ideal in this case.

To be clear, are you proposing that this libcall-specific ABI would vary with -mmulti-value, so toolchains would all need to provider specific -mmulti-value versions of their libcall library? (regardless of whether we ever ship a new multi-value C ABI).

I addition to the complexity of shipping and selecting such as version of compiler-rt, I think building it could be very tricky since it would there is no value we could use COMPILER_RT_ABI that would cause the compiler to generate this ABI. We would need to come up with kind of intermediate ABI that compiler-rt could be compiled with. That sound like a fairly significant amount of work.

fitzgen · 2024-03-20T16:54:46Z

I was assuming that in a world with universal support for multi-value (effectively the state of the world now, but I recognize that this is a separate, larger discussion) that the __multi3/libcall ABI would always use a multi-value or v128 return (doesn't matter which one, so long as it is not using a return pointer) regardless what C ABI (today's vs a future multi-value ABI) is in use.

If I understand the constraints correctly, and given that assumption of universal multi-value support regardless of C ABI in use, this wouldn't require maintaining multiple compiler-rt versions nor concerns around selecting the correct version. Is that correct?

sbc100 · 2024-03-20T17:01:48Z

I was assuming that in a world with universal support for multi-value (effectively the state of the world now, but I recognize that this is a separate, larger discussion) that the __multi3/libcall ABI would always use a multi-value or v128 return (doesn't matter which one, so long as it is not using a return pointer) regardless what C ABI (today's vs a future multi-value ABI) is in use.

If I understand the constraints correctly, and given that assumption of universal multi-value support regardless of C ABI in use, this wouldn't require maintaining multiple compiler-rt versions nor concerns around selecting the correct version. Is that correct?

True, once multi-value is universally available we could just assume it when we build compiler-rt.

In emscripten we will want to continue to support targets without multivalue (i.e. user of -mno-multivalue), however the way we tend to handle this kind thing today is to build with the feature enabled and then use a binaryen pass to lower it away. So that shouldn't be a problem for emscripten anyway.

I am still curious to hear about from you regarding the importance of optimizing the software emulation of int128: #223 (comment). Are you sure we really need to care about it enough to invent a new ABI that differs from the normal C ABI?

jedisct1 · 2024-03-20T18:04:01Z

In Wasmtime, we've seen a lot of bottlenecks on compiler-rt's __muilti3, especially in various cryptography functions

Is the problem here that the cryptography functions are compiling with the assumption that they have native int128 support? Perhaps if they would prefer a different codepath if they knew that int128 were being software emulated? i.e. is there some HAVE_NATIVE_INT128 macro that they are getting wrong for wasm?

The assumption is that if the TI mode is supported or the __i128t type is available, we should use it, even if this is a software emulation, because on native targets, this will be inlined. A different code path would reimplement the same thing.

The intuition is that WebAssembly would behave the same way. The fact that this is not the case, even with LTO, was not expected https://gist.github.com/jedisct1/dfc18913680c24824595b50b8eb04db3

fitzgen · 2024-03-20T18:11:29Z

Is the problem here that the cryptography functions are compiling with the assumption that they have native int128 support? Perhaps if they would prefer a different codepath if they knew that int128 were being software emulated? i.e. is there some HAVE_NATIVE_INT128 macro that they are getting wrong for wasm?

I'm personally not sure about these particulars. @jameysharp might have a better idea, as he was the one doing the original investigation.

sbc100 · 2024-03-20T18:12:27Z

because on native targets, this will be inlined

Can you explain what you mean here? Have you seen native platform target that provide bitcode/LTO-able versions of compiler-rt?

As far as I know there are no native targets that provide LTO-able versions of compiler-rt functions. The reason I think this is that in emscripten (where are pushing the limits of what we make LTO-able) we still haven't found a way to make compiler-rt function compile as bitcode (i.e. LTO-able). See https://github.com/emscripten-core/emscripten/blob/10827283603d0d8dba0a813827b5ee5f82d1adf5/tools/system_libs.py#L922-L924

aheejin · 2024-03-20T18:25:24Z

@sbc100

True, once multi-value is universally available we could just assume it when we build compiler-rt.

In emscripten we will want to continue to support targets without multivalue (i.e. user of -mno-multivalue), however the way we tend to handle this kind thing today is to build with the feature enabled and then use a binaryen pass to lower it away. So that shouldn't be a problem for emscripten anyway.

Can we lower multivalue away completely? We can do that within a function or across functions in static linking + non-exported/imported functions, but can we do that with imported/exported functions or dynamic linking?

sbc100 · 2024-03-20T18:55:57Z

@sbc100

True, once multi-value is universally available we could just assume it when we build compiler-rt.
In emscripten we will want to continue to support targets without multivalue (i.e. user of -mno-multivalue), however the way we tend to handle this kind thing today is to build with the feature enabled and then use a binaryen pass to lower it away. So that shouldn't be a problem for emscripten anyway.

Can we lower multivalue away completely? We can do that within a function or across functions in static linking + non-exported/imported functions, but can we do that with imported/exported functions or dynamic linking?

Before we enable multi-value by default we would need some way to lower it away for users targeting older browsers. If we can't lower it away then that would be a one reason not to support the use of multi-value for libcalls purely on the basis of the feature exsiting (-mmulti-value)

jedisct1 · 2024-03-20T20:36:34Z

because on native targets, this will be inlined

Can you explain what you mean here? Have you seen native platform target that provide bitcode/LTO-able versions of compiler-rt?

I haven't, but at least on mips64, riscv64, aarch64 and x86_64, a 128-bit multiplication doesn't call the compiler-rt implementation at all.

$ for target in mips64-linux riscv64-linux aarch64-linux x86_64-linux; do
       zig cc -S -target "$target" -O3 a.c && ug -q __multi3 a.s && echo "[$target]: __multi3 called";
   done

(empty)

But on WebAssembly, even if the __multi3 is inlined, will WebAssembly compilers understand the pattern and be able to take advantage of CPU instructions for wide multiplications?

aheejin · 2024-03-20T23:21:20Z

@alexcrichton

If I might throw out another possible alternative, could it perhaps also be possible to write this down in source itself?
For example compiler-rt is specific to a toolchain version (or so I believe) which means that even if the default C ABI is to not use multi-value compiler-rt itself could use multi-value. That would enable defining __multi3, for example, using a multi-value-based-return without affecting everyone else's C ABI. This wouldn't require multiple builds with/without multivalue, instead it's a decision one day of whether __multi3, and only __multi3 (or other compiler-rt intrinsics) should use multi-value for their return values.

If __multi3's return type changes, wouldn't the caller (user code) need know about it to match the signature?

I believe the libcall function such as __multi3 are not designed to be directly callable from C code. They only need to be callable from the llvm backend, when it generates calls to them programatically.

So in theory we can use whatever ABI we like and I think it is allowed to differ from the normal C ABI. See https://github.com/llvm/llvm-project/blob/12a2bc301fe83eea3b214428827d712c8cfb28a9/compiler-rt/lib/builtins/int_lib.h#L23-L31

Ah I see. Yeah I think code annotations can be one way to support different ABIs for certain performance-critical functions. But if we decide to do this in future, it seems also compatible with the clang option I suggested in #223 (comment). Also given that some people (especially who own the whole toolchain) would want to enable multivalue lowering for all potential candidates and wouldn't want to annotate every function that has multiple or 128-bit return values, having that kind of option can be convenient.

dschuff · 2024-03-21T00:06:31Z

I think it does make sense in general to allow the calling convention ABI to be determined on a per-function basis, with a clang attribute you can attach to a function declaration that propagates down through the IR via an attribute or some similar mechanism. This is how ABIs work on ARM, which has a similar proliferation of optional features and calling conventions (e.g. you can mark a function with __attribute__((pcs("aapcs"))) to make it use the AAPCS calling convention). The nice thing about that is that, as @alexcrichton mentioned above, you can then mix functions or object files with different conventions and everything works as long as the declarations are correct.
I think this would be a good idea in general. It doesn't quite solve the problem for compiler-rt functions on its own because compiler-rt functions don't have any declarations, they are just emitted directly by the compiler. So if we wanted to support more than 1 ABI for compiler-rt we'll still need some kind of flag to control which ABI it uses.
Right now, the only people using multivalue who would regress with what @aheejin proposed above are already using the experimental multivalue ABI for everything. So as a first step, I do think it would make sense to make the expirimental-mv-abi flag control compiler-rt as well. That will decouple the calling convention from just enabling the feature, which I think everyone agrees is desirable.

Then we can address the best way to optimize the performance and use of the calling convention and performance of __multi3 and other compiler-rt functions. We could decide to use MV by default if available upstream, which would be going back to what we have today, except that for emscripten we'd ether provide some other flag to control it, or use a Binaryen pass to lower it away. Or we can decide to decouple the flags, or decide to just tweak the current MV ABI and use it in more toolchains, but even going back-and-forth like this would still be better since it decouples the timeline for that work from blocking exnref (which is what precipitated all this in the first place).

Separately I think we should also just add widening multiply, add-with-carry, and maybe a few other related instructions to wasm. Those have always been fairly uncontroversial but just punted until MV is available everywhere, so maybe that time is now. That's obviously a bit slower process than these toolchain changes could be, so maybe we do both.

Emscripten EH/SjLj uses invoke wrappers in JS, such as `invoke_vi`, from which user functions are called indirectly, to emulate exceptions and setjmp/longjmp. But in case the invoked function returns multiple values or a 128-bit value, its the JS invoke wrappers cannot return multivalue because JS doesn't support that. So we should not enable multivalue returns for the JS invoke wrappers and also the functions called by those JS wrappers because their signature has to match with the JS wrapper. For example, if `func` returns `{i32, i32}` and we have ```ll invoke {i32, i32} @func() ... ``` while LowerEmscriptenEHSjLj will lower it down to something like ```ll %0 = call { i32, i32 } @"__invoke_{i32.i32}_void"(ptr @func) ... ``` we should eventually lower both the invoke wrapper (whose name will be changed later to `invoke_vi`) and `func` down to a signature that indirectly returns multiple values by memory parameter, because JS invoke wrappers do support multiple return values. So we need to disable multivalue returns for JS invoke wrappers and functions called by them. I think we have three ways to do that: 1. Make a set and add all functions that are invoked by JS invoke wrappers in LowerEmscriptenEHSjLj and pass it to the backend using an auxiliary data structure. We have a precedent of this kind of structure already, which is used for Wasm EH: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/CodeGen/WasmEHFuncInfo.h We can even add this set to `WasmEHFuncInfo` maybe, given that this is also a way of handling exceptions in Wasm. - Pros: Most precise - Cons: Auxiliary structure needed 2. Unless we record the invoked functions in LowerEmscriptenEHSjLj like 1, we don't have a way of precisely knowing the set of invoked functions. But they are all indirectly invoked, so in the backend we can check whether a given function is ever indirectly used (i.e., its pointer taken) by traversing its `users()` in the IR and if it is, we don't allow multivalue returns for them. - Pros: No auxiliary structure - Cons: IR checking overhead. More conservative than 1. 3. Disallow all multivalue returns when Emscripten EH or SjLj is enabled. - Pros: Simplest - Cons: Most conservative. This PR is doing 3. While it is the most conservative and possibly disallowing multivalue returns from more functions than needed, I chose this because it is the simplest, and given that hopefully more people will adopt Wasm EH going forward, I don't think there will be many people who would use multivalue and Emscripten EH/SjLj together and want the whatever performance benefit that multivalue return can bring, given that Emscripten EH/SjLj has already a huge performance cost. This is separate from whether we should make the multivalue return dependent on the multivalue feature or something else like a clang flag, which is being partly discussed in WebAssembly/tool-conventions#223. Whichever way we decide on that front, we still need to disable multivalue returns in case Emscripten EH/SjLj is used.

waterlens · 2024-03-21T11:59:50Z

I think it does make sense in general to allow the calling convention ABI to be determined on a per-function basis, with a clang attribute you can attach to a function declaration that propagates down through the IR via an attribute or some similar mechanism. This is how ABIs work on ARM, which has a similar proliferation of optional features and calling conventions (e.g. you can mark a function with __attribute__((pcs("aapcs"))) to make it use the AAPCS calling convention). The nice thing about that is that, as @alexcrichton mentioned above, you can then mix functions or object files with different conventions and everything works as long as the declarations are correct. I think this would be a good idea in general. It doesn't quite solve the problem for compiler-rt functions on its own because compiler-rt functions don't have any declarations, they are just emitted directly by the compiler. So if we wanted to support more than 1 ABI for compiler-rt we'll still need some kind of flag to control which ABI it uses. Right now, the only people using multivalue who would regress with what @aheejin proposed above are already using the experimental multivalue ABI for everything. So as a first step, I do think it would make sense to make the expirimental-mv-abi flag control compiler-rt as well. That will decouple the calling convention from just enabling the feature, which I think everyone agrees is desirable.

Then we can address the best way to optimize the performance and use of the calling convention and performance of __multi3 and other compiler-rt functions. We could decide to use MV by default if available upstream, which would be going back to what we have today, except that for emscripten we'd ether provide some other flag to control it, or use a Binaryen pass to lower it away. Or we can decide to decouple the flags, or decide to just tweak the current MV ABI and use it in more toolchains, but even going back-and-forth like this would still be better since it decouples the timeline for that work from blocking exnref (which is what precipitated all this in the first place).

Separately I think we should also just add widening multiply, add-with-carry, and maybe a few other related instructions to wasm. Those have always been fairly uncontroversial but just punted until MV is available everywhere, so maybe that time is now. That's obviously a bit slower process than these toolchain changes could be, so maybe we do both.

I agree with @dschuff that we should push things like widening multiplication or similar things into wasm. There's a related issue: WebAssembly/design#1495.

tlively · 2024-03-25T21:32:47Z

So our suggestion is, how about making the lowering of 128-bit return values also depend on -target-abi -Xclang experimental-mv? This is a clang flag and transferring this info available to the backend may involve some hack but I think it's doable (@tlively, do you think it's unreasonably hard?) .

No, I agree it's hacky but doable.

I remember @tlively did some experiment on this a long time ago and don't remember the result was, but I don't think we want to invest more time on redoing the experiments to pick an optimal number now. @tlively, do you have a suggestion on a reasonable limit?

I did some experiments that showed performance wins from multivalue function return, but those experiments did not say what an optimal size would be. Basically multivalue return is always faster than returning through memory, but at some point the potential code size bloat would not be worth it any more. Dan's suggestion of 4 seems reasonable to me.

Multivalue feature of WebAssembly has been standardized for several years now. I think it makes sense to be able to enable it in the feature section by default for our clang/llvm-produced binaries so that the multivalue feature can be used as necessary when necessary within our toolchain and also when running other optimizers (e.g. wasm-opt) after the LLVM code generation. But some WebAssembly toolchains, such as Emscripten, do not provide both mulvalue-returning and not-multivalue-returning versions of libraries. Also allowing the uses of multivalue in the features section does not necessarily mean we generate them whenever we can to the fullest, which is a different code generation / optimization option. So this makes the lowering of multivalue returns conditional on the use of 'experimental-mv' target ABI. This ABI is turned off by default and turned on by passing `-Xclang -target-abi -Xclang experimental-mv` to `clang`, or `-target-abi experimental-mv` to `clang -cc1` or `llc`. But the purpose of this PR is not tying the multivalue lowering to this specific 'experimental-mv'. 'experimental-mv' is just one multivalue ABI we currently have, and it is still experimental, meaning it is not very well optimized or tuned for performance. (e.g. it does not have the limitation of the max number of multivalue-lowered values, which can be detrimental to performance.) We may change the name of this ABI, or improve it, or add a new multivalue ABI in the future. Also I heard that WASI is planning to add their multivalue ABI soon. So the plan is, whenever any one of multivalue ABIs is enabled, we enable the lowering of multivalue returns in the backend. We currently have only 'experimental-mv' in the repo so we only check for that in this PR. Related past discussions: llvm#82714 WebAssembly/tool-conventions#223 (comment)

aheejin · 2024-04-12T09:54:39Z

I submitted a PR to do what I proposed above: llvm/llvm-project#88492

This makes the lowering multivalue returns in the backend conditional to the use of 'experimental-mv' option. But I plan to extend this to other multivalue ABIs as they are added; if WASI adds their multivalue ABI, the backend can also detect that ABI and enable multivalue return lowering.

jameysharp · 2024-04-12T16:39:30Z

Is the problem here that the cryptography functions are compiling with the assumption that they have native int128 support? Perhaps if they would prefer a different codepath if they knew that int128 were being software emulated? i.e. is there some HAVE_NATIVE_INT128 macro that they are getting wrong for wasm?

I'm personally not sure about these particulars. @jameysharp might have a better idea, as he was the one doing the original investigation.

In case this still matters: I was testing a Rust implementation of RSA, compiled to WebAssembly and then run on Wasmtime on x86-64. What I found was a ~10% improvement in the speed of RSA operations by changing the underlying multi-precision integer library from 64-bit limbs to 32-bit, which fortunately it had a build-time configuration option for.

When that library did multiplication, it needed double-width temporaries, so with 64-bit limbs it used 128-bit multiplies implemented by libcalls to __multi3. With 32-bit limbs it used 64-bit multiplies which are natively available in WebAssembly. I opened an issue with that library about preferring 32-bit limbs on WebAssembly targets but I don't think it went anywhere.

I believe, but haven't proven yet, that in Wasmtime we have the ability to write optimization rules which will transform the arithmetic in __multi3 back into a native 128-bit multiply. But there's no way we can optimize away the memory accesses that result from returning the pair of results on the shadow stack, because by the time we receive this code, we're constrained by WebAssembly semantics to faithfully implement those writes to linear memory. A post-compilation inlining pass has the same problem. So I'm in favor of any plan that allows these libcalls to use multivalue returns.

See WebAssembly/tool-conventions#223 (comment)

Seems to be at least relevant for RSA See WebAssembly/tool-conventions#223 (comment)

…88492) Multivalue feature of WebAssembly has been standardized for several years now. I think it makes sense to be able to enable it in the feature section by default for our clang/llvm-produced binaries so that the multivalue feature can be used as necessary when necessary within our toolchain and also when running other optimizers (e.g. wasm-opt) after the LLVM code generation. But some WebAssembly toolchains, such as Emscripten, do not provide both mulvalue-returning and not-multivalue-returning versions of libraries. Also allowing the uses of multivalue in the features section does not necessarily mean we generate them whenever we can to the fullest, which is a different code generation / optimization option. So this makes the lowering of multivalue returns conditional on the use of 'experimental-mv' target ABI. This ABI is turned off by default and turned on by passing `-Xclang -target-abi -Xclang experimental-mv` to `clang`, or `-target-abi experimental-mv` to `clang -cc1` or `llc`. But the purpose of this PR is not tying the multivalue lowering to this specific 'experimental-mv'. 'experimental-mv' is just one multivalue ABI we currently have, and it is still experimental, meaning it is not very well optimized or tuned for performance. (e.g. it does not have the limitation of the max number of multivalue-lowered values, which can be detrimental to performance.) We may change the name of this ABI, or improve it, or add a new multivalue ABI in the future. Also I heard that WASI is planning to add their multivalue ABI soon. So the plan is, whenever any one of multivalue ABIs is enabled, we enable the lowering of multivalue returns in the backend. We currently have only 'experimental-mv' in the repo so we only check for that in this PR. Related past discussions: #82714 WebAssembly/tool-conventions#223 (comment)

aheejin requested review from sbc100, dschuff, sunfishcode and tlively March 19, 2024 02:22

sbc100 reviewed Mar 19, 2024

View reviewed changes

kripken reviewed Mar 19, 2024

View reviewed changes

aheejin mentioned this pull request Mar 21, 2024

[WebAssembly] Don't return multivalue when Emscripten EH/SjLj is used llvm/llvm-project#86048

Closed

aheejin mentioned this pull request Apr 12, 2024

[WebAssembly] Enable multivalue return when multivalue ABI is used llvm/llvm-project#88492

Merged

jedisct1 added a commit to jedisct1/libsodium that referenced this pull request Apr 12, 2024

Avoid 128-bit arithmetic when on WebAssembly targets

7478140

See WebAssembly/tool-conventions#223 (comment)

jedisct1 added a commit to jedisct1/openssl-wasm that referenced this pull request Apr 12, 2024

Prefer 32-bit arithmetic on wasm32-wasi

ac9c294

Seems to be at least relevant for RSA See WebAssembly/tool-conventions#223 (comment)

[BasicCABI] Pass compiler-rt 128-bit return values in memory #223

Are you sure you want to change the base?

[BasicCABI] Pass compiler-rt 128-bit return values in memory #223

Conversation

aheejin commented Mar 19, 2024

tlively commented Mar 19, 2024

fitzgen commented Mar 19, 2024

TerrorJack commented Mar 19, 2024

sbc100 commented Mar 19, 2024

jedisct1 commented Mar 19, 2024

sbc100 Mar 19, 2024

Choose a reason for hiding this comment

aheejin Mar 19, 2024

Choose a reason for hiding this comment

sbc100 commented Mar 19, 2024

jedisct1 commented Mar 19, 2024

sbc100 commented Mar 19, 2024

jedisct1 commented Mar 19, 2024

kripken Mar 19, 2024

Choose a reason for hiding this comment

aheejin Mar 19, 2024

Choose a reason for hiding this comment

kripken commented Mar 19, 2024

aheejin commented Mar 19, 2024

TerrorJack commented Mar 19, 2024

alexcrichton commented Mar 19, 2024

sbc100 commented Mar 19, 2024

aheejin commented Mar 19, 2024 • edited

aheejin commented Mar 19, 2024

sbc100 commented Mar 19, 2024

fitzgen commented Mar 20, 2024

sbc100 commented Mar 20, 2024 • edited

fitzgen commented Mar 20, 2024

sbc100 commented Mar 20, 2024 • edited

jedisct1 commented Mar 20, 2024

fitzgen commented Mar 20, 2024

sbc100 commented Mar 20, 2024

aheejin commented Mar 20, 2024

sbc100 commented Mar 20, 2024

jedisct1 commented Mar 20, 2024

aheejin commented Mar 20, 2024

dschuff commented Mar 21, 2024

waterlens commented Mar 21, 2024

tlively commented Mar 25, 2024

aheejin commented Apr 12, 2024

jameysharp commented Apr 12, 2024

aheejin commented Mar 19, 2024 •

edited

sbc100 commented Mar 20, 2024 •

edited

sbc100 commented Mar 20, 2024 •

edited