-
Notifications
You must be signed in to change notification settings - Fork 43
#372 Integer Sign/Zero Extension for {8,16}->{32,64} #395
Conversation
Extra notes about performance and implementation (left for posterity) ARM64 with int64On ARM:According to llvm-mca tbl/sshr should have identical performance to that of 2 sshlls since they use the exact same ports with the same latency and the same number of instructions. This suggests there's a potential benefit for 8 to 64bit case with signed integers. Signed Data on x64 without SSE4On architectures that don't support SSE4, it can make sense to spill the vector memory, load the values into individual registers, move it back to vectors, and unpack. Since machines lacking SSE4 seems to be such an edge case, this should provide reasonably good fallback behavior. Example: ```assembly movaps xmmword ptr [rsp - 128], xmm0 movsx r8, byte ptr [rsp - 128] movsx rcx, byte ptr [rsp - 127] movq xmm0, rcx movq xmm1, r8 punpcklqdq xmm1, xmm0 # xmm1 = xmm1[0],xmm0[0] ``` |
Updated the assembly above to provide comments on spill and load options as well for using |
And another option which has the potential to double the signed 64bit output depending on how it's used: vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI9_7] # xmm0 = zero,zero,zero,xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero,zero
vpsrad xmm1, xmm0, 24
vpsrad xmm0, xmm0, 31
vpunpckldq xmm0, xmm0, xmm1 |
The instruction set checks for ARMv7 with NEON is done. The method seems to port nicely. vtbl.8 d3, {d1}, d16
vtbl.8 d2, {d0}, d17
vshr.s64 q0, q1, #56 Will update the rest of the documentation for this PR later today. Should cover ARM64, ARMv7+Neon, x64/SSE4 (including SSSE3), AVX, and SSE2. Updated: This is done. |
hey @ngzhian, On our call @Maratyszcza had a question about v8 preserving constants with respect to this proposal. This behavior appears to exist on x64 for any constant parameter, and v8 will pregenerate them at the beginning of the code block. Will this also apply for ARM7 and ARM64? The biggest benefit with respect to this proposal is making sure that the masks that are used are only loaded once. |
These constant mask will remain in registers and be reused as long as they are not spilled. Same for ARM7 and ARM64. |
That's awesome and should make this really efficient. |
It turns out that the TBL approach with SSHR can be more efficient than SSHLL when the algorithm is adjusted, such that SSHR is only called once. According to the ARM Cortex-A76 software optimization guide shift operations can only occur in one instruction per cycle, but TBL operations with two table vectors can occur twice per cycle. Whether Cortex-A76 is an accurate testbed is to be decided -- however -- it gave me an idea for a new implementation that uses fewer instructions for signed conversion and leverages the performances of TBL for these integer conversions. The biggest difference in the implementation is the mask that's required for signed conversion. Here's a Godbolt example. |
Thanks for your suggestion and the detailed implementation guide. Couple of notes:
|
First and foremost, thanks for looking at this. I know this is a doozy of a proposal. There are 24 variants masquerading as 6 instructions even if most of them are masks. I'm going to take your questions a bit out of order, so you can understand how this came to be, and what the benefits will be.
As it stands today, every integer conversion requires stepwise conversion in the WASM SIMD instruction set. Thus the initial premise removes the minimum required instructions to go from 8 to 64 for 8 results from 14 WASM SIMD instructions to 8. For 8 to 32, it's 4 instead of 6. This proposal can neatly do that for unsigned values with PSHUFB/TBL equivalents assuming masks are present. For signed data types, the underlying implementation is equally efficient on x86/x64 even though there are more instructions by virtue of completely different port usage. And, if there's even a remote possibility that the ARM support can be implemented like this for signed data types, ARM will receive all of the same benefits as well. While all of those cases get clear direct benefits when in vectors, the largest benefits come from the operations that come directly from memory. V8 leverages this functionality for x64 and can do an in-flight LoadTransform (see here) for single step integer type conversion, but can't do it for multi-step. With these new instructions, load transformation could apply universally for x86/x64 without any interaction from the programmer and without the need for masks while still giving a very performant solution for ARM.
This is correct (mostly) with a couple of caveats. It doesn't take advantage of any of the underlying LoadTransform stuff listed above, and it has to deal with the less than efficient swizzle implementation that doesn't recognize that the input parameters themselves are constant. If we can come up with an optimization like proposed in #403, swizzle wouldn't be a bad option, but it'll never be as good as a load and shuffle or the loadtransform above.
I have some ideas on how to make loading memory constants work nicely inside the current architecture of v8 with minimal changes to the code. I just need some time to flesh them out a bit. For runtime generation, there's a bunch of ways to do it that are better than individual inserts. If you need some samples, please let me know. Even the insert strategy isn't so bad as long as the masks are only generated once and reused by subsequent calls.
Yes. I'll update this thread with some examples when I have a minute. |
Are any of those compiling to wasm or on the way to compiling? |
Yes sir. Simdpp (header only is up first). @tlively is there preprocessor macro to detect emscripten / wasm implementation? |
Yep, you can check for the |
Yup external references (the link you sent) are arch-independent. |
@omnisip you mentioned you will get some numbers if this is prototype. Which instruction are you planning to make use of? And for which architecture. This is a lot to prototype. |
Also, simdpp is a simd header library, I wouldn't consider it a use case according to our inclusion criteria (since as a library it necessarily includes more instructions.) |
The most interesting instructions to me are the 8 to 32s. I added the 64 bit variants for orthogonality. The 8 to 32 cases for unsigned stands out most since on x64 I have to use swizzle four times yielding at least 8 shuffles, 4 movs and 4 adds. With the shuffle method assuming I'm using a second vector with zeros, it doesn't look much better. I have a prefix sum / scan calculation that leverages quite a bit of this with simdpp even if it's not posted yet. This will be a WASM first library for ssim calculation. |
Before any further action, I would like to see more support for this set of instructions, e.g. community members saying that this is useful for them. It will also be better if existing use cases can immediately benefit if we have this set of instruction, rather than new developments. |
@ngzhian -- Please see the meeting notes from 11/13/2020 where this was discussed in detail. Specifically, this proposal is necessary because the conversions on x64 from 8 to 32bit are difficult and expensive to perform with our existing instruction set. No option exists without at least two shuffles ops for any conversion, and all of the widen high variants require at least 2 (alignr/psrlq + pmovsx...). For every conversion from 8 to 32 on x64, it takes a minimum of 3 shuffle ops to get from 1x8x16 to 2x16x8, then another 6 to go from 2x16x8 to 4x32x4 yielding 9 instructions and 9 shuffle ops. The other options -- swizzle and shuffle are worse since no pattern matches will occur for these. If that wasn't problematic, it gets really messy with signed conversion cases -- where you end up with a shuffle like this: shuffle(0,16,16,16,1,17,17,17,17,2,18,18,18,3,19,19,19); All of that said -- ARM's performance improvement should be as good as the performance improvement for x64 on all of the unsigned cases today. It'll be even better when once the proposal for lifting reused constant intermediates is finished. This turns the signed cases into a net 5 instruction solution -- instead of 8. Here are some extra use cases that show how these are used elsewhere: |
I looked at the meetings notes, main takeaways:
this is going to take a while, until then we have to live with the performance cliffs
I don't understand this part, as above:
Is a single shuffle. How are you getting 9 instructions? If we were to ignore X->64 for a second, all the 8->32 instructions look like convenience wrappers or groupings around instructions we already have. My main point is that our existing use cases don't benefit from this set of instruction (especially ->64). Pushing this to post-mvp will help reduce the surface area we need to work on to get to Phase 4, which makes SIMD more useful because we get it closer to the hands of all users. |
The 9 instructions and 9 shuffles is what it takes without these proposed instructions. |
If you prototype these for me, you can ditch the 64 bit ones. This was drafted to be fully complete by the submission deadline. That leaves only two instructions. With the external reference support in v8 making it possible to do aligned loads, we can (and probably should) implement these with memory arguments. The performance should be excellent and it'll provide good support for 8 bit to float conversions which is often a subsequent step for these. @Maratyszcza are there any outstanding proposals that would justify keeping the 16 to 64 variants? What would stand out would be something that allowed conversion of i64s to doubles. |
|
Instead of the stepwise conversion, you can emit the single |
Also, |
Yep. I just did emsdk install tot, an hour ago. I'm assuming I had to compile llvm from source, so I pointed emscripten to point at my new llvm build. Is that wrong? |
I don't know if that is what's going on, but LLVM just rolled the version from 12 to 13 very recently (maybe even yesterday), maybe emscripten's version detection hasn't gotten the memo. |
Weird, it looks like that expectation was updated yesterday. @omnisip, did you do |
Don't recall doing emsdk update-tags, but I think I did emsdk update. I'll redo it again if that helps. |
Same issue, but the warning flags are gone now -- so that's a plus. dan@dl360:~/applications/wrapper/xnnpack$ EMCC_DEBUG=1 /opt/emsdk/upstream/emscripten/emcc -o bazel-out/wasm-dbg/bin/elu_bench bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o bazel-out/wasm-dbg/bin/libXNNPACK.a bazel-out/wasm-dbg/bin/libmemory_planner.a bazel-out/wasm-dbg/bin/liboperator_run.a bazel-out/wasm-dbg/bin/liboperators.a bazel-out/wasm-dbg/bin/libindirection.a bazel-out/wasm-dbg/bin/liblogging_utils.a bazel-out/wasm-dbg/bin/libpacking.a bazel-out/wasm-dbg/bin/libscalar_ukernels.a bazel-out/wasm-dbg/bin/libwasm_ukernels.a bazel-out/wasm-dbg/bin/libtables.a bazel-out/wasm-dbg/bin/libasm_ukernels.a bazel-out/wasm-dbg/bin/libbench_utils.a bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a bazel-out/wasm-dbg/bin/external/clog/libclog.a bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a -s 'ASSERTIONS=1' -s 'ERROR_ON_UNDEFINED_SYMBOLS=1' -s 'EXIT_RUNTIME=1' -s 'ALLOW_MEMORY_GROWTH=1' -s 'TOTAL_MEMORY=268435456' --pre-js ./preamble.js.lds -pthread -msimd128 -g -sWASM_BIGINT -sERROR_ON_WASM_CHANGES_AFTER_LINK -msimd128 -s 'USE_PTHREADS=0' -s 'ERROR_ON_UNDEFINED_SYMBOLS=0' '-Wl,--export=__heap_base' '-Wl,--export=__data_end'
tools.filelock:DEBUG: Attempting to acquire lock 139907621271968 on /tmp/emscripten_temp/emscripten.lock
tools.filelock:DEBUG: Lock 139907621271968 acquired on /tmp/emscripten_temp/emscripten.lock
emcc:WARNING: invocation: /opt/emsdk/upstream/emscripten/emcc.py -o bazel-out/wasm-dbg/bin/elu_bench bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o bazel-out/wasm-dbg/bin/libXNNPACK.a bazel-out/wasm-dbg/bin/libmemory_planner.a bazel-out/wasm-dbg/bin/liboperator_run.a bazel-out/wasm-dbg/bin/liboperators.a bazel-out/wasm-dbg/bin/libindirection.a bazel-out/wasm-dbg/bin/liblogging_utils.a bazel-out/wasm-dbg/bin/libpacking.a bazel-out/wasm-dbg/bin/libscalar_ukernels.a bazel-out/wasm-dbg/bin/libwasm_ukernels.a bazel-out/wasm-dbg/bin/libtables.a bazel-out/wasm-dbg/bin/libasm_ukernels.a bazel-out/wasm-dbg/bin/libbench_utils.a bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a bazel-out/wasm-dbg/bin/external/clog/libclog.a bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a -s ASSERTIONS=1 -s ERROR_ON_UNDEFINED_SYMBOLS=1 -s EXIT_RUNTIME=1 -s ALLOW_MEMORY_GROWTH=1 -s TOTAL_MEMORY=268435456 --pre-js ./preamble.js.lds -pthread -msimd128 -g -sWASM_BIGINT -sERROR_ON_WASM_CHANGES_AFTER_LINK -msimd128 -s USE_PTHREADS=0 -s ERROR_ON_UNDEFINED_SYMBOLS=0 -Wl,--export=__heap_base -Wl,--export=__data_end (in /home/dan/applications/wrapper/xnnpack)
shared:DEBUG: successfully executed /opt/emsdk/upstream/bin/clang --version
cache:DEBUG: PID 208108 acquiring multiprocess file lock to Emscripten cache at /opt/emsdk/upstream/emscripten/cache
tools.filelock:DEBUG: Attempting to acquire lock 139907621272112 on /opt/emsdk/upstream/emscripten/cache/cache.lock
tools.filelock:DEBUG: Lock 139907621272112 acquired on /opt/emsdk/upstream/emscripten/cache/cache.lock
cache:DEBUG: done
shared:DEBUG: sanity file up-to-date but check forced: /opt/emsdk/upstream/emscripten/cache/sanity.txt
shared:DEBUG: successfully executed /opt/emsdk/node/12.18.1_64bit/bin/node --version
shared:DEBUG: successfully executed /opt/emsdk/upstream/bin/llc --version
shared:INFO: (Emscripten: Running sanity checks)
shared:DEBUG: successfully executed /opt/emsdk/node/12.18.1_64bit/bin/node -e console.log("hello")
tools.filelock:DEBUG: Attempting to release lock 139907621272112 on /opt/emsdk/upstream/emscripten/cache/cache.lock
tools.filelock:DEBUG: Lock 139907621272112 released on /opt/emsdk/upstream/emscripten/cache/cache.lock
cache:DEBUG: PID 208108 released multiprocess file lock to Emscripten cache at /opt/emsdk/upstream/emscripten/cache
diagnostics:DEBUG: disabled warning: use of legacy setting: TOTAL_MEMORY (setting renamed to INITIAL_MEMORY) [-Wlegacy-settings]
emcc:DEBUG: compiling to bitcode
emcc:DEBUG: emcc step "parse arguments and setup" took 0.12 seconds
emcc:DEBUG: using object file: bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libXNNPACK.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libmemory_planner.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/liboperator_run.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/liboperators.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libindirection.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/liblogging_utils.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libpacking.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libscalar_ukernels.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libwasm_ukernels.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libtables.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libasm_ukernels.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libbench_utils.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/external/clog/libclog.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a
emcc:DEBUG: emcc step "compile inputs" took 0.00 seconds
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libXNNPACK.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libmemory_planner.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/liboperator_run.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/liboperators.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libindirection.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/liblogging_utils.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libpacking.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libscalar_ukernels.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libwasm_ukernels.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libtables.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libasm_ukernels.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libbench_utils.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/external/clog/libclog.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a
system_libs:DEBUG: adding dependency on malloc due to deps-info on realloc
system_libs:DEBUG: adding dependency on free due to deps-info on realloc
system_libs:DEBUG: adding dependency on malloc due to deps-info on getenv
system_libs:DEBUG: adding dependency on free due to deps-info on getenv
system_libs:DEBUG: adding dependency on malloc due to deps-info on gmtime_r
system_libs:DEBUG: adding dependency on free due to deps-info on gmtime_r
system_libs:DEBUG: adding dependency on _get_tzname due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on _get_daylight due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on _get_timezone due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on malloc due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on free due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on malloc due to deps-info on pthread_create
system_libs:DEBUG: adding dependency on free due to deps-info on pthread_create
system_libs:DEBUG: adding dependency on emscripten_main_thread_process_queued_calls due to deps-info on pthread_create
system_libs:DEBUG: adding dependency on malloc due to deps-info on calloc
system_libs:DEBUG: adding dependency on free due to deps-info on calloc
system_libs:DEBUG: including libgl (libgl.a)
system_libs:DEBUG: including libal (libal.a)
system_libs:DEBUG: including libhtml5 (libhtml5.a)
system_libs:DEBUG: including libc (libc.a)
system_libs:DEBUG: including libcompiler_rt (libcompiler_rt.a)
system_libs:DEBUG: including libc++ (libc++-noexcept.a)
system_libs:DEBUG: including libc++abi (libc++abi-noexcept.a)
system_libs:DEBUG: including libmalloc (libdlmalloc.a)
system_libs:DEBUG: including libc_rt_wasm (libc_rt_wasm.a)
system_libs:DEBUG: including libsockets (libsockets.a)
emcc:DEBUG: emcc step "calculate system libraries" took 0.46 seconds
emcc:DEBUG: linking: ['bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o', 'bazel-out/wasm-dbg/bin/libXNNPACK.a', 'bazel-out/wasm-dbg/bin/libmemory_planner.a', 'bazel-out/wasm-dbg/bin/liboperator_run.a', 'bazel-out/wasm-dbg/bin/liboperators.a', 'bazel-out/wasm-dbg/bin/libindirection.a', 'bazel-out/wasm-dbg/bin/liblogging_utils.a', 'bazel-out/wasm-dbg/bin/libpacking.a', 'bazel-out/wasm-dbg/bin/libscalar_ukernels.a', 'bazel-out/wasm-dbg/bin/libwasm_ukernels.a', 'bazel-out/wasm-dbg/bin/libtables.a', 'bazel-out/wasm-dbg/bin/libasm_ukernels.a', 'bazel-out/wasm-dbg/bin/libbench_utils.a', 'bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a', 'bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a', 'bazel-out/wasm-dbg/bin/external/clog/libclog.a', 'bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a', '--export=__heap_base', '--export=__data_end', '-L/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libgl.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libal.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libhtml5.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libcompiler_rt.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc++-noexcept.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc++abi-noexcept.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libdlmalloc.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc_rt_wasm.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libsockets.a']
shared:DEBUG: successfully executed /opt/emsdk/upstream/bin/wasm-ld -o bazel-out/wasm-dbg/bin/elu_bench.wasm bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o bazel-out/wasm-dbg/bin/libXNNPACK.a bazel-out/wasm-dbg/bin/libmemory_planner.a bazel-out/wasm-dbg/bin/liboperator_run.a bazel-out/wasm-dbg/bin/liboperators.a bazel-out/wasm-dbg/bin/libindirection.a bazel-out/wasm-dbg/bin/liblogging_utils.a bazel-out/wasm-dbg/bin/libpacking.a bazel-out/wasm-dbg/bin/libscalar_ukernels.a bazel-out/wasm-dbg/bin/libwasm_ukernels.a bazel-out/wasm-dbg/bin/libtables.a bazel-out/wasm-dbg/bin/libasm_ukernels.a bazel-out/wasm-dbg/bin/libbench_utils.a bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a bazel-out/wasm-dbg/bin/external/clog/libclog.a bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a --export=__heap_base --export=__data_end -L/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libgl.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libal.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libhtml5.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libcompiler_rt.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc++-noexcept.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc++abi-noexcept.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libdlmalloc.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc_rt_wasm.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libsockets.a -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr --allow-undefined --export main --export emscripten_stack_get_end --export emscripten_stack_get_free --export emscripten_stack_init --export stackSave --export stackRestore --export stackAlloc --export __wasm_call_ctors --export fflush --export __errno_location --export malloc --export free --export _get_tzname --export _get_daylight --export _get_timezone --export emscripten_main_thread_process_queued_calls --export-table -z stack-size=5242880 --initial-memory=268435456 --no-entry --max-memory=2147483648 --global-base=1024
emcc:DEBUG: emcc step "link" took 0.12 seconds
emcc:DEBUG: emscript
building:DEBUG: saving debug copy /tmp/emscripten_temp/emcc-0-base.wasm
shared:DEBUG: successfully executed /opt/emsdk/upstream/bin/wasm-opt --version
[parse exception: invalid code after SIMD prefix: 103 (at 0:222205)]
Fatal: error in parsing input
emcc: error: '/opt/emsdk/upstream/bin/wasm-emscripten-finalize --detect-features --minimize-wasm-changes -g --bigint --no-dyncalls --no-legalize-javascript-ffi --dwarf bazel-out/wasm-dbg/bin/elu_bench.wasm' failed (1)
tools.filelock:DEBUG: Attempting to release lock 139907621271968 on /tmp/emscripten_temp/emscripten.lock
tools.filelock:DEBUG: Lock 139907621271968 released on /tmp/emscripten_temp/emscripten.lock |
Looks like I'll have to do a Binaryen implementation after all. Sorry about that. I should have a PR up soon. |
As proposed in WebAssembly/simd#395. Note that the other instructions in the proposal have not been implemented in LLVM or in V8, so there is no need to implement them in Binaryen right now either. This PR introduces a new expression class for the new instructions because they uniquely take an immediate argument identifying which portion of the input vector to widen.
As proposed in WebAssembly/simd#395. Note that the other instructions in the proposal have not been implemented in LLVM or in V8, so there is no need to implement them in Binaryen right now either. This PR introduces a new expression class for the new instructions because they uniquely take an immediate argument identifying which portion of the input vector to widen.
As proposed in WebAssembly/simd#395. Note that the other instructions in the proposal have not been implemented in LLVM or in V8, so there is no need to implement them in Binaryen right now either. This PR introduces a new expression class for the new instructions because they uniquely take an immediate argument identifying which portion of the input vector to widen.
I wasn't clear with my comments on the meetings issue, following up here so we can discuss in more detail. The number of operations introduced here, and the number of constant masks required, combined with the fact that I can't seem to narrow down whether there are production use cases for these operations make me lean towards these not being a good candidate for MVP. A couple of things that I think will make a more compelling case for the inclusion of these operations -
|
I have little objection to dropping the 64 bit variants. I don't use them. I originally proposed them for completeness to be discussed before the November cut-off, but didn't know if they served any value. With respect to the i8->i32 use cases, many are used for conversion from i8->i32->f32. It depends on the algorithm and level of precision in how this is done. For instance, my Structural Similarity (SSIM) calculation depends on 2D prefix sums (scan) to be calculated efficiently. In such case, I convert from i8->i32 before going to f32 for the final calculations. However, there are plenty of libraries that go from i8->i32->f32 in one go. For benchmark purposes, I plan on presenting the performance time improvement or loss based on the modifications in prefix sum calculations and select XNNPack sections that use load_8x8 today. |
@omnisip Sounds good, then let's limit this issue and the subsequent vote only to the non-64x2 variants. I'm not adding a vote here as it would make sense to vote after looking at the data. Adding this to this week's agenda. |
Test Case2D Prefix Sum with 10K images each at 1920x1080 resolution. Each test was repeated 5 times to mitigate any errors or jitter in the results set. Preliminary Results TableUPDATED: 2021-02-04 23:34Z
AnalysisWhile widen_i8x16 is significantly better than swizzle, it's not significantly better than stepwise expansion (i8x16->i16x8->i32x4). This is likely due to the fact that the memory constants have to be reloaded each time. When compilers implement reuse of intermediate constants, performance is likely to improve on the swizzle variants such that it will be better than both the stepwise and widen implementations. A simple test to verify this can be performed by testing swizzle with the proposed masks on ARM64 against the stepwise instructions or by patching i8x16 swizzle for x64 (the method used to show the results above). |
With https://crrev.com/c/2664994 this is prototyped on arm64 as well, using the double shifts. Just saw the latest comment sharing the benchmarks, would you prefer the arm64 code sequence to use tbl instead? |
@ngzhian -- I appreciate the offer, but I don't think that's necessary. After going through this and benchmarking it a few different ways, there's a better way to address this specific issue and provide a solution/optimization that makes swizzle on x64 competitive with the ARM64 implementation. In essence, not only would it make for a faster swizzle for i8x16 -> i32x4 it would also work make a whole host of other applications using swizzle more performant. To do that, we should consider detecting v128 Consts as arguments and perform optimizations in the Instruction Selector. If we can detect that there's S128Constant through the graph using a NodeMatcher, we can then predetermine before the code-generation step if the input to swizzle is constant, and if so, whether or not the parameter needs modification. As such, if we can determine a top bit is already set if any of the values are out of range, we can emit a pshufb without any additional movdqu, pshufd, and paddusb instructions -- effectively making swizzle a single-op instruction and eliminating the need for this instruction set. |
Good suggestion, we have a tracking bug for such optimizations at https://crbug.com/v8/10992. This is an slow case that has shown up multiple times and an optimization we would like to get to. |
I updated the table above to show what the performance would be if the optimization was in-place. |
Indeed, SpiderMonkey translates swizzle-with-constant-mask into shuffle-with-zero, which is then subject to the usual pattern matching optimizations: https://searchfox.org/mozilla-central/source/js/src/jit/MIR.cpp#4313. |
This question may sound a bit silly, but what indices should I be using with the zero vector for this test or does it not matter? |
@abrown, @dtig, @ngzhian -- Here are the benchmarks for shuffle:
This week (hopefully today or tomorrow), I'm going to put together a self-contained sample for everyone to test and file the ticket. |
Closing as per #436. |
As proposed in WebAssembly/simd#395 and matching the opcodes used in V8: https://chromium-review.googlesource.com/c/v8/v8/+/2617385/4/src/wasm/wasm-opcodes.h Differential Revision: https://reviews.llvm.org/D95557
As proposed in WebAssembly/simd#395 and matching the opcodes used in V8: https://chromium-review.googlesource.com/c/v8/v8/+/2617385/4/src/wasm/wasm-opcodes.h Differential Revision: https://reviews.llvm.org/D95557
As proposed in WebAssembly/simd#395 and matching the opcodes used in V8: https://chromium-review.googlesource.com/c/v8/v8/+/2617385/4/src/wasm/wasm-opcodes.h Differential Revision: https://reviews.llvm.org/D95557
Introduction
This proposal mirrors #290 to add new variants of existing widen instructions and extends the 32 and 64 widen instructions to include support from 16 and 8-bit integers. The practical use case for this is signal processing -- specifically audio and image processing, but the use cases for this are pretty large in general. For a non-image processing use case, these could be very helpful any time someone wants to convert an 8-bit value to a floating-point number. Currently, this requires multiple conversions steps between integers before converting to float, but modern architectures provide operations to convert from just about any integer size to another. Due to the non-binary relationship between 8 bits and 64 bits, this instruction will introduce new terminology that will replace the high/low terminology with a constant parameter immediate. This PR supersedes #372 to provide the implementation guidelines for this proposal.
Use Cases
Notable Applications and Libraries
Proposed Instructions
Withdrawn instructions
Performance and Portability Considerations
The principal implementation is that of a shuffle/swizzle and shift for signed data and merely a shuffle/swizzle for unsigned data.
Analysis describing the efficacy of this proposal is described here and is demonstrated here for 8 to 32bit and here for 8 to 64 bit. There's a lot of room for compiler optimization depending on how the subsequent code operates. For instance, the primary advantage of the
tbl
approach (on ARM64) is when a mask already exists and doesn't require a load from memory. In other cases, it may make more sense to go theushll
orsshll
routes. Whether or not a benefit is achieved depends on port utilization of the subsequent code and how much out of order and instruction-level parallelism that can be obtained. This does not appear to be the case with x64 chips which appear to gain a benefit so long as the number of shuffles is reduced. In such cases, if a compiler detects a load followed by a convert, it can immediately optimize it upstream withmovzx****
ormovsx****
directly to the target register. Such should provide the maximum instruction-level parallelism and minimal port usage. In any case where performance with this method does not exceed that of incremental conversions, the incremental conversion method may be used in its place. Similarly, any system or architecture that benefits from this conversion method over that of the incremental conversion method can use any of the masks described herein as if they were constants provided to the existing v128.swizzle operation.Mapping To Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience. Compliant WebAssembly implementations do not have to follow the same code generation patterns.
Masks or Tables relevant to x64 and ARM Implementations
Withdrawn lowerings
x86/x86-64 processors with AVX instruction set
i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128
i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128
Withdrawn lowerings
i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128
i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128
i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128
i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128
x86/x86-64 processors with SSE4 instruction set
i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128
i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128
Withdrawn lowerings
i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128
i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128
i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128
i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128
x86/x86-64 processors with SSE2 instruction set
i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128
i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128
Withdrawn lowerings
i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128
i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128
i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128
i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128
on ARM64
i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128
i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128
Withdrawn lowerings
i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128
i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128
i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128
i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128
on ARMv7 with NEON
i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128
i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128
Withdrawn lowerings
i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128
i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128
i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128
i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128