unify _mm_hsub_* functions #432

howjmay · 2021-05-27T20:31:47Z

The implementation for _mm_hsub_* functions varies. Maybe we should unify to the faster one

marktwtn · 2021-10-18T06:43:48Z

The generated assembly code of different implementations
Compiler: ARM64 GCC 11.1
Optimization: -O2

ARM32 with unzip vector intrinsic:
https://godbolt.org/z/rhdcP7Khe

ARM64 with unzip vector intrinsic:
https://godbolt.org/z/ehvn51To3

Extract narrow and shift implementation:
https://godbolt.org/z/7Ybof1Y1K

The unzip vector implementation has less assembly code.

Unify the implementation of _mm_hsub[s]_* with unzip vector intrinsic. The old implementation: https://godbolt.org/z/7Ybof1Y1K The better implementation with less assembly code for ARM32 and ARM64: https://godbolt.org/z/rhdcP7Khe https://godbolt.org/z/ehvn51To3 Extract variable declaration for readability. Replace transpose vector intrinsic with unzip vector instrinsic for unification. Close #432.

howjmay self-assigned this Aug 8, 2021

marktwtn mentioned this issue Oct 18, 2021

perf: Improve and unify _mm_hsub[s]_* intrinsics #498

Merged

jserv closed this as completed in #498 Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unify _mm_hsub_* functions #432

unify _mm_hsub_* functions #432

howjmay commented May 27, 2021 •

edited by marktwtn

marktwtn commented Oct 18, 2021

unify _mm_hsub_* functions #432

unify _mm_hsub_* functions #432

Comments

howjmay commented May 27, 2021 • edited by marktwtn

marktwtn commented Oct 18, 2021

howjmay commented May 27, 2021 •

edited by marktwtn