You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After adding an inline assembly implementation of the AVX2 and AVX encoders in #104 and #108, it turns out that we can easily repeat the trick for SSSE3. The code looks a lot like the AVX implementation. Benchmarking across a few machines consistently shows around 10-20% speedup.
One caveat is that the inline assembly codepath will only be available on 64-bit machines. 32-bit machines with SSSE3 support (rare, but they exist, I own one) have eight XMM registers instead of sixteen, and that's not enough to implement a proper pipelined, unrolled loop.
I did try to write inline assembly that uses only eight XMM registers, but under those constraints I could not implement a parallelized loop, and in fact I could not even beat the compiler for speed.
The text was updated successfully, but these errors were encountered:
After adding an inline assembly implementation of the AVX2 and AVX encoders in #104 and #108, it turns out that we can easily repeat the trick for SSSE3. The code looks a lot like the AVX implementation. Benchmarking across a few machines consistently shows around 10-20% speedup.
One caveat is that the inline assembly codepath will only be available on 64-bit machines. 32-bit machines with SSSE3 support (rare, but they exist, I own one) have eight XMM registers instead of sixteen, and that's not enough to implement a proper pipelined, unrolled loop.
I did try to write inline assembly that uses only eight XMM registers, but under those constraints I could not implement a parallelized loop, and in fact I could not even beat the compiler for speed.
The text was updated successfully, but these errors were encountered: