New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve generated x86 code for AVX targets #138
Conversation
@pdimov Peter, MSVC-14.0 fails with ICE in this CI run in |
I'll look into it. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## develop #138 +/- ##
========================================
Coverage 83.92% 83.92%
========================================
Files 15 15
Lines 678 678
Branches 156 156
========================================
Hits 569 569
Misses 18 18
Partials 91 91
Continue to review full report in Codecov by Sentry.
|
Ping @jeking3. |
86b1ae5
to
e7aa746
Compare
Prefer movdqu to lddqu on CPUs supporting SSE4.1 and later. lddqu has one extra cycle latency on Skylake and later Intel CPUs, and with AVX vlddqu is not merged to the following instructions as a memory operand, which makes the code slightly larger. Legacy SSE3 lddqu is still preferred when SSE4.1 is not enabled because it is faster on Prescott and the same as movdqu on AMD CPUs. It also doesn't affect code size because movdqu cannot be converted to a memory operand as memory operands are required to be aligned in SSE. Closes boostorg#137.
e7aa746
to
432763d
Compare
Rebased and updated the code to prefer movdqu starting with SSE4.1. It doesn't matter on CPUs not supporting AVX, but it is possible that SSE4.1 code will run on a modern CPU that does prefer movdqu to lddqu. |
Why not just use |
Because lddqu is better on NetBurst CPUs. And there's also a workaround for MSVC codegen bug. |
Really? Have you seen one recently (as in, in the last decade)? :-) I had a Thinkpad with a P4D I gave away, its battery lasted about half an hour. |
Well, I'm fine with dropping support for NetBurst CPUs, but as I said, there's also MSVC bug, so the code wouldn't get much simpler anyway. |
It's still a simplification even if we still pretend to care about VS 2008. Not many parts of Boost still work with it, because it's not tested. (I still test msvc-9.0 on Appveyor in old libraries as a matter of habit but that's more of an exception and is not going to last.) |
The bug is only fixed in VS2015; VS2013 and before are affected. Anyway, I've pushed a commit to use movdqu universally, except for the MSVC workaround. |
Yes but without the VS2008 path, it's a single ifdef over _ReadWriteBarrier. |
c7e0a70
to
685eed4
Compare
This effectively drops the optimization for NetBurst CPUs and instead prefers code that is slightly better on Skylake and later Intel CPUs, even when the code is compiled for SSE3 and not SSE4.1.
685eed4
to
5e7637b
Compare
Ok, done. I don't want to remove the VS2008 workaround yet. |
Here's some discussion on lddqu vs movdqu, for reference: https://community.intel.com/t5/Intel-ISA-Extensions/LDDQU-vs-MOVDQU-guidelines/m-p/1178965 |
Yeah, I forgot about that discussion, thanks for digging it up. Although it didn't result in a definitive answer - Intel reps didn't comment. In the end I was left with the opinion I had when I started it - use |
@pdimov Since apparently Boost.UUID is no longer actively maintained again, maybe you could merge this? The Codecov failure does not seem to be caused by this. |
I was going to wait until after the release, but I can merge it now if you insist. |
No, after the release is fine. Thanks. |
A gentle reminder about this PR. |
Prefer
vmovdqu
tovlddqu
on CPUs supporting AVX.vlddqu
has one extra cycle latency on Skylake and later Intel CPUs and is not merged to the following instructions as a memory operand, which makes the code slightly larger. Legacy SSE3lddqu
is still preferred because it is faster on Prescott and the same asmovdqu
on AMD CPUs. It also doesn't affect code size becausemovdqu
cannot be converted to a memory operand as memory operands are required to be aligned in SSE.Closes #137.
Also, re-format the test code for MSVC bug 981648, no functional changes.