Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve generated x86 code for AVX targets #138

Merged
merged 3 commits into from Apr 15, 2023

Conversation

Lastique
Copy link
Member

Prefer vmovdqu to vlddqu on CPUs supporting AVX. vlddqu has one extra cycle latency on Skylake and later Intel CPUs and is not merged to the following instructions as a memory operand, which makes the code slightly larger. Legacy SSE3 lddqu is still preferred because it is faster on Prescott and the same as movdqu on AMD CPUs. It also doesn't affect code size because movdqu cannot be converted to a memory operand as memory operands are required to be aligned in SSE.

Closes #137.

Also, re-format the test code for MSVC bug 981648, no functional changes.

@Lastique
Copy link
Member Author

@pdimov Peter, MSVC-14.0 fails with ICE in this CI run in is_contiguous_range implementation. Could you take a look please?

@pdimov
Copy link
Member

pdimov commented Oct 26, 2022

I'll look into it.

@codecov
Copy link

codecov bot commented Oct 27, 2022

Codecov Report

Merging #138 (86b1ae5) into develop (9df4da9) will not change coverage.
The diff coverage is n/a.

❗ Current head 86b1ae5 differs from pull request most recent head c7e0a70. Consider uploading reports for the commit c7e0a70 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff            @@
##           develop     #138   +/-   ##
========================================
  Coverage    83.92%   83.92%           
========================================
  Files           15       15           
  Lines          678      678           
  Branches       156      156           
========================================
  Hits           569      569           
  Misses          18       18           
  Partials        91       91           
Impacted Files Coverage Δ
include/boost/uuid/detail/uuid_x86.ipp 100.00% <ø> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9df4da9...c7e0a70. Read the comment docs.

@Lastique
Copy link
Member Author

Ping @jeking3.

Prefer movdqu to lddqu on CPUs supporting SSE4.1 and later. lddqu has one
extra cycle latency on Skylake and later Intel CPUs, and with AVX vlddqu
is not merged to the following instructions as a memory operand, which makes
the code slightly larger. Legacy SSE3 lddqu is still preferred when SSE4.1
is not enabled because it is faster on Prescott and the same as movdqu on
AMD CPUs. It also doesn't affect code size because movdqu cannot be converted
to a memory operand as memory operands are required to be aligned in SSE.

Closes boostorg#137.
@Lastique
Copy link
Member Author

Rebased and updated the code to prefer movdqu starting with SSE4.1. It doesn't matter on CPUs not supporting AVX, but it is possible that SSE4.1 code will run on a modern CPU that does prefer movdqu to lddqu.

@pdimov
Copy link
Member

pdimov commented Mar 15, 2023

Why not just use movdqu everywhere?

@Lastique
Copy link
Member Author

Because lddqu is better on NetBurst CPUs. And there's also a workaround for MSVC codegen bug.

@pdimov
Copy link
Member

pdimov commented Mar 15, 2023

NetBurst CPUs

Really? Have you seen one recently (as in, in the last decade)? :-)

I had a Thinkpad with a P4D I gave away, its battery lasted about half an hour.

@Lastique
Copy link
Member Author

Lastique commented Mar 15, 2023

Well, I'm fine with dropping support for NetBurst CPUs, but as I said, there's also MSVC bug, so the code wouldn't get much simpler anyway.

@pdimov
Copy link
Member

pdimov commented Mar 15, 2023

It's still a simplification even if we still pretend to care about VS 2008. Not many parts of Boost still work with it, because it's not tested. (I still test msvc-9.0 on Appveyor in old libraries as a matter of habit but that's more of an exception and is not going to last.)

@Lastique
Copy link
Member Author

The bug is only fixed in VS2015; VS2013 and before are affected.

Anyway, I've pushed a commit to use movdqu universally, except for the MSVC workaround.

@pdimov
Copy link
Member

pdimov commented Mar 15, 2023

Yes but without the VS2008 path, it's a single ifdef over _ReadWriteBarrier.

This effectively drops the optimization for NetBurst CPUs and instead
prefers code that is slightly better on Skylake and later Intel CPUs,
even when the code is compiled for SSE3 and not SSE4.1.
@Lastique
Copy link
Member Author

Ok, done. I don't want to remove the VS2008 workaround yet.

@Lastique Lastique closed this Mar 20, 2023
@Lastique Lastique reopened this Mar 20, 2023
@pdimov
Copy link
Member

pdimov commented Mar 21, 2023

Here's some discussion on lddqu vs movdqu, for reference: https://community.intel.com/t5/Intel-ISA-Extensions/LDDQU-vs-MOVDQU-guidelines/m-p/1178965

@Lastique
Copy link
Member Author

Yeah, I forgot about that discussion, thanks for digging it up. Although it didn't result in a definitive answer - Intel reps didn't comment. In the end I was left with the opinion I had when I started it - use lddqu up until AVX, use vmovdqu with AVX and later, as before AVX lddqu is not worse than movdqu and is sometimes better.

@Lastique
Copy link
Member Author

@pdimov Since apparently Boost.UUID is no longer actively maintained again, maybe you could merge this? The Codecov failure does not seem to be caused by this.

@pdimov
Copy link
Member

pdimov commented Mar 21, 2023

I was going to wait until after the release, but I can merge it now if you insist.

@Lastique
Copy link
Member Author

No, after the release is fine. Thanks.

@Lastique
Copy link
Member Author

A gentle reminder about this PR.

@pdimov pdimov merged commit 1a4e7ed into boostorg:develop Apr 15, 2023
81 of 82 checks passed
@Lastique Lastique deleted the feature/avx_optimization branch April 15, 2023 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

x86 optimized operators are slower than generic version
2 participants