Improve generated x86 code for AVX targets #138

Lastique · 2022-10-26T07:22:28Z

Prefer vmovdqu to vlddqu on CPUs supporting AVX. vlddqu has one extra cycle latency on Skylake and later Intel CPUs and is not merged to the following instructions as a memory operand, which makes the code slightly larger. Legacy SSE3 lddqu is still preferred because it is faster on Prescott and the same as movdqu on AMD CPUs. It also doesn't affect code size because movdqu cannot be converted to a memory operand as memory operands are required to be aligned in SSE.

Closes #137.

Also, re-format the test code for MSVC bug 981648, no functional changes.

Lastique · 2022-10-26T15:46:11Z

@pdimov Peter, MSVC-14.0 fails with ICE in this CI run in is_contiguous_range implementation. Could you take a look please?

pdimov · 2022-10-26T16:10:17Z

I'll look into it.

codecov · 2022-10-27T00:10:38Z

Codecov Report

Merging #138 (86b1ae5) into develop (9df4da9) will not change coverage.
The diff coverage is n/a.

❗ Current head 86b1ae5 differs from pull request most recent head c7e0a70. Consider uploading reports for the commit c7e0a70 to get more accurate results

Additional details and impacted files

@@           Coverage Diff            @@
##           develop     #138   +/-   ##
========================================
  Coverage    83.92%   83.92%           
========================================
  Files           15       15           
  Lines          678      678           
  Branches       156      156           
========================================
  Hits           569      569           
  Misses          18       18           
  Partials        91       91

Impacted Files	Coverage Δ
include/boost/uuid/detail/uuid_x86.ipp	`100.00% <ø> (ø)`

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9df4da9...c7e0a70. Read the comment docs.

Lastique · 2023-01-20T17:09:31Z

Ping @jeking3.

Prefer movdqu to lddqu on CPUs supporting SSE4.1 and later. lddqu has one extra cycle latency on Skylake and later Intel CPUs, and with AVX vlddqu is not merged to the following instructions as a memory operand, which makes the code slightly larger. Legacy SSE3 lddqu is still preferred when SSE4.1 is not enabled because it is faster on Prescott and the same as movdqu on AMD CPUs. It also doesn't affect code size because movdqu cannot be converted to a memory operand as memory operands are required to be aligned in SSE. Closes boostorg#137.

Lastique · 2023-03-14T21:59:25Z

Rebased and updated the code to prefer movdqu starting with SSE4.1. It doesn't matter on CPUs not supporting AVX, but it is possible that SSE4.1 code will run on a modern CPU that does prefer movdqu to lddqu.

pdimov · 2023-03-15T00:04:27Z

Why not just use movdqu everywhere?

Lastique · 2023-03-15T00:52:12Z

Because lddqu is better on NetBurst CPUs. And there's also a workaround for MSVC codegen bug.

pdimov · 2023-03-15T01:11:15Z

NetBurst CPUs

Really? Have you seen one recently (as in, in the last decade)? :-)

I had a Thinkpad with a P4D I gave away, its battery lasted about half an hour.

Lastique · 2023-03-15T01:28:32Z

Well, I'm fine with dropping support for NetBurst CPUs, but as I said, there's also MSVC bug, so the code wouldn't get much simpler anyway.

pdimov · 2023-03-15T01:37:49Z

It's still a simplification even if we still pretend to care about VS 2008. Not many parts of Boost still work with it, because it's not tested. (I still test msvc-9.0 on Appveyor in old libraries as a matter of habit but that's more of an exception and is not going to last.)

Lastique · 2023-03-15T01:43:01Z

The bug is only fixed in VS2015; VS2013 and before are affected.

Anyway, I've pushed a commit to use movdqu universally, except for the MSVC workaround.

pdimov · 2023-03-15T01:48:05Z

Yes but without the VS2008 path, it's a single ifdef over _ReadWriteBarrier.

This effectively drops the optimization for NetBurst CPUs and instead prefers code that is slightly better on Skylake and later Intel CPUs, even when the code is compiled for SSE3 and not SSE4.1.

Lastique · 2023-03-15T01:57:51Z

Ok, done. I don't want to remove the VS2008 workaround yet.

pdimov · 2023-03-21T18:16:45Z

Here's some discussion on lddqu vs movdqu, for reference: https://community.intel.com/t5/Intel-ISA-Extensions/LDDQU-vs-MOVDQU-guidelines/m-p/1178965

Lastique · 2023-03-21T19:59:37Z

Yeah, I forgot about that discussion, thanks for digging it up. Although it didn't result in a definitive answer - Intel reps didn't comment. In the end I was left with the opinion I had when I started it - use lddqu up until AVX, use vmovdqu with AVX and later, as before AVX lddqu is not worse than movdqu and is sometimes better.

Lastique · 2023-03-21T20:03:04Z

@pdimov Since apparently Boost.UUID is no longer actively maintained again, maybe you could merge this? The Codecov failure does not seem to be caused by this.

pdimov · 2023-03-21T20:05:06Z

I was going to wait until after the release, but I can merge it now if you insist.

Lastique · 2023-03-21T20:06:58Z

No, after the release is fine. Thanks.

Lastique · 2023-04-15T18:11:54Z

A gentle reminder about this PR.

pdimov mentioned this pull request Oct 26, 2022

Internal compiler error with msvc-14.0 in test_uuid.cpp boostorg/container_hash#28

Closed

Lastique closed this Oct 26, 2022

Lastique reopened this Oct 26, 2022

Lastique force-pushed the feature/avx_optimization branch from 86b1ae5 to e7aa746 Compare March 14, 2023 21:55

Lastique added 2 commits March 15, 2023 00:56

Re-format the test code for MSVC bug 981648.

24d77f1

Lastique force-pushed the feature/avx_optimization branch from e7aa746 to 432763d Compare March 14, 2023 21:56

Lastique mentioned this pull request Mar 15, 2023

clang-cl build errors in AppVeyor CI bfgroup/b2#219

Closed

3 tasks

Lastique force-pushed the feature/avx_optimization branch from c7e0a70 to 685eed4 Compare March 15, 2023 01:49

Use movdqu universally for loading UUIDs.

5e7637b

This effectively drops the optimization for NetBurst CPUs and instead prefers code that is slightly better on Skylake and later Intel CPUs, even when the code is compiled for SSE3 and not SSE4.1.

Lastique force-pushed the feature/avx_optimization branch from 685eed4 to 5e7637b Compare March 15, 2023 01:54

Lastique closed this Mar 20, 2023

Lastique reopened this Mar 20, 2023

pdimov merged commit 1a4e7ed into boostorg:develop Apr 15, 2023
81 of 82 checks passed

Lastique deleted the feature/avx_optimization branch April 15, 2023 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve generated x86 code for AVX targets #138

Improve generated x86 code for AVX targets #138

Lastique commented Oct 26, 2022

Lastique commented Oct 26, 2022

pdimov commented Oct 26, 2022

codecov bot commented Oct 27, 2022 •

edited

Lastique commented Jan 20, 2023

Lastique commented Mar 14, 2023

pdimov commented Mar 15, 2023

Lastique commented Mar 15, 2023

pdimov commented Mar 15, 2023

Lastique commented Mar 15, 2023 •

edited

pdimov commented Mar 15, 2023

Lastique commented Mar 15, 2023

pdimov commented Mar 15, 2023

Lastique commented Mar 15, 2023

pdimov commented Mar 21, 2023

Lastique commented Mar 21, 2023

Lastique commented Mar 21, 2023

pdimov commented Mar 21, 2023

Lastique commented Mar 21, 2023

Lastique commented Apr 15, 2023

Improve generated x86 code for AVX targets #138

Improve generated x86 code for AVX targets #138

Conversation

Lastique commented Oct 26, 2022

Lastique commented Oct 26, 2022

pdimov commented Oct 26, 2022

codecov bot commented Oct 27, 2022 • edited

Codecov Report

Lastique commented Jan 20, 2023

Lastique commented Mar 14, 2023

pdimov commented Mar 15, 2023

Lastique commented Mar 15, 2023

pdimov commented Mar 15, 2023

Lastique commented Mar 15, 2023 • edited

pdimov commented Mar 15, 2023

Lastique commented Mar 15, 2023

pdimov commented Mar 15, 2023

Lastique commented Mar 15, 2023

pdimov commented Mar 21, 2023

Lastique commented Mar 21, 2023

Lastique commented Mar 21, 2023

pdimov commented Mar 21, 2023

Lastique commented Mar 21, 2023

Lastique commented Apr 15, 2023

codecov bot commented Oct 27, 2022 •

edited

Lastique commented Mar 15, 2023 •

edited