Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for AVX2 #165

Merged
merged 3 commits into from
Oct 10, 2018
Merged

Add support for AVX2 #165

merged 3 commits into from
Oct 10, 2018

Conversation

osschar
Copy link
Collaborator

@osschar osschar commented Sep 20, 2018

Some timing measurements (all done on phi3):
Notation, 32 / 16 = 32 threads with 16 events in flight

Single thread, 100 evs, build time only:

AVX-512 5.188 5.188 5.191 => 5.19
AVX2 6.143 6.154 6.130 => 6.14
AVX 7.882 7.881 7.879 => 7.88

32 / 16, 5000 evs, wall time:

AVX-512 20.584 20.542 => 20.6
AVX2 19.919 19.844 => 19.9
AVX 23.448 23.573 => 23.5

64 / 16, 5000 evs, wall time:

AVX-512 16.777 16.900 => 16.8
AVX2 16.244 16.255 => 16.2
AVX 18.910 18.942 => 18.9

128 / 32, 5000 evs, wall time:

AVX-512 29.902 27.163 => 28.5
AVX2 23.919 24.730 => 24.3
AVX 28.804 28.669 => 28.7

Some timing measurements:

Single thread, 100 evs, build time only:

AVX-512  5.188  5.188 5.191 => 5.19
AVX2     6.143  6.154 6.130 => 6.14
AVX      7.882  7.881 7.879 => 7.88

32 / 16, 5000 evs, wall time:

AVX-512  20.584 20.542      => 20.6
AVX2     19.919 19.844      => 19.9
AVX      23.448 23.573      => 23.5

64 / 16, 5000 evs, wall time:

AVX-512  16.777 16.900      => 16.8
AVX2     16.244 16.255      => 16.2
AVX      18.910 18.942      => 18.9

128 / 32, 5000 evs, wall time:

AVX-512  29.902 27.163      => 28.5
AVX2     23.919 24.730      => 24.3
AVX      28.804 28.669      => 28.7
@kmcdermo
Copy link
Collaborator

@osschar This is nice! Can you please run the standard benchmarking?

Then, for funsies, can you change ./xeon_scripts/benchmark-cmssw-ttbar-fulldet-build.sh make options: ${mOpt} to AVX2 for phi2 and phi3? And the same for ${mVal} in ./val_scripts/validation-cmssw-benchmarks.sh? I am expecting the latter to make no difference, but it will be an important check that changing the vector instruction set does not change the physics!

@srlantz
Copy link
Collaborator

srlantz commented Sep 21, 2018

@osschar Edited your initial comment to include the explanations of your notation as you told us during today's call

@osschar
Copy link
Collaborator Author

osschar commented Oct 5, 2018

Benchmarks for the thing (with nevents=100):
http://xrd-cache-1.t2.ucsd.edu/matevz/PKF/165-avx2/

Standard - AVX-512 on phi2 and phi3
http://xrd-cache-1.t2.ucsd.edu/matevz/PKF/165-avx2/std-avx512/

AVX2 on phi2 and ph3 (note, we use also MPT_SIZE=16 which means we actually go over the matriplex in two passes there):
http://xrd-cache-1.t2.ucsd.edu/matevz/PKF/165-avx2/avx2/

@kmcdermo
Copy link
Collaborator

kmcdermo commented Oct 8, 2018

Thanks for the benchmarks! Few things to note:

  1. Physics is unchanged, as expected:
  1. Perhaps unexpectedly, VU time is unchanged on phi2 and phi3 for AVX2:
  1. And when fully loaded, times do not change on phi2 and phi3 with AXV2:

@srlantz If all looks good on the code side, can you give one last review and hit merge?

@srlantz
Copy link
Collaborator

srlantz commented Oct 10, 2018

This PR looks fine to me on the code side. In fact, I'm glad Matevz went with integer-vector intrinsics for setting the mask for vgather, because the resulting code is much clearer. (I forgot we would need the int version of the mask anyway for the store ops.)

Like Kevin, I am surprised that the plots of the computational performance of AVX2 and AVX-512 are almost indistinguishable. I think this warrants further investigation. The icc vectorizer does sometimes elect to use AVX2 over AVX-512 on Skylakes, but we are also specifying -qopt-zmm-usage=high, which means the compiler should be favoring AVX-512 in such cases.

It's not specifically part of this PR, but I note that in the current Makefiles, AVX-512 corresponds to -xHost, which means to tailor the code for whatever processor is doing the compilation. Thus the semantics don't work out right if we compile AVX-512 code on phi1, for example (presumably for running elsewhere). Perhaps under the ifdef AVX_512 we should have VEC_ICC := -xCOMMON-AVX512 -qopt-zmm-usage=high, to be consistent with VEC_GCC := -mavx512f -mavx512cd?

Here's a wilder suggestion (at least for code where we're not using intrinsics): instead of relying on these ifdef's, we could consider building fat binaries that have AVX instructions as their base, but also include more recent instructions that will be used preferentially on processors that support them. OK, icc can do this trick, I don't know about gcc.

Nothing that I've said above constitutes a reason to hold up the pull request.

(By the way, I edited Kevin's comment above to fix an incorrect link)

@kmcdermo
Copy link
Collaborator

I will take that as a +1 :), and merge this.

@kmcdermo kmcdermo merged commit 825e264 into trackreco:devel Oct 10, 2018
@osschar
Copy link
Collaborator Author

osschar commented Oct 10, 2018

About fat binaries ... when one uses MPLEX_USE_INTRINSICS this limits the code in Matriplex::SlurpIn() and in auto-generated code to only use that set (as defined with AVX_512 and AVX2 flags (or KNC_BUILD, I think ... but this is rotting away now)).

I don't know what would the best options for SKL be ... definitely something worth exploring in painful detail :) Maybe something we could ask Boyana and her students to look into? Including AVX2 vs AVX_512 comparison when running under full load.

@kmcdermo
Copy link
Collaborator

It's not specifically part of this PR, but I note that in the current Makefiles, AVX-512 corresponds to -xHost, which means to tailor the code for whatever processor is doing the compilation. Thus the semantics don't work out right if we compile AVX-512 code on phi1, for example (presumably for running elsewhere). Perhaps under the ifdef AVX_512 we should have VEC_ICC := -xCOMMON-AVX512 -qopt-zmm-usage=high, to be consistent with VEC_GCC := -mavx512f -mavx512cd?

To be fair, we are now compiling code natively on each platform before running tests on those platforms. Cross-compilation died when we turned off the original KNC cards. But, as you suggest, it might better to make this more general.

@dan131riley
Copy link
Collaborator

wrt -xHost vs. -xCOMMON-AVX512, PR #140 originally had the latter, but that was changed to the former to avoid an apparent compiler bug that killed the hit-finding efficiency, issue #139 has details. We should check if newer compilers have fixed this.

@srlantz
Copy link
Collaborator

srlantz commented Oct 10, 2018

I will open a couple of GitHub issues to track the points that have been made in this thread.

By the way, credit where credit is due--here's where I spotted the trick of setting a mask through a comparison: https://tech.io/playgrounds/283/sse-avx-vectorization/masking-and-conditional-load

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants