-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for AVX2 #165
Add support for AVX2 #165
Conversation
Some timing measurements: Single thread, 100 evs, build time only: AVX-512 5.188 5.188 5.191 => 5.19 AVX2 6.143 6.154 6.130 => 6.14 AVX 7.882 7.881 7.879 => 7.88 32 / 16, 5000 evs, wall time: AVX-512 20.584 20.542 => 20.6 AVX2 19.919 19.844 => 19.9 AVX 23.448 23.573 => 23.5 64 / 16, 5000 evs, wall time: AVX-512 16.777 16.900 => 16.8 AVX2 16.244 16.255 => 16.2 AVX 18.910 18.942 => 18.9 128 / 32, 5000 evs, wall time: AVX-512 29.902 27.163 => 28.5 AVX2 23.919 24.730 => 24.3 AVX 28.804 28.669 => 28.7
@osschar This is nice! Can you please run the standard benchmarking? Then, for funsies, can you change ./xeon_scripts/benchmark-cmssw-ttbar-fulldet-build.sh make options: |
@osschar Edited your initial comment to include the explanations of your notation as you told us during today's call |
Benchmarks for the thing (with nevents=100): Standard - AVX-512 on phi2 and phi3 AVX2 on phi2 and ph3 (note, we use also MPT_SIZE=16 which means we actually go over the matriplex in two passes there): |
This PR looks fine to me on the code side. In fact, I'm glad Matevz went with integer-vector intrinsics for setting the mask for vgather, because the resulting code is much clearer. (I forgot we would need the int version of the mask anyway for the store ops.) Like Kevin, I am surprised that the plots of the computational performance of AVX2 and AVX-512 are almost indistinguishable. I think this warrants further investigation. The icc vectorizer does sometimes elect to use AVX2 over AVX-512 on Skylakes, but we are also specifying -qopt-zmm-usage=high, which means the compiler should be favoring AVX-512 in such cases. It's not specifically part of this PR, but I note that in the current Makefiles, AVX-512 corresponds to -xHost, which means to tailor the code for whatever processor is doing the compilation. Thus the semantics don't work out right if we compile AVX-512 code on phi1, for example (presumably for running elsewhere). Perhaps under the Here's a wilder suggestion (at least for code where we're not using intrinsics): instead of relying on these ifdef's, we could consider building fat binaries that have AVX instructions as their base, but also include more recent instructions that will be used preferentially on processors that support them. OK, icc can do this trick, I don't know about gcc. Nothing that I've said above constitutes a reason to hold up the pull request. (By the way, I edited Kevin's comment above to fix an incorrect link) |
I will take that as a +1 :), and merge this. |
About fat binaries ... when one uses MPLEX_USE_INTRINSICS this limits the code in Matriplex::SlurpIn() and in auto-generated code to only use that set (as defined with AVX_512 and AVX2 flags (or KNC_BUILD, I think ... but this is rotting away now)). I don't know what would the best options for SKL be ... definitely something worth exploring in painful detail :) Maybe something we could ask Boyana and her students to look into? Including AVX2 vs AVX_512 comparison when running under full load. |
To be fair, we are now compiling code natively on each platform before running tests on those platforms. Cross-compilation died when we turned off the original KNC cards. But, as you suggest, it might better to make this more general. |
I will open a couple of GitHub issues to track the points that have been made in this thread. By the way, credit where credit is due--here's where I spotted the trick of setting a mask through a comparison: https://tech.io/playgrounds/283/sse-avx-vectorization/masking-and-conditional-load |
Some timing measurements (all done on phi3):
Notation, 32 / 16 = 32 threads with 16 events in flight
Single thread, 100 evs, build time only:
AVX-512 5.188 5.188 5.191 => 5.19
AVX2 6.143 6.154 6.130 => 6.14
AVX 7.882 7.881 7.879 => 7.88
32 / 16, 5000 evs, wall time:
AVX-512 20.584 20.542 => 20.6
AVX2 19.919 19.844 => 19.9
AVX 23.448 23.573 => 23.5
64 / 16, 5000 evs, wall time:
AVX-512 16.777 16.900 => 16.8
AVX2 16.244 16.255 => 16.2
AVX 18.910 18.942 => 18.9
128 / 32, 5000 evs, wall time:
AVX-512 29.902 27.163 => 28.5
AVX2 23.919 24.730 => 24.3
AVX 28.804 28.669 => 28.7