Add support for AVX2 #165

osschar · 2018-09-20T23:40:32Z

Some timing measurements (all done on phi3):
Notation, 32 / 16 = 32 threads with 16 events in flight

Single thread, 100 evs, build time only:

AVX-512 5.188 5.188 5.191 => 5.19
AVX2 6.143 6.154 6.130 => 6.14
AVX 7.882 7.881 7.879 => 7.88

32 / 16, 5000 evs, wall time:

AVX-512 20.584 20.542 => 20.6
AVX2 19.919 19.844 => 19.9
AVX 23.448 23.573 => 23.5

64 / 16, 5000 evs, wall time:

AVX-512 16.777 16.900 => 16.8
AVX2 16.244 16.255 => 16.2
AVX 18.910 18.942 => 18.9

128 / 32, 5000 evs, wall time:

AVX-512 29.902 27.163 => 28.5
AVX2 23.919 24.730 => 24.3
AVX 28.804 28.669 => 28.7

Some timing measurements: Single thread, 100 evs, build time only: AVX-512 5.188 5.188 5.191 => 5.19 AVX2 6.143 6.154 6.130 => 6.14 AVX 7.882 7.881 7.879 => 7.88 32 / 16, 5000 evs, wall time: AVX-512 20.584 20.542 => 20.6 AVX2 19.919 19.844 => 19.9 AVX 23.448 23.573 => 23.5 64 / 16, 5000 evs, wall time: AVX-512 16.777 16.900 => 16.8 AVX2 16.244 16.255 => 16.2 AVX 18.910 18.942 => 18.9 128 / 32, 5000 evs, wall time: AVX-512 29.902 27.163 => 28.5 AVX2 23.919 24.730 => 24.3 AVX 28.804 28.669 => 28.7

Makefile.config

kmcdermo · 2018-09-20T23:49:09Z

@osschar This is nice! Can you please run the standard benchmarking?

Then, for funsies, can you change ./xeon_scripts/benchmark-cmssw-ttbar-fulldet-build.sh make options: ${mOpt} to AVX2 for phi2 and phi3? And the same for ${mVal} in ./val_scripts/validation-cmssw-benchmarks.sh? I am expecting the latter to make no difference, but it will be an important check that changing the vector instruction set does not change the physics!

srlantz · 2018-09-21T16:14:40Z

@osschar Edited your initial comment to include the explanations of your notation as you told us during today's call

osschar · 2018-10-05T22:54:05Z

Benchmarks for the thing (with nevents=100):
http://xrd-cache-1.t2.ucsd.edu/matevz/PKF/165-avx2/

Standard - AVX-512 on phi2 and phi3
http://xrd-cache-1.t2.ucsd.edu/matevz/PKF/165-avx2/std-avx512/

AVX2 on phi2 and ph3 (note, we use also MPT_SIZE=16 which means we actually go over the matriplex in two passes there):
http://xrd-cache-1.t2.ucsd.edu/matevz/PKF/165-avx2/avx2/

kmcdermo · 2018-10-08T01:11:18Z

Thanks for the benchmarks! Few things to note:

Physics is unchanged, as expected:

HEAD of devel: https://kmcdermo.web.cern.ch/kmcdermo/mictrk/PR166/SIMVAL/SKL-SP_CMSSW_TTbar_PU70_eff_eta_build_pt0p0_SIMVAL.png
This PR w/ AVX512 (standard): http://xrd-cache-1.t2.ucsd.edu/matevz/PKF/165-avx2/std-avx512/SIMVAL/SKL-SP_CMSSW_TTbar_PU70_eff_eta_build_pt0p0_SIMVAL.png
This PR w/ AVX2: http://xrd-cache-1.t2.ucsd.edu/matevz/PKF/165-avx2/avx2/SIMVAL/SKL-SP_CMSSW_TTbar_PU70_eff_eta_build_pt0p0_SIMVAL.png

Perhaps unexpectedly, VU time is unchanged on phi2 and phi3 for AVX2:

phi2:
- HEAD of devel: https://kmcdermo.web.cern.ch/kmcdermo/mictrk/PR166/Benchmarks/KNL_CMSSW_TTbar_PU70_VU_time.png
- This PR w/ AVX512: http://xrd-cache-1.t2.ucsd.edu/matevz/PKF/165-avx2/std-avx512/Benchmarks/KNL_CMSSW_TTbar_PU70_VU_time.png
- This PR w/ AVX2: http://xrd-cache-1.t2.ucsd.edu/matevz/PKF/165-avx2/avx2/Benchmarks/KNL_CMSSW_TTbar_PU70_VU_time.png
phi3:
- HEAD of devel: https://kmcdermo.web.cern.ch/kmcdermo/mictrk/PR166/Benchmarks/SKL-SP_CMSSW_TTbar_PU70_VU_time.png
- This PR w/ AVX512: http://xrd-cache-1.t2.ucsd.edu/matevz/PKF/165-avx2/std-avx512/Benchmarks/SKL-SP_CMSSW_TTbar_PU70_VU_time.png
- This PR w/ AVX2: http://xrd-cache-1.t2.ucsd.edu/matevz/PKF/165-avx2/avx2/Benchmarks/SKL-SP_CMSSW_TTbar_PU70_VU_time.png

And when fully loaded, times do not change on phi2 and phi3 with AXV2:

phi2:
phi3:

@srlantz If all looks good on the code side, can you give one last review and hit merge?

srlantz · 2018-10-10T17:34:12Z

This PR looks fine to me on the code side. In fact, I'm glad Matevz went with integer-vector intrinsics for setting the mask for vgather, because the resulting code is much clearer. (I forgot we would need the int version of the mask anyway for the store ops.)

Like Kevin, I am surprised that the plots of the computational performance of AVX2 and AVX-512 are almost indistinguishable. I think this warrants further investigation. The icc vectorizer does sometimes elect to use AVX2 over AVX-512 on Skylakes, but we are also specifying -qopt-zmm-usage=high, which means the compiler should be favoring AVX-512 in such cases.

It's not specifically part of this PR, but I note that in the current Makefiles, AVX-512 corresponds to -xHost, which means to tailor the code for whatever processor is doing the compilation. Thus the semantics don't work out right if we compile AVX-512 code on phi1, for example (presumably for running elsewhere). Perhaps under the ifdef AVX_512 we should have VEC_ICC := -xCOMMON-AVX512 -qopt-zmm-usage=high, to be consistent with VEC_GCC := -mavx512f -mavx512cd?

Here's a wilder suggestion (at least for code where we're not using intrinsics): instead of relying on these ifdef's, we could consider building fat binaries that have AVX instructions as their base, but also include more recent instructions that will be used preferentially on processors that support them. OK, icc can do this trick, I don't know about gcc.

Nothing that I've said above constitutes a reason to hold up the pull request.

(By the way, I edited Kevin's comment above to fix an incorrect link)

kmcdermo · 2018-10-10T17:49:03Z

I will take that as a +1 :), and merge this.

osschar · 2018-10-10T17:53:11Z

About fat binaries ... when one uses MPLEX_USE_INTRINSICS this limits the code in Matriplex::SlurpIn() and in auto-generated code to only use that set (as defined with AVX_512 and AVX2 flags (or KNC_BUILD, I think ... but this is rotting away now)).

I don't know what would the best options for SKL be ... definitely something worth exploring in painful detail :) Maybe something we could ask Boyana and her students to look into? Including AVX2 vs AVX_512 comparison when running under full load.

kmcdermo · 2018-10-10T17:57:02Z

It's not specifically part of this PR, but I note that in the current Makefiles, AVX-512 corresponds to -xHost, which means to tailor the code for whatever processor is doing the compilation. Thus the semantics don't work out right if we compile AVX-512 code on phi1, for example (presumably for running elsewhere). Perhaps under the ifdef AVX_512 we should have VEC_ICC := -xCOMMON-AVX512 -qopt-zmm-usage=high, to be consistent with VEC_GCC := -mavx512f -mavx512cd?

To be fair, we are now compiling code natively on each platform before running tests on those platforms. Cross-compilation died when we turned off the original KNC cards. But, as you suggest, it might better to make this more general.

dan131riley · 2018-10-10T19:47:48Z

wrt -xHost vs. -xCOMMON-AVX512, PR #140 originally had the latter, but that was changed to the former to avoid an apparent compiler bug that killed the hit-finding efficiency, issue #139 has details. We should check if newer compilers have fixed this.

srlantz · 2018-10-10T20:19:43Z

I will open a couple of GitHub issues to track the points that have been made in this thread.

By the way, credit where credit is due--here's where I spotted the trick of setting a mask through a comparison: https://tech.io/playgrounds/283/sse-avx-vectorization/masking-and-conditional-load

kmcdermo reviewed Sep 20, 2018

View reviewed changes

Makefile.config Show resolved Hide resolved

kmcdermo reviewed Sep 20, 2018

View reviewed changes

Makefile.config Show resolved Hide resolved

osschar added 2 commits September 25, 2018 14:04

Implement SlurpIn with avx2 gather.

0f9fa97

Use Steve's trick to set SlurpIn AVX2 mask (but use integer cmp).

fff2432

kmcdermo merged commit 825e264 into trackreco:devel Oct 10, 2018

This was referenced Oct 10, 2018

no difference in computational performance between AVX-512 and AVX2? #170

Closed

Makefile.config inconsistencies between icc and gcc #171

Closed

kmcdermo mentioned this pull request Oct 12, 2018

Handle input hit phi outside of (-pi,pi) during hit binning #172

Merged

osschar deleted the avx2 branch October 12, 2018 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for AVX2 #165

Add support for AVX2 #165

osschar commented Sep 20, 2018 •

edited by srlantz

Loading

kmcdermo commented Sep 20, 2018

srlantz commented Sep 21, 2018

osschar commented Oct 5, 2018

kmcdermo commented Oct 8, 2018 •

edited by srlantz

Loading

srlantz commented Oct 10, 2018 •

edited

Loading

kmcdermo commented Oct 10, 2018

osschar commented Oct 10, 2018

kmcdermo commented Oct 10, 2018

dan131riley commented Oct 10, 2018

srlantz commented Oct 10, 2018

Add support for AVX2 #165

Add support for AVX2 #165

Conversation

osschar commented Sep 20, 2018 • edited by srlantz Loading

kmcdermo commented Sep 20, 2018

srlantz commented Sep 21, 2018

osschar commented Oct 5, 2018

kmcdermo commented Oct 8, 2018 • edited by srlantz Loading

srlantz commented Oct 10, 2018 • edited Loading

kmcdermo commented Oct 10, 2018

osschar commented Oct 10, 2018

kmcdermo commented Oct 10, 2018

dan131riley commented Oct 10, 2018

srlantz commented Oct 10, 2018

osschar commented Sep 20, 2018 •

edited by srlantz

Loading

kmcdermo commented Oct 8, 2018 •

edited by srlantz

Loading

srlantz commented Oct 10, 2018 •

edited

Loading