Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX-512 Broken? #139

Closed
kmcdermo opened this issue May 7, 2018 · 23 comments
Closed

AVX-512 Broken? #139

kmcdermo opened this issue May 7, 2018 · 23 comments
Assignees

Comments

@kmcdermo
Copy link
Collaborator

kmcdermo commented May 7, 2018

For some reason, even with all the latest fixes protecting against NaNs, we see a serious loss of hits/track on phi2 (KNL) with AVX-512 enabled (at least with max nThreads): https://kmcdermo.web.cern.ch/kmcdermo/catching-nans-4-pt2/PlotsFromDump/CMSSW_TTbar_PU70_CE_nHits.png

This is true for BH, STD, CE. This most likely explains the enormous vectorization speedup on phi2: https://kmcdermo.web.cern.ch/kmcdermo/catching-nans-4-pt2/Benchmarks/KNL_CMSSW_TTbar_PU70_VU_speedup.png

A first test would be to make the same plot with nTh=1, to isolate multithreading from AVX-512, and perhaps making the same plot for nVU=2,4,8 nTh=1.

If I remember correctly, @slava77 reported seeing lots of new NaNs with max vectorization previously that @osschar did not see, but perhaps this was because we focused efforts on phiphi and not phi2.

@kmcdermo
Copy link
Collaborator Author

kmcdermo commented May 8, 2018

So, I ran some tests on SKL-Ag (Cornell Silver, lnx4108), SKL-Au (UCSD Gold, phi3), and KNL (phi2), and it appears it is truly that running at full vector width is broken across the board.

I ran the following tests on each platform:

  • nVU=1, nTH=1
  • nVU=2, nTH=1
  • nVU=4, nTH=1
  • nVU=8, nTH=1
  • nVU=16, nTH=1
  • nVU=16+intrinsics, nTH=1
  • nVU=16+intrinsics, nTH=$max_th

Where $max_th is:

  • SKL-Ag: 48
  • SKL-Au: 64
  • KNL: 256

And in all three nVU=16 tests... the number of hits is drastically different compared to the other vector widths for all platforms.

knl_cmssw_ttbar_pu70_ce_nhits
skl-ag_cmssw_ttbar_pu70_ce_nhits
skl-au_cmssw_ttbar_pu70_ce_nhits

@IHateLinus
Copy link
Collaborator

IHateLinus commented May 8, 2018 via email

@kmcdermo
Copy link
Collaborator Author

kmcdermo commented May 8, 2018

@IHateLinus , yes, to be sure, this does not affect the numbers @slava77 reported in his email from March 19 (which was using PR135).

The plots for pr135 are here: https://kmcdermo.web.cern.ch/kmcdermo/pr135/

With the relevant plot here:
image

So as you can see, we are "safe" for Slava's numbers at 80Hz. We can rest a bit easy in this regard.

@kmcdermo
Copy link
Collaborator Author

kmcdermo commented May 8, 2018

(and because I was tired of looking for this email I am posting it here as a pdf dump)
fully_loaded_knl.pdf

@osschar
Copy link
Collaborator

osschar commented May 8, 2018

Thanks Kevin! Good to know it's not only intrinsics that are broken. I'm looking into this now.

@osschar
Copy link
Collaborator

osschar commented May 8, 2018

OK, see my notes below, the problem is xCOMMON-AVX512, using xCORE solves the problem.

I told y'all that I haven't changed anything that could warrant changes in vectorization performance :)

I'll leave it to original committer to figure out the correct fix ;)

Debugging VU=16 performance problem

Working on phi3.

mkFit/mkFit --cmssw-n2seeds --build-ce --geom CMS-2017 --start-event 1 --num-events 1 --quality-val --input-file /data2/slava77/samples/2017/pass-4874f28/initialStep/PU70/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFullINPUT+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU/memoryFile.fv3.clean.writeAll.recT.072617.bin

  • gcc-7, default opts:

Read complete, 9362 simtracks on file.
Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362
found tracks=728 in pT 10%=680 in pT 20%=723 no_mc_assoc=470
nH >= 80% =707 in pT 10%=659 in pT 20%=702

  • icc, default opts (-O2 for ConformalFitter); or -O2 for all

Read complete, 9362 simtracks on file.
Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362
found tracks=24 in pT 10%=19 in pT 20%=22 no_mc_assoc=1174
nH >= 80% =3 in pT 10%=0 in pT 20%=2

  • icc with -xCORE-AVX512 (instead of xCOMMON)

Read complete, 9362 simtracks on file.
Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362
found tracks=728 in pT 10%=680 in pT 20%=723 no_mc_assoc=470
nH >= 80% =707 in pT 10%=659 in pT 20%=702

@srlantz
Copy link
Collaborator

srlantz commented May 8, 2018 via email

@osschar
Copy link
Collaborator

osschar commented May 8, 2018

Kevin ran on phi2 (18.0.1), phi3 (18.0.2) and Cornell SKL (???) - all had the same issue.

I remember getting a mail that a new intel compiler version is available, let me check. Oh, it says 2017, update 7.

I think you're right and we're looking at a compiler bug/issue here.

@dan131riley
Copy link
Collaborator

Weird. '-xCOMMON-AVX512' was fine before your changes, so it's the combination of our changes. Appears to be something in MkFinder.

@dan131riley
Copy link
Collaborator

I guess the expedient thing to do is '-xHost'. The compiler does produce a lot of different instruction sequences with common vs. core, especially for the copy routines...consider this, which looks like an example of what Steve was talking about...common first, not vectorized:

        movl      (%rax,%r8,8), %r11d                           #98.24
        movl      %r11d, (%r10,%r9)                             #98.10
        movl      4(%rax,%r8,8), %r11d                          #98.24
        incq      %r8                                           #96.7
        movl      %r11d, 64(%r10,%r9)                           #98.10
        addq      $128, %r10                                    #96.7
        cmpq      %rdi, %r8                                     #96.7
        jb        ..B1.10       # Prob 64%                      #96.7

vs. core:

        vpcmpeqb  %xmm0, %xmm0, %k1                             #98.10
        lea       (%rdi,%r11), %r14                             #98.10
        vpcmpeqb  %xmm0, %xmm0, %k2                             #98.10
        vmovups   (%rax,%r9,4), %ymm1                           #98.24
        vmovups   32(%rax,%r9,4), %ymm2                         #98.24
        vscatterdps %ymm1, (%r14,%ymm0,8){%k1}                  #98.10
        vscatterdps %ymm2, 512(%r14,%ymm0,8){%k2}               #98.10
        addq      $16, %r9                                      #96.7
        addq      $1024, %r11                                   #96.7
        cmpq      %r10, %r9                                     #96.7
        jb        ..B1.10       # Prob 82%                      #96.7

@osschar
Copy link
Collaborator

osschar commented May 8, 2018

VEC_ICC := -xHost works on phi3. What do we do? Make the change and ask Kevin nicely to run the tests again?

@dan131riley
Copy link
Collaborator

It'll be in my pull request adding phi3 to the benchmarks, coming soon.

@srlantz
Copy link
Collaborator

srlantz commented May 8, 2018 via email

@osschar
Copy link
Collaborator

osschar commented May 8, 2018

I think Kevin changed that to build on each machine. And we dropped KNC.

@pwittich
Copy link
Collaborator

pwittich commented May 8, 2018

we need to find someone besides Kevin to run these tests -- Kevin will graduate not too far in the future.

@IHateLinus
Copy link
Collaborator

IHateLinus commented May 8, 2018 via email

@kmcdermo
Copy link
Collaborator Author

kmcdermo commented May 8, 2018

VEC_ICC := -xHost works on phi3. What do we do? Make the change and ask Kevin nicely to run the tests again?

I can do this for sure, but let's wait to see what Dan comes back with for the new benchmarking scripts :), since he is addressing this issue. The plots above were a one-off and overly pedantic for general benchmarking, but can be made easily again in case of need for debugging -- maybe I will add them to a dedicated script...

Don't the benchmark scripts do all their compiling on one machine and ship binaries around to be run on other machines.

Matevz is correct: once we dropped KNC, we tar up the working directory and ship it to the native platform and compile it there.

This is the case for phiphi (SNB), where the benchmarks are launched, and also shipped to phi2 (KNL). With phi3/lnx4108, we could tar and compile on each machine, since they both have the intel compiler. Let's see what Dan has in the new benchmarking scripts for the skylakes :).

Cant you pit an iron ball on his leg?

Don't know if I can get through airport security with it though!

@dan131riley
Copy link
Collaborator

-xHost works fine for phi3, but phi2 still hits the bug!

cmssw_ttbar_pu70_ce_nhits

@dan131riley
Copy link
Collaborator

This seems to work around the problem:

diff --git a/mkFit/MkFinder.cc b/mkFit/MkFinder.cc
index 29e0ae7..7376935 100644
--- a/mkFit/MkFinder.cc
+++ b/mkFit/MkFinder.cc
@@ -228,7 +228,7 @@ void MkFinder::SelectHitIndices(const LayerOfHits &layer_of_
     }

     dphi = std::min(std::abs(dphi), L.max_dphi());
-    dq   = std::min(std::max(dq, L.min_dq()), L.max_dq());
+    dq   = clamp(dq, L.min_dq(), L.max_dq());

     qv[itrack] = q;
     phiv[itrack] = phi;

Apparently with common-avx512 or mic-avx512 the nested min/max gets vectorized incorrectly. I looked at the assembly, but there were enough differences that I couldn't identify where it went wrong.

@dan131riley
Copy link
Collaborator

image

@kmcdermo
Copy link
Collaborator Author

kmcdermo commented May 9, 2018

Are there other places that use min/max together that we should use std::clamp instead?

Also, should we replace clamp with std::clamp?

@osschar
Copy link
Collaborator

osschar commented May 9, 2018

Yay, good catch! Thanks! :)

@dan131riley
Copy link
Collaborator

@kmcdermo When I added clamp() I looked for places it could be used, dunno if I got all of them. When I reviewed #137 I had even thought of flagging this case, but put it off! std::clamp is answered in #140

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants