AVX-512 Broken? #139

kmcdermo · 2018-05-07T21:41:56Z

For some reason, even with all the latest fixes protecting against NaNs, we see a serious loss of hits/track on phi2 (KNL) with AVX-512 enabled (at least with max nThreads): https://kmcdermo.web.cern.ch/kmcdermo/catching-nans-4-pt2/PlotsFromDump/CMSSW_TTbar_PU70_CE_nHits.png

This is true for BH, STD, CE. This most likely explains the enormous vectorization speedup on phi2: https://kmcdermo.web.cern.ch/kmcdermo/catching-nans-4-pt2/Benchmarks/KNL_CMSSW_TTbar_PU70_VU_speedup.png

A first test would be to make the same plot with nTh=1, to isolate multithreading from AVX-512, and perhaps making the same plot for nVU=2,4,8 nTh=1.

If I remember correctly, @slava77 reported seeing lots of new NaNs with max vectorization previously that @osschar did not see, but perhaps this was because we focused efforts on phiphi and not phi2.

kmcdermo · 2018-05-08T04:34:53Z

So, I ran some tests on SKL-Ag (Cornell Silver, lnx4108), SKL-Au (UCSD Gold, phi3), and KNL (phi2), and it appears it is truly that running at full vector width is broken across the board.

I ran the following tests on each platform:

nVU=1, nTH=1
nVU=2, nTH=1
nVU=4, nTH=1
nVU=8, nTH=1
nVU=16, nTH=1
nVU=16+intrinsics, nTH=1
nVU=16+intrinsics, nTH=$max_th

Where $max_th is:

SKL-Ag: 48
SKL-Au: 64
KNL: 256

And in all three nVU=16 tests... the number of hits is drastically different compared to the other vector widths for all platforms.

IHateLinus · 2018-05-08T14:52:54Z

Hi Kevin, I think you said in a mtg that this is a “new development” and that the timing and performance numbers we have been talking about (e.g. ~80Hz for ttbar+70PU on fully loaded KNL), were derived when this hit distribution was “healthy” Can you please confirm and attach those plots so we establish a baseline and also make sure wr are not fulling ourselves (and others) by those earlier values… Thanks, Avi.

…

On May 7, 2018, at 9:34 PM, Kevin McDermott ***@***.***> wrote: So, I ran some tests on SKL-Ag (Cornell Silver, lnx4108), SKL-Au (UCSD Gold, phi3), and KNL (phi2), and it appears it is truly that running at full vector width is broken across the board. I ran the following tests on each platform: nVU=1, nTH=1 nVU=1, nTH=1 <https://user-images.githubusercontent.com/7645989/39738149-7be8b5d4-5257-11e8-9207-42603fb950fd.png> <https://user-images.githubusercontent.com/7645989/39738150-7bfae18c-5257-11e8-88de-f72ae0c268d3.png> <https://user-images.githubusercontent.com/7645989/39738151-7c1331d8-5257-11e8-9a56-90938b8aee3f.png> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#139 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEdT8UGCaJjttyHyYk19Na8GMpAnajz1ks5twSBugaJpZM4T1sEj>.

kmcdermo · 2018-05-08T15:01:42Z

@IHateLinus , yes, to be sure, this does not affect the numbers @slava77 reported in his email from March 19 (which was using PR135).

The plots for pr135 are here: https://kmcdermo.web.cern.ch/kmcdermo/pr135/

With the relevant plot here:

So as you can see, we are "safe" for Slava's numbers at 80Hz. We can rest a bit easy in this regard.

kmcdermo · 2018-05-08T15:04:48Z

(and because I was tired of looking for this email I am posting it here as a pdf dump)
fully_loaded_knl.pdf

osschar · 2018-05-08T15:57:38Z

Thanks Kevin! Good to know it's not only intrinsics that are broken. I'm looking into this now.

osschar · 2018-05-08T16:38:26Z

OK, see my notes below, the problem is xCOMMON-AVX512, using xCORE solves the problem.

I told y'all that I haven't changed anything that could warrant changes in vectorization performance :)

I'll leave it to original committer to figure out the correct fix ;)

Debugging VU=16 performance problem

Working on phi3.

mkFit/mkFit --cmssw-n2seeds --build-ce --geom CMS-2017 --start-event 1 --num-events 1 --quality-val --input-file /data2/slava77/samples/2017/pass-4874f28/initialStep/PU70/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFullINPUT+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU/memoryFile.fv3.clean.writeAll.recT.072617.bin

gcc-7, default opts:

Read complete, 9362 simtracks on file.
Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362
found tracks=728 in pT 10%=680 in pT 20%=723 no_mc_assoc=470
nH >= 80% =707 in pT 10%=659 in pT 20%=702

icc, default opts (-O2 for ConformalFitter); or -O2 for all

Read complete, 9362 simtracks on file.
Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362
found tracks=24 in pT 10%=19 in pT 20%=22 no_mc_assoc=1174
nH >= 80% =3 in pT 10%=0 in pT 20%=2

icc with -xCORE-AVX512 (instead of xCOMMON)

Read complete, 9362 simtracks on file.
Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362
found tracks=728 in pT 10%=680 in pT 20%=723 no_mc_assoc=470
nH >= 80% =707 in pT 10%=659 in pT 20%=702

srlantz · 2018-05-08T18:23:01Z

Cool discovery--or rather, cool clue. Obviously we wouldn't want the choice of instruction set to change the results like this. It may simply be the case that the CORE option is letting us avoid a compiler bug that is only present in the COMMON path. Did you try giving the COMMON option to different compiler versions, perhaps on other machines? Because if that works, then for sure we're looking at a compiler bug. Otherwise... I'm not sure whether it helps, but in case it does, the difference between -xCOMMON-AVX512 and -xCORE-AVX512 is that the latter adds the following groups of instructions: * · AVX512BW - Byte and word (8 and 16-bit) operations that enhance integer operations. (They make use of all 64 bits in the mask registers, too.) · AVX512DQ - Doubleword and quadword (32 and 64-bit) operations that enhance integer and floating-point operations. · AVX512VL - Vector Length Extensions. They provide for most AVX-512 instructions to operate on 128 or 256 bits, instead of only 512. The use of Vector Length Extensions extends most AVX-512 operations to also operate on XMM (128-bit, SSE) registers and YMM (256-bit, AVX) registers. It also permits access to registers 16..31, to be applied to XMM and YMM registers instead of only to ZMM registers. Thus with the CORE flag the compiler will vectorize loops that involve more types such as int and long long; more operations such as bit manipulations; and shorter vectors. Steve *Abridged descriptions from https://software.intel.com/en-us/blogs/additional-avx-512-instructions From: Matevž Tadel [mailto:notifications@github.com] Sent: Tuesday, May 08, 2018 12:38 PM To: cerati/mictest Cc: Subscribed Subject: Re: [cerati/mictest] AVX-512 Broken? (#139) OK, see my notes below, the problem is xCOMMON-AVX512, using xCORE solves the problem. I told y'all that I haven't changed anything that could warrant changes in vectorization performance :) I'll leave it to original committer to figure out the correct fix ;) Debugging VU=16 performance problem Working on phi3. mkFit/mkFit --cmssw-n2seeds --build-ce --geom CMS-2017 --start-event 1 --num-events 1 --quality-val --input-file /data2/slava77/samples/2017/pass-4874f28/initialStep/PU70/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFullINPUT+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU/memoryFile.fv3.clean.writeAll.recT.072617.bin * gcc-7, default opts: Read complete, 9362 simtracks on file. Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362 found tracks=728 in pT 10%=680 in pT 20%=723 no_mc_assoc=470 nH >= 80% =707 in pT 10%=659 in pT 20%=702 * icc, default opts (-O2 for ConformalFitter); or -O2 for all Read complete, 9362 simtracks on file. Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362 found tracks=24 in pT 10%=19 in pT 20%=22 no_mc_assoc=1174 nH >= 80% =3 in pT 10%=0 in pT 20%=2 * icc with -xCORE-AVX512 (instead of xCOMMON) Read complete, 9362 simtracks on file. Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362 found tracks=728 in pT 10%=680 in pT 20%=723 no_mc_assoc=470 nH >= 80% =707 in pT 10%=659 in pT 20%=702 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#139 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHooysOpfvVKLtB2A6u2JOGIDbKbhjCEks5twcoEgaJpZM4T1sEj>.

osschar · 2018-05-08T18:46:58Z

Kevin ran on phi2 (18.0.1), phi3 (18.0.2) and Cornell SKL (???) - all had the same issue.

I remember getting a mail that a new intel compiler version is available, let me check. Oh, it says 2017, update 7.

I think you're right and we're looking at a compiler bug/issue here.

dan131riley · 2018-05-08T18:49:10Z

Weird. '-xCOMMON-AVX512' was fine before your changes, so it's the combination of our changes. Appears to be something in MkFinder.

dan131riley · 2018-05-08T19:08:43Z

I guess the expedient thing to do is '-xHost'. The compiler does produce a lot of different instruction sequences with common vs. core, especially for the copy routines...consider this, which looks like an example of what Steve was talking about...common first, not vectorized:

        movl      (%rax,%r8,8), %r11d                           #98.24
        movl      %r11d, (%r10,%r9)                             #98.10
        movl      4(%rax,%r8,8), %r11d                          #98.24
        incq      %r8                                           #96.7
        movl      %r11d, 64(%r10,%r9)                           #98.10
        addq      $128, %r10                                    #96.7
        cmpq      %rdi, %r8                                     #96.7
        jb        ..B1.10       # Prob 64%                      #96.7

vs. core:

        vpcmpeqb  %xmm0, %xmm0, %k1                             #98.10
        lea       (%rdi,%r11), %r14                             #98.10
        vpcmpeqb  %xmm0, %xmm0, %k2                             #98.10
        vmovups   (%rax,%r9,4), %ymm1                           #98.24
        vmovups   32(%rax,%r9,4), %ymm2                         #98.24
        vscatterdps %ymm1, (%r14,%ymm0,8){%k1}                  #98.10
        vscatterdps %ymm2, 512(%r14,%ymm0,8){%k2}               #98.10
        addq      $16, %r9                                      #96.7
        addq      $1024, %r11                                   #96.7
        cmpq      %r10, %r9                                     #96.7
        jb        ..B1.10       # Prob 82%                      #96.7

osschar · 2018-05-08T19:23:05Z

VEC_ICC := -xHost works on phi3. What do we do? Make the change and ask Kevin nicely to run the tests again?

dan131riley · 2018-05-08T19:25:05Z

It'll be in my pull request adding phi3 to the benchmarks, coming soon.

srlantz · 2018-05-08T19:29:51Z

Don't the benchmark scripts do all their compiling on one machine and ship binaries around to be run on other machines. Or did that change? If I'm right, -xHost wouldn't be compatible with how our benchmarks get run. From: Dan Riley [mailto:notifications@github.com] Sent: Tuesday, May 08, 2018 3:25 PM To: cerati/mictest Cc: Steve Lantz; Comment Subject: Re: [cerati/mictest] AVX-512 Broken? (#139) It'll be in my pull request adding phi3 to the benchmarks, coming soon. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#139 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHooytsuFqiUS3vcdyjRv7s5-MyUPiSlks5twfESgaJpZM4T1sEj>.

osschar · 2018-05-08T19:32:43Z

I think Kevin changed that to build on each machine. And we dropped KNC.

pwittich · 2018-05-08T19:32:57Z

we need to find someone besides Kevin to run these tests -- Kevin will graduate not too far in the future.

IHateLinus · 2018-05-08T20:16:10Z

that is very sad. Cant you pit an iron ball on his leg?

…

On May 8, 2018, at 12:32 PM, Peter Wittich ***@***.***> wrote: we need to find someone besides Kevin to run these tests -- Kevin will graduate not too far in the future. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#139 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEdT8U6O_VslRh-27HSE01IhDIjEqksAks5twfLqgaJpZM4T1sEj>.

kmcdermo · 2018-05-08T20:46:22Z

VEC_ICC := -xHost works on phi3. What do we do? Make the change and ask Kevin nicely to run the tests again?

I can do this for sure, but let's wait to see what Dan comes back with for the new benchmarking scripts :), since he is addressing this issue. The plots above were a one-off and overly pedantic for general benchmarking, but can be made easily again in case of need for debugging -- maybe I will add them to a dedicated script...

Don't the benchmark scripts do all their compiling on one machine and ship binaries around to be run on other machines.

Matevz is correct: once we dropped KNC, we tar up the working directory and ship it to the native platform and compile it there.

This is the case for phiphi (SNB), where the benchmarks are launched, and also shipped to phi2 (KNL). With phi3/lnx4108, we could tar and compile on each machine, since they both have the intel compiler. Let's see what Dan has in the new benchmarking scripts for the skylakes :).

Cant you pit an iron ball on his leg?

Don't know if I can get through airport security with it though!

dan131riley · 2018-05-08T21:35:25Z

-xHost works fine for phi3, but phi2 still hits the bug!

dan131riley · 2018-05-09T01:04:41Z

This seems to work around the problem:

diff --git a/mkFit/MkFinder.cc b/mkFit/MkFinder.cc
index 29e0ae7..7376935 100644
--- a/mkFit/MkFinder.cc
+++ b/mkFit/MkFinder.cc
@@ -228,7 +228,7 @@ void MkFinder::SelectHitIndices(const LayerOfHits &layer_of_
     }

     dphi = std::min(std::abs(dphi), L.max_dphi());
-    dq   = std::min(std::max(dq, L.min_dq()), L.max_dq());
+    dq   = clamp(dq, L.min_dq(), L.max_dq());

     qv[itrack] = q;
     phiv[itrack] = phi;

Apparently with common-avx512 or mic-avx512 the nested min/max gets vectorized incorrectly. I looked at the assembly, but there were enough differences that I couldn't identify where it went wrong.

dan131riley · 2018-05-09T01:09:49Z

kmcdermo · 2018-05-09T01:37:59Z

Are there other places that use min/max together that we should use std::clamp instead?

Also, should we replace clamp with std::clamp?

osschar · 2018-05-09T01:41:01Z

Yay, good catch! Thanks! :)

dan131riley · 2018-05-09T02:23:01Z

@kmcdermo When I added clamp() I looked for places it could be used, dunno if I got all of them. When I reviewed #137 I had even thought of flagging this case, but put it off! std::clamp is answered in #140

dan131riley closed this as completed May 9, 2018

dan131riley mentioned this issue May 9, 2018

Phi3 benchmarks #140

Merged

dan131riley mentioned this issue Oct 10, 2018

Add support for AVX2 #165

Merged

srlantz mentioned this issue Oct 11, 2018

Makefile.config inconsistencies between icc and gcc #171

Closed

dan131riley self-assigned this Feb 28, 2019

dan131riley mentioned this issue Jul 19, 2019

Adapt benchmark scripts to try more nVU #219

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX-512 Broken? #139

AVX-512 Broken? #139

kmcdermo commented May 7, 2018

kmcdermo commented May 8, 2018 •

edited

Loading

IHateLinus commented May 8, 2018 via email

kmcdermo commented May 8, 2018

kmcdermo commented May 8, 2018

osschar commented May 8, 2018

osschar commented May 8, 2018

srlantz commented May 8, 2018 via email

osschar commented May 8, 2018

dan131riley commented May 8, 2018

dan131riley commented May 8, 2018

osschar commented May 8, 2018

dan131riley commented May 8, 2018

srlantz commented May 8, 2018 via email

osschar commented May 8, 2018

pwittich commented May 8, 2018

IHateLinus commented May 8, 2018 via email

kmcdermo commented May 8, 2018

dan131riley commented May 8, 2018

dan131riley commented May 9, 2018

dan131riley commented May 9, 2018

kmcdermo commented May 9, 2018

osschar commented May 9, 2018

dan131riley commented May 9, 2018

AVX-512 Broken? #139

AVX-512 Broken? #139

Comments

kmcdermo commented May 7, 2018

kmcdermo commented May 8, 2018 • edited Loading

IHateLinus commented May 8, 2018 via email

kmcdermo commented May 8, 2018

kmcdermo commented May 8, 2018

osschar commented May 8, 2018

osschar commented May 8, 2018

Debugging VU=16 performance problem

srlantz commented May 8, 2018 via email

osschar commented May 8, 2018

dan131riley commented May 8, 2018

dan131riley commented May 8, 2018

osschar commented May 8, 2018

dan131riley commented May 8, 2018

srlantz commented May 8, 2018 via email

osschar commented May 8, 2018

pwittich commented May 8, 2018

IHateLinus commented May 8, 2018 via email

kmcdermo commented May 8, 2018

dan131riley commented May 8, 2018

dan131riley commented May 9, 2018

dan131riley commented May 9, 2018

kmcdermo commented May 9, 2018

osschar commented May 9, 2018

dan131riley commented May 9, 2018

kmcdermo commented May 8, 2018 •

edited

Loading