-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX-512 Broken? #139
Comments
So, I ran some tests on SKL-Ag (Cornell Silver, lnx4108), SKL-Au (UCSD Gold, phi3), and KNL (phi2), and it appears it is truly that running at full vector width is broken across the board. I ran the following tests on each platform:
Where $max_th is:
And in all three nVU=16 tests... the number of hits is drastically different compared to the other vector widths for all platforms. |
Hi Kevin,
I think you said in a mtg that this is a “new development” and that the timing and performance numbers we have been talking about (e.g. ~80Hz for ttbar+70PU on fully loaded KNL), were derived when this hit distribution was “healthy”
Can you please confirm and attach those plots so we establish a baseline and also make sure wr are not fulling ourselves (and others) by those earlier values…
Thanks, Avi.
… On May 7, 2018, at 9:34 PM, Kevin McDermott ***@***.***> wrote:
So, I ran some tests on SKL-Ag (Cornell Silver, lnx4108), SKL-Au (UCSD Gold, phi3), and KNL (phi2), and it appears it is truly that running at full vector width is broken across the board.
I ran the following tests on each platform:
nVU=1, nTH=1
nVU=1, nTH=1
<https://user-images.githubusercontent.com/7645989/39738149-7be8b5d4-5257-11e8-9207-42603fb950fd.png>
<https://user-images.githubusercontent.com/7645989/39738150-7bfae18c-5257-11e8-88de-f72ae0c268d3.png>
<https://user-images.githubusercontent.com/7645989/39738151-7c1331d8-5257-11e8-9a56-90938b8aee3f.png>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#139 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEdT8UGCaJjttyHyYk19Na8GMpAnajz1ks5twSBugaJpZM4T1sEj>.
|
@IHateLinus , yes, to be sure, this does not affect the numbers @slava77 reported in his email from March 19 (which was using PR135). The plots for pr135 are here: https://kmcdermo.web.cern.ch/kmcdermo/pr135/ So as you can see, we are "safe" for Slava's numbers at 80Hz. We can rest a bit easy in this regard. |
(and because I was tired of looking for this email I am posting it here as a pdf dump) |
Thanks Kevin! Good to know it's not only intrinsics that are broken. I'm looking into this now. |
OK, see my notes below, the problem is xCOMMON-AVX512, using xCORE solves the problem. I told y'all that I haven't changed anything that could warrant changes in vectorization performance :) I'll leave it to original committer to figure out the correct fix ;) Debugging VU=16 performance problemWorking on phi3. mkFit/mkFit --cmssw-n2seeds --build-ce --geom CMS-2017 --start-event 1 --num-events 1 --quality-val --input-file /data2/slava77/samples/2017/pass-4874f28/initialStep/PU70/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFullINPUT+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU/memoryFile.fv3.clean.writeAll.recT.072617.bin
Read complete, 9362 simtracks on file.
Read complete, 9362 simtracks on file.
Read complete, 9362 simtracks on file. |
Cool discovery--or rather, cool clue. Obviously we wouldn't want the choice of instruction set to change the results like this.
It may simply be the case that the CORE option is letting us avoid a compiler bug that is only present in the COMMON path. Did you try giving the COMMON option to different compiler versions, perhaps on other machines? Because if that works, then for sure we're looking at a compiler bug.
Otherwise... I'm not sure whether it helps, but in case it does, the difference between -xCOMMON-AVX512 and -xCORE-AVX512 is that the latter adds the following groups of instructions: *
· AVX512BW - Byte and word (8 and 16-bit) operations that enhance integer operations. (They make use of all 64 bits in the mask registers, too.)
· AVX512DQ - Doubleword and quadword (32 and 64-bit) operations that enhance integer and floating-point operations.
· AVX512VL - Vector Length Extensions. They provide for most AVX-512 instructions to operate on 128 or 256 bits, instead of only 512. The use of Vector Length Extensions extends most AVX-512 operations to also operate on XMM (128-bit, SSE) registers and YMM (256-bit, AVX) registers. It also permits access to registers 16..31, to be applied to XMM and YMM registers instead of only to ZMM registers.
Thus with the CORE flag the compiler will vectorize loops that involve more types such as int and long long; more operations such as bit manipulations; and shorter vectors.
Steve
*Abridged descriptions from https://software.intel.com/en-us/blogs/additional-avx-512-instructions
From: Matevž Tadel [mailto:notifications@github.com]
Sent: Tuesday, May 08, 2018 12:38 PM
To: cerati/mictest
Cc: Subscribed
Subject: Re: [cerati/mictest] AVX-512 Broken? (#139)
OK, see my notes below, the problem is xCOMMON-AVX512, using xCORE solves the problem.
I told y'all that I haven't changed anything that could warrant changes in vectorization performance :)
I'll leave it to original committer to figure out the correct fix ;)
Debugging VU=16 performance problem
Working on phi3.
mkFit/mkFit --cmssw-n2seeds --build-ce --geom CMS-2017 --start-event 1 --num-events 1 --quality-val --input-file /data2/slava77/samples/2017/pass-4874f28/initialStep/PU70/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFullINPUT+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU/memoryFile.fv3.clean.writeAll.recT.072617.bin
* gcc-7, default opts:
Read complete, 9362 simtracks on file.
Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362
found tracks=728 in pT 10%=680 in pT 20%=723 no_mc_assoc=470
nH >= 80% =707 in pT 10%=659 in pT 20%=702
* icc, default opts (-O2 for ConformalFitter); or -O2 for all
Read complete, 9362 simtracks on file.
Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362
found tracks=24 in pT 10%=19 in pT 20%=22 no_mc_assoc=1174
nH >= 80% =3 in pT 10%=0 in pT 20%=2
* icc with -xCORE-AVX512 (instead of xCOMMON)
Read complete, 9362 simtracks on file.
Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362
found tracks=728 in pT 10%=680 in pT 20%=723 no_mc_assoc=470
nH >= 80% =707 in pT 10%=659 in pT 20%=702
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#139 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHooysOpfvVKLtB2A6u2JOGIDbKbhjCEks5twcoEgaJpZM4T1sEj>.
|
Kevin ran on phi2 (18.0.1), phi3 (18.0.2) and Cornell SKL (???) - all had the same issue. I remember getting a mail that a new intel compiler version is available, let me check. Oh, it says 2017, update 7. I think you're right and we're looking at a compiler bug/issue here. |
Weird. '-xCOMMON-AVX512' was fine before your changes, so it's the combination of our changes. Appears to be something in MkFinder. |
I guess the expedient thing to do is '-xHost'. The compiler does produce a lot of different instruction sequences with common vs. core, especially for the copy routines...consider this, which looks like an example of what Steve was talking about...common first, not vectorized:
vs. core:
|
VEC_ICC := -xHost works on phi3. What do we do? Make the change and ask Kevin nicely to run the tests again? |
It'll be in my pull request adding phi3 to the benchmarks, coming soon. |
Don't the benchmark scripts do all their compiling on one machine and ship binaries around to be run on other machines. Or did that change? If I'm right, -xHost wouldn't be compatible with how our benchmarks get run.
From: Dan Riley [mailto:notifications@github.com]
Sent: Tuesday, May 08, 2018 3:25 PM
To: cerati/mictest
Cc: Steve Lantz; Comment
Subject: Re: [cerati/mictest] AVX-512 Broken? (#139)
It'll be in my pull request adding phi3 to the benchmarks, coming soon.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#139 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHooytsuFqiUS3vcdyjRv7s5-MyUPiSlks5twfESgaJpZM4T1sEj>.
|
I think Kevin changed that to build on each machine. And we dropped KNC. |
we need to find someone besides Kevin to run these tests -- Kevin will graduate not too far in the future. |
that is very sad. Cant you pit an iron ball on his leg?
… On May 8, 2018, at 12:32 PM, Peter Wittich ***@***.***> wrote:
we need to find someone besides Kevin to run these tests -- Kevin will graduate not too far in the future.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#139 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEdT8U6O_VslRh-27HSE01IhDIjEqksAks5twfLqgaJpZM4T1sEj>.
|
I can do this for sure, but let's wait to see what Dan comes back with for the new benchmarking scripts :), since he is addressing this issue. The plots above were a one-off and overly pedantic for general benchmarking, but can be made easily again in case of need for debugging -- maybe I will add them to a dedicated script...
Matevz is correct: once we dropped KNC, we tar up the working directory and ship it to the native platform and compile it there. This is the case for phiphi (SNB), where the benchmarks are launched, and also shipped to phi2 (KNL). With phi3/lnx4108, we could tar and compile on each machine, since they both have the intel compiler. Let's see what Dan has in the new benchmarking scripts for the skylakes :).
Don't know if I can get through airport security with it though! |
This seems to work around the problem:
Apparently with common-avx512 or mic-avx512 the nested min/max gets vectorized incorrectly. I looked at the assembly, but there were enough differences that I couldn't identify where it went wrong. |
Are there other places that use min/max together that we should use std::clamp instead? Also, should we replace clamp with std::clamp? |
Yay, good catch! Thanks! :) |
For some reason, even with all the latest fixes protecting against NaNs, we see a serious loss of hits/track on phi2 (KNL) with AVX-512 enabled (at least with max nThreads): https://kmcdermo.web.cern.ch/kmcdermo/catching-nans-4-pt2/PlotsFromDump/CMSSW_TTbar_PU70_CE_nHits.png
This is true for BH, STD, CE. This most likely explains the enormous vectorization speedup on phi2: https://kmcdermo.web.cern.ch/kmcdermo/catching-nans-4-pt2/Benchmarks/KNL_CMSSW_TTbar_PU70_VU_speedup.png
A first test would be to make the same plot with nTh=1, to isolate multithreading from AVX-512, and perhaps making the same plot for nVU=2,4,8 nTh=1.
If I remember correctly, @slava77 reported seeing lots of new NaNs with max vectorization previously that @osschar did not see, but perhaps this was because we focused efforts on phiphi and not phi2.
The text was updated successfully, but these errors were encountered: