-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Results for other systems #6
Comments
Thanks! No arguments are necessary, but BTW I have never tested this with AVX-512, I have no idea whether it would all work flawlessly, so fingers crossed :) |
OK, I will run 2x with: FWIW on my SKL system I get some wrong results, like:
Where 0.5 lat is ... unlikely. I guess the problem is maybe CULT doesn't know that the first argument to |
Yeah, I think it's the opposite - BTW: you don't have to use |
@kobalicek - oops, good point I forgot that Here's another one I noticed:
I also got a lot of 0.2 recip throughput results which should be wrong (max 4 ops/cycle), but it seemed to go away after I turned off turbo. Do I need to turn off turbo to get good results? |
Yeah I'm also getting 0.2 reciprocal throughput on some instructions on Ryzen, but apparently Ryzen is capable of executing 5 instructions per cycle if they are in uop cache. However, if it says 0.2 it's probably true even on Intel although it's possible that I miscalculate the cycles wasted for each loop iteration, which is currently set to 1 cycle - hard to say whether that could cause reporting 0.2 instead of 0.25 in such cases. |
I have opened #7 to track the latency issue |
Yes, on Ryzen that is expected. It's definitely not 0.2 on Intel though, I've tested this stuff exhaustively down to the cycle using lots of different calibration and cycle measurement techniques and I have never seen any case you can do 5 ops/cycle. As I mentioned it could be turbo effects - how are you doing the timing? Do you use a clock-based timing and then convert to cycles using a calibration based on a well-known timing, say a loop of dependent instructions? |
My first CNL results look all wrong:
I will try to turn off turbo. Update: Looks OK with turbo off. |
Hmm, I don't know how to fix this though. It seems the readings are incorrect in that case. It uses |
Yes, but |
I think the manual I followed was written when the turbo didn't exist :) Do you have any suggestion about improving it? The logic is in |
The "fix" is either to force the user to turbo off turbo, you can see how I do this programatically here: https://github.com/travisdowns/uarch-bench/blob/master/uarch-bench.sh#L66 Or to do a calibration that allows you to convert from "nominal cycles" as read by https://github.com/travisdowns/avx-turbo/blob/master/tsc-support.cpp |
Yeah many moons ago, there was no frequency scaling (neither turbo nor anti-turbo, i.e., scaling below the nominal freq) so Then there was a brief period after Intel added fequency scaling where Turning off turbo is good because you get much more stable results since you don't have the forces frequency switch when another core spins up (then the current core has to slow down because modern chips have turbo multipliers that depend on how many cores are running), but there are also a lot of problems like even figuring out how to turn off turbo on all systems, user has to be root, etc. |
It's kinda pity I don't have Intel hardware anymore at the moment. I would even experiment with this a bit, but it's impossible to make it right on the first time. But, I would research this a bit. If I think of it I think this would never be 100% reliable tool, but if I can make it close enough I would be happy. |
@kobalicek - my experience with uarch-bench indicates that the calibration approach is fairly robust. At most you sometimes get a wrong calibration due to a wrong assumption: e.g., when I ran on POWER9 I found out that dependent instructions always have a latency of at least 2, so the calculated frequency was half of the real frequency, but at least the error was obvious and you can correct it once you notice it. Do you have AMD hardware, or something non-x86? I may be interested in some AMD numbers for some random microbenchmarks since I don't have easy access to AMD to test. |
BTW, running now in parallel on SKX, SKL and CNL, results should be available in a few more minutes. FWIW here's the script I used which might be useful for anyone else who wants to automate this (heavily based on your README):
You run it like |
Nice thanks! I have reduced all my machines to only one, which is Ryzen 1700 atm (but planning upgrade to 16c/32t at the end of the year). Then only ARM devices like raspberry for testing, interested in RISC-V though. |
BTW don't wanna waste more of your time on this. I would have to fix the timing issues if I want better numbers, as I really didn't know it could go that off initially. |
Don't worry, I turned off turbo and the numbers seem good. |
Thanks a lot! I have updated the web-app with the new data here: https://asmjit.com/asmgrid/ - The architectures look pretty similar to me - Selecting few architectures and enabling "Hide equal cols" would only show rows that differ, which is useful when looking at differences between microarchitectures. I think I have some work to do here as I can see that AVX-512 instructions that use |
I'll take a look at why they don't run.
…On Wed, Jun 12, 2019, 5:27 PM Petr Kobalicek ***@***.***> wrote:
Thanks a lot! I have updated the web-app with the new data here:
https://asmjit.com/asmgrid/ - The architectures look pretty similar to me
- Selecting few architectures and enabling "Hide equal cols" would only
show rows that differ, which is useful when looking at differences between
microarchitectures.
I think I have some work to do here as I can see that AVX-512 instructions
that use k and zmm registers are not executed, but that would take me
some time as it's not that high priority to me at the moment.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#6?email_source=notifications&email_token=AASKZQM2JZWXBFLKTIXOF33P2FZ37A5CNFSM4HXQAO32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXR7BNY#issuecomment-501477559>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASKZQP7KTBIBHQ5XPUP55TP2FZ37ANCNFSM4HXQAO3Q>
.
|
No need, I have to iterate over instruction signatures instead of doing what I do at the moment, asmjit has now all the information I need in |
Ah, OK!
Ping me if it gets fixed and I can redo the runs.
…On Thu, Jun 13, 2019, 2:28 AM Petr Kobalicek ***@***.***> wrote:
No need, I have to iterate over instruction signatures instead of doing
what I do at the moment, asmjit has now all the information I need in cult
to do it properly.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#6?email_source=notifications&email_token=AASKZQPQNIZALOATS4KPQV3P2HZILA5CNFSM4HXQAO32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXSZDJQ#issuecomment-501584294>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASKZQM2AJ2X7GPHI6NDO7DP2HZILANCNFSM4HXQAO3Q>
.
|
There are still some things that are not proper (for example it's hard to test latency of cmp, test, bt, and such instructions as the result is just flags. I will think of something, however, it's a minority of instructions so it's not that severe I think). I have also added |
Cool! Would you like me to run it on any systems? In addition to the ones above I now have access to Zen 2 and Ice Lake.
Right. Have you seen what uops.info does? They consider each instruction to have a matrix of latencies, one for each combination of input and output. For for a typical instruction like Here's cmp, and they show the latency to the flag ouput (which is 1 from either input in this case, but other cases are more interesting). This is how I think of instruction latency now, although admittedly it often does simplify to the "single figure" for many instructions with N register inputs and 1 register output and where the latency is the same for each input. Not all instructions fit that pattern though, particularly instructions with more than 1 uop.
The TSC frequency alone doesn't do that, it just lets you convert Then you run your benchmark and measure realtime and use the conversion factor to get cycles. Of course, this only works if the CPU frequency is the same during the calibration and the benchmark. That's not always the case. Approaches that are robust against that problem include:
|
@travisdowns If you have time to run cult on any Intel hardware I would be interested in results. I have updated cult to test more stuff, also memory ops, etc... There are still instructions where latency is wrong (as write-only memory ops don't create a dependency, for example), but these are things I would fix in the future and that don't bother me much as you can clearly see in the results that the timings are impossible. I have at the moment only Zen4 desktop and Tigerlake laptop, so any other arch would help me to improve asmgrid as I would have to delete all previous tables. |
As I mentioned on HN, I can run this on SKL, SKX and CNL (CannonLake) for you.
Are there any specific arguments or format you want the results in, or just capture the output of
cult
and include it in this issue?The text was updated successfully, but these errors were encountered: