[SYSTEMDS-2718] Matrix Mult Accelerator Comparison#1095
[SYSTEMDS-2718] Matrix Mult Accelerator Comparison#1095Baunsgaard wants to merge 6 commits intoapache:masterfrom
Conversation
This commit initialize the performance test suite, with a micro benchmark comparing the performance of MKL and default matrix multiplication. The construction allows easy execution on different hardware platforms to check FP-OPS, to see theoretical trhoughput compared to hardware specification.
|
On my laptop i get: Total elapsed time: 21.939 sec.
1 ba+* 21.214 3
Total elapsed time: 23.765 sec.
1 ba+* 22.940 3
Total elapsed time: 24.867 sec.
1 ba+* 24.023 3
Total elapsed time: 26.128 sec.
1 ba+* 25.239 3
Total elapsed time: 25.375 sec.
1 ba+* 24.495 3
342.403,88 msec task-clock # 13,748 CPUs utilized ( +- 3,22% )
911.071.333.279 cycles # 2,661 GHz ( +- 0,60% ) (30,77%)
1.099.760.123.229 instructions # 1,21 insn per cycle ( +- 0,03% ) (38,47%)
Total elapsed time: 4.513 sec.
1 ba+* 3.616 3
Total elapsed time: 4.465 sec.
1 ba+* 3.579 3
Total elapsed time: 4.542 sec.
1 ba+* 3.594 3
Total elapsed time: 4.808 sec.
1 ba+* 3.966 3
Total elapsed time: 4.493 sec.
1 ba+* 3.611 3
34.108,67 msec task-clock # 6,727 CPUs utilized ( +- 1,57% )
85.121.161.235 cycles # 2,496 GHz ( +- 1,27% ) (30,78%)
202.393.564.071 instructions # 2,38 insn per cycle ( +- 0,59% ) (38,46%)
Total elapsed time: 5.259 sec.
1 ba+* 4.333 3
Total elapsed time: 5.359 sec.
1 ba+* 4.471 3
Total elapsed time: 5.086 sec.
1 ba+* 4.175 3
Total elapsed time: 5.277 sec.
1 ba+* 4.316 3
Total elapsed time: 5.495 sec.
1 ba+* 4.493 3
71.032,45 msec task-clock # 12,188 CPUs utilized ( +- 0,96% )
158.323.022.360 cycles # 2,229 GHz ( +- 1,12% ) (30,79%)
232.735.293.658 instructions # 1,47 insn per cycle ( +- 1,33% ) (38,47%) |
|
I would recommend to run it with larger matrices and/or much more operations to give the JIT a chance for tiered (cold, warm, hot) just-in-time compilation into native code. |
|
never mind, I read the results as ms not seconds. Then I suspect what you see is the vector width of AVX512. |
|
On an AMD EPYC 7302 16-Core Processor Total elapsed time: 9.860 sec.
1 ba+* 8.542 3
Total elapsed time: 9.754 sec.
1 ba+* 8.410 3
Total elapsed time: 9.542 sec.
1 ba+* 8.217 3
Total elapsed time: 9.568 sec.
1 ba+* 8.349 3
Total elapsed time: 9.738 sec.
1 ba+* 8.330 3
244244.54 msec task-clock # 23.411 CPUs utilized ( +- 0.52% )
615634864321 cycles # 2.521 GHz ( +- 0.13% ) (33.32%)
1099917247652 instructions # 1.79 insn per cycle
Total elapsed time: 4.087 sec.
1 ba+* 2.758 3
Total elapsed time: 4.067 sec.
1 ba+* 2.701 3
Total elapsed time: 3.880 sec.
1 ba+* 2.656 3
Total elapsed time: 4.617 sec.
1 ba+* 3.300 3
Total elapsed time: 4.175 sec.
1 ba+* 2.841 3
50723.62 msec task-clock # 9.997 CPUs utilized ( +- 1.29% )
94475106229 cycles # 1.863 GHz ( +- 1.02% ) (33.37%)
214603521520 instructions # 2.27 insn per cycle
Total elapsed time: 4.084 sec.
1 ba+* 2.683 3
Total elapsed time: 4.100 sec.
1 ba+* 2.701 3
Total elapsed time: 4.030 sec.
1 ba+* 2.670 3
Total elapsed time: 3.828 sec.
1 ba+* 2.651 3
Total elapsed time: 3.976 sec.
1 ba+* 2.660 3
88602.91 msec task-clock # 17.962 CPUs utilized ( +- 0.33% )
165132494019 cycles # 1.864 GHz ( +- 0.68% ) (33.30%)
251221873378 instructions # 1.52 insn per cycle
on two socket 2x 28 core Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz Total elapsed time: 5.208 sec.
1 ba+* 4.210 3
Total elapsed time: 5.189 sec.
1 ba+* 4.234 3
Total elapsed time: 5.215 sec.
1 ba+* 4.255 3
Total elapsed time: 5.352 sec.
1 ba+* 4.333 3
Total elapsed time: 5.485 sec.
1 ba+* 4.321 3
403205.65 msec task-clock # 68.960 CPUs utilized ( +- 2.07% )
1045022715778 cycles # 2.592 GHz ( +- 2.16% ) (30.72%)
1111149539772 instructions # 1.06 insn per cycle ( +- 0.08% ) (38.41%)
Total elapsed time: 3.651 sec.
1 ba+* 2.165 3
Total elapsed time: 2.687 sec.
1 ba+* 1.535 3
Total elapsed time: 3.269 sec.
1 ba+* 1.924 3
Total elapsed time: 3.148 sec.
1 ba+* 1.818 3
Total elapsed time: 2.899 sec.
1 ba+* 1.547 3
131080.09 msec task-clock # 35.417 CPUs utilized ( +- 7.67% )
327562943703 cycles # 2.499 GHz ( +- 8.49% ) (30.69%)
176246317254 instructions # 0.54 insn per cycle ( +- 3.37% ) (38.37%)
Total elapsed time: 3.132 sec.
1 ba+* 1.490 3
Total elapsed time: 2.763 sec.
1 ba+* 1.505 3
Total elapsed time: 3.474 sec.
1 ba+* 1.524 3
Total elapsed time: 2.930 sec.
1 ba+* 1.516 3
Total elapsed time: 3.184 sec.
1 ba+* 1.662 3
153409.48 msec task-clock # 41.661 CPUs utilized ( +- 7.83% )
397930300980 cycles # 2.594 GHz ( +- 9.09% ) (30.66%)
250592134329 instructions # 0.63 insn per cycle ( +- 1.03% ) (38.33%) |
I did try larger matrices, but the JIT compilation times did not increase. Since a 5k 5k matrix multiplication is this fast on the CPUs take 1 to 8 seconds to do 3 reps is big enough if we do more than 3 repetitions? Or would you rather suggest to use larger matrices and fewer repetitions? We can also do both? |
|
No you're experiments are fine - I mis-read it as milliseconds without looking at the dimensions. So JIT will already happen during the first matrix multiplication. For me the take-away is that we should have a look into the influence of the higher turbo-frequency, different cache sizes and other characteristics of your laptop to see if there is something we can do about it. |
|
One difference that might affect performance here is the associativity of the L2 cache - the Intel 6238R setup has 16 way, AMD EPYC 7302 has 8 way, and your i9-9980HK has 4 way. This might matter because our dense mm cache blocking largely optimizes for common L2 sizes and behavior. |
This commit initialize the performance test suite, with a micro benchmark
comparing the performance of MKL and default matrix multiplication.
The construction allows easy execution on different hardware platforms
to check FP-OPS, to see theoretical trhoughput compared to hardware
specification.