[SYSTEMDS-2718] Matrix Mult Accelerator Comparison by Baunsgaard · Pull Request #1095 · apache/systemds

Baunsgaard · 2020-11-06T14:31:52Z

This commit initialize the performance test suite, with a micro benchmark
comparing the performance of MKL and default matrix multiplication.
The construction allows easy execution on different hardware platforms
to check FP-OPS, to see theoretical trhoughput compared to hardware
specification.

This commit initialize the performance test suite, with a micro benchmark comparing the performance of MKL and default matrix multiplication. The construction allows easy execution on different hardware platforms to check FP-OPS, to see theoretical trhoughput compared to hardware specification.

Baunsgaard · 2020-11-06T21:00:58Z

On my laptop i get:
of witch the first 3 are normal, then 3 mkl and then 3 openBLAS.
If anyone would like to review i would like measurements from your machine as well.
It should be as easy as running the script in scripts/perftest/runAll.sh

Total elapsed time:		21.939 sec.
 1  ba+*          21.214      3
Total elapsed time:		23.765 sec.
 1  ba+*          22.940      3
Total elapsed time:		24.867 sec.
 1  ba+*          24.023      3
Total elapsed time:		26.128 sec.
 1  ba+*          25.239      3
Total elapsed time:		25.375 sec.
 1  ba+*          24.495      3
        342.403,88 msec task-clock                #   13,748 CPUs utilized            ( +-  3,22% )
   911.071.333.279      cycles                    #    2,661 GHz                      ( +-  0,60% )  (30,77%)
 1.099.760.123.229      instructions              #    1,21  insn per cycle           ( +-  0,03% )  (38,47%)
Total elapsed time:		4.513 sec.
 1  ba+*           3.616      3
Total elapsed time:		4.465 sec.
 1  ba+*           3.579      3
Total elapsed time:		4.542 sec.
 1  ba+*           3.594      3
Total elapsed time:		4.808 sec.
 1  ba+*           3.966      3
Total elapsed time:		4.493 sec.
 1  ba+*           3.611      3
         34.108,67 msec task-clock                #    6,727 CPUs utilized            ( +-  1,57% )
    85.121.161.235      cycles                    #    2,496 GHz                      ( +-  1,27% )  (30,78%)
   202.393.564.071      instructions              #    2,38  insn per cycle           ( +-  0,59% )  (38,46%)
Total elapsed time:		5.259 sec.
 1  ba+*           4.333      3
Total elapsed time:		5.359 sec.
 1  ba+*           4.471      3
Total elapsed time:		5.086 sec.
 1  ba+*           4.175      3
Total elapsed time:		5.277 sec.
 1  ba+*           4.316      3
Total elapsed time:		5.495 sec.
 1  ba+*           4.493      3
         71.032,45 msec task-clock                #   12,188 CPUs utilized            ( +-  0,96% )
   158.323.022.360      cycles                    #    2,229 GHz                      ( +-  1,12% )  (30,79%)
   232.735.293.658      instructions              #    1,47  insn per cycle           ( +-  1,33% )  (38,47%)

mboehm7 · 2020-11-06T21:06:06Z

I would recommend to run it with larger matrices and/or much more operations to give the JIT a chance for tiered (cold, warm, hot) just-in-time compilation into native code.

mboehm7 · 2020-11-06T21:08:30Z

never mind, I read the results as ms not seconds. Then I suspect what you see is the vector width of AVX512.

Baunsgaard · 2020-11-06T21:24:09Z

On an AMD EPYC 7302 16-Core Processor

Total elapsed time:		9.860 sec.
 1  ba+*           8.542      3
Total elapsed time:		9.754 sec.
 1  ba+*           8.410      3
Total elapsed time:		9.542 sec.
 1  ba+*           8.217      3
Total elapsed time:		9.568 sec.
 1  ba+*           8.349      3
Total elapsed time:		9.738 sec.
 1  ba+*           8.330      3
         244244.54 msec task-clock                #   23.411 CPUs utilized            ( +-  0.52% )
      615634864321      cycles                    #    2.521 GHz                      ( +-  0.13% )  (33.32%)
     1099917247652      instructions              #    1.79  insn per cycle         
Total elapsed time:		4.087 sec.
 1  ba+*           2.758      3
Total elapsed time:		4.067 sec.
 1  ba+*           2.701      3
Total elapsed time:		3.880 sec.
 1  ba+*           2.656      3
Total elapsed time:		4.617 sec.
 1  ba+*           3.300      3
Total elapsed time:		4.175 sec.
 1  ba+*           2.841      3
          50723.62 msec task-clock                #    9.997 CPUs utilized            ( +-  1.29% )
       94475106229      cycles                    #    1.863 GHz                      ( +-  1.02% )  (33.37%)
      214603521520      instructions              #    2.27  insn per cycle         
Total elapsed time:		4.084 sec.
 1  ba+*           2.683      3
Total elapsed time:		4.100 sec.
 1  ba+*           2.701      3
Total elapsed time:		4.030 sec.
 1  ba+*           2.670      3
Total elapsed time:		3.828 sec.
 1  ba+*           2.651      3
Total elapsed time:		3.976 sec.
 1  ba+*           2.660      3
          88602.91 msec task-clock                #   17.962 CPUs utilized            ( +-  0.33% )
      165132494019      cycles                    #    1.864 GHz                      ( +-  0.68% )  (33.30%)
      251221873378      instructions              #    1.52  insn per cycle

on two socket 2x 28 core Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz

Total elapsed time:		5.208 sec.
 1  ba+*           4.210      3
Total elapsed time:		5.189 sec.
 1  ba+*           4.234      3
Total elapsed time:		5.215 sec.
 1  ba+*           4.255      3
Total elapsed time:		5.352 sec.
 1  ba+*           4.333      3
Total elapsed time:		5.485 sec.
 1  ba+*           4.321      3
         403205.65 msec task-clock                #   68.960 CPUs utilized            ( +-  2.07% )
     1045022715778      cycles                    #    2.592 GHz                      ( +-  2.16% )  (30.72%)
     1111149539772      instructions              #    1.06  insn per cycle           ( +-  0.08% )  (38.41%)
Total elapsed time:		3.651 sec.
 1  ba+*           2.165      3
Total elapsed time:		2.687 sec.
 1  ba+*           1.535      3
Total elapsed time:		3.269 sec.
 1  ba+*           1.924      3
Total elapsed time:		3.148 sec.
 1  ba+*           1.818      3
Total elapsed time:		2.899 sec.
 1  ba+*           1.547      3
         131080.09 msec task-clock                #   35.417 CPUs utilized            ( +-  7.67% )
      327562943703      cycles                    #    2.499 GHz                      ( +-  8.49% )  (30.69%)
      176246317254      instructions              #    0.54  insn per cycle           ( +-  3.37% )  (38.37%)
Total elapsed time:		3.132 sec.
 1  ba+*           1.490      3
Total elapsed time:		2.763 sec.
 1  ba+*           1.505      3
Total elapsed time:		3.474 sec.
 1  ba+*           1.524      3
Total elapsed time:		2.930 sec.
 1  ba+*           1.516      3
Total elapsed time:		3.184 sec.
 1  ba+*           1.662      3
         153409.48 msec task-clock                #   41.661 CPUs utilized            ( +-  7.83% )
      397930300980      cycles                    #    2.594 GHz                      ( +-  9.09% )  (30.66%)
      250592134329      instructions              #    0.63  insn per cycle           ( +-  1.03% )  (38.33%)

Baunsgaard · 2020-11-06T21:28:31Z

I would recommend to run it with larger matrices and/or much more operations to give the JIT a chance for tiered (cold, warm, hot) just-in-time compilation into native code.

I did try larger matrices, but the JIT compilation times did not increase.

Since a 5k 5k matrix multiplication is this fast on the CPUs take 1 to 8 seconds to do 3 reps is big enough if we do more than 3 repetitions? Or would you rather suggest to use larger matrices and fewer repetitions? We can also do both?

mboehm7 · 2020-11-06T21:37:31Z

No you're experiments are fine - I mis-read it as milliseconds without looking at the dimensions. So JIT will already happen during the first matrix multiplication.

For me the take-away is that we should have a look into the influence of the higher turbo-frequency, different cache sizes and other characteristics of your laptop to see if there is something we can do about it.

mboehm7 · 2020-11-06T21:54:31Z

One difference that might affect performance here is the associativity of the L2 cache - the Intel 6238R setup has 16 way, AMD EPYC 7302 has 8 way, and your i9-9980HK has 4 way. This might matter because our dense mm cache blocking largely optimizes for common L2 sizes and behavior.

Baunsgaard added 3 commits November 6, 2020 15:24

Add licenses

5399d2a

Include OpenBlas

d90ba37

reenable

2b0f232

Baunsgaard added 2 commits November 6, 2020 22:10

Minor bug fixes

7e89ce9

Details and path

74cc973

Baunsgaard closed this in 22375bb Nov 6, 2020

Baunsgaard deleted the NewPerfTest branch November 7, 2020 10:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-2718] Matrix Mult Accelerator Comparison#1095

[SYSTEMDS-2718] Matrix Mult Accelerator Comparison#1095
Baunsgaard wants to merge 6 commits intoapache:masterfrom
Baunsgaard:NewPerfTest

Baunsgaard commented Nov 6, 2020

Uh oh!

Baunsgaard commented Nov 6, 2020 •

edited

Loading

Uh oh!

mboehm7 commented Nov 6, 2020

Uh oh!

mboehm7 commented Nov 6, 2020 •

edited

Loading

Uh oh!

Baunsgaard commented Nov 6, 2020

Uh oh!

Baunsgaard commented Nov 6, 2020

Uh oh!

mboehm7 commented Nov 6, 2020 •

edited

Loading

Uh oh!

mboehm7 commented Nov 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Baunsgaard commented Nov 6, 2020

Uh oh!

Baunsgaard commented Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mboehm7 commented Nov 6, 2020

Uh oh!

mboehm7 commented Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Baunsgaard commented Nov 6, 2020

Uh oh!

Baunsgaard commented Nov 6, 2020

Uh oh!

mboehm7 commented Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mboehm7 commented Nov 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Baunsgaard commented Nov 6, 2020 •

edited

Loading

mboehm7 commented Nov 6, 2020 •

edited

Loading

mboehm7 commented Nov 6, 2020 •

edited

Loading