Skip to content

[SYSTEMDS-2718] Matrix Mult Accelerator Comparison#1095

Closed
Baunsgaard wants to merge 6 commits intoapache:masterfrom
Baunsgaard:NewPerfTest
Closed

[SYSTEMDS-2718] Matrix Mult Accelerator Comparison#1095
Baunsgaard wants to merge 6 commits intoapache:masterfrom
Baunsgaard:NewPerfTest

Conversation

@Baunsgaard
Copy link
Copy Markdown
Contributor

This commit initialize the performance test suite, with a micro benchmark
comparing the performance of MKL and default matrix multiplication.
The construction allows easy execution on different hardware platforms
to check FP-OPS, to see theoretical trhoughput compared to hardware
specification.

This commit initialize the performance test suite, with a micro benchmark
comparing the performance of MKL and default matrix multiplication.
The construction allows easy execution on different hardware platforms
to check FP-OPS, to see theoretical trhoughput compared to hardware
specification.
@Baunsgaard
Copy link
Copy Markdown
Contributor Author

Baunsgaard commented Nov 6, 2020

On my laptop i get:
of witch the first 3 are normal, then 3 mkl and then 3 openBLAS.
If anyone would like to review i would like measurements from your machine as well.
It should be as easy as running the script in scripts/perftest/runAll.sh

Total elapsed time:		21.939 sec.
 1  ba+*          21.214      3
Total elapsed time:		23.765 sec.
 1  ba+*          22.940      3
Total elapsed time:		24.867 sec.
 1  ba+*          24.023      3
Total elapsed time:		26.128 sec.
 1  ba+*          25.239      3
Total elapsed time:		25.375 sec.
 1  ba+*          24.495      3
        342.403,88 msec task-clock                #   13,748 CPUs utilized            ( +-  3,22% )
   911.071.333.279      cycles                    #    2,661 GHz                      ( +-  0,60% )  (30,77%)
 1.099.760.123.229      instructions              #    1,21  insn per cycle           ( +-  0,03% )  (38,47%)
Total elapsed time:		4.513 sec.
 1  ba+*           3.616      3
Total elapsed time:		4.465 sec.
 1  ba+*           3.579      3
Total elapsed time:		4.542 sec.
 1  ba+*           3.594      3
Total elapsed time:		4.808 sec.
 1  ba+*           3.966      3
Total elapsed time:		4.493 sec.
 1  ba+*           3.611      3
         34.108,67 msec task-clock                #    6,727 CPUs utilized            ( +-  1,57% )
    85.121.161.235      cycles                    #    2,496 GHz                      ( +-  1,27% )  (30,78%)
   202.393.564.071      instructions              #    2,38  insn per cycle           ( +-  0,59% )  (38,46%)
Total elapsed time:		5.259 sec.
 1  ba+*           4.333      3
Total elapsed time:		5.359 sec.
 1  ba+*           4.471      3
Total elapsed time:		5.086 sec.
 1  ba+*           4.175      3
Total elapsed time:		5.277 sec.
 1  ba+*           4.316      3
Total elapsed time:		5.495 sec.
 1  ba+*           4.493      3
         71.032,45 msec task-clock                #   12,188 CPUs utilized            ( +-  0,96% )
   158.323.022.360      cycles                    #    2,229 GHz                      ( +-  1,12% )  (30,79%)
   232.735.293.658      instructions              #    1,47  insn per cycle           ( +-  1,33% )  (38,47%)

@mboehm7
Copy link
Copy Markdown
Contributor

mboehm7 commented Nov 6, 2020

I would recommend to run it with larger matrices and/or much more operations to give the JIT a chance for tiered (cold, warm, hot) just-in-time compilation into native code.

@mboehm7
Copy link
Copy Markdown
Contributor

mboehm7 commented Nov 6, 2020

never mind, I read the results as ms not seconds. Then I suspect what you see is the vector width of AVX512.

@Baunsgaard
Copy link
Copy Markdown
Contributor Author

On an AMD EPYC 7302 16-Core Processor

Total elapsed time:		9.860 sec.
 1  ba+*           8.542      3
Total elapsed time:		9.754 sec.
 1  ba+*           8.410      3
Total elapsed time:		9.542 sec.
 1  ba+*           8.217      3
Total elapsed time:		9.568 sec.
 1  ba+*           8.349      3
Total elapsed time:		9.738 sec.
 1  ba+*           8.330      3
         244244.54 msec task-clock                #   23.411 CPUs utilized            ( +-  0.52% )
      615634864321      cycles                    #    2.521 GHz                      ( +-  0.13% )  (33.32%)
     1099917247652      instructions              #    1.79  insn per cycle         
Total elapsed time:		4.087 sec.
 1  ba+*           2.758      3
Total elapsed time:		4.067 sec.
 1  ba+*           2.701      3
Total elapsed time:		3.880 sec.
 1  ba+*           2.656      3
Total elapsed time:		4.617 sec.
 1  ba+*           3.300      3
Total elapsed time:		4.175 sec.
 1  ba+*           2.841      3
          50723.62 msec task-clock                #    9.997 CPUs utilized            ( +-  1.29% )
       94475106229      cycles                    #    1.863 GHz                      ( +-  1.02% )  (33.37%)
      214603521520      instructions              #    2.27  insn per cycle         
Total elapsed time:		4.084 sec.
 1  ba+*           2.683      3
Total elapsed time:		4.100 sec.
 1  ba+*           2.701      3
Total elapsed time:		4.030 sec.
 1  ba+*           2.670      3
Total elapsed time:		3.828 sec.
 1  ba+*           2.651      3
Total elapsed time:		3.976 sec.
 1  ba+*           2.660      3
          88602.91 msec task-clock                #   17.962 CPUs utilized            ( +-  0.33% )
      165132494019      cycles                    #    1.864 GHz                      ( +-  0.68% )  (33.30%)
      251221873378      instructions              #    1.52  insn per cycle         

on two socket 2x 28 core Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz

Total elapsed time:		5.208 sec.
 1  ba+*           4.210      3
Total elapsed time:		5.189 sec.
 1  ba+*           4.234      3
Total elapsed time:		5.215 sec.
 1  ba+*           4.255      3
Total elapsed time:		5.352 sec.
 1  ba+*           4.333      3
Total elapsed time:		5.485 sec.
 1  ba+*           4.321      3
         403205.65 msec task-clock                #   68.960 CPUs utilized            ( +-  2.07% )
     1045022715778      cycles                    #    2.592 GHz                      ( +-  2.16% )  (30.72%)
     1111149539772      instructions              #    1.06  insn per cycle           ( +-  0.08% )  (38.41%)
Total elapsed time:		3.651 sec.
 1  ba+*           2.165      3
Total elapsed time:		2.687 sec.
 1  ba+*           1.535      3
Total elapsed time:		3.269 sec.
 1  ba+*           1.924      3
Total elapsed time:		3.148 sec.
 1  ba+*           1.818      3
Total elapsed time:		2.899 sec.
 1  ba+*           1.547      3
         131080.09 msec task-clock                #   35.417 CPUs utilized            ( +-  7.67% )
      327562943703      cycles                    #    2.499 GHz                      ( +-  8.49% )  (30.69%)
      176246317254      instructions              #    0.54  insn per cycle           ( +-  3.37% )  (38.37%)
Total elapsed time:		3.132 sec.
 1  ba+*           1.490      3
Total elapsed time:		2.763 sec.
 1  ba+*           1.505      3
Total elapsed time:		3.474 sec.
 1  ba+*           1.524      3
Total elapsed time:		2.930 sec.
 1  ba+*           1.516      3
Total elapsed time:		3.184 sec.
 1  ba+*           1.662      3
         153409.48 msec task-clock                #   41.661 CPUs utilized            ( +-  7.83% )
      397930300980      cycles                    #    2.594 GHz                      ( +-  9.09% )  (30.66%)
      250592134329      instructions              #    0.63  insn per cycle           ( +-  1.03% )  (38.33%)

@Baunsgaard
Copy link
Copy Markdown
Contributor Author

I would recommend to run it with larger matrices and/or much more operations to give the JIT a chance for tiered (cold, warm, hot) just-in-time compilation into native code.

I did try larger matrices, but the JIT compilation times did not increase.

Since a 5k 5k matrix multiplication is this fast on the CPUs take 1 to 8 seconds to do 3 reps is big enough if we do more than 3 repetitions? Or would you rather suggest to use larger matrices and fewer repetitions? We can also do both?

@mboehm7
Copy link
Copy Markdown
Contributor

mboehm7 commented Nov 6, 2020

No you're experiments are fine - I mis-read it as milliseconds without looking at the dimensions. So JIT will already happen during the first matrix multiplication.

For me the take-away is that we should have a look into the influence of the higher turbo-frequency, different cache sizes and other characteristics of your laptop to see if there is something we can do about it.

@mboehm7
Copy link
Copy Markdown
Contributor

mboehm7 commented Nov 6, 2020

One difference that might affect performance here is the associativity of the L2 cache - the Intel 6238R setup has 16 way, AMD EPYC 7302 has 8 way, and your i9-9980HK has 4 way. This might matter because our dense mm cache blocking largely optimizes for common L2 sizes and behavior.

@Baunsgaard Baunsgaard closed this in 22375bb Nov 6, 2020
@Baunsgaard Baunsgaard deleted the NewPerfTest branch November 7, 2020 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants