Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-2760] Transpose micro benchmark #1127

Closed
wants to merge 2 commits into from

Conversation

Baunsgaard
Copy link
Contributor

This PR contains a simple addition to the micro benchmarks.
This time transpose of a matrix is measured.

3 basic cases:

"skinny" with 2.5mil rows 50 cols
"wide" with 50 cols and 2.5 mil rows
"full" with 20000 rows and 5000 cols

run sparse and dense.

on 3 different systems the time it takes for a transpose varies drastically, and currently not as expected:

The one case that is very inefficient currently is a sparse wide matrix, that takes 4-5 times longer on many core machines compared to a few cores.

Alpha:

scripts/perftest/results/transpose-wide-0.1.log
Total elapsed time:		22.405 sec.
 1  r'            20.578      5
Total elapsed time:		24.200 sec.
 1  r'            22.324      5
Total elapsed time:		27.786 sec.
 1  r'            25.900      5
Total elapsed time:		25.258 sec.
 1  r'            23.406      5
Total elapsed time:		22.654 sec.
 1  r'            20.861      5
        1532631.15 msec task-clock                #   61.009 CPUs utilized            ( +-  4.47% )
     4621538496317      cycles                    #    3.015 GHz                      ( +-  4.35% )  (30.76%)
      659745166202      instructions              #    0.14  insn per cycle           ( +- 11.77% )  (38.45%)

xps:

scripts/perftest/results/transpose-wide-0.1.log
Total elapsed time:             6.269 sec.
 1  r'             4.907      5
Total elapsed time:             6.356 sec.
 1  r'             5.004      5
Total elapsed time:             6.482 sec.
 1  r'             5.108      5
Total elapsed time:             6.511 sec.
 1  r'             5.022      5
Total elapsed time:             6.428 sec.
 1  r'             5.061      5
         60.976,97 msec task-clock                #    8,790 CPUs utilized            ( +-  0,75% )
   187.499.949.636      cycles                    #    3,075 GHz                      ( +-  0,29% )  (30,65%)
    92.529.928.550      instructions              #    0,49  insn per cycle           ( +-  0,36% )  (38,34%)

tango:

scripts/perftest/results/transpose-wide-0.1.log
Total elapsed time:		23.525 sec.
 1  r'            21.169      5
Total elapsed time:		21.440 sec.
 1  r'            19.291      5
Total elapsed time:		23.235 sec.
 1  r'            21.014      5
Total elapsed time:		23.051 sec.
 1  r'            21.037      5
Total elapsed time:		22.883 sec.
 1  r'            20.551      5
         454278.39 msec task-clock                #   19.145 CPUs utilized            ( +-  1.71% )
     1203991163914      cycles                    #    2.650 GHz                      ( +-  1.64% )  (33.34%)
      236261779038      instructions              #    0.20  insn per cycle      

For refference why i'm addressing this is because compression is doing a transpose in the beginning.
On my airlines dataset, this transpose takes 16 seconds on alpha while, on my laptop 1 second.

@mboehm7
Copy link
Contributor

mboehm7 commented Dec 17, 2020

I'll have a look tonight and see what we can do. Airline was dense, right?

@Baunsgaard
Copy link
Contributor Author

The large 15 mil case seems to have little to no difference.

But there still is a bug somewhere.

XPS:

scripts/perftest/results/transpose-large.log
Total elapsed time:             7.377 sec.
 1  r'             4.352      1
Total elapsed time:             7.835 sec.
 1  r'             4.649      1
Total elapsed time:             7.659 sec.
 1  r'             4.398      1
Total elapsed time:             7.903 sec.
 1  r'             4.677      1
Total elapsed time:             7.723 sec.
 1  r'             4.445      1
         36.435,71 msec task-clock                #    4,264 CPUs utilized            ( +-  1,27% )
   134.881.449.707      cycles                    #    3,702 GHz                      ( +-  0,43% )  (30,65%)
   119.303.817.112      instructions              #    0,88  insn per cycle           ( +-  0,37% )  (38,39%)

Alpha:

scripts/perftest/results/transpose-large.log
Total elapsed time:		8.531 sec.
 1  r'             5.459      1
Total elapsed time:		8.366 sec.
 1  r'             5.412      1
Total elapsed time:		10.413 sec.
 1  r'             7.507      1
Total elapsed time:		8.373 sec.
 1  r'             5.420      1
Total elapsed time:		8.254 sec.
 1  r'             5.394      1
         100414.75 msec task-clock                #   10.271 CPUs utilized            ( +-  5.07% )
      314073685855      cycles                    #    3.128 GHz                      ( +-  4.82% )  (30.86%)
      127951221368      instructions              #    0.41  insn per cycle           ( +-  3.10% )  (38.62%)

@Baunsgaard
Copy link
Contributor Author

Baunsgaard commented Dec 17, 2020

I'll have a look tonight and see what we can do. Airline was dense, right?

Yes airline is dense, and i don't seem to be able to reproduce the bad performance calling transpose in a script.

dimensions on airline is:
14.5mil row, 29 col, 220 mil nnz

@mboehm7
Copy link
Contributor

mboehm7 commented Dec 17, 2020

ok, I just pushed some minor performance improvements for sparse-sparse transpose operations which reduced the execution time of ten 2.5M x 50 (sparsity=0.1, seed=12) operations from 2.7 to 1.9s. Furthermore, I found the following:

  • For dense transpose operations, we have two significant parts: allocating the dense output, and the multi-threaded transpose operation. On a box with 112 vcores, the allocation is 10x more expensive than the actual transpose operation. The conclusion would be an in-place transpose wherever possible. For example, compression is injected directly after the persistent read which makes it safe to use in-place by default for both local and distributed compression. This approach would not just improve compression times but also eliminate the unnecessary temporary memory requirements. I leave this up to you though.
  • In all my tests the sparse-sparse operations were slower for tall&skinny compared to short&wide of the same dimensions. Looking at the implementation, it also makes sense because the sparse-sparse transpose parallelizes over input columns which allows unsynchronized output row updates. Did the above results maybe originate from earlier experiments where the scripts inverted the input arguments?

This micro benchmark considers multiple cases, tallskinny, shortwide
and "normal" matrices. It gives an indication of if the transpose is
parallelizing and using the hardware appropriately.
@Baunsgaard
Copy link
Contributor Author

When looking at before and after (the way i tested it was dropping the transpose commit from the history.) it looks like i might have done something wrong in the initial tests. That said, it does not look like the changes had any impact, but it did make me notice the difference between executions on the wide transpose is large. Sometimes it takes 5 seconds sometimes 2.5 I'm guessing it has to do with the two NUMA nodes?

The Full transpose micro benchmark:

After change Alpha

scripts/perftest/results/transpose-skinny-1.0.log
Total elapsed time:		5.177 sec.
 1  r'             2.567      5
Total elapsed time:		5.592 sec.
 1  r'             2.487      5
Total elapsed time:		5.394 sec.
 2  r'             2.393      5
Total elapsed time:		5.607 sec.
 1  r'             2.496      5
Total elapsed time:		5.361 sec.
 1  r'             2.531      5
         195735.81 msec task-clock                #   31.188 CPUs utilized            ( +-  3.50% )
      595845281584      cycles                    #    3.044 GHz                      ( +-  3.34% )  (30.75%)
       67405027834      instructions              #    0.11  insn per cycle           ( +-  2.26% )  (38.51%)
scripts/perftest/results/transpose-wide-1.0.log
Total elapsed time:		4.870 sec.
 1  r'             2.439      5
Total elapsed time:		5.466 sec.
 1  r'             2.418      5
Total elapsed time:		5.381 sec.
 1  r'             2.393      5
Total elapsed time:		5.257 sec.
 1  r'             2.343      5
Total elapsed time:		4.880 sec.
 1  r'             2.453      5
         197370.59 msec task-clock                #   32.701 CPUs utilized            ( +-  6.74% )
      598434626116      cycles                    #    3.032 GHz                      ( +-  6.70% )  (30.76%)
       70128163005      instructions              #    0.12  insn per cycle           ( +-  1.65% )  (38.51%)
scripts/perftest/results/transpose-full-1.0.log
Total elapsed time:		3.736 sec.
 2  r'             1.343      5
Total elapsed time:		3.858 sec.
 2  r'             1.326      5
Total elapsed time:		3.500 sec.
 2  r'             1.299      5
Total elapsed time:		3.894 sec.
 2  r'             1.305      5
Total elapsed time:		3.526 sec.
 2  r'             1.304      5
         104490.76 msec task-clock                #   22.819 CPUs utilized            ( +-  1.56% )
      320478636150      cycles                    #    3.067 GHz                      ( +-  1.69% )  (30.80%)
       62146562879      instructions              #    0.19  insn per cycle           ( +-  1.59% )  (38.55%)
scripts/perftest/results/transpose-skinny-0.1.log
Total elapsed time:		2.701 sec.
 1  r'             1.437      5
Total elapsed time:		2.659 sec.
 1  r'             1.141      5
Total elapsed time:		3.174 sec.
 1  r'             1.761      5
Total elapsed time:		2.705 sec.
 1  r'             1.103      5
Total elapsed time:		3.112 sec.
 1  r'             1.472      5
         152922.25 msec task-clock                #   43.917 CPUs utilized            ( +-  5.32% )
      473697710114      cycles                    #    3.098 GHz                      ( +-  5.32% )  (31.11%)
       75871932728      instructions              #    0.16  insn per cycle           ( +-  2.13% )  (38.92%)
scripts/perftest/results/transpose-wide-0.1.log
Total elapsed time:		7.215 sec.
 1  r'             5.376      5
Total elapsed time:		6.703 sec.
 1  r'             4.871      5
Total elapsed time:		4.625 sec.
 1  r'             2.815      5
Total elapsed time:		4.400 sec.
 1  r'             2.592      5
Total elapsed time:		5.506 sec.
 1  r'             3.721      5
         214645.79 msec task-clock                #   33.943 CPUs utilized            ( +- 18.68% )
      658068071617      cycles                    #    3.066 GHz                      ( +- 18.75% )  (30.71%)
       78768925872      instructions              #    0.12  insn per cycle           ( +- 21.76% )  (38.42%)
scripts/perftest/results/transpose-full-0.1.log
Total elapsed time:		1.368 sec.
 1  r'             0.583      5
Total elapsed time:		1.365 sec.
 1  r'             0.574      5
Total elapsed time:		1.724 sec.
 1  r'             0.835      5
Total elapsed time:		1.564 sec.
 1  r'             0.708      5
Total elapsed time:		1.404 sec.
 1  r'             0.522      5
          79268.38 msec task-clock                #   38.130 CPUs utilized            ( +-  8.03% )
      239815721367      cycles                    #    3.025 GHz                      ( +-  7.83% )  (30.85%)
       32295607242      instructions              #    0.13  insn per cycle           ( +-  2.61% )  (38.69%)
scripts/perftest/results/transpose-large.log
Total elapsed time:		34.586 sec.
 1  r'            31.577      5
Total elapsed time:		31.789 sec.
 1  r'            28.383      5
Total elapsed time:		31.772 sec.
 1  r'            28.304      5
Total elapsed time:		31.899 sec.
 1  r'            28.529      5
Total elapsed time:		32.218 sec.
 1  r'            28.521      5
         220380.73 msec task-clock                #    6.530 CPUs utilized            ( +-  4.50% )
      702976272821      cycles                    #    3.190 GHz                      ( +-  4.32% )  (30.77%)
      341876674221      instructions              #    0.49  insn per cycle           ( +-  1.53% )  (38.51%)

Alpha Before:

scripts/perftest/results/transpose-skinny-1.0.log
Total elapsed time:		4.930 sec.
 1  r'             2.404      5
Total elapsed time:		5.457 sec.
 2  r'             2.394      5
Total elapsed time:		5.097 sec.
 1  r'             2.435      5
Total elapsed time:		5.163 sec.
 1  r'             2.422      5
Total elapsed time:		4.820 sec.
 1  r'             2.399      5
         168393.69 msec task-clock                #   28.362 CPUs utilized            ( +-  5.87% )
      509712089539      cycles                    #    3.027 GHz                      ( +-  5.78% )  (30.68%)
       64186798883      instructions              #    0.13  insn per cycle           ( +-  2.02% )  (38.44%)
scripts/perftest/results/transpose-wide-1.0.log
Total elapsed time:		5.288 sec.
 1  r'             2.408      5
Total elapsed time:		4.944 sec.
 1  r'             2.386      5
Total elapsed time:		5.192 sec.
 1  r'             2.440      5
Total elapsed time:		4.996 sec.
 1  r'             2.410      5
Total elapsed time:		5.010 sec.
 1  r'             2.450      5
         179656.42 msec task-clock                #   30.310 CPUs utilized            ( +-  4.60% )
      543794617678      cycles                    #    3.027 GHz                      ( +-  4.54% )  (30.82%)
       68647994631      instructions              #    0.13  insn per cycle           ( +-  1.66% )  (38.59%)
scripts/perftest/results/transpose-full-1.0.log
Total elapsed time:		4.217 sec.
 2  r'             1.321      5
Total elapsed time:		3.806 sec.
 2  r'             1.304      5
Total elapsed time:		3.456 sec.
 2  r'             0.864      5
Total elapsed time:		4.261 sec.
 2  r'             1.303      5
Total elapsed time:		3.254 sec.
 2  r'             0.853      5
         117925.13 msec task-clock                #   25.265 CPUs utilized            ( +-  7.35% )
      358782539233      cycles                    #    3.042 GHz                      ( +-  7.25% )  (30.63%)
       59148304387      instructions              #    0.16  insn per cycle           ( +-  1.26% )  (38.40%)
scripts/perftest/results/transpose-skinny-0.1.log
Total elapsed time:		3.027 sec.
 1  r'             1.638      5
Total elapsed time:		3.016 sec.
 1  r'             1.583      5
Total elapsed time:		2.768 sec.
 1  r'             1.461      5
Total elapsed time:		3.227 sec.
 1  r'             1.709      5
Total elapsed time:		2.434 sec.
 1  r'             1.421      5
         103834.93 msec task-clock                #   29.698 CPUs utilized            ( +-  6.79% )
      324467854857      cycles                    #    3.125 GHz                      ( +-  6.93% )  (30.72%)
       47190326093      instructions              #    0.15  insn per cycle           ( +-  1.21% )  (38.48%)
scripts/perftest/results/transpose-wide-0.1.log
Total elapsed time:		4.556 sec.
 1  r'             2.705      5
Total elapsed time:		4.808 sec.
 1  r'             3.000      5
Total elapsed time:		4.250 sec.
 1  r'             2.398      5
Total elapsed time:		7.544 sec.
 1  r'             5.691      5
Total elapsed time:		5.221 sec.
 1  r'             3.368      5
         179756.00 msec task-clock                #   30.373 CPUs utilized            ( +- 24.85% )
      548210698316      cycles                    #    3.050 GHz                      ( +- 24.92% )  (30.83%)
       71613380037      instructions              #    0.13  insn per cycle           ( +- 24.12% )  (38.56%)
scripts/perftest/results/transpose-full-0.1.log
Total elapsed time:		1.314 sec.
 1  r'             0.533      5
Total elapsed time:		1.489 sec.
 1  r'             0.629      5
Total elapsed time:		1.621 sec.
 1  r'             0.823      5
Total elapsed time:		1.269 sec.
 1  r'             0.518      5
Total elapsed time:		1.346 sec.
 1  r'             0.532      5
          69956.45 msec task-clock                #   35.093 CPUs utilized            ( +-  8.36% )
      212798679757      cycles                    #    3.042 GHz                      ( +-  8.18% )  (30.91%)
       33221409066      instructions              #    0.16  insn per cycle           ( +-  3.80% )  (38.69%)
scripts/perftest/results/transpose-large.log
Total elapsed time:		34.882 sec.
 1  r'            32.116      5
Total elapsed time:		31.270 sec.
 1  r'            28.360      5
Total elapsed time:		32.466 sec.
 1  r'            28.763      5
Total elapsed time:		33.783 sec.
 1  r'            30.827      5
Total elapsed time:		34.388 sec.
 1  r'            31.564      5
         226007.94 msec task-clock                #    6.518 CPUs utilized            ( +-  4.21% )
      719315156077      cycles                    #    3.183 GHz                      ( +-  4.15% )  (30.82%)
      350177806352      instructions              #    0.49  insn per cycle           ( +-  1.34% )  (38.56%)

@Baunsgaard
Copy link
Contributor Author

* For dense transpose operations, we have two significant parts: allocating the dense output, and the multi-threaded transpose operation. On a box with 112 vcores, the allocation is 10x more expensive than the actual transpose operation. The conclusion would be an in-place transpose wherever possible. For example, compression is injected directly after the persistent read which makes it safe to use in-place by default for both local and distributed compression. This approach would not just improve compression times but also eliminate the unnecessary temporary memory requirements. I leave this up to you though.

I will look at this! 👍

@mboehm7
Copy link
Contributor

mboehm7 commented Dec 18, 2020

Just as a side note: these operations are best profiled with more operations, and printing the execution time per operation (simply decomment that in LibMatrixReorg) then you'll see that often the variance comes from individual operations (e.g., first 2-3 times due to JIT, or arbitrary ones due to GC). Generally, we try to optimize for the steady state after JIT and without one-time GC influences. Anyway, thanks for inspiring the sparse-sparse improvements.

@Baunsgaard Baunsgaard closed this Dec 21, 2020
@Baunsgaard Baunsgaard deleted the TransposeMiscroBenchmark branch January 14, 2021 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants