New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMDS-2760] Transpose micro benchmark #1127
Conversation
I'll have a look tonight and see what we can do. Airline was dense, right? |
The large 15 mil case seems to have little to no difference. But there still is a bug somewhere. XPS: scripts/perftest/results/transpose-large.log
Total elapsed time: 7.377 sec.
1 r' 4.352 1
Total elapsed time: 7.835 sec.
1 r' 4.649 1
Total elapsed time: 7.659 sec.
1 r' 4.398 1
Total elapsed time: 7.903 sec.
1 r' 4.677 1
Total elapsed time: 7.723 sec.
1 r' 4.445 1
36.435,71 msec task-clock # 4,264 CPUs utilized ( +- 1,27% )
134.881.449.707 cycles # 3,702 GHz ( +- 0,43% ) (30,65%)
119.303.817.112 instructions # 0,88 insn per cycle ( +- 0,37% ) (38,39%) Alpha: scripts/perftest/results/transpose-large.log
Total elapsed time: 8.531 sec.
1 r' 5.459 1
Total elapsed time: 8.366 sec.
1 r' 5.412 1
Total elapsed time: 10.413 sec.
1 r' 7.507 1
Total elapsed time: 8.373 sec.
1 r' 5.420 1
Total elapsed time: 8.254 sec.
1 r' 5.394 1
100414.75 msec task-clock # 10.271 CPUs utilized ( +- 5.07% )
314073685855 cycles # 3.128 GHz ( +- 4.82% ) (30.86%)
127951221368 instructions # 0.41 insn per cycle ( +- 3.10% ) (38.62%) |
Yes airline is dense, and i don't seem to be able to reproduce the bad performance calling transpose in a script. dimensions on airline is: |
ok, I just pushed some minor performance improvements for sparse-sparse transpose operations which reduced the execution time of ten 2.5M x 50 (sparsity=0.1, seed=12) operations from 2.7 to 1.9s. Furthermore, I found the following:
|
This micro benchmark considers multiple cases, tallskinny, shortwide and "normal" matrices. It gives an indication of if the transpose is parallelizing and using the hardware appropriately.
f457ddf
to
ce706c8
Compare
When looking at before and after (the way i tested it was dropping the transpose commit from the history.) it looks like i might have done something wrong in the initial tests. That said, it does not look like the changes had any impact, but it did make me notice the difference between executions on the wide transpose is large. Sometimes it takes 5 seconds sometimes 2.5 I'm guessing it has to do with the two NUMA nodes? The Full transpose micro benchmark: After change Alpha
Alpha Before:
|
I will look at this! 👍 |
Just as a side note: these operations are best profiled with more operations, and printing the execution time per operation (simply decomment that in |
This PR contains a simple addition to the micro benchmarks.
This time transpose of a matrix is measured.
3 basic cases:
"skinny" with 2.5mil rows 50 cols
"wide" with 50 cols and 2.5 mil rows
"full" with 20000 rows and 5000 cols
run sparse and dense.
on 3 different systems the time it takes for a transpose varies drastically, and currently not as expected:
The one case that is very inefficient currently is a sparse wide matrix, that takes 4-5 times longer on many core machines compared to a few cores.
Alpha:
xps:
tango:
For refference why i'm addressing this is because compression is doing a transpose in the beginning.
On my airlines dataset, this transpose takes 16 seconds on alpha while, on my laptop 1 second.