timings in comsa_miniapp #106

airmler · 2022-07-01T09:24:20Z

I am running the cosma miniapp on a 72 core xeon machine with the following parameters
$parallel_cosma -m 8688 -n 8688 -k 8688 -r 3
The last line of the stdout reads:
COSMA TIMES [ms] = 458 460 771

I am curious about the large spread between fastest and slowest multiplication. The fast number would mean 40 GFLOPS/core/s which is a good number for this machine. The slowest number would imply only 23 GFLOPS/core/s.

Am I right that there is a 300 ms overhead finding the optimal "parallelization strategy"? Which of both numbers would be fair to compare with other libraries like ScaLapack and others?

I am aware that this is a very extreme example. But a spread of 10-20% between fastest and slowest number is very typical.

The text was updated successfully, but these errors were encountered:

rasolca · 2022-07-01T10:01:09Z

Am I right that there is a 300 ms overhead finding the optimal "parallelization strategy"?

No, the overhead is very likely due to library initializations during the first run in the miniapp.
Multithreaded MKL is usually the the library that introduces more overhead as it has to initialize the OpenMP environment and allocate some memory during the first library calls. MPI on certain systems introduces as well some overhead during the first communications.

airmler · 2022-07-01T10:06:34Z

Thanks for fast clarification.
I conclude that the correct approach is to neglect the slowest number.

airmler closed this as completed Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timings in comsa_miniapp #106

timings in comsa_miniapp #106

airmler commented Jul 1, 2022

rasolca commented Jul 1, 2022

airmler commented Jul 1, 2022

timings in comsa_miniapp #106

timings in comsa_miniapp #106

Comments

airmler commented Jul 1, 2022

rasolca commented Jul 1, 2022

airmler commented Jul 1, 2022