Skip to content
kostrzewa edited this page Apr 5, 2012 · 4 revisions

Testing the current Hybrid code on Jugene I've come to the following results when running on just one compute card. All results are in Mflops from the benchmark application. The local lattice size is chosen to be 2x24x24x24 with j_max preset to 2048. MPI-parallelisation is in 1 dimension only. All these results are from one run only so there could be statistical anomalies present.

The hybrid / OMP version of the code only compiles cleanly with -O2 or -O3 -qstrict. (Just -O3 stalls in the IPA step and just sits there for hours, using more and more memory)

C99 Complex

disable-halfspinor

So far, the halfspinor version of the BG/X code does not seem to work with OpenMP

4 MPI tasks (-np 4 -mode VN)

645 / nocom: 771

Bandwidth ~ 548 MB/s

2 MPI tasks (-np 2 -mode VN)

717 / nocom: 789

Bandwidth ~ 1100 MB/s

1 MPI task (-np 1 -mode VN)

754 / nocom: 803

Bandwidth ~ 1700 MB/s

2 MPI tasks / 2 OMP threads (-np 2 -mode DUAL, ompnumthreads=2)

1142 per task (571 per thread) / nocom: 1466 per task (733 per thread)

Bandwidth ~ 720 MB/s

2 MPI tasks / 1 OMP thread (-np 2 -mode DUAL, ompnumthreads=1)

638 / nocom: 755

Bandwidth ~ 570 MB/s

2 MPI tasks / 1 OMP thread (-np 2 -mode VN, ompnumthreads=1)

633 / nocom: 749

Bandwidth ~ 570 MB/s

4 MPI tasks / 1 OMP thread (-np 4 -mode VN, ompnumthreads=1)

575 / nocom: 734

Bandwidth ~ 370 MB/s

4 OMP threads (configure --enable-mpi, -np 1 -mode SMP, ompnumthreads=4)

These runs elucidate the overhead of running the MPI calls with exchange between one task and itself.

2182 (545 per thread) / nocom: 2918 (729 per thread)

Bandwidth ~ 1200 MB/s (of course, this is non-sensical)

2 OMP threads (configure --enable-mpi, -np 1 -mode SMP, ompnumthreads=2)

1231 (615 per thread) / nocom: 1497 (748 per thread)

1 OMP thread (configure --enable-mpi, -np 1 -mode SMP, ompnumthreads=1)

662 / nocom: 762

Bandwidth ~ 700 MB/s

4 OMP threads (configure --disable-mpi, llrun -np 1 -mode SMP, ompnumthreads=4)

This is a special test because it might show whether the coexistence of MPI and OpenMP is causing a slowdown. At the same time it elucidates how good clock() works as a timer on BG/P.

3129 (782 per thread)

It seems as though this is by far the fastest setup on one compute card, but I'm not sure how good the time measurement is since without MPI I'm forced to use clock() to measure the times.

2 OMP threads (configure --disable-mpi, llrun -np 1 -mode SMP, ompnumthreads=2)

1593 (796 per thread)

1 OMP thread (configure --disable-mpi, llrun -np 1 -mode SMP, ompnumthreads=1)

808

More than one card

4 MPI tasks / 4 OMP threads (llrun -np 4 -mode SMP, ompnumthreads=4)

1307 (326 per thread) / 2918 (729 per thread)

Bandwidth ~ 328 MB/s

16 MPI tasks (1dim MPI, T=32)

569 / nocom: 771

Bandwidth ~ 300 MB/s

256 MPI tasks / 4 OMP threads (llrun -np 256 -mode SMP, ompnumthreads=4, 3 dim MPI)

NrTProcs=8, NrXProcs=8, NrYProcs=4, 16x32^3

1059 (264 per thread) / nocom: 2731 (682 per thread)

Bandwidth ~ 420 MB/s

512 MPI tasks / 2 OMP threads (llrun -np 512 -mode DUAL, ompnumthreads=2, 3 dim MPI)

NrTProcs=16, NrXProcs=8, NrYProcs=4, 32^4

762 (381 per thread) / nocom: 1403 (701 per thread)

Bandwidth ~ 405 MB/s

1024 MPI tasks (llrun -np 1024 -mode VN)

NrTProcs=8, NrXProcs=8, NrYProcs=4, NrZProcs=4

Fails with MPI null pointers despite sufficient local lattice size!

enable halfspinor

Old Code

disable halfspinor

More than one card

enable halfspinor

Conclusions

Hybrid code is really slow on the BG/P. This will be investigated a bit more with scalasca.

Clone this wiki locally