Skip to content
Bartosz Kostrzewa edited this page Jan 7, 2015 · 11 revisions

The prototype for the replacement of Juropa can currently be accessed via juropatest.fz-juelich.de.

The system is equipped with 70 compute nodes, each of which has two 14 (!) core Xeon E5-2695 v3 processors at 2.3 GHz which support FMA and simultaneous multithreading. As a result, theoretical peak performance per core is 18.4 GFlop/s (but this is a very synthetic number).

Compilation

The first step is to load the intel environment which will provide the Intel compiler.

$ module load intel

Pure MPI

To configure tmlqcd (in pure MPI mode and with 4D parallelization), we proceed as follows with the master branch of github.com/etmc/tmLQCD:

$CODEPATH/configure --disable-omp --enable-mpi --with-mpidimension=4 --enable-alignment=32 
--with-lapack="-L/usr/local/software/juropatest/Stage1/software/MPI/intel/2015.0.090/impi/5.0.1.035/imkl/11.2.0.090/mkl/lib/intel64 
-lmkl_blas95_lp64 -lmkl_avx2 -lmkl_core -lmkl_sequential -lmkl_intel_lp64" 
--with-limedir=$yourlimedir --disable-sse2 --disable-sse3 --enable-gaugecopy 
--enable-halfspinor CC=mpicc CFLAGS="-fma -axCORE-AVX2 -O3 -std=c99" F77=ifort

and in one line for easy copying:

$CODEPATH/configure --disable-omp --enable-mpi --with-mpidimension=4 --enable-alignment=32 --with-lapack="-L/usr/local/software/juropatest/Stage1/software/MPI/intel/2015.0.090/impi/5.0.1.035/imkl/11.2.0.090/mkl/lib/intel64 -lmkl_blas95_lp64 -lmkl_avx2 -lmkl_core -lmkl_sequential -lmkl_intel_lp64" --with-limedir=$yourlimedir --with-lemondir=$yourlemondir --disable-sse2 --disable-sse3 --enable-gaugecopy --enable-halfspinor CC=mpicc CFLAGS="-fma -axCORE-AVX2 -O3 -std=c99" F77=ifort

Hybrid

The pesky factor of 7 in the number of cores means that it might make sense to use the hybrid code with or without overlapping (to be tested!) of communication and computation. The former does three volume loops and is not necessarily faster because of this overhead.

Overlapping

This mode seems to be somewhat slower due to increased overheads and there has not been any time yet to look into the reasons for why this is the case. In order to use this, you need to grab the InterleavedNDTwistedClover branch from github.com/urbach/tmLQCD and configure the code like so:

$CODEPATH/configure --enable-omp --enable-mpi --with-mpidimension=4 --enable-alignment=32 --enable-threadoverlap --with-lapack="-L/usr/local/software/juropatest/Stage1/software/MPI/intel/2015.0.090/impi/5.0.1.035/imkl/11.2.0.090/mkl/lib/intel64 -lmkl_blas95_lp64 -lmkl_avx2 -lmkl_core -lmkl_intel_thread -lmkl_intel_lp64" --with-limedir=$YOURLIMEPATH --with-lemondir=$YOURLEMONPATH --disable-sse2 --disable-sse3 --enable-gaugecopy --enable-halfspinor CC=mpicc CFLAGS="-fma -axCORE-AVX2 -O3 -fopenmp -std=c99" F77=ifort LDFLAGS=-fopenmp

No Overlapping

For this use the code from the master branch of github.com/etmc/tmLQCD and configure it like this:

$CODEPATH/configure --enable-omp --enable-mpi --with-mpidimension=4 --enable-alignment=32 --with-lapack="-L/usr/local/software/juropatest/Stage1/software/MPI/intel/2015.0.090/impi/5.0.1.035/imkl/11.2.0.090/mkl/lib/intel64 -lmkl_blas95_lp64 -lmkl_avx2 -lmkl_core -lmkl_intel_thread -lmkl_intel_lp64" --with-limedir=$YOURLIMEPATH --with-lemondir=$YOURLEMONPATH --disable-sse2 --disable-sse3 --enable-gaugecopy --enable-halfspinor CC=mpicc CFLAGS="-fma -axCORE-AVX2 -O3 -fopenmp -std=c99" F77=ifort LDFLAGS=-fopenmp

MPI, Overlap?

Each node of this machine can host up to 56 MPI processes (14 * 2 * 2) because of Hyperthreading. Even if you don't have a factor of 7 in your lattice geometry, it might make sense to simply forget about a couple of the threads and run 48 or 50 processes per node, but it really needs to be tested on a case by case basis which code is fastest. In some cases, however, it might be easiest to absorb the factor of 7 using OpenMP threads and run 2, 4 or 8 processes per node with 28, 14 or 7 threads per process respectively. It must also be tested on a case by case basis whether the overlapping or the non-overlapping code provides the best performance.

Running

So far the only thing that was explored are interactive runs. The batch system is SLURM and there is some documentation one the JSC User Info page regarding the writing of job scripts.