GROMACS is a molecular dynamics package with an extensive array of modeling, simulation and analysis capabilities. While primarily developed for the simulation of biochemical molecules, its broad adoption includes reaserch fields such as non-biological chemistry, metadynamics and mesoscale physics. One of the key aspects characterizing GROMACS is the strong focus on high performance and resource efficiency, making use of state-of-the-art algorithms and optimized low-level programming techniques for CPUs and GPUs.
As test case, we select the 3M atom system from the HECBioSim benchmark suite for Molecular Dynamics:
A pair of hEGFR tetramers of 1IVO and 1NQL:
* Total number of atoms = 2,997,924
* Protein atoms = 86,996 Lipid atoms = 867,784 Water atoms = 2,041,230 Ions = 1,914
The simulation is carried out using single precision, 1 MPI process per node and 12 OpenMP threads per MPI process. We measured runtimes for 4, 8, 16, 32 and 64 compute nodes. The input file to download for the test case is 3000k-atoms/benchmark.tpr.
Assuming that the benchmark.tpr
input data is present in a directory which Sarus is configured to automatically mount inside the container ( here referred by the arbitrary variable $INPUT
), we can run the container on 16 nodes as follows:
srun -C gpu -N16 srun sarus run --mpi \
ethcscs/gromacs:2018.3-cuda9.2_mpich3.1.4 \
/usr/local/gromacs/bin/mdrun_mpi -s ${INPUT}/benchmark.tpr -ntomp 12
A typical output will look like:
:-) GROMACS - mdrun_mpi, 2018.3 (-:
... Using 4 MPI processes Using 12 OpenMP threads per MPI process
On host nid00001 1 GPU auto-selected for this run. Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node: PP:0 NOTE: DLB will not turn on during the first phase of PME tuning starting mdrun 'Her1-Her1' 10000 steps, 20.0 ps.
Core t (s) Wall t (s) (%)
- Time: 20878.970 434.979 4800.0
(ns/day) (hour/ns)
Performance: 3.973 6.041
GROMACS reminds you: "Shake Yourself" (YES)
If the system administrator did not configure Sarus to mount the input data location during container setup, we can use the --mount
option:
srun -C gpu -N16 sarus run --mpi \
--mount=type=bind,src=<path-to-input-directory>,dst=/gromacs-data \
ethcscs/gromacs:2018.3-cuda9.2_mpich3.1.4 \
/usr/local/gromacs/bin/mdrun_mpi -s /gromacs-data/benchmark.tpr -ntomp 12
CSCS provides and supports GROMACS on Piz Daint. This documentation page gives more details on how to run GROMACS as a native application. For this test case, the GROMACS/2018.3-CrayGNU-18.08-cuda-9.1
modulefile was loaded.
The container image ethcscs/gromacs:2018.3-cuda9.2_mpich3.1.4
(based on cuda/9.2 and mpich/3.1) used for this test case can be pulled from CSCS DockerHub or be rebuilt with this Dockerfile </cookbook/dockerfiles/gromacs/Dockerfile_2018.3>
:
/cookbook/dockerfiles/gromacs/Dockerfile_2018.3
- NVIDIA Container Runtime hook
- Native MPI hook (MPICH-based)
We measure wall clock time (in seconds) and performance (in ns/day) as reported by the application logs. The speedup values are computed using the wall clock time averages for each data point, taking the native execution time at 4 nodes as baseline. The results of our experiments are illustrated in the following figure:
We observe the container application being up to 6% faster than the native implementation, with a small but consistent performance advantage and comparable standard deviations across the different node counts.