Skip to content

Commit

Permalink
updated plan_mpi_omp and documentation (#1076)
Browse files Browse the repository at this point in the history
* Updated plan_mpi_omp and documentation.
* Reformatted Markdown to match rules.
  • Loading branch information
hfp committed Sep 2, 2020
1 parent e21da2a commit 4861590
Show file tree
Hide file tree
Showing 2 changed files with 113 additions and 77 deletions.
152 changes: 104 additions & 48 deletions tools/plan_mpi_omp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,28 @@

## Overview

CP2K's grid-based calculation as well as DBCSR's block sparse matrix
multiplication (Cannon algorithm) prefer a square-number for the total rank-count
(2d communication pattern). This is not to be obfuscated with a Power-of-Two (POT)
rank-count that usually leads to trivial work distribution (MPI).

It can be more efficient to leave CPU-cores unused to achieve this square-number
property than using all cores with a "wrong" total rank-count (sometimes a
frequency upside over an "all-core turbo" emphasizes this property further).
Counter-intuitively, even an unbalanced rank-count per node i.e., different
rank-counts per socket can be an advantage. Pinning MPI processes and placing
threads requires extra care to be taken on a per-node basis to load a dual-socket
system in a balanced fashion or to setup space between ranks for the OpenMP threads.

Because of the above-mentioned complexity, a script for planning MPI/OpenMP-hybrid
execution (`plan.sh`) is available. Here is a first example for running the
PSMP-binary on an SMP-enabled (Hyperthreads) dual-socket system with
24 cores per processor/socket (96 hardware threads in total).
At first, a run with 48 ranks and 2 threads per core comes to the mind
(`48x2`). However, for instance 16 ranks with 6 threads per rank may
be better for performance (`16x6`). To easily place the ranks, Intel MPI is used:
DBCSR's block sparse matrix multiplication (Cannon algorithm) prefer a
square-number for the total rank-count (2d communication pattern). This is
not to be obfuscated with a Power-of-Two (POT) rank-count that usually leads
to trivial work distribution (MPI).

It can be more efficient to leave CPU-cores unused in order to achieve this
square-number property rather than using all cores with a "non-preferred" total
rank-count (sometimes a frequency upside over an "all-core turbo" emphasizes
this property further). Counter-intuitively, even an unbalanced rank-count per
node i.e., different rank-counts per socket can be an advantage. Pinning MPI
processes and placing threads requires extra care to be taken on a per-node
basis to load a dual-socket system in a balanced fashion or to setup space
between ranks for the OpenMP threads.

Because of the above-mentioned complexity, a script for planning
MPI/OpenMP-hybrid execution (`plan.sh`) is available. Here is a first example
for running the PSMP-binary on an SMP-enabled (Hyperthreads) dual-socket system
with 24 cores per processor/socket (96 hardware threads in total). At
first, a run with 48 ranks and 2 threads per core comes to the mind
(`48x2`). However, for instance 16 ranks with 6 threads per rank can
be better for performance (`16x6`). To easily place the ranks, Intel MPI
is used:

```bash
mpirun -np 16 \
Expand All @@ -32,22 +34,44 @@ mpirun -np 16 \
```

**NOTE**: For hybrid codes, `I_MPI_PIN_DOMAIN=auto` is recommended as it spaces
the ranks according to the number of OpenMP threads (`OMP_NUM_THREADS`).
It is not necessary and not recommended to build a rather complicated `I_MPI_PIN_PROCESSOR_LIST`
for hybrid codes (MPI plus OpenMP). To display and to log the pinning and thread
affinization at the startup of an application, `I_MPI_DEBUG=4` can be used with
no performance penalty. The recommended `I_MPI_PIN_ORDER=bunch` ensures that
ranks per node are split as even as possible with respect to sockets (e.g.,
36 ranks on a 2x20-core system are put in 2x18 ranks instead of 20+16 ranks).
the ranks according to the number of OpenMP threads (`OMP_NUM_THREADS`). It is
not necessary and not recommended to build a rather complicated
`I_MPI_PIN_PROCESSOR_LIST` for hybrid codes (MPI plus OpenMP). To display and
to log the pinning and thread affinization at the startup of an application,
`I_MPI_DEBUG=4` can be used with no performance penalty. The recommended
`I_MPI_PIN_ORDER=bunch` ensures that ranks per node are split as even as
possible with respect to sockets (e.g., 36 ranks on a 2x20-core system
are put in 2x18 ranks instead of 20+16 ranks).

To achieve a similar placement with OpenMPI, ranks are mapped to "execution
slots" (`--map-by slot`) along with specifying the number of processing
elements (`PE`). By default, execution slots are counted in number of physical
cores which yields `--map-by slot:PE=3` for the same system (mentioned above).

```bash
mpirun -np 16 --map-by slot:PE=3 \
-x OMP_PLACES=threads -x OMP_PROC_BIND=SPREAD \
-x OMP_NUM_THREADS=6 \
exe/Linux-x86-64-intelx/cp2k.psmp workload.inp
```

**NOTE**: Intel MPI's `I_MPI_PIN_ORDER=bunch` to balance the number of
ranks between sockets (see above) appears hard to achieve with OpenMPI
therefore an undersubscribed system may not be recommended. To display and
to log the pinning and thread affinization at the startup of an application,
`mpirun --report-bindings` can be used.

The end of the next section continues with our example and extends execution
to multiple nodes of the above-mentioned system.

## Plan Script

To configure the plan-script, the metric of the compute nodes can be given for
future invocations so that only the node-count is required as an argument.
The script's help output (`-h` or `--help`) initially shows the "system metric"
of the computer the script is invoked on. For a system with 48 cores
(two sockets, SMP/HT enabled), setting up the "system metric" looks like
(`plan.sh <num-nodes> <ncores-per-node> <nthreads-per-core> <nsockets-per-node>`):
future invocations so that only the node-count is required as an argument. The
script's help output (`-h` or `--help`) initially shows the "system metric" of
the computer the script is invoked on. For a system with 48&#160;cores (two
sockets, SMP/HT enabled), setting up the "system metric" looks like (`plan.sh
[num-nodes] [ncores-per-node] [nthreads-per-core] [nsockets-per-node]`):

```bash
./plan.sh 1 48 2 2
Expand Down Expand Up @@ -76,28 +100,60 @@ following output (`plan.sh 8`):

The first group of the output displays POT-style (trivial) MPI/OpenMP
configurations (penalty denotes potential communication overhead), however the
second group (if present) shows rank/thread combinations with the total rank-count
hitting a square number (penalty denotes waste of compute due to not filling each
node). For the given example, 8&#160;ranks per node with 12&#160;threads per rank
may be chosen (`8x12`) and MPI-executed:
second group (if present) shows rank/thread combinations with the total
rank-count hitting a square number (penalty denotes waste of compute due to not
filling each node). For the given example, 8&#160;ranks per node with
12&#160;threads per rank is chosen (`8x12`) and MPI-executed:

```bash
mpirun -perhost 8 -host node1,node2,node3,node4,node5,node6,node7,node8 \
-genv I_MPI_PIN_DOMAIN=auto -genv I_MPI_PIN_ORDER=bunch \
-genv OMP_PLACES=threads -genv OMP_PROC_BIND=SPREAD \
-genv OMP_NUM_THREADS=12 -genv I_MPI_DEBUG=4 \
-genv I_MPI_PIN_DOMAIN=auto -genv I_MPI_PIN_ORDER=bunch -genv I_MPI_DEBUG=4 \
-genv OMP_PLACES=threads -genv OMP_PROC_BIND=SPREAD -genv OMP_NUM_THREADS=12 \
exe/Linux-x86-64-intelx/cp2k.psmp workload.inp
```

**NOTE**: For Intel&#160;MPI as well as OpenMPI, mpirun's host-list (`mpirun
-host`) is setup with unique node-names, and this is the only style that is
explained in this article. There is a competing style where nodes names are
duplicated for the sake of enumerating available ranks (or "execution slots"
in case of OpenMPI), which is not exercised in this article.

For OpenMPI, the quantity (per node) of the previously mentioned "execution
slots" (measured in number of physical cores) are sometimes not known to
OpenMPI (depends on cluster/scheduler setup). For instance, `mpirun` may be
complaining about an attempt to use too many execution slots simply because
OpenMPI believes all systems represent a single such slot (instead of 2x24
cores it only "sees" a single core per system). In such case, it is not
recommended to "oversubscribe" the system because rank/thread affinity will
likely be wrong (`mpirun --oversubscribe`). Instead, the list of unique nodes
names (`-host`) may be augmented with the number of physical cores on each of
the nodes (e.g., ":48" in our case).

```bash
mpirun -npernode 8 -host
node1:48,node2:48,node3:48,node4:48,node5:48,node6:48,node7:48,node8:48 \
--map-by slot:PE=6 --report-bindings \
-x OMP_PLACES=threads -x OMP_PROC_BIND=SPREAD -x OMP_NUM_THREADS=12 \
exe/Linux-x86-64-intelx/cp2k.psmp workload.inp
```

The script also suggests close-by configurations (lower and higher node-counts)
that hit the square-property ("Try also the following node counts"). The example
(as exercised above) was to illustrate how the script works, however it can be
very helpful when running jobs especially on CPUs with not many prime factors in
the core-count. Remember, the latter can be also the case for virtualized
environments that reserve some of the cores to run the Hypervisor i.e., reporting
less cores to the Operating System (guest OS) when compared to the physical core-count.
**NOTE**: It can be still insufficient to augment the nodes with the expected
number of slots (`:48`). If OpenMPI's mpirun is still complaining, it might
be caused and solved by the job scheduler. For example, `qsub` (PBS) may be
instructed with `-l select=8:mpiprocs=48` in the above case (`mpirun` in this
job can use less than 48 ranks per node).

The plan-script also suggests close-by configurations (lower and higher
node-counts) that can hit the square-property ("Try also the following node
counts"). The example (as exercised above) was to illustrate how the script
works, however it can be very helpful when running jobs especially on CPUs with
not many prime factors in the core-count. Remember, the latter can be also the
case for virtualized environments that reserve some of the cores to run the
Hypervisor i.e., reporting less cores to the Operating System (guest OS) when
compared to the physical core-count.

## References

<https://github.com/hfp/xconfigure/raw/master/config/cp2k/plan.sh>
<https://xconfigure.readthedocs.io/cp2k/>
[https://github.com/hfp/xconfigure/raw/master/config/cp2k/plan.sh](https://github.com/hfp/xconfigure/raw/master/config/cp2k/plan.sh)
[https://xconfigure.readthedocs.io/cp2k/](https://xconfigure.readthedocs.io/cp2k/)
[https://software.intel.com/content/www/us/en/develop/articles/pinning-simulator-for-intel-mpi-library.html](https://software.intel.com/content/www/us/en/develop/articles/pinning-simulator-for-intel-mpi-library.html)
38 changes: 9 additions & 29 deletions tools/plan_mpi_omp/plan.sh
Original file line number Diff line number Diff line change
@@ -1,34 +1,14 @@
#!/bin/bash
#############################################################################
# Copyright (c) 2019, Intel Corporation #
# All rights reserved. #
# #
# Redistribution and use in source and binary forms, with or without #
# modification, are permitted provided that the following conditions #
# are met: #
# 1. Redistributions of source code must retain the above copyright #
# notice, this list of conditions and the following disclaimer. #
# 2. Redistributions in binary form must reproduce the above copyright #
# notice, this list of conditions and the following disclaimer in the #
# documentation and/or other materials provided with the distribution. #
# 3. Neither the name of the copyright holder nor the names of its #
# contributors may be used to endorse or promote products derived #
# from this software without specific prior written permission. #
# #
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS #
# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT #
# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR #
# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT #
# HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, #
# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED #
# TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR #
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF #
# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING #
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS #
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. #
#############################################################################
###############################################################################
# Copyright (c) Intel Corporation - All rights reserved. #
# This file is part of the XCONFIGURE project. #
# #
# For information on the license, see the LICENSE file. #
# Further information: https://github.com/hfp/xconfigure/ #
# SPDX-License-Identifier: BSD-3-Clause #
###############################################################################
# Hans Pabst (Intel Corp.)
#############################################################################
###############################################################################

# number of systems (clusters nodes)
TOTALNUMNODES=1
Expand Down

0 comments on commit 4861590

Please sign in to comment.