From 477e0974f6a883172cd827d17b6ceb135a95eb35 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Thu, 3 Apr 2025 12:43:56 +0200 Subject: [PATCH 01/22] Add more links to libfabric section --- docs/software/communication/cray-mpich.md | 2 ++ docs/software/communication/libfabric.md | 18 ++++++++++++++++++ 2 files changed, 20 insertions(+) diff --git a/docs/software/communication/cray-mpich.md b/docs/software/communication/cray-mpich.md index c50f21ae..e06880fd 100644 --- a/docs/software/communication/cray-mpich.md +++ b/docs/software/communication/cray-mpich.md @@ -58,12 +58,14 @@ See [this page][ref-slurm-gh200] for more information on configuring SLURM to us Alternatively, if you wish to not use GPU-aware MPI, either unset `MPICH_GPU_SUPPORT_ENABLED` or explicitly set it to `0` in your launch scripts. +[](){#ref-communication-cray-mpich-known-issues} ## Known issues This section documents known issues related to Cray MPICH on Alps. Resolved issues are also listed for reference. ### Existing Issues +[](){#ref-communication-cray-mpich-cache-monitor-disable} #### Cray MPICH hangs Cray MPICH may sometimes hang on larger runs. diff --git a/docs/software/communication/libfabric.md b/docs/software/communication/libfabric.md index 27f1ab6b..3d0c760d 100644 --- a/docs/software/communication/libfabric.md +++ b/docs/software/communication/libfabric.md @@ -4,4 +4,22 @@ [Libfabric](https://ofiwg.github.io/libfabric/), or Open Fabrics Interfaces (OFI), is a low level networking library that abstracts away various networking backends. It is used by Cray MPICH, and can be used together with OpenMPI, NCCL, and RCCL to make use of the [Slingshot network on Alps][ref-alps-hsn]. +## Using libfabric + +If you are using a uenv provided by CSCS, such as [prgenv-gnu][ref-uenv-prgenv-gnu], [Cray MPICH][ref-communication-cray-mpich] is linked to libfabric and the high speed network will be used. +No changes are required in applications. + +If you are using containers, the system libfabric can be loaded into your container using the [CXI hook provided by the container engine][ref-ce-cxi-hook]. +Using the hook is essential to make full use of the Alps network. + +## Tuning libfabric + +Tuning libfabric (particularly together with [Cray MPICH][ref-communication-cray-mpich], [OpenMPI][ref-communication-openmpi], [NCCL][ref-communication-nccl], and [RCCL][ref-communication-rccl]) depends on many factors, including the application, workload, and system. +For a comprehensive overview libfabric options for the CXI provider (the provider for the Slingshot network), see the [`fi_cxi` man pages](https://ofiwg.github.io/libfabric/v2.1.0/man/fi_cxi.7.html). +Note that the exact version deployed on Alps may differ, and not all options may be applicable on Alps. + +See the [Cray MPICH known issues page][ref-communication-cray-mpich-known-issues] for issues when using Cray MPICH together with libfabric. +For example, certain applications may hang at scale unless [the `FI_MR_CACHE_MONITOR=disabled`][ref-communication-cray-mpich-cache-monitor-disable] option is set. + !!! todo + More options? From 7a871b3c875effcf2b163489dd1280ac091d5f54 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Thu, 3 Apr 2025 13:12:45 +0200 Subject: [PATCH 02/22] Add a few environment variables for OpenMPI on Alps --- docs/software/communication/openmpi.md | 36 +++++++++++++++++++++++++- 1 file changed, 35 insertions(+), 1 deletion(-) diff --git a/docs/software/communication/openmpi.md b/docs/software/communication/openmpi.md index 9625902b..937b827c 100644 --- a/docs/software/communication/openmpi.md +++ b/docs/software/communication/openmpi.md @@ -6,5 +6,39 @@ However, [OpenMPI](https://www.open-mpi.org/) can be used as an alternative in s To use OpenMPI on Alps, it must be built against [libfabric][ref-communication-libfabric] with support for the [Slingshot 11 network][ref-alps-hsn]. +## Using OpenMPI + +!!! warning + Building and using OpenMPI on Alps is still [work in progress](https://eth-cscs.github.io/cray-network-stack/). + The instructions found on this page may be inaccurate, but are a good starting point to using OpenMPI on Alps. + +!!! todo + Deploy experimental uenv. + !!! todo - Building OpenMPI for Alps is still work in progress: https://eth-cscs.github.io/cray-network-stack/. + Document OpenMPI uenv next to prgenv-gnu, prgenv-nvfortran, and linalg? + +OpenMPI is provided through a [uenv][ref-uenv] similar to [`prgenv-gnu`][ref-uenv-prgenv-gnu]. +Once the uenv is loaded, compiling and linking with OpenMPI and libfabric is transparent. +At runtime, some additional options must be set to correctly use the Slingshot network. + +First, when launching applications through slurm, [PMIx](https://pmix.github.com) must be used for application launching. +This is done with the `--mpi` flag of `srun`: +```bash +srun --mpi=pmix ... +``` + +Additionally, the following environment variables should be set: +```bash +export PMIX_MCA_psec="native" # (1) +export FI_PROVIDER="lnx" # (2) +export FI_LNX_PROV_LINKS="shm+cxi" # (3) +export OMPI_MCA_pml="^ucx" # (4) +export OMPI_MCA_mtl="ofi" # (5) +``` + +1. Ensures PMIx uses the same security domain as Slurm. Otherwise PMIx will print warnings at startup. +2. Use the [libfabric LINKx](https://ofiwg.github.io/libfabric/v2.1.0/man/fi_lnx.7.html) provider, to allow using different libfabric providers for inter- and intra-node communication. +3. Use the shared memory provider for intra-node communication and the CXI (Slingshot) provider for inter-node communication. +4. Use anything except [UCX](https://openucx.org/documentation/) for [point-to-point communication](https://docs.open-mpi.org/en/v5.0.x/mca.html#selecting-which-open-mpi-components-are-used-at-run-time). +5. Use libfabric for the [Matching Transport Layer](https://docs.open-mpi.org/en/v5.0.x/mca.html#frameworks). From ea89fe0bdf0f5882a2572e6de1c63419f9fe8cf3 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Thu, 3 Apr 2025 13:33:43 +0200 Subject: [PATCH 03/22] Expand NCCL and RCCL pages --- docs/software/communication/nccl.md | 18 +++++++++++++++--- docs/software/communication/rccl.md | 4 ++++ docs/software/container-engine.md | 1 + 3 files changed, 20 insertions(+), 3 deletions(-) diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index d58c9329..1d830380 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -4,7 +4,19 @@ [NCCL](https://developer.nvidia.com/nccl) is an optimized inter-GPU communication library for NVIDIA GPUs. It is commonly used in machine learning frameworks, but traditional scientific applications can also benefit from NCCL. +## Using NCCL + +To use the Slingshot network on Alps, the [`aws-ofi-nccl`](https://github.com/aws/aws-ofi-nccl) plugin must be used. +With the container engine, the [AWS OFI NCCL hook][ref-ce-aws-ofi-hook] can be used to load the plugin into the container and configure NCCL to use it. + +While the container engine does this automatically, regardless of application, the following environment variable should always be set when using NCCL: + +```bash +export NCCL_NET_PLUGIN="ofi" +``` + +This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network. +Conversely, if the plugin can not be found, applications will fail to start instead of falling back to e.g. TCP, which would be significantly slower than with the plugin. + !!! todo - - high level description - - libfabric/aws-ofi-nccl plugin - - configuration options + More options? diff --git a/docs/software/communication/rccl.md b/docs/software/communication/rccl.md index a42b968d..4e33fb3a 100644 --- a/docs/software/communication/rccl.md +++ b/docs/software/communication/rccl.md @@ -8,3 +8,7 @@ It provides equivalent functionality to [NCCL][ref-communication-nccl] for AMD G - high level description - libfabric/aws-ofi-rccl plugin - configuration options + +!!! info + RCCL uses many of the same [configuration options as NCCL](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html), with the `NCCL` prefix, not `RCCL`. + Refer to NCCL documentation to tune RCCL. diff --git a/docs/software/container-engine.md b/docs/software/container-engine.md index 6231d38f..d14aeaf0 100644 --- a/docs/software/container-engine.md +++ b/docs/software/container-engine.md @@ -533,6 +533,7 @@ Container hooks let you customize container behavior to fit system-specific need ### AWS OFI NCCL Hook  The [AWS OFI NCCL plugin](https://github.com/aws/aws-ofi-nccl) is a software extension that allows the [NCCL](https://developer.nvidia.com/nccl) and [RCCL](https://rocm.docs.amd.com/projects/rccl/en/latest/) libraries to use libfabric as a network provider and, through libfabric, to access the Slingshot high-speed interconnect. +Also see [NCCL][ref-communication-nccl] and [libfabric][ref-communication-libfabric] for more information on using the libraries on Alps. The Container Engine includes a hook program to inject the AWS OFI NCCL plugin in containers; since the plugin must also be compatible with the GPU programming software stack being used, the `com.hooks.aws_ofi_nccl.variant` annotation is used to specify a plugin variant suitable for a given container image. At the moment of writing, 4 plugin variants are configured: `cuda11`, `cuda12` (to be used on NVIDIA GPU nodes), `rocm5`, and `rocm6` (to be used on AMD GPU nodes alongside RCCL). From 4b2a984234521930547a3ab7efd0809a9da6c6cb Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Thu, 3 Apr 2025 13:38:44 +0200 Subject: [PATCH 04/22] Add note box in container engine docs --- docs/software/container-engine.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/software/container-engine.md b/docs/software/container-engine.md index d14aeaf0..691b3359 100644 --- a/docs/software/container-engine.md +++ b/docs/software/container-engine.md @@ -437,7 +437,9 @@ If a libfabric library is already present in the container filesystem (for examp !!! note Due to the nature of Slingshot and the mechanism implemented by the CXI hook, container applications need to use a communication library which supports libfabric in order to benefit from usage of the hook. -> Libfabric support might have to be defined at compilation time (as is the case for some MPI implementations, like MPICH and OpenMPI) or could be dynamically available at runtime (as is the case with NCCL - see also [this][ref-ce-aws-ofi-hook] section for more details). + +!!! note + Libfabric support might have to be defined at compilation time (as is the case for some MPI implementations, like MPICH and OpenMPI) or could be dynamically available at runtime (as is the case with NCCL - see also [this][ref-ce-aws-ofi-hook] section for more details). The hook is activated by setting the `com.hooks.cxi.enabled` annotation, which can be defined in the EDF, as shown in the following example: From 59f5ba2460dd1cdc3dd800a48491fef8374c3356 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Thu, 3 Apr 2025 13:38:57 +0200 Subject: [PATCH 05/22] Add more codeowners to communication pages --- .github/CODEOWNERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 0beb2604..3b243f79 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1,5 +1,5 @@ * @bcumming @msimberg @RMeli docs/services/firecrest @jpdorsch @ekouts -docs/software/communication @msimberg +docs/software/communication @biddisco @Madeeks @msimberg docs/software/prgenv/linalg.md @finkandreas @msimberg docs/software/sciapps/cp2k.md @abussy @RMeli From 8f15929e5d2fc769bf48d1141c276ca414bd3549 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Thu, 3 Apr 2025 14:06:57 +0200 Subject: [PATCH 06/22] Update docs/software/communication/nccl.md --- docs/software/communication/nccl.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index 1d830380..2d99ae1d 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -18,5 +18,12 @@ export NCCL_NET_PLUGIN="ofi" This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network. Conversely, if the plugin can not be found, applications will fail to start instead of falling back to e.g. TCP, which would be significantly slower than with the plugin. +!!! warning "GPU-aware MPI with NCCL" + Using GPU-aware MPI together with NCCL [can easily lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi). + Unless care is taken to ensure that the two methods of communication are not used concurrently, we recommend not using GPU-aware MPI with NCCL. + To explicitly disable GPU-aware MPI with Cray MPICH, explicitly set `MPICH_GPU_SUPPORT_ENABLED=0`. + Note that this option may be set to `1` by default on some Alps clusters. + See [the Cray MPICH documentation][ref-communication-cray-mpich] for more details on GPU-aware MPI with Cray MPICH. + !!! todo More options? From 259fd4bea0dbda01d1b66a834484eac7ae4f0224 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Thu, 3 Apr 2025 14:21:13 +0200 Subject: [PATCH 07/22] Recommend cxi over lnx when using OpenMPI --- docs/software/communication/openmpi.md | 30 +++++++++++++++++++------- 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/docs/software/communication/openmpi.md b/docs/software/communication/openmpi.md index 937b827c..c474421f 100644 --- a/docs/software/communication/openmpi.md +++ b/docs/software/communication/openmpi.md @@ -31,14 +31,28 @@ srun --mpi=pmix ... Additionally, the following environment variables should be set: ```bash export PMIX_MCA_psec="native" # (1) -export FI_PROVIDER="lnx" # (2) -export FI_LNX_PROV_LINKS="shm+cxi" # (3) -export OMPI_MCA_pml="^ucx" # (4) -export OMPI_MCA_mtl="ofi" # (5) +export FI_PROVIDER="cxi" # (2) +export OMPI_MCA_pml="^ucx" # (3) +export OMPI_MCA_mtl="ofi" # (4) ``` 1. Ensures PMIx uses the same security domain as Slurm. Otherwise PMIx will print warnings at startup. -2. Use the [libfabric LINKx](https://ofiwg.github.io/libfabric/v2.1.0/man/fi_lnx.7.html) provider, to allow using different libfabric providers for inter- and intra-node communication. -3. Use the shared memory provider for intra-node communication and the CXI (Slingshot) provider for inter-node communication. -4. Use anything except [UCX](https://openucx.org/documentation/) for [point-to-point communication](https://docs.open-mpi.org/en/v5.0.x/mca.html#selecting-which-open-mpi-components-are-used-at-run-time). -5. Use libfabric for the [Matching Transport Layer](https://docs.open-mpi.org/en/v5.0.x/mca.html#frameworks). +2. Use the CXI (Slingshot) provider. +3. Use anything except [UCX](https://openucx.org/documentation/) for [point-to-point communication](https://docs.open-mpi.org/en/v5.0.x/mca.html#selecting-which-open-mpi-components-are-used-at-run-time). +4. Use libfabric for the [Matching Transport Layer](https://docs.open-mpi.org/en/v5.0.x/mca.html#frameworks). + +!!! info "CXI provider does all communication through the network interface cards (NICs)" + When using the libfabric CXI provider, all communication goes through NICs, including intra-node communication. + This means that intra-node communication can not make use of shared memory optimizations and the maximum bandwidth will not be severely limited. + + Libfabric has a new [LINKx](https://ofiwg.github.io/libfabric/v2.1.0/man/fi_lnx.7.html) provider, which allows using different libfabric providers for inter- and intra-node communication. + This provider is not as well tested, but can in theory perform better for intra-node communication, because it can use shared memory. + To use the LINKx provider, set the following, instead of `FI_PROVIDER=cxi`: + + ```bash + export FI_PROVIDER="lnx" # (1) + export FI_LNX_PROV_LINKS="shm+cxi" # (2) + ``` + + 1. Use the libfabric LINKx provider, to allow using different libfabric providers for inter- and intra-node communication. + 2. Use the shared memory provider for intra-node communication and the CXI (Slingshot) provider for inter-node communication. From 30901d1adc4c83c2889d3739f918412840e19b18 Mon Sep 17 00:00:00 2001 From: boeschf <48126478+boeschf@users.noreply.github.com> Date: Thu, 3 Apr 2025 16:03:10 +0200 Subject: [PATCH 08/22] perf variables --- docs/software/communication/nccl.md | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index 2d99ae1d..e7e79085 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -9,15 +9,26 @@ It is commonly used in machine learning frameworks, but traditional scientific a To use the Slingshot network on Alps, the [`aws-ofi-nccl`](https://github.com/aws/aws-ofi-nccl) plugin must be used. With the container engine, the [AWS OFI NCCL hook][ref-ce-aws-ofi-hook] can be used to load the plugin into the container and configure NCCL to use it. -While the container engine does this automatically, regardless of application, the following environment variable should always be set when using NCCL: +While the container engine does this automatically, regardless of application, the following environment variables should always be set when using NCCL: ```bash -export NCCL_NET_PLUGIN="ofi" +export NCCL_NET="AWS Libfabric" ``` This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network. Conversely, if the plugin can not be found, applications will fail to start instead of falling back to e.g. TCP, which would be significantly slower than with the plugin. +For optimal performance, the following environment variables should also be set (these are set automatically by the container engine): + +```bash +export NCCL_NET_GDR_LEVEL=PHB +export FI_CXI_DISABLE_HOST_REGISTER=1 +export FI_MR_CACHE_MONITOR=userfaultfd +export FI_CXI_DEFAULT_CQ_SIZE=131072 +export FI_CXI_DEFAULT_TX_SIZE=32768 +export FI_CXI_RX_MATCH_MODE=software +``` + !!! warning "GPU-aware MPI with NCCL" Using GPU-aware MPI together with NCCL [can easily lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi). Unless care is taken to ensure that the two methods of communication are not used concurrently, we recommend not using GPU-aware MPI with NCCL. From c93b4df6286a6da465ed8f845241a2d72461faf0 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Thu, 3 Apr 2025 16:34:42 +0200 Subject: [PATCH 09/22] Add links to NCCL docs from GB docs --- docs/guides/gb2025.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/guides/gb2025.md b/docs/guides/gb2025.md index daf0de46..765cb68a 100644 --- a/docs/guides/gb2025.md +++ b/docs/guides/gb2025.md @@ -86,3 +86,7 @@ Without these settings, we have observed application slowdown due to poor thread !!! todo write a guide on which versions to use, environment variables to set, etc. + +See [the container engine documentation][ref-ce-aws-ofi-hook] for information on using NCCL in containers. +The [NCCL][ref-communication-nccl] contains general information on configuring NCCL. +This information is especially important when using uenvs, as the environment variables are not set automatically. From 988c24a361ba5899b9364a33f5ffe591f35247c5 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Thu, 3 Apr 2025 17:09:27 +0200 Subject: [PATCH 10/22] Refactor NCCL docs, add uenv notes --- docs/software/communication/nccl.md | 32 +++++++++++++++++------------ 1 file changed, 19 insertions(+), 13 deletions(-) diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index e7e79085..35373928 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -9,26 +9,32 @@ It is commonly used in machine learning frameworks, but traditional scientific a To use the Slingshot network on Alps, the [`aws-ofi-nccl`](https://github.com/aws/aws-ofi-nccl) plugin must be used. With the container engine, the [AWS OFI NCCL hook][ref-ce-aws-ofi-hook] can be used to load the plugin into the container and configure NCCL to use it. -While the container engine does this automatically, regardless of application, the following environment variables should always be set when using NCCL: +Most uenvs, like [`prgenv-gnu`][ref-uenv-prgenv-gnu] also contain the NCCL plugin. +When using e.g. the `default` view of `prgenv-gnu` the `aws-ofi-nccl` plugin will be available in the environment. +Alternatively, loading the `aws-ofi-nccl` module with the `modules` view also makes the plugin available in the environment. +The environment variables described below must still be set to ensure that NCCL uses the plugin. -```bash -export NCCL_NET="AWS Libfabric" -``` - -This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network. -Conversely, if the plugin can not be found, applications will fail to start instead of falling back to e.g. TCP, which would be significantly slower than with the plugin. - -For optimal performance, the following environment variables should also be set (these are set automatically by the container engine): +While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optial performance when using NCCL: ```bash -export NCCL_NET_GDR_LEVEL=PHB -export FI_CXI_DISABLE_HOST_REGISTER=1 -export FI_MR_CACHE_MONITOR=userfaultfd -export FI_CXI_DEFAULT_CQ_SIZE=131072 +export NCCL_NET="AWS Libfabric" # (1) +export NCCL_NET_GDR_LEVEL=PHB # (2) +export FI_CXI_DEFAULT_CQ_SIZE=131072 # (3) export FI_CXI_DEFAULT_TX_SIZE=32768 +export FI_CXI_DISABLE_HOST_REGISTER=1 export FI_CXI_RX_MATCH_MODE=software +export FI_MR_CACHE_MONITOR=userfaultfd +export MPICH_GPU_SUPPORT_ENABLED=0 # (3) ``` +1. This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network. If the plugin can not be found, applications will fail to start. With the default value, applications would instead fall back to e.g. TCP, which would be significantly slower than with the plugin. [More information about `NCCL_NET`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net). +2. Use GPU Direct RDMA when GPU and NIC are on the same NUMA node. [More information about `NCCL_NET_GDR_LEVEL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net-gdr-level-formerly-nccl-ib-gdr-level). +3. This and the other `FI` (libfabric) environment variables have been found to give the best performance on the Alps network across a wide range of applications. Specific applications may perform better with other values. +4. Disable GPU-aware MPI explicitly, to avoid potential deadlocks between MPI and NCCL. + +!!! warning "Using NCCL with uenvs" + The environment variables listed above are not set automatically when using uenvs. + !!! warning "GPU-aware MPI with NCCL" Using GPU-aware MPI together with NCCL [can easily lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi). Unless care is taken to ensure that the two methods of communication are not used concurrently, we recommend not using GPU-aware MPI with NCCL. From 79c51c20cab7c2d2e48f61119e7f3a864173c908 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Thu, 3 Apr 2025 17:23:41 +0200 Subject: [PATCH 11/22] Add comma --- docs/software/communication/nccl.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index 35373928..e7dc2ff8 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -9,7 +9,7 @@ It is commonly used in machine learning frameworks, but traditional scientific a To use the Slingshot network on Alps, the [`aws-ofi-nccl`](https://github.com/aws/aws-ofi-nccl) plugin must be used. With the container engine, the [AWS OFI NCCL hook][ref-ce-aws-ofi-hook] can be used to load the plugin into the container and configure NCCL to use it. -Most uenvs, like [`prgenv-gnu`][ref-uenv-prgenv-gnu] also contain the NCCL plugin. +Most uenvs, like [`prgenv-gnu`][ref-uenv-prgenv-gnu], also contain the NCCL plugin. When using e.g. the `default` view of `prgenv-gnu` the `aws-ofi-nccl` plugin will be available in the environment. Alternatively, loading the `aws-ofi-nccl` module with the `modules` view also makes the plugin available in the environment. The environment variables described below must still be set to ensure that NCCL uses the plugin. From 9ae67444df43dd61316a0b7a3a6adfab9cf6a6b9 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Thu, 3 Apr 2025 17:30:24 +0200 Subject: [PATCH 12/22] Fix tyop --- docs/software/communication/nccl.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index e7dc2ff8..fc1c3739 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -14,7 +14,7 @@ When using e.g. the `default` view of `prgenv-gnu` the `aws-ofi-nccl` plugin wil Alternatively, loading the `aws-ofi-nccl` module with the `modules` view also makes the plugin available in the environment. The environment variables described below must still be set to ensure that NCCL uses the plugin. -While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optial performance when using NCCL: +While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL: ```bash export NCCL_NET="AWS Libfabric" # (1) From 4b7ae6b7df1dfdd3bfdbbf1c95c86e497aee26e4 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Fri, 4 Apr 2025 11:20:11 +0200 Subject: [PATCH 13/22] Add more examples and warnings about aws ofi nccl plugin not loading correctly --- docs/software/communication/nccl.md | 38 +++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index fc1c3739..0920e81a 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -42,5 +42,39 @@ export MPICH_GPU_SUPPORT_ENABLED=0 # (3) Note that this option may be set to `1` by default on some Alps clusters. See [the Cray MPICH documentation][ref-communication-cray-mpich] for more details on GPU-aware MPI with Cray MPICH. -!!! todo - More options? +!!! warning "`invalid usage` error with `NCCL_NET="AWS Libfabric`" + If you are getting error messages such as: + ```console + nid006352: Test NCCL failure common.cu:958 'invalid usage (run with NCCL_DEBUG=WARN for details) + ``` + this may be due to the plugin not being found by NCCL. + If this is the case, running the application with the recommended `NCCL_DEBUG=WARN` should print something similar to the following: + ```console + nid006352:34157:34217 [1] net.cc:626 NCCL WARN Error: network AWS Libfabric not found. + ``` + When using uenvs like `prgenv-gnu`, make sure you are either using the `default` view which loads `aws-ofi-nccl` automatically, or, if using the `modules` view, load the `aws-ofi-nccl` module with `module load aws-ofi-nccl`. + If the plugin is found correctly, running the application with `NCCL_DEBUG=INFO` should print: + ```console + nid006352:34610:34631 [0] NCCL INFO Using network AWS Libfabric + ``` + +!!! warning "`NCCL_NET_PLUGIN="ofi"` with uenvs" + When using uenvs, do not set `NCCL_NET_PLUGIN="ofi"` instead of, or in addition to, `NCCL_NET="AWS Libfabric"`. + If you do, your application will fail to start since NCCL will: + + 1. fail to find the plugin because of the name of the shared library in the uenv, and + 2. prefer `NCCL_NET_PLUGIN` over `NCCL_NET`, so it will fail to find the plugin even if `NCCL_NET="AWS Libfabric"` is correctly set. + + When both environment variables are set the error message, with `NCCL_DEBUG=WARN`, will look similar to when the plugin isn't available: + ```console + nid006365:179857:179897 [1] net.cc:626 NCCL WARN Error: network AWS Libfabric not found. + ``` + + With `NCCL_DEBUG=INFO`, NCCL will print: + ```console + nid006365:180142:180163 [0] NCCL INFO NET/Plugin: Could not find: ofi libnccl-net-ofi.so. Using internal network plugin. + ... + nid006365:180142:180163 [0] net.cc:626 NCCL WARN Error: network AWS Libfabric not found. + ``` + + If you only set `NCCL_NET="ofi"`, NCCL may silently fail to load the plugin but fall back to the default implementation. From f0b7e1de091c9965b3cb4cf5b9c3637fb5096e60 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Fri, 4 Apr 2025 11:22:26 +0200 Subject: [PATCH 14/22] Fix annotation numbering in NCCL docs --- docs/software/communication/nccl.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index 0920e81a..be7a3c2c 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -24,7 +24,7 @@ export FI_CXI_DEFAULT_TX_SIZE=32768 export FI_CXI_DISABLE_HOST_REGISTER=1 export FI_CXI_RX_MATCH_MODE=software export FI_MR_CACHE_MONITOR=userfaultfd -export MPICH_GPU_SUPPORT_ENABLED=0 # (3) +export MPICH_GPU_SUPPORT_ENABLED=0 # (4) ``` 1. This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network. If the plugin can not be found, applications will fail to start. With the default value, applications would instead fall back to e.g. TCP, which would be significantly slower than with the plugin. [More information about `NCCL_NET`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net). From 18aee3f8e56e370046d5e165168d669adc78f634 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Fri, 4 Apr 2025 11:24:21 +0200 Subject: [PATCH 15/22] Add more text about NCCL_NET_PLUGIN --- docs/software/communication/nccl.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index be7a3c2c..c14e0921 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -58,7 +58,8 @@ export MPICH_GPU_SUPPORT_ENABLED=0 # (4) nid006352:34610:34631 [0] NCCL INFO Using network AWS Libfabric ``` -!!! warning "`NCCL_NET_PLUGIN="ofi"` with uenvs" +!!! warning "Do not use `NCCL_NET_PLUGIN="ofi"` with uenvs" + NCCL has an alternative way of specifying what plugin to use: `NCCL_NET_PLUGIN`. When using uenvs, do not set `NCCL_NET_PLUGIN="ofi"` instead of, or in addition to, `NCCL_NET="AWS Libfabric"`. If you do, your application will fail to start since NCCL will: From 49af1cc0e025632b83de7f186a0a80e86fe61237 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Fri, 4 Apr 2025 11:25:33 +0200 Subject: [PATCH 16/22] Remove biddisco from communication code owners --- .github/CODEOWNERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 3b243f79..4dc7bab5 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1,5 +1,5 @@ * @bcumming @msimberg @RMeli docs/services/firecrest @jpdorsch @ekouts -docs/software/communication @biddisco @Madeeks @msimberg +docs/software/communication @Madeeks @msimberg docs/software/prgenv/linalg.md @finkandreas @msimberg docs/software/sciapps/cp2k.md @abussy @RMeli From b1e6b3a904677cf0ca4661f643e2a8b412253d44 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Fri, 4 Apr 2025 12:46:03 +0200 Subject: [PATCH 17/22] Update docs/software/communication/libfabric.md Co-authored-by: Rocco Meli --- docs/software/communication/libfabric.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/software/communication/libfabric.md b/docs/software/communication/libfabric.md index 3d0c760d..a8dd80d8 100644 --- a/docs/software/communication/libfabric.md +++ b/docs/software/communication/libfabric.md @@ -19,7 +19,6 @@ For a comprehensive overview libfabric options for the CXI provider (the provide Note that the exact version deployed on Alps may differ, and not all options may be applicable on Alps. See the [Cray MPICH known issues page][ref-communication-cray-mpich-known-issues] for issues when using Cray MPICH together with libfabric. -For example, certain applications may hang at scale unless [the `FI_MR_CACHE_MONITOR=disabled`][ref-communication-cray-mpich-cache-monitor-disable] option is set. !!! todo More options? From 36262c8720056441a80464bc0e74fd7ac280239f Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Fri, 4 Apr 2025 12:46:26 +0200 Subject: [PATCH 18/22] Update docs/software/communication/nccl.md Co-authored-by: Rocco Meli --- docs/software/communication/nccl.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index c14e0921..ffd9ae0f 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -12,7 +12,7 @@ With the container engine, the [AWS OFI NCCL hook][ref-ce-aws-ofi-hook] can be u Most uenvs, like [`prgenv-gnu`][ref-uenv-prgenv-gnu], also contain the NCCL plugin. When using e.g. the `default` view of `prgenv-gnu` the `aws-ofi-nccl` plugin will be available in the environment. Alternatively, loading the `aws-ofi-nccl` module with the `modules` view also makes the plugin available in the environment. -The environment variables described below must still be set to ensure that NCCL uses the plugin. +The environment variables described below must be set to ensure that NCCL uses the plugin. While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL: From 20a8b3c9fcd240532a66f7d9c5b93c47b9d6e52e Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Fri, 4 Apr 2025 12:48:12 +0200 Subject: [PATCH 19/22] Update docs/software/communication/nccl.md Co-authored-by: Rocco Meli --- docs/software/communication/nccl.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index ffd9ae0f..ca1c9295 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -17,14 +17,14 @@ The environment variables described below must be set to ensure that NCCL uses t While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL: ```bash -export NCCL_NET="AWS Libfabric" # (1) -export NCCL_NET_GDR_LEVEL=PHB # (2) -export FI_CXI_DEFAULT_CQ_SIZE=131072 # (3) +export NCCL_NET="AWS Libfabric" # (1)! +export NCCL_NET_GDR_LEVEL=PHB # (2)! +export FI_CXI_DEFAULT_CQ_SIZE=131072 # (3)! export FI_CXI_DEFAULT_TX_SIZE=32768 export FI_CXI_DISABLE_HOST_REGISTER=1 export FI_CXI_RX_MATCH_MODE=software export FI_MR_CACHE_MONITOR=userfaultfd -export MPICH_GPU_SUPPORT_ENABLED=0 # (4) +export MPICH_GPU_SUPPORT_ENABLED=0 # (4)! ``` 1. This forces NCCL to use the libfabric plugin, enabling full use of the Slingshot network. If the plugin can not be found, applications will fail to start. With the default value, applications would instead fall back to e.g. TCP, which would be significantly slower than with the plugin. [More information about `NCCL_NET`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net). From 2b2ba8c7ae99681c040c57d7bf2e80724b0f0643 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Fri, 4 Apr 2025 12:51:05 +0200 Subject: [PATCH 20/22] Update docs/software/communication/openmpi.md Co-authored-by: Rocco Meli --- docs/software/communication/openmpi.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/software/communication/openmpi.md b/docs/software/communication/openmpi.md index c474421f..2b3300d3 100644 --- a/docs/software/communication/openmpi.md +++ b/docs/software/communication/openmpi.md @@ -50,8 +50,8 @@ export OMPI_MCA_mtl="ofi" # (4) To use the LINKx provider, set the following, instead of `FI_PROVIDER=cxi`: ```bash - export FI_PROVIDER="lnx" # (1) - export FI_LNX_PROV_LINKS="shm+cxi" # (2) + export FI_PROVIDER="lnx" # (1)! + export FI_LNX_PROV_LINKS="shm+cxi" # (2)! ``` 1. Use the libfabric LINKx provider, to allow using different libfabric providers for inter- and intra-node communication. From 4ea05bc389d554fe65eacdec5321ab3dd020383f Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Fri, 4 Apr 2025 12:51:35 +0200 Subject: [PATCH 21/22] Update docs/software/communication/openmpi.md --- docs/software/communication/openmpi.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/docs/software/communication/openmpi.md b/docs/software/communication/openmpi.md index 2b3300d3..e5258a9c 100644 --- a/docs/software/communication/openmpi.md +++ b/docs/software/communication/openmpi.md @@ -30,11 +30,10 @@ srun --mpi=pmix ... Additionally, the following environment variables should be set: ```bash -export PMIX_MCA_psec="native" # (1) -export FI_PROVIDER="cxi" # (2) -export OMPI_MCA_pml="^ucx" # (3) -export OMPI_MCA_mtl="ofi" # (4) -``` +export PMIX_MCA_psec="native" # (1)! +export FI_PROVIDER="cxi" # (2)! +export OMPI_MCA_pml="^ucx" # (3)! +export OMPI_MCA_mtl="ofi" # (4)! 1. Ensures PMIx uses the same security domain as Slurm. Otherwise PMIx will print warnings at startup. 2. Use the CXI (Slingshot) provider. From 4b9a49cd2bc995d1c9092d1328588e2c4bab21b7 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Fri, 4 Apr 2025 12:52:56 +0200 Subject: [PATCH 22/22] Update docs/software/communication/openmpi.md --- docs/software/communication/openmpi.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/communication/openmpi.md b/docs/software/communication/openmpi.md index e5258a9c..4e067dfd 100644 --- a/docs/software/communication/openmpi.md +++ b/docs/software/communication/openmpi.md @@ -37,7 +37,7 @@ export OMPI_MCA_mtl="ofi" # (4)! 1. Ensures PMIx uses the same security domain as Slurm. Otherwise PMIx will print warnings at startup. 2. Use the CXI (Slingshot) provider. -3. Use anything except [UCX](https://openucx.org/documentation/) for [point-to-point communication](https://docs.open-mpi.org/en/v5.0.x/mca.html#selecting-which-open-mpi-components-are-used-at-run-time). +3. Use anything except [UCX](https://openucx.org/documentation/) for [point-to-point communication](https://docs.open-mpi.org/en/v5.0.x/mca.html#selecting-which-open-mpi-components-are-used-at-run-time). The `^` signals that OpenMPI should exclude all listed components. 4. Use libfabric for the [Matching Transport Layer](https://docs.open-mpi.org/en/v5.0.x/mca.html#frameworks). !!! info "CXI provider does all communication through the network interface cards (NICs)"