eth-cscs · bcumming · Jul 8, 2025 · Jul 7, 2025 · Jul 8, 2025
@@ -9,21 +9,21 @@

 <div class="grid cards" markdown>

 -   :fontawesome-solid-mountain-sun: __Configuring jobs__

    Specific guidance for configuring Slurm jobs on different node types.

    [:octicons-arrow-right-24: GH200 nodes (Daint, Clariden, Santis)][ref-slurm-gh200]

    [:octicons-arrow-right-24: AMD CPU-only nodes (Eiger)][ref-slurm-amdcpu]

 -   :fontawesome-solid-mountain-sun: __Node sharing__

    Guides on how to effectively use all resouces on nodes by running more than one job per node.

    [:octicons-arrow-right-24: Node sharing][ref-slurm-sharing]

    [:octicons-arrow-right-24: Multiple MPI jobs per node][ref-slurm-exclusive]

 </div>

@@ -68,7 +68,7 @@
 !!! note
    The flags `--account` and `-Cmc` that were required on the old [Eiger][ref-cluster-eiger] cluster are no longer required.

 ## Prioritization and scheduling

 Job priorities are determined based on each project's resource usage relative to its quarterly allocation, as well as in comparison to other projects.
 An aging factor is also applied to each job in the queue to ensure fairness over time.
@@ -138,7 +138,7 @@

 3. Enable CUDA support on systems that provide NVIDIA GPUs.

 4. Enable ROCM support on systems that provide AMD GPUs.

 The build generates the following executables:

@@ -219,7 +219,7 @@

    1. Test GPU affinity: note how all 4 ranks see the same 4 GPUs.

    2. Test GPU affinity: note how the `--gpus-per-task=1` parameter assings a unique GPU to each rank.

 !!! info "Quick affinity checks"

@@ -242,7 +242,7 @@

 The [GH200 nodes on Alps][ref-alps-gh200-node] have four GPUs per node, and Slurm job submissions must be configured appropriately to best make use of the resources.
 Applications that can saturate the GPUs with a single process per GPU should generally prefer this mode.
 [Configuring Slurm jobs to use a single GPU per rank][ref-slurm-gh200-single-rank-per-gpu] is also the most straightforward setup.
 Some applications perform badly with a single rank per GPU, and require use of [NVIDIA's Multi-Process Service (MPS)] to oversubscribe GPUs with multiple ranks per GPU.

 The best Slurm configuration is application- and workload-specific, so it is worth testing which works best in your particular case.
@@ -254,12 +254,12 @@
    Unlike "exclusive process" mode, "default" mode allows multiple processes to submit work to a single GPU simultaneously.
    This also means that different ranks on the same node can inadvertently use the same GPU leading to suboptimal performance or unused GPUs, rather than job failures.

    Some applications benefit from using multiple ranks per GPU. However, [MPS should be used][ref-slurm-gh200-multi-rank-per-gpu] in these cases.

    If you are unsure about which GPU is being used for a particular rank, print the `CUDA_VISIBLE_DEVICES` variable, along with e.g. `SLURM_LOCALID`, `SLURM_PROCID`, and `SLURM_NODEID` variables, in your job script.
    If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU.

 [](){#ref-slurm-gh200-single-rank-per-gpu}
 ### One rank per GPU

 Configuring Slurm to use one GH200 GPU per rank is easiest done using the `--ntasks-per-node=4` and `--gpus-per-task=1` Slurm flags.
@@ -278,7 +278,7 @@

 Omitting the `--gpus-per-task` results in `CUDA_VISIBLE_DEVICES` being unset, which will lead to most applications using the first GPU on all ranks.

 [](){#ref-slurm-gh200-multi-rank-per-gpu}
 ### Multiple ranks per GPU

 Using multiple ranks per GPU can improve performance e.g. of applications that don't generate enough work for a GPU using a single rank, or ones that scale badly to all 72 cores of the Grace CPU.
@@ -347,13 +347,13 @@
 [](){#ref-slurm-amdcpu}
 ## AMD CPU nodes

 Alps has nodes with two AMD Epyc Rome CPU sockets per node for CPU-only workloads, most notably in the [Eiger][ref-cluster-eiger] cluster provided by the [HPC Platform][ref-platform-hpcp].
 For a detailed description of the node hardware, see the [AMD Rome node][ref-alps-zen2-node] hardware documentation.

 ??? info "Node description"
    - The node has 2 x 64 core sockets
    - Each socket is divided into 4 NUMA regions
        - the 16 cores in each NUMA region have faster memory access to their of 32 GB
    - Each core has two processing units (PUs)

    ![Screenshot](../images/slurm/eiger-topo.png)
@@ -401,7 +401,7 @@
     srun --nodes=4 --ntasks-per-node=2
     ```
 
-It is often more efficient to only run one task per core instead of the default two PU, which can be achieved using the `--hint=nomultithreading` option.
+It is often more efficient to only run one task per core instead of the default two PU, which can be achieved using the `--hint=nomultithread` option.
 ```console title="One MPI rank per socket with 1 PU per core"
 $ srun -n2 -N1 -c64 --hint=nomultithread ./affinity.mpi
 affinity test for 2 MPI ranks
@@ -413,8 +413,8 @@
    The best configuration for performance is highly application specific, with no one-size-fits-all configuration.
    Take the time to experiment with `--hint=nomultithread`.

 Memory on the node is divided into NUMA (non-uniform memory access) regions.
 The 256 GB of a standard-memory node are divided into 8 NUMA nodes of 32 GB, with 16 cores associated with each node:

 * memory access is optimal when all the cores of a rank are on the same NUMA node;
 * memory access to NUMA regions on the other socket are significantly slower.
@@ -491,7 +491,7 @@
 In the above examples all threads on each -- we are effectively allowing the OS to schedule the threads on the available set of cores as it sees fit.
 This often gives the best performance, however sometimes it is beneficial to bind threads to explicit cores.

 The OpenMP threading runtime provides additional options for controlling the pinning of threads to the cores assinged to each MPI rank.

 Use the `--omp` flag with `affinity.mpi` to get more detailed information about OpenMP thread affinity.
 For example, four MPI ranks on one node with four cores and four OpenMP threads: