Skip to content

[Manual] Devito on ARCHER2

George Bisbas edited this page Mar 20, 2024 · 42 revisions

Installing Devito on ARCHER2

Before you start, important readings:
https://docs.archer2.ac.uk/quick-start/quickstart-users/
https://docs.archer2.ac.uk/user-guide/tuning/
https://docs.archer2.ac.uk/user-guide/scheduler/#interconnect-locality
https://docs.archer2.ac.uk/user-guide/energy/

Important: Parallel jobs on ARCHER2 should be run from the work file systems as the home file systems are not available on the compute nodes - you will see a chdir or file not found error if you try to access data on the home file system within a parallel job running on the compute nodes.

# After completing the registration
# Do `ssh` to your login node (private key needed)
ssh your_username@login.archer2.ac.uk -vv

# ch-dir to work filesystem
cd /work/"project-number"/"project-number"/"username"/
module load cray-python
# Create a python3 virtual env
# Activate it

# If devito is not cloned:
git clone https://github.com/devitocodes/devito
# TOFIX: BUILD WITH PRG-ENV-GNU
pip3 install -e .


# Load Cray MPI / https://docs.nersc.gov/development/programming-models/mpi/cray-mpich/
module load cray-mpich
# Build mpi4py using Cray's wrapper
env MPICC=/opt/cray/pe/craype/2.7.6/bin/cc pip3 install --force-reinstall --no-cache-dir -r requirements-mpi.txt
export OMP_PLACES=cores

Example script:

#!/bin/bash

# Slurm job options (job-name, compute nodes, job time)
#SBATCH --job-name=Example_MPI_Job
#SBATCH --time=0:20:0
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=16

# Replace [budget code] below with your project code (e.g. t01)
#SBATCH --account=[budget code] 
#SBATCH --partition=standard
#SBATCH --qos=standard


# Set the number of threads to 16 and specify placement
#   There are 16 OpenMP threads per MPI process
#   We want one thread per physical core
export OMP_NUM_THREADS=16
export OMP_PLACES=cores

# Launch the parallel job
#   Using 32 MPI processes
#   8 MPI processes per node
#   16 OpenMP threads per MPI process
#   Additional srun options to pin one thread per physical core
srun --hint=nomultithread --distribution=block:block ./my_mixed_executable.x arg1 arg2

salloc --nodes=2 --ntasks-per-node=8 --cpus-per-task=16 --time=01:00:00 --partition=standard --qos=standard --account=d011
# In interactive job

# Allocated nodes
OMP_NUM_THREADS=16 DEVITO_MPI=1 DEVITO_ARCH=cray DEVITO_LANGUAGE=openmp DEVITO_LOGGING=DEBUG srun --distribution=block:block --hint=nomultithread python examples/seismic/acoustic/acoustic_example.py -d 1024 1024 1024 --tn 512 -so 12 -a aggressive

# Nodes 1
OMP_NUM_THREADS=16 DEVITO_MPI=1 DEVITO_ARCH=cray DEVITO_LANGUAGE=openmp DEVITO_LOGGING=DEBUG srun -n 8 --distribution=block:block --hint=nomultithread python examples/seismic/acoustic/acoustic_example.py -d 512 512 512 --tn 100

# Nodes 2
OMP_NUM_THREADS=16 DEVITO_MPI=1 DEVITO_ARCH=cray DEVITO_LANGUAGE=openmp DEVITO_LOGGING=DEBUG srun -n 16 --distribution=block:block --hint=nomultithread python examples/seismic/acoustic/acoustic_example.py -d 512 512 512 --tn 512 -so 8

!Add autotuning! Very important!

Notes: autotuning may lead to perf variance from runs to runs. Block shape selected not standard?

For interactive jobs:

https://docs.archer2.ac.uk/user-guide/scheduler/#interactive-jobs

Notes:

# Devito-specific env variables
export DEVITO_ARCH=cray
export DEVITO_LANGUAGE=openmp
export DEVITO_LOGGING=BENCH
export DEVITO_MPI=1
export DEVITO_PROFILING=advanced2

# Archer specific
export MPICH_OFI_STARTUP_CONNECT=1
export MPICH_OFI_RMA_STARTUP_CONNECT=1
export FI_OFI_RXM_SAR_LIMIT=524288
export FI_OFI_RXM_BUFFER_SIZE=131072
export MPICH_SMP_SINGLE_COPY_SIZE=16384
export CRAY_OMP_CHECK_AFFINITY=TRUE
export SLURM_CPU_FREQ_REQ=2250000

TO TRY:

module swap craype-network-ofi craype-network-ucx 
module swap cray-mpich cray-mpich-ucx 
For example, to place processes sequentially on nodes but round-robin on the 16-core NUMA regions in a single node, you would use the --distribution=block:cyclic option to srun. This type of process placement can be beneficial when a code is memory bound.
Implications
An MPI program will typically run faster if MPI_Send is implemented asynchronously using the eager protocol since synchronisation between sender and receive is much reduced.
However, you should never assume that MPI_Send buffers your message, so if you have concerns about deadlock you will need to use the non-blocking variant MPI_Isend to guarantee that the send routine returns control to you immediately even if there is no matching receive.
It is not enough to say deadlock is an issue in principle, but it runs OK on my laptop so there is no problem in practice. The eager limit is system-dependent so the fact that a message happens to be buffered on your laptop is no guarantee it will be buffered on ARCHER2.
To check that you have a correct code, replace all instances of MPI_Send / MPI_Isend with MPI_Ssend / MPI_Issend. A correct MPI program should still run correctly when all references to standard send are replaced by synchronous send (since MPI is allowed to implement standard send as synchronous send).

Archer2 after upgrade (June 2023), example batch script

#!/bin/bash

# Slurm job options (job-name, compute nodes, job time)
#SBATCH --job-name=Devito_MPI_Job
#SBATCH --time=1:20:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH --switches=1@360 # Each group has 128 nodes

# Replace [budget code] below with your project code (e.g. t01)
#SBATCH --account=n03-devito
#SBATCH --partition=standard
#SBATCH --qos=standard
#SBATCH -o ./jobs-full/output-4-full.%j.out # STDOUT

# Propagate the cpus-per-task setting from script to srun commands
#    By default, Slurm does not propagate this setting from the sbatch
#    options to srun commands in the job script. If this is not done,
#    process/thread pinning may be incorrect leading to poor performance
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

module load cray-python
module load cray-mpich
source environments/python3-env/bin/activate
cd devito

# Set the number of threads to 16 and specify placement
#   There are 16 OpenMP threads per MPI process
#   We want one thread per physical core
export OMP_NUM_THREADS=16
export OMP_PLACES=cores

# Devito-specific env variables
export DEVITO_ARCH=cray
export DEVITO_LANGUAGE=openmp
export DEVITO_LOGGING=DEBUG
export DEVITO_MPI=full
export DEVITO_PROFILING=advanced2

# Archer specific
export MPICH_OFI_STARTUP_CONNECT=1
export MPICH_OFI_RMA_STARTUP_CONNECT=1
export FI_OFI_RXM_SAR_LIMIT=524288
export FI_OFI_RXM_BUFFER_SIZE=131072
export MPICH_SMP_SINGLE_COPY_SIZE=16384
export CRAY_OMP_CHECK_AFFINITY=TRUE
export SLURM_CPU_FREQ_REQ=2250000

# Launch the parallel job
#   Using nodes x ntasks-per-node MPI processes
#   8 MPI processes per node
#   16 OpenMP threads per MPI process
#   Additional srun options to pin one thread per physical core
srun --distribution=block:block --hint=nomultithread python examples/seismic/acoustic/acoustic_example.py -d 1024 1024 1024 --tn 512 -so 4 -a aggressive
srun --distribution=block:block --hint=nomultithread python examples/seismic/acoustic/acoustic_example.py -d 1024 1024 1024 --tn 512 -so 8 -a aggressive
srun --distribution=block:block --hint=nomultithread python examples/seismic/acoustic/acoustic_example.py -d 1024 1024 1024 --tn 512 -so 12 -a aggressive
srun --distribution=block:block --hint=nomultithread python examples/seismic/acoustic/acoustic_example.py -d 1024 1024 1024 --tn 512 -so 16 -a aggressive

srun --distribution=block:block --hint=nomultithread python examples/seismic/elastic/elastic_example.py -d 1024 1024 1024 --tn 512 -so 4 -a aggressive
srun --distribution=block:block --hint=nomultithread python examples/seismic/elastic/elastic_example.py -d 1024 1024 1024 --tn 512 -so 8 -a aggressive
srun --distribution=block:block --hint=nomultithread python examples/seismic/elastic/elastic_example.py -d 1024 1024 1024 --tn 512 -so 12 -a aggressive
srun --distribution=block:block --hint=nomultithread python examples/seismic/elastic/elastic_example.py -d 1024 1024 1024 --tn 512 -so 16 -a aggressive

srun --distribution=block:block --hint=nomultithread python examples/seismic/tti/tti_example.py -d 1024 1024 1024 --tn 512 -so 4 -a aggressive
srun --distribution=block:block --hint=nomultithread python examples/seismic/tti/tti_example.py -d 1024 1024 1024 --tn 512 -so 8 -a aggressive
srun --distribution=block:block --hint=nomultithread python examples/seismic/tti/tti_example.py -d 1024 1024 1024 --tn 512 -so 12 -a aggressive
srun --distribution=block:block --hint=nomultithread python examples/seismic/tti/tti_example.py -d 1024 1024 1024 --tn 512 -so 16 -a aggressive

Roofline:

module load amd-uprof

AMDuProfPcm roofline -X -o /tmp/myroof.csv -- /usr/bin/srun --distribution=block:block --hint=nomultithread python examples/seismic/acoustic/acoustic_example.py -d 1024 1024 1024 --tn 512 -so 4 -a aggressive

AMDuProf is problematic as it requires access to hardware counters, which we cannot get, as no root access is given.

OMP_NUM_THREADS=4 DEVITO_MPI=1 DEVITO_JIT_BACKDOOR=0 DEVITO_LANGUAGE=openmp DEVITO_LOGGING=DEBUG mpirun -n 2 likwid-perfctr -c L:N:0 -g MEM_SP -o test_%h_%p.txt python examples/seismic/acoustic/acoustic_example.py -d 200 200 200 --tn 200 -so 16


OMP_NUM_THREADS=4 DEVITO_MPI=1 DEVITO_JIT_BACKDOOR=0 DEVITO_LANGUAGE=openmp DEVITO_LOGGING=DEBUG likwid-mpirun -n 2 -g MEM_SP python examples/seismic/acoustic/acoustic_example.py -d 200 200 200 --tn 200 -so 16

Trying with ERT: Check: https://github.com/hpc-uk/build-instructions/blob/main/utils/ERT/install_ert_1.1.0_archer2.md


Clone this wiki locally