# Using HIP on Setonix

## Introduction 

Setonix is a world class supercomputer, delivering over 27 Petaflops of floating point performance using AMD EPYC CPUs and Instinct MI250x GPUs. As of November 2023 the gpu nodes of Setonix sit in place 25 on the [TOP 500](https://top500.org/system/180123/) list of the world's most powerful computers, and the CPU nodes sit at place 462.

## Official documentation

The [Pawsey Documentation Portal](https://support.pawsey.org.au/documentation/) should be your first point of call when looking for documentation. That source must take priority if there is any discrepancy between the official documentation and this material. On  this [page](https://support.pawsey.org.au/documentation/display/US/Setonix+GPU+Partition+Quick+Start) is some specific documentation for using GPU's on Setonix. 

## Access to Setonix

Firstly, you need a username and password to access Setonix. Your **username** and **password** will be given to you prior to the beginning of this workshop. If you are using your regular Pawsey account then you can reset your password [here](https://support.pawsey.org.au/password-reset/).

Access to Setonix is via Secure SHell (SSH). On Linux, Mac OS, and Windows 10 and higher an SSH client is available from the command line or terminal application. Otherwise you need to use a client program like [Putty](https://www.putty.org/) or [MobaXterm](https://mobaxterm.mobatek.net/download-home-edition.html).

### Access with SSH on the command line

On the command line use **ssh** to access Setonix.

```bash
ssh -Y <username>@setonix.pawsey.org.au
```

#### Passwordless login with SSH

In order to avoid specifying a username and password on each login you can generate a keypair on your computer, like this

```bash
ssh-keygen -t rsa
```

Then copy the public key (the file that ends in \*.pub) to your account on Setonix and append it to the authorized_keys file in `${HOME}/.ssh`. On your machine run this command:

```bash
scp -r <filename>.pub <username>@setonix.pawsey.org.au
```

Then login to Setonix and run this command

```bash
mkdir -p ${HOME}/.ssh
cat <filename>.pub >> ${HOME}/.ssh/authorized_keys
chmod -R 0400 ${HOME}/.ssh
```

Then you can run 

```bash
ssh <username>@setonix.pawsey.org.au
```

without a password.

### Access from Windows with the MobaXterm client

Windows 11 and recent builds of Windows 10 have an ssh client that is accessible from the command prompt application. If you can't use `ssh` from the command line then just download **MobaXterm Home (Portable Edition)** from [this location](https://mobaxterm.mobatek.net/download-home-edition.html). Extract the Zip file and run the application. You might need to accept a firewall notification. 

Now go to **Settings -> SSH** and uncheck **"Enable graphical SSH-browser"** in the SSH-browser settings pane. Also enable **"SSH keepalive"** to keep SSH connections active.

<figure style="margin-bottom 3em; margin-top: 2em; margin-left:auto; margin-right:auto; width:100%">
    <img style="vertical-align:middle" src="../images/MobaXTerm_Settings.svg"> <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Figure: MobaXTerm settings.</figcaption>
</figure>

Close the MobaXTerm settings and start a local terminal.

## Hardware environment on Setonix

On Setonix there are two main kinds of compute nodes:

* CPU nodes with 2 sockets and 128 cores, 256 threads.
* GPU nodes with 1 CPU socket with 64 cores, 128 threads, and 4 MI250X GPU sockets. Each MI250X GPU socket contains two GPU compute devices.

### CPU nodes

CPU nodes are based on the AMD<span>&trade;</span> EPYC<span>&trade;</span> 7763 processor in a dual-socket configuration. Each processor has a multi-chip design with 8 chiplets (Core CompleX's). Shown below is a near infrared image of an EPYC processor, showing 8 chiplets and an IO die. 

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:50%;">
    <img src="images/EPYC_7702_delidded.jpg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Near infrared photograph of a de-lidded AMD EPYC CPU with chiplets and IO die. Image credit: <a href="https://commons.wikimedia.org/wiki/File:AMD_Epyc_7702_delidded.jpg")>Wikipedia.</a> </figcaption>
</figure>

Each chiplet has 8 cores, and these cores share access to a 32 MB L3 cache. Every core has its own L1 and L2 cache, provides 2 hardware threads, and has access to SIMD units that can perform floating point math on vectors up to 256 bits (8x32-bit floats) wide in a single clock cycle. There are 16 hardware threads available per chiplet. Since every processor has 8 chiplets, there are a total of 64 cores 128 threads per processor; and 128 cores 256 threads per node. Here is some cache and performance information for the AMD Epyc 7763 CPU.

| Node | CPU | Base clock freq(GHz) | Peak clock freq (GHz) | Cores | Hardware threads | L1 Cache (KB) | L2 Cache (KB) | L3 cache (MB) | FP SIMD width (bits) | Peak TFLOPs (FP32) |
|:----:|:----:|-----:| -----: | -----: | :----: | :----: | :----: | :----: | :----: | :---: |
| CPU |AMD EPYC 7763 | 2.45 | 3.50 | 64 | 128 | 64x32 | 64x512 | 8x32 | 256 | ~1.79 |

Below is an image of a CPU compute blade on Setonix, in this shot there are 8 CPU heatsinks for a total of four nodes per blade.  

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:100%;">
    <img src="images/cpu_blade.jpg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">A CPU blade on Setonix, showing four compute nodes per blade. Each compute node has two CPU sockets.</figcaption>
</figure>

### GPU nodes

GPU nodes on Setonix have one AMD 7A53 'Trento' CPU processor and **four** MI250X GPU processors. The CPU is a specially-optimized version of the EPYC processor used in the CPU nodes, but otherwise has the same design and architecture. The Instinct<span>&trade;</span> MI250X processor is also a Multi-Chip Module (MCM) design, with two GPU dies (otherwise known as Graphics Compute Dies) per processor, as shown below.

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:100%;">
    <img src="../images/MI250x.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">AMD Instinct<span>&trade;</span> MI250X compute architecture, showing two GPU devices per processor. Image credit: <a href="https://hc34.hotchips.org/")>AMD Instinct<span>&trade;</span> MI200 Series Accelerator and Node Architectures | Hot Chips 34</a></figcaption>
</figure>

Each of the two GCD's in a MI250X appears to HIP and SLURM as a **individual compute device** with its own 64 GB of global memory and 8MB of L2 cache. Since there are four MI250X's **there are a total of 8 GPU compute devices visible to HIP per GPU node**. Each compute device has 110 **compute units**, and every compute unit executes instructions over a bank of 4 Execution Units, each with 16 shader cores that share a 16KB L1 cache and 64KB of shared memory in the Local Data Share, as seen below:

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:100%;">
    <img src="images/Setonix-GPU-Compute-Unit.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Close-up of an AMD Instinct MI250X compute unit.</figcaption>
</figure>

Each Execution Unit of 16 shader cores is responsible for progressing the 64 threads of a wavefront over four clock cycles, and can have up to 8 wavefronts active (ready to execute instructions). Therefore each compute unit can execute four wavefronts over four clock cycles and have up to 32 wavefronts active at full occupancy.

The interesting thing to note with these compute units is that both 64-bit and 32-bit floating instructions are executed natively **at the same rate**. Therefore only the increased bandwidth requirements for moving 64-bit numbers around is a performance consideration. Below is a table of performance numbers for each of the four dual-gpu MI250X processors in a gpu node.

| Card | Boost clock (GHz)| Compute Units | FP32 Processing Elements | FP64 Processing Elements (equivalent compute capacity) | L1 Cache (KB) | LDS (shared mem, KB)  | L2 Cache (MB) | device memory (GB) | Peak Tflops (FP32)| Peak Tflops (FP64)|
|:----:|:-----| :----- | :---- | :----- | :---- | :---- | :---- | :---- | :---- | :---- |
| AMD Radeon Instinct MI250x |1.7 | 2x110 | 2x7040 | 2x7040 | 2x110x16 | 2x110x64 | 2x8 | 2x64 | 47.9 | 47.9 |

Below is an installation image of a GPU compute blade with two nodes. Each node has 1 CPU socket and four GPU sockets.

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:100%;">
    <img src="images/gpu_blade.jpg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">A GPU blade on Setonix, showing two GPU nodes, each node has one CPU socket and four GPU sockets.</figcaption>
</figure>

## Job queues

On Setonix the command `scontrol show partitions` shows the job queues that are available and how many nodes they contain. The following CPU queues are available for general use. 

|Queue| Max time limit| Nodes | CPU cores per node  | Available host memory per node (GB) |
| --- | --- | --- | --- | :---: |
| work | 24 hours | 1396 | 128 | 245 |
| long | 96 hours | 8 | 128 | 245 | 
| copy | 48 hours | 4 | 32 | 120 |
| debug | 1 hour | 8 | 128 | 245 |
| highmem | 96 hours | 8 | 128 | 1020 | 

A special account is needed to access the `gpu` queues. This will usually be your project name followed by the suffix **-gpu**. Here are the GPU queues available for general use.

|Queue| Max time limit| Nodes | CPU cores per node  | Available host memory per node (GB) | Number of HIP devices | Memory per HIP device (GB) |
| --- | --- | --- | --- | :---: | --- | --- | 
| gpu | 24 hours | 134 | 64 | 245 | 8 | 64 |
| gpu-dev | 4 hours | 20 | 64 | 245 | 8 | 64 |
| gpu-highmem | 24 hours | 38 | 64 | 490 | 8 | 64 |

## Interactive jobs on GPU nodes

When compiling software or running test jobs on Setonix it is sometimes helpful to have interactive access to a gpu node.  Allocations for the **gpu** queue on Setonix need a separate allocation. The following command reserves an `allocation-pack` consisting of a GPU, a connected chiplet of 8 cores, and one eighth of the available memory on a node, for interactive use. You can use this to compile software and run interactive jobs on a gpu node of Setonix, but for the workshop you might need to use the `salloc` command in the welcome letter.               
```bash
salloc --account=${PAWSEY_PROJECT}-gpu --nodes=1 --time=2:00:00 --gres=gpu:1 --partition=gpu-dev
```

## Building software for Setonix

The main complexity with building HIP enabled applications on Setonix is when also need support for MPI. Otherwise you can simply load the **rocm** module and use **hipcc**. Here are some suggested workflows if you need MPI support.

### Software modules


#### Programming environment

There are three main programming environments available on Setonix. Each provides C/C++ and Fortran compilers that build software with knowledge of of the MPI libraries available on Setonix. The **PrgEnv-GNU** programming environment loads the GNU compilers for best software compatibility, the module **PrgEnv-aocc** loads the AMD **aocc** optimising compiler to try and get the best performance from the AMD CPU's on Setonix, and the **PrgEnv-cray** environment loads the well-supported compilers from Cray. Use these commands to find which module to load.

| Programming environment | command to use |
| :--- | :--- |
| AMD | ```module avail PrgEnv-aocc``` |
| Cray | ```module avail PrgEnv-cray``` |
| GNU | ```module avail PrgEnv-gnu``` |

When compiling C/C++ HIP sources you have the choice of either the the ROCM **hipcc** compiler wrapper or the Cray compiler wrapper **CC** from **PrgEnv-cray**. If you use the Cray compiler wrapper you need to swap to the module **PrgEnv-cray**, as the GNU programming environment (**PrgEnv-gnu**) is loaded by default. 

```bash
module swap PrgEnv-gnu PrgEnv-cray
```

Then the following compiler wrappers are available for use to compile source files:

| Command | Explanation |
| :--- | :--- |
| cc | C compiler |
| CC | C++ compiler |
| ftn | FORTRAN compiler |

In order to use a GPU-aware MPI library from Cray you also need to load the **craype-accel-amd-gfx90a** module, which is available in all three programming environments.  Load the module with this command.

```bash
module load craype-accel-amd-gfx90a
```

then set this environment variable to enable GPU support with MPI.

```bash
export MPICH_GPU_SUPPORT_ENABLED=1
```

#### ROCM libraries

Finally, in order to have ROCM software (such as hipcc and rocgdb) and libraries available you need to load **rocm/5.2.3** module. 

```bash
module load rocm/5.2.3
```

The default **rocm** module (version 5.2.3) is independent of the programming environment and is the production version of ROCM on Setonix. There is also an experimental module `rocm/5.4.3` but we don't use it for most of the course, because it has a build error when building HIP sources with OpenMP.

#### Omnitrace support

[Omnitrace](https://github.com/AMDResearch/omnitrace) is a tool for using rocprof to collect **traces**, or information on **when** an application component starts using compute resources, and **for how long** it uses those resources. After loading rocm you will need these modules loaded to access the Omnitrace tools.

```bash
module load omnitrace/1.10.2
```

#### Omniperf support

[Omniperf](https://github.com/AMDResearch/omniperf) is a tool to make low level information collected by **rocprof** accessible. It can perform feats like creating [roofline models](https://en.wikipedia.org/wiki/Roofline_model) of how well your kernels are performing in relation to the theoretical capability of the compute hardware. The following commands will help you access the experimental Omniperf tools.

```bash
module load omniperf/1.0.6
```

### Compiling software with HIP and MPI support

According to this [documentation](https://docs.amd.com/bundle/HIP-Programming-Guide-v5.0/page/Transitioning_from_CUDA_to_HIP.html), the AMD compiler wrapper **hipcc** can be used for compiling C/C++ and HIP source files and is the suggested linker for program objects. The Cray C++ compiler **also** has the ability to compile HIP source code through adding the compiler option `-x hip` to CC, but you need to have the **PrgEnv-cray** environment loaded in order fo this to work.

In order provide the best chance of reducing compiler issues it is **best practice to compile from a gpu node**, either from a batch or interactive job. If you use **hipcc** to compile HIP source, then you can use another compiler to compile other sources and then use **hipcc** to link them. Note that for Setonix you need to pass the `--offload-arch=gfx90a` flag to `hipcc`.

#### Option 1: Compiling and linking MPI sources with **hipcc**

You can use these compiler flags with **hipcc** to bring in the MPI headers and make sure **hipcc** compiles kernels for the MI250X architecture on Setonix. These flags work with **hipcc** in all of the programming environments. 

| Function | flags |
| :--- | :--- |
| Compile | ```-fPIC -I${MPICH_DIR}/include --offload-arch=gfx90a``` |
| Link (with MPI) | ```-L${MPICH_DIR}/lib -lmpi -L${CRAY_MPICH_ROOTDIR}/gtl/lib -lmpi_gtl_hsa``` |
| Debug (compile and link) | ```-g -ggdb -O0 --offload-arch=gfx90a``` |
| OpenMP (compile and link)| ```-fopenmp (only supported by the rocm/5.2.3 module at present)``` |

If you want **hipcc** to behave like the compiler wrapper **CC** from your chosen programming environment then make sure the **craype-accel-amd-gfx90a** module is also loaded. Then add the output of this command,

```bash
$(CC --cray-print-opts=cflags)
```

to the hipcc compile flags, and the output of this command,

```bash
$(CC --cray-print-opts=libs)
```

to the hipcc linker flags.

#### Option 2: Compiling and linking with the Cray **CC** compiler wrapper 

If you are using the C++ compiler wrapper **CC** from the **PrgEnv-cray** environment you can add these flags to compile and link HIP code for the MI250X GPU's on Setonix. 

| Function | flags |
| :--- | :--- |
| Compile | ```-x hip -fPIC -D__HIP_ROCclr__ -D__HIP_ARCH_GFX90A__=1 --offload-arch=gfx90a --rocm-path=${ROCM_PATH}``` |
| Link | ```--rocm-path=${ROCM_PATH} -L${ROCM_PATH}/lib -lamdhip64``` |
| Debug (compile and link) | ```-g``` |
| OpenMP (compile and link)| ```Not supported when -x hip is used, for non HIP sources you can use -fopenmp``` |

The `-fPIC` flag is useful for sources that are part of shared libraries, but potentially has some overhead. Sometimes you also need this flag if you get relocation errors when linking. This [site](https://support.hpe.com/hpesc/public/docDisplay?docId=a00115299en_us&page=HIP_Support_and_Options.html) from HPE Enterprise documentation explains what the other compiler options are for. The flag `-x hip` informs `CC` that the file is HIP source. The option `-D__HIP_ROCclr__` is necessary to use the ROCm Common Language Runtime interface, and the flags `-D__HIP_ARCH_GFX90A__=1` and `--offload-arch=gfx90a` enable specific settings and device code for the `gfx90a` architecture in the MI250X GPUs. Finally, compiling HIP sources with the Cray compiler is **not supported** for HIP source files.

##### Mixing hipcc and Cray compilation

From this [documentation](https://docs.amd.com/bundle/HIP-Programming-Guide-v5.0/page/Transitioning_from_CUDA_to_HIP.html) whenever you mix compilers it is important to ensure that **all code** links to the same C++ standard libraries. The command ```hipconfig --cxx``` generates extra compile flags that might be useful for including in the build process with the Cray wrappers. 

## Exercise: compile and run your first MPI-enabled HIP application

In the files [hello_devices_mpi.cpp](hello_devices_mpi.cpp) and [hello_devices_mpi_onefile.cpp](hello_devices_mpi_onefile.cpp) are files to implement an MPI-enabled HIP application that reports on devices and fills a vector. The difference between the two is that for [hello_devices_mpi.cpp](hello_devices_mpi.cpp) has the kernel located in a separate file [kernels.hip.cpp](kernels.hip.cpp). Your task is to compile these files into two executables, **hello_devices_mpi** and **hello_devices_mpi_onefile**.

### Compilation steps

#### Task 1. Login and setup

* Log into **setonix.pawsey.org.au**.
```bash
ssh <username>@setonix.pawsey.org.au
```
* Change directory to your space on /scratch
```bash
cd $MYSCRATCH
```
* Get the course material from Github if don't already have it.
```bash
git clone https://github.com/pelagos-consulting/HIP_Course.git
cd HIP_Course/course_material/L2_Using_HIP_On_Setonix
```
* Get an interactive GPU job on Setonix. The correct command to use will be in the welcome letter, and looks something like this. 
```bash
salloc --account=${PAWSEY_PROJECT}-gpu --nodes=1 --time=2:00:00 --gres=gpu:1 --partition=gpu-dev
```
* Swap out the **PrgEnv-gnu** module for the **PrgEnv-cray** module
```bash
module swap PrgEnv-gnu PrgEnv-cray
module load rocm/5.2.3
module load craype-accel-amd-gfx90a
```

#### Task 2. Compile the kernel and main program in separate files
* Compile the kernel file [kernels.hip.cpp](kernels.hip.cpp). This creates an object file `kernels.o` for later linking.
```bash
hipcc -c kernels.hip.cpp -fPIC -I${MPICH_DIR}/include --offload-arch=gfx90a -o kernels.o
```
* Use **CC** to compile the file [hello_devices_mpi.cpp](hello_devices_mpi.cpp). 
```bash
CC -x hip -c -D__HIP_ROCclr__ -D__HIP_ARCH_GFX90A__=1 --rocm-path=${ROCM_PATH} --offload-arch=gfx90a hello_devices_mpi.cpp -o hello_devices_mpi.o
```
* Now use **hipcc** to link the two object files together in a way that is aware of the MPI library.
```bash
hipcc kernels.o hello_devices_mpi.o -o hello_devices_mpi --offload-arch=gfx90a -L${MPICH_DIR}/lib -lmpi -L${CRAY_MPICH_ROOTDIR}/gtl/lib -lmpi_gtl_hsa
```

#### Task 3. Compile the combined file in one go using **hipcc**

This should work with any programming environment.

```bash
hipcc -I${MPICH_DIR}/include --offload-arch=gfx90a hello_devices_mpi_onefile.cpp -o hello_devices_mpi_onefile_hipcc -L${MPICH_DIR}/lib -lmpi -L${CRAY_MPICH_ROOTDIR}/gtl/lib -lmpi_gtl_hsa
```

#### Task 4. Compile the combined file in one go using **CC**

This only works with the **PrgEnv-cray** environment.

```bash
CC -D__HIP_ROCclr__ -D__HIP_ARCH_GFX90A__=1 --rocm-path=${ROCM_PATH} --offload-arch=gfx90a -x hip hello_devices_mpi_onefile.cpp -o hello_devices_mpi_onefile_CC -L${ROCM_PATH}/lib -lamdhip64
```

#### Task 5. Run the compiled software

If you are in an interactive or batch job then the proper number of compute devices should appear when we run these commands.

```bash
srun -N 1 -n 1 -c 8 ./hello_devices_mpi
srun -N 1 -n 1 -c 8 ./hello_devices_mpi_onefile_hipcc
srun -N 1 -n 1 -c 8 ./hello_devices_mpi_onefile_CC
```

#### Task 6. Compile and run hello_jobstep

The file `hello_jobstep.cpp`, is adapted from a [program](https://code.ornl.gov/olcf/hello_jobstep) by Thomas Papatheodore from ORNL. It is an application to print the OpenMP thread ID, Hardware Thread ID, GPU Bus ID and NUMA ID. We compile the application in one go using `hipcc` and `-fopenmp`:

```bash
hipcc -fopenmp -I${MPICH_DIR}/include --offload-arch=gfx90a hello_jobstep.cpp -o hello_jobstep -L${MPICH_DIR}/lib -lmpi -L${CRAY_MPICH_ROOTDIR}/gtl/lib -lmpi_gtl_hsa
```

Now try changing the number of GPU's in your request for resources for the interactive job from 1 to 2. You will need this command.

```bash
srun -N 1 -n 2 --gpus-per-task=1 -c 8 ./hello_jobstep
```

### Makefile solution

If you get stuck, the example [Makefile](Makefile) contains the above compilation steps. Assuming you loaded the Cray programming environment, the make command is run as follows:

```bash
module swap PrgEnv-gnu PrgEnv-cray
module load rocm/5.2.3
module load craype-accel-amd-gfx90a
make clean; make
```

The script **run_compile.sh** contains the necessary commands to load the appropriate modules and run the **make** command.

```bash
   chmod 700 run_compile.sh
   ./run_compile.sh
```

### CMake solution

The CMake build system should have built the files in this exercise. If you source the script `../env` then these programs are in the `$PATH` and available in the `$RUN_DIR` directory. 

## Batch jobs with HIP on GPU nodes

Pawsey has extensive documentation available for running jobs, at this [site](https://pawsey.atlassian.net/wiki/spaces/US/pages/51928618/Setonix+GPU+Partition+Quick+Start). Here is some information that is specific to making best use of the GPU nodes on Setonix.

### GPU node configuration

On the GPU nodes of Setonix there is 1 CPU and 8 Graphics Compute Dies (GCD's). Each of the 8 chiplets in the CPU is intended to have optimal access to one of the 8 available GCD's  (compute devices). To avoid confusion, SLURM regards each of the 8 GCD's in a node as a "unique GPU", and we will do the same. Shown below is a hardware diagram of a compute node, where each chiplet is connected optimally to one compute device.

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:100%;">
    <img src="images/Setonix-GPU-Node.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Overall view of a Setonix GPU node, showing the placement of hardware threads and the closest available compute device.</figcaption>
</figure>

From the above diagram we see that best use of the GPU's occur when a chiplet accesses a GPU that is closest to it. It then makes sense to always allocate a chiplet of CPU's to a GPU. Pawsey has simplified resource requests on GPU nodes such that allocations are in packs. Each `allocation pack` is granted the following discrete chunk of resources:

* 1 chiplet (8 cores)
* 1/8 the memory in a node
* 1 GPU compute device

Then you specify the number of nodes you need with `--nodes`, and the number of GCD's (allocation packs) per node with `--gres=gpu:<number>`. Enough allocation packs must be requested to cover the maximum of chiplets, GPU, and memory in your request, even if these resources are not used. One may optionally add `--exclusive` to use the whole node, though you need to be aware this will make the job more time consuming to progress through the queue and will incur the maximum charge to your service units.

Within the sbatch job you now need to run your parallelised program with `srun`, and pass it the full range of options to exert the control you need for your job, namely:

|Option|explanation|
|:---|:---|
| `--nodes` | Total number of nodes to request. |
| `--ntasks=8` | Total number of MPI tasks to request. Should be `<nodes>`*`<allocation_packs_per_node>`/`<gpus_per_task>`. |
| `--gres=gpu:8` | Number of Graphics Compute Dies (and allocation packs) requested per node. |
| `--cpus-per-task=8` | Number of cores to allocate per task, even if they aren't used. |
| `--gpus-per-task=1` | Number of GPU's to make available per task. |
| `--gpu-bind=closest` | Attempt to bind each GPU to its closest chiplet. |

The evironment variable `OMP_NUM_THREADS` is used to fine-tune the number of OpenMP threads per MPI task. Furthermore the environment variable `MPICH_GPU_SUPPORT_ENABLED=1` enables GPU-aware communication with MPI. One can try to use `--gpu-bind=closest` with srun to try and make sure each GPU compute device is optimally placed to its closest chiplet, however this might not always work. In such instances see [this Pawsey resource](https://support.pawsey.org.au/documentation/display/US/Example+Slurm+Batch+Scripts+for+Setonix+on+GPU+Compute+Nodes) for detailed information on how to map CPU cores to GPU's. 

### Example job script

The suggested job script below will allocate an MPI task for every compute device on a node of Setonix. Then it will allocate 8 OpenMP threads to each MPI task. We can use the helper program [hello_jobstep.cpp](hello_jobstep.cpp) adapted from a [program](https://code.ornl.gov/olcf/hello_jobstep) by Thomas Papatheodore from ORNL. Every software thread executed by the program reports the MPI rank, OpenMP thread, the CPU hardware thread, as well as the GPU and BUS ID's of the GPU hardware.

```bash
#!/bin/bash -l

#SBATCH --account=<account>-gpu    # your account
#SBATCH --partition=gpu            # Using the gpu partition
#SBATCH --nodes=1                  # Total number of nodes
#SBATCH --gres=gpu:8               # The number of GPU's (and associated allocation packs) per node
#SBATCH --exclusive                # Use this to request all the resources on a node
#SBATCH --time=00:05:00

module swap PrgEnv-gnu PrgEnv-cray
module load craype-accel-amd-gfx90a
module load rocm

export MPICH_GPU_SUPPORT_ENABLED=1 # Enable GPU-aware MPI communication
export OMP_NUM_THREADS=8    # Set the number of OpenMP threads per task
export OMP_PLACES=cores     # To bind OpenMP threads to cores 
export OMP_PROC_BIND=close  # To bind (fix) threads (allocating them as close as possible). This option works together with the "places" indicated above, then: allocates threads in closest cores.
 
# Temporal workaround for avoiding Slingshot issues on shared nodes:
export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)

# Compile the software
make clean
make

# Run a job with task placement and $BIND_OPTIONS
srun --nodes=$SLURM_JOB_NUM_NODES --ntasks=8 --cpus-per-task=8\
	--gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest\
	./hello_jobstep | sort 
```

In the file [jobscript.sh](jobscript.sh) is a batch script for the information above. Edit the `<account>` field to include the account to charge to. The value to use will be in the environment variable `$PAWSEY_PROJECT`. 

```bash
echo $PAWSEY_PROJECT
```

Then submit the script to the batch queue with this command

```bash
sbatch jobscript.sh
```

Use this command to check on the progress of your job

```bash
squeue --me
```

Then if you need to you and you know the job id you can cancel a job with this command

```bash
scancel <jobID>
```

Once the job is done, have a look at the `*.out` file and examine how the threads and GPU's are placed.

## Summary

In this section we cover using HIP on the Pawsey Supercomputer Setonix. This includes logins with SSH;  hardware and software environments; and accessing the job queues through interactive and batch jobs. We conclude the chapter with the HIP software compilation process on Setonix, and then how to get the best performance in batch jobs by scheduling MPI tasks close to the available compute devices.

<address>
Written by Dr. Toby Potter of <a href="https://www.pelagos-consulting.com">Pelagos Consulting and Education</a>, for the <a href="https://pawsey.org.au">Pawsey Supercomputing Centre</a>, and with contributions from the Pawsey team. All trademarks mentioned in this teaching series belong to their respective owners.
</address>