# Using OpenCL on Setonix

## Access to Setonix

Firstly you need a username and password to access Setonix. Your **username** and **password** will be given to you prior to the beginning of this workshop. If you are using your regular Pawsey account then you can reset your password [here](https://support.pawsey.org.au/password-reset/).

Access to Setonix is via Secure SHell (SSH). On Linux, Mac OS, and Windows 10 and higher, an SSH client is available from the command line or terminal application. Otherwise you need to use a client program like [Putty](https://www.putty.org/) or [MobaXterm](https://mobaxterm.mobatek.net/download-home-edition.html).

### Access with SSH on the command line

On the command line use the command **ssh** to access Setonix.

```bash
ssh -Y <username>@setonix.pawsey.org.au
```

#### Passwordless login with SSH

In order to avoid specifying a username and password on each login you can generate a key and password combination on your computer using the following on the command line.

```bash
ssh-keygen -t rsa
```

Then copy the public key (the file that ends in \*.pub) to your account on Setonix and append it to the authorized keys in .ssh. On your machine run this command:

```bash
scp -r <filename>.pub <username>@setonix.pawsey.org.au
```

Then login to Setonix and run this command

```bash
mkdir -p ${HOME}/.ssh
cat <filename>.pub >> ${HOME}/.ssh/authorized_keys
chmod -R 0400 ${HOME}/.ssh
```

Finally, if you are using MacOS or Linux you can add this line to ${HOME}/.ssh/config on your computer

```text
Host setonix
    Hostname setonix.pawsey.org.au
    IdentityFile <private_key_file>
    User <username>
    ForwardX11 yes
    ForwardAgent yes
    ServerAliveInterval 300
    ServerAliveCountMax 2
    TCPKeepAlive no
```

Then you can run 

```bash
ssh setonix
```

without a password.

### Access from Windows with the MobaXterm client

If you have a OS that is older than Windows 10 and need a client in a hurry, just download **MobaXterm Home (Portable Edition)** from [this location](https://mobaxterm.mobatek.net/download-home-edition.html). Extract the Zip file and run the application. You might need to accept a firewall notification. 

Now go to **Settings -> SSH** and uncheck **"Enable graphical SSH-browser"** in the SSH-browser settings pane. Also enable **"SSH keepalive"** to keep SSH connections active.

<figure style="margin-bottom 3em; margin-top: 2em; margin-left:auto; margin-right:auto; width:100%">
    <img style="vertical-align:middle" src="../images/MobaXTerm_Settings.svg"> <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Figure: MobaXTerm settings.</figcaption>
</figure>

Then close the settings and start a local terminal.

## Hardware environment on Setonix

On Setonix there are two main kinds of compute nodes:

* CPU nodes with 2 sockets and 256 threads.
* GPU nodes with 1 CPU socket with 128 threads and 4 MI250X GPU sockets. Each GPU socket contains two GPU compute devices.

### CPU nodes

CPU nodes are based on the AMD<span>&trade;</span> EPYC<span>&trade;</span> 7763 processor in a dual-socket configuration. Each processor is a multi-chip design with 8 chiplets (Core CompleX's). Each chiplet has 8 cores and its own 32 MB L3 cache. Every core in the chiplet has its own L1 and L2 cache, and provides 2 hardware threads. There  are 16 hardware threads available per chiplet, a total of 64 cores 128 threads per processor, and 128 cores 256 threads per node. Here is some cache and performance infromation for individual CPU's.

| Node | CPU | Base clock freq(GHz) | Peak clock freq (GHz) | Cores | Hardware threads | L1 Cache (KB) | L2 Cache (KB) | L3 cache (MB) | FP SIMD width (bits) | Peak TFLOPs (FP32) |
|:----:|:----:|-----:| -----: | -----: | :----: | :----: | :----: | :----: | :----: | :---: |
| CPU |AMD EPYC 7763 | 2.45 | 3.50 | 64 | 128 | 64x32 | 64x512 | 8x32 | 256 | ~1.79 |

Below is an image of a CPU compute blade on Setonix, in this shot there are 8 CPU heatsinks for a total of four nodes per blade.  

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:100%;">
    <img src="images/cpu_blade.jpg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">A CPU blade on Setonix, showing four compute nodes per blade. Each compute node has two CPU sockets.</figcaption>
</figure>

### GPU nodes

GPU nodes on Setonix have one AMD 7A53 'Trento' CPU processor and **four** MI250X GPU processors. The CPU is a specially optimized version of the EPYC processor used in the CPU nodes, but otherwise has the same design and architecture. The Instinct<span>&trade;</span> MI250X processor is also a Multi-Chip Module (MCM) design, with two graphics dies (otherwise known as Graphics Complex Dies) per processor, as shown below.

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:100%;">
    <img src="../images/MI250x.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">AMD Instinct<span>&trade;</span> MI250X compute architecture, showing two GPU devices per processor. Image credit: <a href="https://hc34.hotchips.org/")>AMD Instinct<span>&trade;</span> MI200 Series Accelerator and Node Architectures | Hot Chips 34</a></figcaption>
</figure>

Each of the two dies (GCD's) in a MI250X appears to OpenCL as a **individual compute device** with its own 64 GB of global memory and 8MB of L2 cache. Therefore since there are four MI250X's, **there are a total of 8 GPU compute devices visible to OpenCL per GPU node**. Every one of the 8 compute devices has 110 **compute units**, and every compute unit executes instructions over a bank of 4x16 floating point SIMD units that share a 16KB L1 cache, as seen below:

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:100%;">
    <img src="images/Setonix-GPU-Compute-Unit.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Close-up of an AMD Instinct MI250X compute unit.</figcaption>
</figure>

The interesting thing to note with these compute units is that both 64-bit and 32-bit floating instructions are executed natively **at the same rate**. Therefore only the increased bandwidth requirements for moving 64-bit numbers around is a consideration for performance. Below is a table of performance numbers for each of the four MI250X processors in a node.

| Card | Boost clock (GHz)| Compute Units | FP32 Processing Elements | FP64 Processing Elements (equivalent compute capacity) | L1 Cache (KB) | L2 Cache (MB) | device memory (GB) | Peak Tflops (FP32)| Peak Tflops (FP64)|
|:----:|:-----| :----- | :----- | :---- | :---- | :---- | :---- | :---- | :---- |
| AMD Radeon Instinct MI250x |1.7 | 2x110 | 2x7040 | 2x7040 | 2x110x16 | 2x8 | 2x64 | 47.9 | 47.9 |

Below is an installation image of a GPU compute blade with two nodes. Each node has 1 CPU socket and four GPU sockets.

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:100%;">
    <img src="images/gpu_blade.jpg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">A GPU blade on Setonix, showing two GPU nodes, each node has one CPU socket and four GPU sockets.</figcaption>
</figure>

## Job queues

On Setonix the following queues are available for general use:

|Queue| Max time limit| Processing elements (CPU) | Socket| Cores| processing elements per CPU core | Available memory (GB) | Number of  devices | Memory per OpenCL device (GB) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| work | 24 hours | 256 | 2 | 64 | 2 | 230 | 0 | 0 |
| long | 96 hours | 256 | 2 | 64 | 2 | 230 | 0 | 0 |
| debug | 1 hour | 256 | 2 | 64 | 2 | 230 | 0 | 0 |
| highmem | 24 hours | 256 | 2 | 64 | 2 | 980 | 0 | 0 |
| copy | 24 hours | 32 | 1 | 64 | 2 | 118 | 0 | 0 |
| gpu | 24 hours | 256 | 1 | 64 | 2 | 230 | 4x2 | 64 |

## Interactive jobs on GPU nodes

When compiling software or running test jobs with GPU's it is helpful to have access to a "live" node.  Allocations for the **gpu** queue on Setonix need a separate allocation. At present this will be the account name followed by **-gpu**. The following command will set up an interactive job on the **gpu** queue of Setonix. You can use this to compile software and run interactive jobs on a gpu node of Setonix.

```bash
salloc --account ${PAWSEY_PROJECT}-gpu --ntasks 1 --mem 4GB --cpus-per-task 1 --time 1:00:00 --gpus-per-task 1 --partition gpu
```

## Building software for Setonix

The main complexity with building OpenCL enabled applications on Setonix is if you also need MPI support. Otherwise you can load the **rocm** module and simply use **hipcc**. Here are some extra modules to load if you also need MPI support.

### Software modules

There are three main programming environments available on Setonix. Each provides C/C++ and Fortran compilers that build software with knowledge of of the MPI libraries available on Setonix. The **PrgEnv-GNU** programming environment uses the GNU compilers, **PrgEnv-aocc** uses the AMD **aocc** optimising compiler to try and get the best performance from the AMD CPU's on Setonix, and the **PrgEnv-cray** compilers use the compilers from Cray. Use these commands to find which module to load.

| Programming environment | command to use |
| :--- | :--- |
| AMD | ```module avail PrgEnv-aocc``` |
| Cray | ```module avail PrgEnv-cray``` |
| GNU | ```module avail PrgEnv-gnu``` |

When compiling OpenCL sources you have the choice of either the the ROCM **hipcc** compiler wrapper or the Cray compiler wrapper **CC** from **PrgEnv-cray**. If you use the Cray compiler wrapper you need to swap to the module **PrgEnv-cray** as the GNU programming environment (**PrgEnv-gnu**) is loaded by default. 

```bash
module swap PrgEnv-gnu PrgEnv-cray
```

Then the following compiler wrappers are available for use to compile source files:

| Command | Explanation |
| :--- | :--- |
| cc | C compiler |
| CC | C++ compiler |
| ftn | FORTRAN compiler |

In order to use the GPU-aware MPI library from Cray you also need to load the **craype-accel-amd-gfx90a** module, which works in all three programming environments. To see which version to load run this command.

```bash
module avail craype-accel-amd-gfx90a
```

Load the module **craype-accel-amd-gfx90a** then set the environment variable

```bash
export MPICH_GPU_SUPPORT_ENABLED=1
```

Finally, in order to have ROCM software (such as hipcc and rocgdb) and libraries available you need to have the **rocm** module loaded. To see which one to load, run this command:

```bash
module avail rocm
```

The **rocm** module is independent of the programming environment module loaded. 

### Compiling software with OpenCL and MPI support

According to this [documentation](https://docs.amd.com/bundle/HIP-Programming-Guide-v5.0/page/Transitioning_from_CUDA_to_HIP.html) the AMD compiler wrapper **hipcc** can be used for compiling OpenCL source files and is the suggested linker for program objects. In order provide the best chance of reducing compiler issues it is good practice to compile **while on a gpu node**, either from a batch or interactive job. 

#### Compiling and linking with the **hipcc** compiler wrapper

You can use these compiler flags to bring in the MPI headers and make sure **hipcc** compiles kernels for the MI250X GPU's on Setonix. 

| Function | flags |
| :--- | :--- |
| Compile | ```-I${MPICH_DIR}/include --offload-arch=gfx90a ``` |
| Link | ```-L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}``` |
| Debug (compile and link) | ```-g -ggdb``` |
| OpenMP (compile and link)| ```-fopenmp``` |

If you want **hipcc** to behave like Cray **CC**, make sure the **PrgEnv-cray** and **craype-accel-amd-gfx90a** modules are also loaded. Then you can add the output of this command,

```bash
$(CC --cray-print-opts=cflags)
```

to the hipcc compile flags, and the output of this command,

```bash
$(CC --cray-print-opts=libs)
```

to the hipcc linker flags.

#### Compiling and linking with the Cray **CC** compiler wrapper 

If you are using the Cray compiler wrapper **CC** you can add these flags to compile and link OpenCL code for the MI250X GPU's on Setonix. You need to have the **rocm** module loaded.

| Function | flags |
| :--- | :--- |
| Compile | ```-D__HIP_ROCclr__ -D__HIP_ARCH_GFX90A__=1 --offload-arch=gfx90a -x hip``` |
| Link |  |
| Debug (compile and link) | ```-g``` |
| OpenMP (compile and link)| ```-fopenmp``` |

#### Mixing hipcc and Cray compilation

From this [documentation](https://docs.amd.com/bundle/HIP-Programming-Guide-v5.0/page/Transitioning_from_CUDA_to_HIP.html) it is important to ensure that all code links back to the same C++ standard libraries. The command ```hipconfig --cxx``` generates extra compile flags that might be useful for including in the build process with the Cray wrapper. 

## Exercise: compile and run your first MPI-enabled OpenCL application

In the files [hello_devices_mpi.cpp](hello_devices_mpi.cpp) and [hello_devices_mpi_onefile.cpp](hello_devices_mpi_onefile.cpp) are files to implement an MPI-enabled HIP application that reports on devices and fills a vector. The difference between the two is that for [hello_devices_mpi.cpp](hello_devices_mpi.cpp) has the kernel located in a separate file [kernels.hip.cpp](kernels.hip.cpp). Your task is to compile these files into two executables, **hello_devices_mpi.exe** and **hello_devices_mpi_onefile.exe**.

### Compilation steps

1. Log into **setonix.pawsey.org.au**.
1. Use **cd** to change directory to your temporary file location in /scratch.
1. Clone the course material from Github if don't already have it.
    <br></br>
    ```git clone git@github.com:pelagos-consulting/HIP_Course.git```
    <br></br>
1. Change directory to **course_material/L2_Using_HIP_On_Setonix**.
1. Get an interactive job on the GPU queue of Setonix with this command:
    <br></br>
    ```salloc --account ${PAWSEY_PROJECT} --ntasks 1 --mem 4GB --cpus-per-task 1 --time 1:00:00 --gpus-per-task 1 --partition gpu```
    <br></br>
1. Load the **rocm** module
    <br></br>
    ```module load rocm```
    <br></br>    
1. Swap out the **PrgEnv-gnu** module for the **PrgEnv-cray** module
    <br></br>
    ```module swap PrgEnv-gnu PrgEnv-cray```
    <br></br>
1. Load the **craype-accel-amd-gfx90a** module
    <br></br>
    ```module load craype-accel-amd-gfx90a```
    
#### Compile the kernel and main program in separate files
1. Compile the kernel file [kernels.hip.cpp](kernels.hip.cpp)
    <br></br>
    ```hipcc -c kernels.hip.cpp --offload-arch=gfx90a -o kernels.o```
    <br></br>
1. Use **CC** to compile the file [hello_devices_mpi.cpp](hello_devices_mpi.cpp). Make sure to include the location of the **hip_helper.hpp** library, located in ../include.
    <br></br>
    ```CC -c -D__HIP_ROCclr__ -D__HIP_ARCH_GFX90A__=1 --offload-arch=gfx90a -I../include -x hip hello_devices_mpi.cpp -o hello_devices_mpi.o```
    <br></br>
1. Use **hipcc** to link the object files together in a way that is aware of the MPI library.
    <br></br>
    ```hipcc kernels.o hello_devices_mpi.o -o hello_devices_mpi.exe -L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}```
<br></br>

#### Compile the combined file in one go using **hipcc**

```hipcc -I${MPICH_DIR}/include -I../include --offload-arch=gfx90a hello_devices_mpi_onefile.cpp -o hello_devices_mpi_onefile_hipcc.exe -L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}```
<br></br>

#### Compile the combined file in one go using **CC**

```CC -D__HIP_ROCclr__ -D__HIP_ARCH_GFX90A__=1 --offload-arch=gfx90a -I../include -x hip hello_devices_mpi_onefile.cpp -o hello_devices_mpi_onefile_CC.exe```


### Run the compiled code

Using **srun** we can run the executable files. If we don't use **srun** then it will pick up all the GPU's on a node. 

1. ```srun ./hello_devices_mpi.exe```
1. ```srun ./hello_devices_mpi_onefile_hipcc.exe```
1. ```srun ./hello_devices_mpi_onefile_CC.exe```

### Bonus task: Try running these programs with and without **srun** to see what happens.

### The answer

If you get stuck, the example [Makefile](Makefile) contains the above compilation steps. Assuming you loaded the right modules defined above, the make command is run as follows:

```make clean; make```

The script **run_compile.sh** contains the necessary commands to load the appropriate modules and run the **make** command.

```bash
   chmod 700 run_compile.sh
   ./run_compile.sh
```

## Tips for batch jobs with HIP on GPU nodes

Pawsey has extensive documentation available for running jobs, at this [site](https://support.pawsey.org.au/documentation/display/US/Running+Jobs+in+Setonix). Here is some information that is specific to making best use of the GPU nodes on Setonix.

### GPU node configuration

On the GPU nodes of Setonix there is 1 CPU and 8 compute devices. Each of the 8 chiplets in the CPU is intended to have optimal access to one of the 8 available GPU compute devices. Shown below is a hardware diagram of a compute node, where each chiplet has optimal access to one compute device.

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:100%;">
    <img src="images/Setonix-GPU-Node.png">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Overall view of a Setonix GPU node, showing the placement of hardware threads and the closest available compute device.</figcaption>
</figure>

Work is still being done on making sure that MPI processes map optimally to available compute devices. These suggestions will help space out the MPI tasks so each task resides on its own chiplet.

* Use **--ntasks-per-node=8** to allocate up to 8 MPI tasks per node, one per compute device.
* Use **--gpus-per-task=1** to allocate 1 compute device per MPI task.
* Use **--cpus-per-task=8** and **--threads-per-core=1** to allocate all available threads in a chiplet to a single MPI process.
* Use the **--gpu-bind=closest** option to bind each compute device to the closest MPI task.

### Example job script

The suggested job script below will allocate an MPI task for every compute device on a node of Setonix. Then it will allocate 8 OpenMP threads to each MPI task. We can use the helper program [hello_jobstep.cpp](hello_jobstep.cpp) adapted from a [program](https://code.ornl.gov/olcf/hello_jobstep) by Thomas Papatheodore from ORNL. Every software thread executed by the program reports the MPI rank, OpenMP thread, the CPU hardware thread, as well as the GPU and BUS ID's of the GPU hardware.

```bash
#!/bin/bash -l

#SBATCH --account=<account>-gpu    # your account
#SBATCH --partition=gpu            # Using the gpu partition
#SBATCH --ntasks=8                 # Total number of tasks
#SBATCH --ntasks-per-node=8        # Set this for 1 mpi task per compute device
#SBATCH --cpus-per-task=8          # How many OpenMP threads per MPI task
#SBATCH --threads-per-core=1       # How many OpenMP threads per core
#SBATCH --gpus-per-task=1          # How many OpenCL compute devices to allocate to a  task
#SBATCH --gpu-bind=closest         # Bind each MPI taks to the nearest GPU
#SBATCH --mem=4000M                #Indicate the amount of memory per node when asking for share resources
#SBATCH --time=01:00:00

module swap PrgEnv-gnu PrgEnv-cray
module load craype-accel-amd-gfx90a
module load rocm

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK   #To define the number of OpenMP threads available per MPI task, in this case it will be 8
export OMP_PLACES=cores     #To bind to cores 
export OMP_PROC_BIND=close  #To bind (fix) threads (allocating them as close as possible). This option works together with the "places" indicated above, then: allocates threads in closest cores.
 
# Temporal workaround for avoiding Slingshot issues on shared nodes:
export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)

# Run a job with task placement and $BIND_OPTIONS
#srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS $BIND_OPTIONS  ./hello_jobstep.exe
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS ./hello_jobstep.exe | sort
```

In the file [jobscript.sh](jobscript.sh) is a batch script for the information above. Edit the \<account\> infomation to include the account to charge to and then run the script with 

```bash
sbatch jobscript.sh
```

Have a look at the .out file and examine how the threads and GPU's are placed.