Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions Pragma_Examples/OpenMP/Fortran/do_concurrent/1_saxpy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,7 @@ There are three folders:
cd 0_port_yourself
```

This is a plain serial saxpy. The task is to replace the `do` loops with `do concurrent` loops, you may also want to replace the initialization with such loops.
and compile with the appropriate `-fdo-concurrent-to-openmp=` flag.
This is a plain serial saxpy. The task is to replace the `do` loops with `do concurrent` loops. You may also want to replace the initialization with such loops and compile with the appropriate `-fdo-concurrent-to-openmp=` flag.

Build and run the serial version (only `-fopenmp` is needed for `omp_get_wtime`):

Expand Down Expand Up @@ -63,7 +62,7 @@ Build and run:
make
./saxpy
```
It is recommended to set `OMP_NUM_THREADS=24` (or other number, 24 makes sense for 1 MI300A) is needed to run in parallel. For best performance affinity should be set (system dependent) for example `OMP_PROC_BIND=close numactl -C 0-23 -m 0 ./saxpy`.
It is recommended to set `OMP_NUM_THREADS=24` (or another number; 24 makes sense for 1 MI300A) to run in parallel. For best performance, affinity should be set (system dependent), for example `OMP_PROC_BIND=close numactl -C 0-23 -m 0 ./saxpy`.



Expand Down Expand Up @@ -115,7 +114,8 @@ The kernel names contain `__omp_offloading`, confirming the compiler transformat
Two kernels are launched: one for the initialization `do concurrent` loop (`_QQmain_l48`)
and one for the saxpy computation `do concurrent` loop (`_QMsaxpymodPsaxpy_l22`).

Note: For best performance setting affinity is important (and system dependent!) for example ROCR_VISIBLE_DEVICES=0 OMP_PROC_BIND=close numactl -C 0 -m 0 ./saxpy
Note: For best performance, setting affinity is important (and system dependent!), for example: `ROCR_VISIBLE_DEVICES=0 OMP_PROC_BIND=close numactl -C 0 -m 0 ./saxpy`

### Running with `HSA_XNACK=0` and `HSA_XNACK=1`

`HSA_XNACK` controls whether the GPU uses page migration (unified shared memory) or
Expand Down Expand Up @@ -147,7 +147,7 @@ LIBOMPTARGET_KERNEL_TRACE=1 ./saxpy

On MI300A you should see both runs succeed. Compare the kernel times — with `HSA_XNACK=1`
the runtime only needs to map pointers rather than copy entire arrays, which can result in
faster execution after the first touch of the data. For the `HSA_XNACK=0` case you will see that data migration relies on the default implict mapping in OpenMP.
faster execution after the first touch of the data. For the `HSA_XNACK=0` case you will see that data migration relies on the default implicit mapping in OpenMP.

This makes `do concurrent` a portable, pragma-free way to express parallelism in
standard Fortran while still leveraging GPU hardware through the OpenMP offload infrastructure.
41 changes: 21 additions & 20 deletions Pragma_Examples/OpenMP/Fortran/do_concurrent/2_reduction/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## `do concurrent` with `REDUCE` (OpenMP-style reduction)

**Layout**
### Exercise overview

| Path | Description |
|------|-------------|
Expand All @@ -17,49 +17,51 @@ end do

This is lowered to OpenMP target code, similar in spirit to a `REDUCE` clause on a parallel `do` loop.

---

### Toolchain and environment
This example requires at least Fortran Drop 23.2.0 (April 2024, beta release). There is no offical rocm version yet which enables REDUCE correctly.
This example requires at least Fortran Drop 23.2.0 (April 2026, beta release). There is no official ROCm version yet which enables REDUCE correctly.
See [here](#beta-compiler-release-which-enables-reduce) for more details on how to install this version.

```bash
module load rocm-therock/23.2.0
module load rocm/therock-23.2.0
export FC=amdflang
```
Either use `HSA_XNACK=1` or `HSA_XNACK=0`, you can also experiment with that.


**Required `do concurrent` and offload flags** (as in `1_do_concurrent_reduce/Makefile`):
### Required compiler flags
As in `1_do_concurrent_reduce/Makefile`, you need to pass additional flags to the compiler to enable `do concurrent` on the GPU:

- `-fdo-concurrent-to-openmp=device` map `do concurrent` to OpenMP *device* regions
- `-fopenmp --offload-arch=<arch>` OpenMP offload compile and link
- `-fdo-concurrent-to-openmp=device`: map `do concurrent` to OpenMP *device* regions
- `-fopenmp --offload-arch=<arch>`: enable OpenMP offload compile and link

The `Makefile` sets `ROCM_GPU` to the first `rocminfo` line that contains a `gfx` token (see `1_do_concurrent_reduce/Makefile`). On CPU login nodes this value is empty. In that case pass **`ROCM_GPU`** manually, e.g. `make ROCM_GPU=gfx942` for MI300-series parts (use the arch that matches your GPU).
The `Makefile` sets `ROCM_GPU` to the first `rocminfo` line that contains a `gfx` token (see `1_do_concurrent_reduce/Makefile`). On CPU login nodes this value is empty. In that case pass **`ROCM_GPU`** manually, e.g. `make ROCM_GPU=gfx942` for MI300-series GPUs (use the arch that matches your GPU).

**Build the serial starting point (any Fortran compiler would do, but use the latest pre release Fortran Drop 23.2.0 (April 2026) as required for the next step):**

### Build and run the example
First, build the serial starting point (any Fortran compiler would do, but use the latest pre-release Fortran Drop 23.2.0 (April 2026) as required for the next step):
```bash
module load rocm-therock/23.2.0
module load rocm/therock-23.2.0
cd 0_port_yourself
make
./freduce
```

Expected result: `sum= 100000.0` (or similar formatting).

**Build the device reference solution:**
Compare the code changes you made to the solution. Run the solution:
Next, compare the code changes you made to the solution. Run the solution with:
```bash
cd 1_do_concurrent_reduce
module load rocm-therock/23.2.0
module load rocm/therock-23.2.0
export FC=amdflang
make # or: make ROCM_GPU=gfx942 if not on a compute node
./freduce #needs to run on a compute node!
./freduce # needs to run on a compute node!
```

It should print the same sum, `100000.0` (summing 100,000 values of 1.0).

### Optional: confirm do concurrent is leveraging OpenMP offload
### Optional: confirm `do concurrent` is leveraging OpenMP offload
Set the `LIBOMPTARGET_KERNEL_TRACE=1` environment variable to enable additional output of the OpenMP runtime:

```bash
cd 1_do_concurrent_reduce
Expand All @@ -69,11 +71,10 @@ LIBOMPTARGET_KERNEL_TRACE=1 ./freduce
You should see traces for kernels whose names include `__omp_offloading`, indicating the `do concurrent` (including `REDUCE`) path was lowered to OpenMP target code.


## Beta compiler release which enables `REDUCE`:
This feature was enabled very recently in the compiler. Today (April 2026) it only works with this pre-release version:

**Beta compiler release which enables REDUCE:**
This feature was very recently enabled in the compiler. Today (April 2026) it only works with this pre release version:

- https://repo.radeon.com/rocm/misc/flang/therock-afar-23.2.0-gfxX-7.13.0-663ad81964a.txt
- (https://repo.radeon.com/rocm/misc/flang/therock-afar-23.2.0-gfxX-7.13.0-663ad81964a.txt)

Read that file for download locations and install notes for your GPU architecture (`gfx*`).
New pre-release Fortran Drops are published (in)frequently here: https://repo.radeon.com/rocm/misc/flang
New pre-release Fortran Drops are published infrequently here: (https://repo.radeon.com/rocm/misc/flang)