amd · jopotyka · Apr 23, 2026 · Apr 23, 2026
diff --git a/Pragma_Examples/OpenMP/Fortran/do_concurrent/1_saxpy/README.md b/Pragma_Examples/OpenMP/Fortran/do_concurrent/1_saxpy/README.md
@@ -34,8 +34,7 @@ There are three folders:
 cd 0_port_yourself
 ```
 
-This is a plain serial saxpy. The task is to replace the `do` loops with `do concurrent` loops, you may also want to replace the initialization with such loops.
-and compile with the appropriate `-fdo-concurrent-to-openmp=` flag.
+This is a plain serial saxpy. The task is to replace the `do` loops with `do concurrent` loops. You may also want to replace the initialization with such loops and compile with the appropriate `-fdo-concurrent-to-openmp=` flag.
 
 Build and run the serial version (only `-fopenmp` is needed for `omp_get_wtime`):
 
@@ -63,7 +62,7 @@ Build and run:
 make
 ./saxpy
 ```
-It is recommended to set `OMP_NUM_THREADS=24` (or other number, 24 makes sense for 1 MI300A) is needed to run in parallel. For best performance affinity should be set (system dependent) for example `OMP_PROC_BIND=close numactl -C 0-23 -m 0 ./saxpy`.
+It is recommended to set `OMP_NUM_THREADS=24` (or another number; 24 makes sense for 1 MI300A) to run in parallel. For best performance, affinity should be set (system dependent), for example `OMP_PROC_BIND=close numactl -C 0-23 -m 0 ./saxpy`.
 
 
 
@@ -115,7 +114,8 @@ The kernel names contain `__omp_offloading`, confirming the compiler transformat
 Two kernels are launched: one for the initialization `do concurrent` loop (`_QQmain_l48`)
 and one for the saxpy computation `do concurrent` loop (`_QMsaxpymodPsaxpy_l22`).
 
-Note: For best performance setting affinity is important (and system dependent!) for example  ROCR_VISIBLE_DEVICES=0 OMP_PROC_BIND=close numactl -C 0 -m 0 ./saxpy
+Note: For best performance, setting affinity is important (and system dependent!), for example: `ROCR_VISIBLE_DEVICES=0 OMP_PROC_BIND=close numactl -C 0 -m 0 ./saxpy`
+
 ### Running with `HSA_XNACK=0` and `HSA_XNACK=1`
 
 `HSA_XNACK` controls whether the GPU uses page migration (unified shared memory) or
@@ -147,7 +147,7 @@ LIBOMPTARGET_KERNEL_TRACE=1 ./saxpy
 
 On MI300A you should see both runs succeed. Compare the kernel times — with `HSA_XNACK=1`
 the runtime only needs to map pointers rather than copy entire arrays, which can result in
-faster execution after the first touch of the data. For the  `HSA_XNACK=0` case you will see that data migration relies on the default implict mapping in OpenMP.
+faster execution after the first touch of the data. For the `HSA_XNACK=0` case you will see that data migration relies on the default implicit mapping in OpenMP.
 
 This makes `do concurrent` a portable, pragma-free way to express parallelism in
 standard Fortran while still leveraging GPU hardware through the OpenMP offload infrastructure.
diff --git a/Pragma_Examples/OpenMP/Fortran/do_concurrent/2_reduction/README.md b/Pragma_Examples/OpenMP/Fortran/do_concurrent/2_reduction/README.md
@@ -1,6 +1,6 @@
 ## `do concurrent` with `REDUCE` (OpenMP-style reduction)
 
-**Layout**
+### Exercise overview
 
 | Path | Description |
 |------|-------------|
@@ -17,49 +17,51 @@ end do
 
 This is lowered to OpenMP target code, similar in spirit to a `REDUCE` clause on a parallel `do` loop.
 
----
 
 ### Toolchain and environment
-This example requires at least Fortran Drop 23.2.0 (April 2024, beta release). There is no offical rocm version yet which enables REDUCE correctly.
+This example requires at least Fortran Drop 23.2.0 (April 2026, beta release). There is no official ROCm version yet which enables REDUCE correctly.
+See [here](#beta-compiler-release-which-enables-reduce) for more details on how to install this version.
 
 ```bash
-module load rocm-therock/23.2.0
+module load rocm/therock-23.2.0
 export FC=amdflang
 ```
 Either use `HSA_XNACK=1` or `HSA_XNACK=0`, you can also experiment with that.
 
 
-**Required `do concurrent` and offload flags** (as in `1_do_concurrent_reduce/Makefile`):
+### Required compiler flags
+As in `1_do_concurrent_reduce/Makefile`, you need to pass additional flags to the compiler to enable `do concurrent` on the GPU:
 
-- `-fdo-concurrent-to-openmp=device` — map `do concurrent` to OpenMP *device* regions  
-- `-fopenmp --offload-arch=<arch>` — OpenMP offload compile and link  
+- `-fdo-concurrent-to-openmp=device`: map `do concurrent` to OpenMP *device* regions  
+- `-fopenmp --offload-arch=<arch>`: enable OpenMP offload compile and link  
 
-The `Makefile` sets `ROCM_GPU` to the first `rocminfo` line that contains a `gfx` token (see `1_do_concurrent_reduce/Makefile`). On CPU login nodes this value is empty. In that case pass **`ROCM_GPU`** manually, e.g. `make ROCM_GPU=gfx942` for MI300-–series parts (use the arch that matches your GPU).
+The `Makefile` sets `ROCM_GPU` to the first `rocminfo` line that contains a `gfx` token (see `1_do_concurrent_reduce/Makefile`). On CPU login nodes this value is empty. In that case pass **`ROCM_GPU`** manually, e.g. `make ROCM_GPU=gfx942` for MI300-series GPUs (use the arch that matches your GPU).
 
-**Build the serial starting point (any Fortran compiler would do, but use the latest pre release Fortran Drop 23.2.0 (April 2026) as required for the next step):**
 
+### Build and run the example
+First, build the serial starting point (any Fortran compiler would do, but use the latest pre-release Fortran Drop 23.2.0 (April 2026) as required for the next step):
 ```bash
-module load rocm-therock/23.2.0
+module load rocm/therock-23.2.0
 cd 0_port_yourself
 make
 ./freduce
 ```
 
 Expected result: `sum=   100000.0` (or similar formatting).
 
-**Build the device reference solution:**
-Compare the code changes you made to the solution. Run the solution:
+Next, compare the code changes you made to the solution. Run the solution with:
 ```bash
 cd 1_do_concurrent_reduce
-module load rocm-therock/23.2.0  
+module load rocm/therock-23.2.0  
 export FC=amdflang
 make   # or:  make ROCM_GPU=gfx942 if not on a compute node
-./freduce #needs to run on a compute node!
+./freduce # needs to run on a compute node!
 ```
 
 It should print the same sum, `100000.0` (summing 100,000 values of 1.0).
 
-### Optional: confirm do concurrent is leveraging  OpenMP offload 
+### Optional: confirm `do concurrent` is leveraging OpenMP offload 
+Set the `LIBOMPTARGET_KERNEL_TRACE=1` environment variable to enable additional output of the OpenMP runtime:
 
 ```bash
 cd 1_do_concurrent_reduce
@@ -69,11 +71,10 @@ LIBOMPTARGET_KERNEL_TRACE=1 ./freduce
 You should see traces for kernels whose names include `__omp_offloading`, indicating the `do concurrent` (including `REDUCE`) path was lowered to OpenMP target code.
 
 
+## Beta compiler release which enables `REDUCE`:
+This feature was enabled very recently in the compiler. Today (April 2026) it only works with this pre-release version:
 
-**Beta compiler release which enables REDUCE:** 
-This feature was very recently enabled in the compiler. Today (April 2026) it only works with this pre release version:
-
-- https://repo.radeon.com/rocm/misc/flang/therock-afar-23.2.0-gfxX-7.13.0-663ad81964a.txt
+- (https://repo.radeon.com/rocm/misc/flang/therock-afar-23.2.0-gfxX-7.13.0-663ad81964a.txt)
 
 Read that file for download locations and install notes for your GPU architecture (`gfx*`).
-New pre-release Fortran Drops are published (in)frequently here: https://repo.radeon.com/rocm/misc/flang
+New pre-release Fortran Drops are published infrequently here: (https://repo.radeon.com/rocm/misc/flang)