# JSC OpenACC Course 2024

* Date: 29 - 31 October 2024
* Location: _online_
* Institue: JÃ¼lich Supercomputing Centre

## Session 6: Multi GPU Programming with MPI and OpenACC 

### Task 1 Apply domain decomposition

* Handle GPU affinity 
* Halo Exchange

Follow `TODO`s in `6-Multi-GPU-Programming-with-MPI_and_OpenACC/exercises/[C|FORTRAN]/task1/poisson2d.[c|F03]`

#### Make Targets

* `run`: run `poisson2d` (default)
* `poisson2d`: build `poisson2d` binary
* `profile`: profile with [Nsight Systems](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cli-profiling)
* `*.solution`: same as above with solution (`poisson2d.solution.*`)

#### Example Output

```console
[kraus1@jwlogin21 task1]$ make
mpicc -c -DUSE_DOUBLE -Minfo=accel -fast -acc=gpu -gpu=cc80 poisson2d_serial.c -o poisson2d_serial.o
poisson2d_serial:
     37, Generating present(Anew[:],rhs[:],Aref[:])
     39, Generating update device(rhs[:ny*nx],Aref[:ny*nx])
     41, Generating Tesla code
         44, #pragma acc loop gang /* blockIdx.x */
             Generating implicit reduction(max:error)
         46, #pragma acc loop vector(128) /* threadIdx.x */
     41, Generating implicit copy(error) [if not already present]
     46, Loop is parallelizable
     52, Generating Tesla code
         55, #pragma acc loop gang /* blockIdx.x */
         57, #pragma acc loop vector(128) /* threadIdx.x */
     57, Loop is parallelizable
     61, Generating Tesla code
         65, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     69, Generating Tesla code
         71, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     82, Generating update self(Aref[:ny*nx])
mpicc -DUSE_DOUBLE -Minfo=accel -fast -acc=gpu -gpu=cc80 poisson2d.c poisson2d_serial.o -o poisson2d
poisson2d.c:
"poisson2d.c", line 164: warning: variable "top" was declared but never referenced
          int top    = (rank == 0) ? (size-1) : rank-1;
              ^

"poisson2d.c", line 165: warning: variable "bottom" was declared but never referenced
          int bottom = (rank == (size-1)) ? 0 : rank+1;
              ^

main:
     72, Generating enter data create(rhs[:ny*nx],Aref[:ny*nx],A[:ny*nx],Anew[:ny*nx])
     77, Generating present(Aref[:],Anew[:],A[:])
         Generating Tesla code
         81, #pragma acc loop gang /* blockIdx.x */
         83, #pragma acc loop vector(128) /* threadIdx.x */
     83, Loop is parallelizable
     93, Generating update self(A[:ny*nx],Anew[:ny*nx],Aref[:ny*nx])
    128, Generating update device(rhs[nx*iy_start:nx*(iy_end-iy_start)],A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])
    130, Generating present(A[:],rhs[:],Anew[:])
         Generating Tesla code
        133, #pragma acc loop gang /* blockIdx.x */
             Generating implicit reduction(max:error)
        135, #pragma acc loop vector(128) /* threadIdx.x */
    130, Generating implicit copy(error) [if not already present]
    135, Loop is parallelizable
    145, Generating present(Anew[:],A[:])
         Generating Tesla code
        148, #pragma acc loop gang /* blockIdx.x */
        150, #pragma acc loop vector(128) /* threadIdx.x */
    150, Loop is parallelizable
    154, Generating present(A[:])
         Generating Tesla code
        159, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    179, Generating present(A[:])
         Generating Tesla code
        182, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    193, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])
    212, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])
/p/software/juwelsbooster/stages/2020/software/binutils/2.36.1-GCCcore-10.3.0/bin/ld: warning: /p/software/juwelsbooster/stages/2020/software/NVHPC/21.9-GCC-10.3.0/Linux_x86_64/21.9/compilers/lib/nvhpc.ld contains output sections; did you forget -T?
srun --ntasks-per-node 4 -n 4 ./poisson2d
Jacobi relaxation Calculation: 8192 x 8192 mesh
Calculate reference solution and time serial execution.
    0, 0.250000
  100, 0.249985
  200, 0.249970
  300, 0.249955
  400, 0.249940
  500, 0.249925
  600, 0.249911
  700, 0.249896
  800, 0.249881
  900, 0.249866
Parallel execution.
    0, 0.250000
  100, 0.249985
  200, 0.249970
  300, 0.249955
  400, 0.249940
  500, 0.249925
  600, 0.249911
  700, 0.249896
  800, 0.249881
  900, 0.249866
Num GPUs: 4.
8192x8192: 1 GPU:   2.3271 s, 4 GPUs:   2.3426 s, speedup:     0.99, efficiency:    24.84%
MPI time:   0.0000 s, inter GPU BW:  6604.84 GiB/s
```

### Task 2 Hide MPI communication time

* Start copy loop asynchronously 
* Wait for async copy loop after MPI communication is done

Follow `TODO`s in `6-Multi-GPU-Programming-with-MPI_and_OpenACC/exercises/[C|FORTRAN]/task2/poisson2d.c`

#### Make Targets

* `run`: run `poisson2d` (default)
* `poisson2d`: build `poisson2d` binary
* `profile`: profile with [Nsight Systems](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cli-profiling)
* `*.solution`: same as above with solution (`poisson2d.solution.*`)

#### Example Output

```console
[kraus1@jwlogin21 task2]$ make
mpicc -c -DUSE_DOUBLE -Minfo=accel -fast -acc=gpu -gpu=cc80 poisson2d_serial.c -o poisson2d_serial.o
poisson2d_serial:
     37, Generating present(Anew[:],rhs[:],Aref[:])
     39, Generating update device(rhs[:ny*nx],Aref[:ny*nx])
     41, Generating Tesla code
         44, #pragma acc loop gang /* blockIdx.x */
             Generating implicit reduction(max:error)
         46, #pragma acc loop vector(128) /* threadIdx.x */
     41, Generating implicit copy(error) [if not already present]
     46, Loop is parallelizable
     52, Generating Tesla code
         55, #pragma acc loop gang /* blockIdx.x */
         57, #pragma acc loop vector(128) /* threadIdx.x */
     57, Loop is parallelizable
     61, Generating Tesla code
         65, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     69, Generating Tesla code
         71, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     82, Generating update self(Aref[:ny*nx])
mpicc -DUSE_DOUBLE -Minfo=accel -fast -acc=gpu -gpu=cc80 poisson2d.c poisson2d_serial.o -o poisson2d
poisson2d.c:
main:
     82, Generating enter data create(Aref[:ny*nx],rhs[:ny*nx],A[:ny*nx],Anew[:ny*nx])
     93, Generating present(Aref[:],Anew[:],A[:])
         Generating Tesla code
         97, #pragma acc loop gang /* blockIdx.x */
         99, #pragma acc loop vector(128) /* threadIdx.x */
     99, Loop is parallelizable
    109, Generating update self(A[:ny*nx],Aref[:ny*nx],Anew[:ny*nx])
    144, Generating update device(rhs[nx*iy_start:nx*(iy_end-iy_start)],A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])
    146, Generating present(A[:],rhs[:],Anew[:])
         Generating Tesla code
        149, #pragma acc loop gang /* blockIdx.x */
             Generating implicit reduction(max:error)
        151, #pragma acc loop vector(128) /* threadIdx.x */
    146, Generating implicit copy(error) [if not already present]
    151, Loop is parallelizable
    161, Generating present(Anew[:],A[:])
         Generating Tesla code
        165, #pragma acc loop gang /* blockIdx.x */
        167, #pragma acc loop vector(128) /* threadIdx.x */
    167, Loop is parallelizable
    190, Generating present(A[:])
         Generating Tesla code
        194, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    205, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])
    224, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])
/p/software/juwelsbooster/stages/2020/software/binutils/2.36.1-GCCcore-10.3.0/bin/ld: warning: /p/software/juwelsbooster/stages/2020/software/NVHPC/21.9-GCC-10.3.0/Linux_x86_64/21.9/compilers/lib/nvhpc.ld contains output sections; did you forget -T?
srun --ntasks-per-node 4 -n 4 ./poisson2d
Jacobi relaxation Calculation: 8192 x 8192 mesh
Calculate reference solution and time serial execution.
    0, 0.250000
  100, 0.249985
  200, 0.249970
  300, 0.249955
  400, 0.249940
  500, 0.249925
  600, 0.249911
  700, 0.249896
  800, 0.249881
  900, 0.249866
Parallel execution.
    0, 0.250000
  100, 0.249985
  200, 0.249970
  300, 0.249955
  400, 0.249940
  500, 0.249925
  600, 0.249911
  700, 0.249896
  800, 0.249881
  900, 0.249866
Num GPUs: 4.
8192x8192: 1 GPU:   2.3268 s, 4 GPUs:   0.6757 s, speedup:     3.44, efficiency:    86.09%
MPI time:   0.0367 s, inter GPU BW:     6.65 GiB/s
```