![Multi_GPU_Slide](img/HPCP_MultiGPU/Folie1.PNG) 

![Multi_GPU_Slide](img/HPCP_MultiGPU/Folie2.PNG) 

![Multi_GPU_Slide](img/HPCP_MultiGPU/Folie3.PNG) 

![Multi_GPU_Slide](img/HPCP_MultiGPU/Folie4.PNG)

In [14]:
!mkdir scripts

mkdir: cannot create directory ‘scripts’: File exists


In [17]:
%%writefile scripts/run_script.sh
#!/bin/bash

module load anaconda
cd /psi/home/${USER}/HS25/04_Advanced/scripts
export PYTHONPATH=/psi/home/${USER}/HS25/04_Advanced/scripts:$PYTHONPATH
conda activate summer-school-hpc-2025
mpirun python ${SCRIPT_NAME}.py

Overwriting scripts/run_script.sh


In [16]:
!chmod +x scripts/run_script.sh

we will use this submission script for all our tests like this:

```bash
cd HS25/04_Advanced/scripts
SCRIPT_NAME=single_cp sbatch --cluster=gmerlin6 --partition=gpu-short --gpus=2 --output=log.out --reservation=psicourse01 run_script.sh
```

Let us discuss the following simple example.

In [9]:
%%writefile scripts/single_cp.py
import cupy as cp
import numpy as np

print(f"Hello from single_job. I see {cp.cuda.runtime.getDeviceCount()} devices.")

A = cp.random.random((1024,1024), dtype=cp.float32)
A = cp.random.random((1024,1024), dtype=cp.float32)
A = cp.random.random((1024,1024), dtype=cp.float32)
A = cp.random.random((1024,1024), dtype=cp.float32)

with cp.cuda.Device(0):   # select GPU 0
    A = cp.random.random((1024,1024), dtype=cp.float32)
    B = cp.random.random((1024,1024), dtype=cp.float32)
    res1_gpu = A @ B 

with cp.cuda.Device(1):   # select GPU 1
    C = cp.random.random((1024,1024), dtype=cp.float32)
    D = cp.random.random((1024,1024), dtype=cp.float32)
    res2_gpu = C @ D 

res1_host = cp.asnumpy(res1_gpu) 
res2_host = cp.asnumpy(res2_gpu) 
print(f"Done. res1 preview: {res1_host[0:5]} \nand res2 preview: {res2_host[0:5]}")

Overwriting scripts/single_cp.py


In [19]:
!cat scripts/log.out

Hello from single_job. I see 2 devices.
Done. res1 preview: [[243.97412 248.05185 249.9029  ... 261.0253  261.1789  255.34895]
 [253.66034 248.2228  252.39926 ... 262.94287 266.6966  250.7923 ]
 [261.63257 256.16394 263.0752  ... 269.082   274.162   262.13193]
 [252.28265 245.14882 252.74081 ... 257.71622 262.12726 255.1165 ]
 [246.93062 240.59546 245.90204 ... 253.8186  255.9559  251.0174 ]] and res2 preview: [[254.87265 258.94232 255.34753 ... 266.2976  254.3287  262.9796 ]
 [244.70996 262.54272 254.03656 ... 260.9129  250.0023  259.76602]
 [246.9503  253.97754 251.55046 ... 258.73566 252.97365 255.2243 ]
 [255.05783 262.63297 262.20947 ... 270.6292  256.50345 262.2771 ]
 [249.85574 264.8111  260.9229  ... 256.94852 257.7815  264.13885]]


Let us move some data from GPU 0 to GPU 1. 

In [1]:
%%writefile scripts/p2p_cp.py
import cupy as cp
from cupy.cuda import runtime, Device, Stream

src_dev, dst_dev = 0, 1

#Enable P2P access
with Device(dst_dev):
    runtime.deviceEnablePeerAccess(src_dev)
with Device(src_dev):
    runtime.deviceEnablePeerAccess(dst_dev)

with Device(src_dev):
    a = cp.arange(10_000_000, dtype=cp.float32)

# Allocate destination on GPU 1
with Device(dst_dev):
    b = cp.empty_like(a)
    runtime.memcpyPeerAsync(b.data.ptr, dst_dev, a.data.ptr, src_dev, a.nbytes, Stream.null.ptr)
    Stream.null.synchronize() # wait until done (this is in-efficient, but OK for this example)

b_host = cp.asnumpy(b) 
print(f"Done cpying data from Device 0 to 1. b: {b_host[0:5]}")

Overwriting scripts/p2p_cp.py


```bash
SCRIPT_NAME=p2p_cp sbatch --cluster=gmerlin6 --partition=gpu-short --gpus=2 --gpus=A5000:2 --output=log.out --reservation=psicourse01 run_script.sh
```

This is pretty tedious—better to use a library that supports communication patterns.

![Multi_GPU_Slide](img/HPCP_MultiGPU/Folie5.PNG)

![Multi_GPU_Slide](img/HPCP_MultiGPU/Folie6.PNG) 

![Multi_GPU_Slide](img/HPCP_MultiGPU/Folie7.PNG) 

![Multi_GPU_Slide](img/HPCP_MultiGPU/Folie8.PNG) 

#### Task 1 — Implement `allGather` with NCCL

Follow the official documentation:  
- CuPy NCCL docs: https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.nccl.NcclCommunicator.html  
- NVIDIA NCCL user guide: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html  

**Steps:**
1. Create send and receive buffers on all GPUs.  
2. Initialize NCCL across all GPUs with `initAll`.  
3. Call the `allGather` operation on each GPU.  
   - *Hint:* Wrap these calls inside `groupStart()` and `groupEnd()`.  
4. Synchronize, then print part of the result to verify correctness.  

In [None]:
%%writefile scripts/nccl_allgather.py


Overwriting scripts/nccl_allgather.py


```bash
SCRIPT_NAME=nccl_allgather sbatch --cluster=gmerlin6 --partition=gpu-short --gpus=2 --output=log.out --reservation=psicourse01 run_script.sh
```

In [9]:
!cat scripts/log.out

[<cupy_backends.cuda.libs.nccl.NcclCommunicator object at 0x14976ab7b830>, <cupy_backends.cuda.libs.nccl.NcclCommunicator object at 0x14976ab7bd90>]
Start allGatrher on: <cupy_backends.cuda.libs.nccl.NcclCommunicator object at 0x14976ab7b830>
Start allGatrher on: <cupy_backends.cuda.libs.nccl.NcclCommunicator object at 0x14976ab7bd90>
sync
sync
Done: [0. 1. 2. 3. 4.]


#### Task 2

Write an NCCL script where each GPU is controlled by its own process.  
You can achieve this by either:
- running a single task with **N cores and N GPUs**, or  
- running **N tasks**, each bound to one GPU.  

**Requirements:**
- Broadcast an array from **rank/GPU 0** to all other ranks.  
- No `groupStart()` needed, but **each process must initialize** its own `NcclCommunicator`.  
- Use `comm_id = nccl.get_unique_id()` to initialize the communicator.  
- Use `cuda.Stream.null.ptr` as the CUDA stream (default stream).  
- Use `n_devices = int(os.environ["SLURM_GPUS_ON_NODE"])` to get the number of GPUs and **not** `cp.cuda.runtime.getDeviceCount()`! Calling cuda.runtime before starting new processes will lead to an error.
- Use `comm.destroy()` to destroy the `NcclCommunicator` within each process at the end

In [None]:
%%writefile scripts/nccl_broadcast.py


Overwriting scripts/nccl_broadcast.py


```bash
SCRIPT_NAME=nccl_broadcast sbatch --cluster=gmerlin6 --partition=gpu-short --gpus=4 --output=log.out --reservation=psicourse01 run_script.sh
```

In [24]:
!cat scripts/log.out

Found 4 GPUs
Rank 3 started.
Rank 2 started.
Rank 1 started.
Rank 1 finished: [1. 1. 1. 1. 1.]
Rank 2 finished: [1. 1. 1. 1. 1.]
Rank 3 finished: [1. 1. 1. 1. 1.]
Rank 0 successfully finished.


![Multi_GPU_Slide](img/HPCP_MultiGPU/Folie9.PNG)

In [29]:
%%writefile scripts/hello_mpi.py
from mpi4py import MPI
import cupy as cp
from cupy import cuda
import socket

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

num_gpus = cuda.runtime.getDeviceCount()
print(f"Hello from rank {rank} on {socket.gethostname()}, I see {num_gpus} GPU(s).")
cp.cuda.Device(rank % num_gpus).use()  # Each task gets an isolated GPU

Overwriting scripts/hello_mpi.py


```bash
SCRIPT_NAME=hello_mpi sbatch --cluster=gmerlin6 --partition=gpu-short --ntasks=8 --gpus-per-task=1 --output=log.out --reservation=psicourse01 run_script.sh
```

In [22]:
!cat scripts/log.out

Hello from rank 1 on merlin-g-014.psi.ch, I see 8 GPUS.
Hello from rank 3 on merlin-g-014.psi.ch, I see 8 GPUS.
Hello from rank 6 on merlin-g-014.psi.ch, I see 8 GPUS.
Hello from rank 7 on merlin-g-014.psi.ch, I see 8 GPUS.
Hello from rank 0 on merlin-g-014.psi.ch, I see 8 GPUS.
Hello from rank 5 on merlin-g-014.psi.ch, I see 8 GPUS.
Hello from rank 4 on merlin-g-014.psi.ch, I see 8 GPUS.
Hello from rank 2 on merlin-g-014.psi.ch, I see 8 GPUS.


#### Task 3
Understand how Slurm parameters affect resource allocation and rank→GPU mapping.
Launch short jobs with different combinations of:
* --ntasks
* --gpus-per-task
* --ntasks-per-node=4
* --ngpus

In [None]:
#ToDo: Play around

#### Task 4
Use two MPI ranks (0 and 1), one GPU per rank. Create a CuPy buffer on rank 0 and send it to rank 1.

* Initialize MPI and map each rank to a GPU.
* On rank 0: create a CuPy array (e.g., cp.arange(...)) on its GPU.
* Send that device buffer to rank 1 using comm.Send (CUDA-aware path: `comm.Send([data, MPI.FLOAT], ...)`).
* On rank 1: Recv into a CuPy buffer on its GPU and verify contents (cp.allclose, print a small slice).

In [None]:
%%writefile scripts/send_mpi.py



Overwriting scripts/send_mpi.py


```bash
SCRIPT_NAME=send_mpi sbatch --cluster=gmerlin6 --partition=gpu-short --ntasks=2 --gpus-per-task=1 --output=log.out --reservation=psicourse01 run_script.sh
```

In [28]:
!cat scripts/log.out

Hello from 0. I see 2 devices.
Hello from 1. I see 2 devices.
0 sent: [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
1 recv: [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]


![Multi_GPU_Slide](img/HPCP_MultiGPU/Folie10.PNG)