- references
    - https://pytorch.org/tutorials/intermediate/dist_tuto.html
    - https://mlbench.github.io/2020/09/08/communication-backend-comparison/

In [2]:
from IPython.display import Image

In [1]:
import torch.distributed as dist
import torch.multiprocessing as mp

## `dist.init_process_group`

```
def init_process(rank, size, backend='nccl'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
```

- 分布式进程组（distributed progress group）
    - 主节点：master node；localhost 就是本机；
    - MASTER_ADDR/MASTER_PORT: 设置主节点的地址及端口号，主要用于分布式的管理；
    - 哪怕是单机（单节点）多卡，也是需要显示地设置；每个进程（process）如何找到主节点；
- 单机双卡，2个进程（Processes），每个进程都会调用 init_process 来初始化分布式环境；

- 进程间通信的后端：communication backend
    - NCCL: NVIDIA Collective Communication Library 
    - Gloo
    - MPI


| Backend |                    Comm. Functions                    | Optimized for | Float32 | Float16 |
|:-------:|:-----------------------------------------------------:|:-------------:|:-------:|:-------:|
| MPI     | All                                                   | CPU, GPU      | Yes     | No      |
| GLOO    | All (on CPU), broadcast & all-reduce (on GPU)         | CPU           | Yes     | Yes     |
| NCCL    | broadcast, all reduce, reduce and all gather (on GPU) | GPU only      | Yes     | Yes     |


PyTorch (built from source) comes with `NCCL and GLOO pre-installed`, so it can be more convenient for a user to use one of those two. Otherwise, MPI needs to be compiled and installed on the machine.

## Point-to-Point Communication

In [6]:
Image(url='https://pytorch.org/tutorials/_images/send_recv.png', width=400)

- send & recv
    - recv 是阻塞式的

```
def run(rank, size):
    torch.cuda.set_device(rank)
    tensor = torch.zeros(1).to(rank)
    if rank == 0:
        tensor += 1
        # Send the tensor to process 1
        dist.send(tensor=tensor, dst=1)
    else:
        # Receive tensor from process 0
        print('init tentor', tensor)
        dist.recv(tensor=tensor, src=0)
    print(f'Rank: {rank}, has data {tensor}')
```

```
init tentor tensor([0.], device='cuda:1')
Rank: 1, has data tensor([1.], device='cuda:1')
Rank: 0, has data tensor([1.], device='cuda:0')
```

- 因为 recv 是阻塞式的，不会打印出 `Rank: 1, has data tensor([0.], device='cuda:1')` 的情况

## Collective Communication

In [8]:
Image(url='https://pytorch.org/tutorials/_images/all_reduce.png', width=400)

```
""" All-Reduce example."""
def run(rank, size):
    """ Simple collective communication. """
    group = dist.new_group([0, 1])
    if rank == 0:
        tensor = torch.tensor([1., 2., 3.])
    else:
        tensor = torch.tensor([4., 5., 6.])
    tensor = tensor.to(rank)
    print(f'Rank: {rank}, random tensor: {tensor}')
    dist.all_reduce(tensor, op=dist.ReduceOp.SUM, group=group)
    print(f'Rank: {rank}, has data: {tensor}')
```

```
Rank: 1, random tensor: tensor([4., 5., 6.], device='cuda:1')
Rank: 0, random tensor: tensor([1., 2., 3.], device='cuda:0')
Rank: 1, has data: tensor([5., 7., 9.], device='cuda:1')
Rank: 0, has data: tensor([5., 7., 9.], device='cuda:0')
```

## 多进程管理

```
if __name__ == "__main__":
    size = 2
    processes = []
    mp.set_start_method("spawn")
    for rank in range(size):
        p = mp.Process(target=init_process, args=(rank, size, run))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()
    
    print('finished')
```

- 通过 `mp.set_start_method("spawn")` 设置进程的启动方式为`spawn`
    - `spawn` 方式会为每个子进程创建一个全新的Python解释器进程，
    - `mp.get_start_method()` => `fork`（ubuntu system）
    - https://docs.python.org/zh-cn/3/library/multiprocessing.html
- 通过在一个循环中对所有进程调用 `join()` 方法，主进程会等待所有子进程执行完成后再继续执行。
    - `join()`方法确保主进程在所有子进程完成它们的任务之前不会继续执行，这是确保数据完整性和避免竞争条件的重要机制。