## Message passing

### Point-to-Point Communication
<!--
* send/recv
* isend/irecv
-->

![alt text](./figs/send_recv.png)


A transfer of data from one process to another is called a point-to-point communication. These are achieved through the `send` and `recv` functions or their *immediate* counter-parts, `isend` and `irecv`.

```python
"""Blocking point-to-point communication."""

def run_blocking(rank, size):
    tensor = torch.zeros(1)
    if rank == 0:
        tensor += 1
        # Send the tensor to process 1
        dist.send(tensor=tensor, dst=1)
    else:
        # Receive tensor from process 0
        dist.recv(tensor=tensor, src=0)
    print('Rank ', rank, ' has data ', tensor[0])

```

In the above example, both processes start with a zero tensor, then process 0 increments the tensor and sends it to process 1 so that they both end up with 1.0. Notice that process 1 needs to allocate memory in order to store the data it will receive.

Also notice that `send`/`recv` are **blocking**: both processes stop until the communication is completed. On the other hand immediates are **non-blocking**; the script continues its execution and the methods return a `DistributedRequest` object upon which we can choose to `wait()`.

```python
"""Non-blocking point-to-point communication."""

def run_non_blocking(rank, size):
    tensor = torch.zeros(1)
    req = None
    if rank == 0:
        tensor += 1
        # Send the tensor to process 1
        req = dist.isend(tensor=tensor, dst=1)
        print('Rank 0 started sending')
    else:
        # Receive tensor from process 0
        req = dist.irecv(tensor=tensor, src=0)
        print('Rank 1 started receiving')
        print('Rank 1 has data ', tensor[0])
    req.wait()
    print('Rank ', rank, ' has data ', tensor[0])

```

Running the above function might result in process 1 still having 0.0 while having already started receiving. However, after `req.wait()` has been executed we are guaranteed that the communication took place, and that the value stored in `tensor[0]` is 1.0.

Point-to-point communication is useful when we want a fine-grained control over the communication of our processes. They can be used to implement fancy algorithms, such as the one used in [Baidu's DeepSpeech](https://github.com/baidu-research/baidu-allreduce) or [Facebook's large-scale experiments](https://research.fb.com/publications/imagenet1kin1h/).)

### Collective Communication
<!--
* gather
* reduce
* broadcast
* scatter
* all_reduce
-->

<table>
<tbody>
<tr>

<td align='center'>
<img src='./figs/scatter.png' width=100% /><br/>
<b>Broadcast</b>
</td>

<td align='center'>
<img src='./figs/all_gather.png' width=100% /><br/>
<b>AllGather</b>
</td>

</tr><tr>

<td align='center'>
<img src='./figs/reduce.png' width=100% /><br/>
<b>Reduce</b>
</td>

<td align='center'>
<img src='./figs/all_reduce.png' width=100% /><br/>
<b>AllReduce</b>
</td>

</tr>
<tr>

<td align='center'>
<img src='./figs/scatter.png' width=100% /><br/>
<b>Scatter</b>
</td>

<td align='center'>
<img src='./figs/gather.png' width=100% /><br/>
<b>Gather</b>
</td>

</tr>
</tbody>
</table>

As opposed to point-to-point communcation, collectives allow for communication patterns across all processes in a **group**. A group is a subset of all our processes. To create a group, we can pass a list of ranks to `dist.new_group(group)`. By default, collectives are executed on the all processes, also known as the **world**. For example, in order to obtain the sum of all tensors at all processes, we can use the `dist.all_reduce(tensor, op, group)` collective.

```python
""" All-Reduce example."""
def run_all_reduce(rank, size):
    """ Simple point-to-point communication. """
    group = dist.new_group([0, 1]) 
    tensor = torch.ones(1)
    dist.all_reduce(tensor, op=dist.reduce_op.SUM, group=group)
    print('Rank ', rank, ' has data ', tensor[0])
```

Since we want the sum of all tensors in the group, we use `dist.reduce_op.SUM` as the reduce operator. Generally speaking, any commutative mathematical operation can be used as an operator. Out-of-the-box, PyTorch comes with 4 such operators, all working at the element-wise level:

* `dist.reduce_op.SUM`,
* `dist.reduce_op.PRODUCT`,
* `dist.reduce_op.MAX`,
* `dist.reduce_op.MIN`.

In addition to `dist.all_reduce(tensor, op, group)`, there are a total of 6 collectives currently implemented in PyTorch.

* `dist.broadcast(tensor, src, group)`: Copies `tensor` from `src` to all other processes.
* `dist.reduce(tensor, dst, op, group)`: Applies `op` to all `tensor` and stores the result in `dst`.
* `dist.all_reduce(tensor, op, group)`: Same as reduce, but the result is stored in all processes.
* `dist.scatter(tensor, src, scatter_list, group)`: Copies the $i^{\text{th}}$ tensor `scatter_list[i]` to the $i^{\text{th}}$ process.
* `dist.gather(tensor, dst, gather_list, group)`: Copies `tensor` from all processes in `dst`.
* `dist.all_gather(tensor_list, tensor, group)`: Copies `tensor` from all processes to `tensor_list`, on all processes.

## Code example

```python
"""send_receive.py"""

import os
import torch
import torch.distributed as dist
from torch.multiprocessing import Process


def init_processes(rank, size, fn, backend='tcp'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    size = 2
    processes = []
    for rank in range(size):
        p = Process(target=init_processes, args=(rank, size, run_non_blocking)) # Options: run_non_blocking, run_blocking, run_allreduce
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

```

The above script spawns two processes who will each setup the distributed environment, initialize the process group (`dist.init_process_group`), and finally execute the given `run` function. 

The `init_processes` ensures that every process will be able to coordinate through a master, using the same ip address and port. Note that we used the TCP backend, but we could have used [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) or [Gloo](http://github.com/facebookincubator/gloo) instead. 


## Launch command

In [1]:
! python code/send_receive.py

Rank 0 started sending
Rank 1 started receiving
Rank  0  has data  tensor(1.)
Rank  1  has data  tensor(1.)
