# Distributed Programming

Because large-scale models are extremely large, they must be split across multiple GPUs. The divided pieces of the model need to communicate with each other over a network to exchange values during computation. This approach‚Äîdistributing large computational resources across multiple computers or devices‚Äîis known as **distributed processing**.

In this session, we will learn the fundamentals of distributed programming using PyTorch.


## 1. Multi-processing with PyTorch

Before diving into the distributed programming tutorial, we will first go through a tutorial on multi-processing applications implemented using PyTorch. Concepts such as threads and processes are typically covered in operating systems courses for computer science majors, so we will omit detailed explanations here. If you are not familiar with these concepts, we recommend searching online or reading an article such as:
https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/

### Basic terminology used in multi-process communication
- **Node**: You can think of this as a computer. For example, three nodes mean three computers.
- **Global Rank**: Originally refers to process priority, but in **machine learning it can be thought of as the GPU ID**.
- **Local Rank**: Originally refers to process priority within a node, but in **machine learning it refers to the GPU ID within a node**.
- **World Size**: Refers to the total number of processes.

<br>

![](../images/process_terms.jpeg)

<br>

### How to run a multi-process application
There are two main ways to run a multi-process application implemented with PyTorch:

1. The user‚Äôs code acts as the main process and spawns specific functions as subprocesses.
2. The PyTorch launcher acts as the main process and spawns the entire user program as subprocesses.

We will examine both approaches. The term *‚Äúspawn‚Äù* here refers to a process acting as a parent and launching multiple subprocesses simultaneously.

<br>

### 1) The user‚Äôs code acts as the main process and spawns specific functions as subprocesses

In this approach, the user‚Äôs code becomes the main process and spawns a specific function as subprocesses.

![](../images/multi_process_1.png)

<br>

In general, there are two ways to spawn subprocesses: `Spawn` and `Fork`.

- **`Spawn`**
  - Does not inherit resources from the main process; only the necessary resources are newly allocated to the subprocess.
  - Slower but safer.
- **`Fork`**
  - Shares all resources of the main process with the subprocess when starting the process.
  - Faster but more dangerous.

p.s. In practice, there is also a `Forkserver` method, but it is less commonly used and relatively unfamiliar, so it is omitted here.


In [None]:
"""
src/ch2_distributed_programming/multi_process_1.py

Note:
Jupyter Notebook has many limitations when running multiprocessing applications.
Therefore, in most cases, only the code is included here, and execution should be done
using the code inside the `src` directory.
Please run the actual code from the `src` folder.
"""

import torch.multiprocessing as mp
# Typically abbreviated as `mp`


# Code executed concurrently in subprocesses
def fn(rank, param1, param2):
    print(f"{param1} {param2} - rank: {rank}")


# Main process
if __name__ == "__main__":
    processes = []
    # Set the start method
    mp.set_start_method("spawn")

    for rank in range(4):
        process = mp.Process(target=fn, args=(rank, "A0", "B1"))
        # Create a subprocess
        process.daemon = False
        # Whether the process is a daemon (terminates when the main process exits)
        process.start()
        # Start the subprocess
        processes.append(process)

    for process in processes:
        process.join()
        # Join the subprocess (terminate after completion)


In [None]:
!python ../src/ch2_distributed_programming/multi_process_1.py

The `torch.multiprocessing.spawn` function makes this process significantly easier to implement.


In [None]:
"""
src/ch2_distributed_programming/multi_process_2.py
"""

import torch.multiprocessing as mp


# Code executed concurrently in subprocesses
def fn(rank, param1, param2):
    # `rank` is provided automatically. `param1` and `param2` are passed when calling `spawn`.
    print(f"{param1} {param2} - rank: {rank}")


# Main process
if __name__ == "__main__":
    mp.spawn(
        fn=fn,
        args=("A0", "B1"),
        nprocs=4,  # Number of processes to create
        join=True,  # Whether to join the processes
        daemon=False,  # Whether the processes are daemons
        start_method="spawn",  # Set the start method
    )


In [None]:
!python ../src/ch2_distributed_programming/multi_process_2.py

In [None]:
"""
Note: torch/multiprocessing/spawn.py

The `mp.spawn` function operates as shown below.
"""


def start_processes(fn, args=(), nprocs=1, join=True, daemon=False, start_method='spawn'):
    _python_version_check()
    mp = multiprocessing.get_context(start_method)
    error_queues = []
    processes = []
    for i in range(nprocs):
        error_queue = mp.SimpleQueue()
        process = mp.Process(
            target=_wrap,
            args=(fn, i, args, error_queue),
            daemon=daemon,
        )
        process.start()
        error_queues.append(error_queue)
        processes.append(process)

    context = ProcessContext(processes, error_queues)
    if not join:
        return context

    # Loop on join until it returns True or raises an exception.
    while not context.join():
        pass

### 2) The PyTorch launcher acts as the parent process and spawns the entire user program as subprocesses

This approach uses the multiprocessing launcher built into PyTorch to execute the entire user program as subprocesses, making it a very convenient method.

![](../images/multi_process_2.png)

<br>

This method uses a command such as:
`python -m torch.distributed.launch --nproc_per_node=n OOO.py`


In [None]:
"""
src/ch2_distributed_programming/multi_process_3.py
"""

# The entire code becomes a subprocess.
import os

# Variables such as RANK, LOCAL_RANK, and WORLD_SIZE are set automatically.
print(f"hello world, {os.environ['RANK']}")


In [None]:
!python -m torch.distributed.launch --nproc_per_node=4 ../src/ch2_distributed_programming/multi_process_3.py

## 2. Distributed Programming with PyTorch
### Concept of Message Passing

Message passing refers to a method in which multiple processes that do not share the same address space exchange data indirectly through messages. For example, if Process-1 is coded to send data with a specific tag to a message queue, and Process-2 is coded to receive that data, the two processes can exchange information without sharing any memory space. Most distributed communication techniques used in large-scale model development rely on this message passing approach.

![](../images/message_passing.png)

<br>

### MPI (Message Passing Interface)
MPI refers to a standard interface for message passing. MPI defines various operations used for message passing between processes (e.g., broadcast, reduce, scatter, gather, ...), and a well-known open-source implementation is OpenMPI.

![](../images/open_mpi.png)

<br>

### NCCL & GLOO
In practice, libraries such as NCCL or GLOO are used more often than OpenMPI.

- **NCCL (NVIDIA Collective Communication Library)**
  - A GPU-optimized message passing library developed by NVIDIA (pronounced ‚Äúnickel‚Äù).
  - Known to deliver significantly higher performance on NVIDIA GPUs compared to other tools.
- **GLOO (Facebook's Collective Communication Library)**
  - A message passing library developed by Facebook.
  - In `torch`, it is mainly recommended for CPU-based distributed processing.

<br>

### Backend Library Selection Guide
Unless there is a specific reason to use OpenMPI, NCCL or GLOO is generally preferred: use **NCCL for GPU-based workloads** and **GLOO for CPU-based workloads**. For more detailed information, please refer to:
https://pytorch.org/docs/stable/distributed.html

The operations supported by each backend are shown below.

![](../images/backends.png)

<br>

### The `torch.distributed` Package
While directly using libraries such as `gloo`, `nccl`, or `openmpi` can be a valuable learning experience, it is not feasible to cover all of them due to time constraints. Instead, we will proceed using the `torch.distributed` package, which wraps these libraries. In practical applications, developers typically use high-level packages like `torch.distributed` rather than interacting directly with low-level libraries such as `nccl`.


### Process Group

Managing a large number of processes can be challenging, so process groups are used to simplify management. When `init_process_group` is called, a default process group (`default_pg`) that includes all processes is created. The `init_process_group` function, which initializes a process group, **must be executed in subprocesses**. If you want to create an additional group that includes only a specific subset of processes, you can call `new_group`.


In [None]:
"""
src/ch2_distributed_programming/process_group_1.py
"""

import torch.distributed as dist
# Typically abbreviated as `dist`

dist.init_process_group(backend="nccl", rank=0, world_size=1)
# Initialize the process group
# In this example, we use NCCL, which is the most commonly used backend.
# You can also specify 'mpi' or 'gloo' instead of 'nccl' for the backend.

process_group = dist.new_group([0])
# Create a process group that includes process with rank 0

print(process_group)


In [None]:
!python ../src/ch2_distributed_programming/process_group_1.py

When running the code above, an error occurs because required variables such as `MASTER_ADDR` and `MASTER_PORT` are not set. We will set these values and run the code again.


In [None]:
"""
src/ch2_distributed_programming/process_group_2.py
"""

import torch.distributed as dist
import os


# These values are typically registered and used as environment variables.
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"

# Set the address required for communication.
os.environ["MASTER_ADDR"] = "localhost"  # Address to communicate with (usually localhost)
os.environ["MASTER_PORT"] = "29500"  # Port for communication (any available value is fine)

dist.init_process_group(backend="nccl", rank=0, world_size=1)
# Initialize the process group

process_group = dist.new_group([0])
# Create a process group that includes the process with rank 0

print(process_group)


In [None]:
!python ../src/ch2_distributed_programming/process_group_2.py

In [None]:
"""
src/ch2_distributed_programming/process_group_3.py
"""

import torch.multiprocessing as mp
import torch.distributed as dist
import os


# Code executed concurrently in subprocesses
def fn(rank, world_size):
    # `rank` is provided automatically. `world_size` is passed as an argument.
    dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
    group = dist.new_group([_ for _ in range(world_size)])
    print(f"{group} - rank: {rank}")


# Main process
if __name__ == "__main__":
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "29500"
    os.environ["WORLD_SIZE"] = "4"

    mp.spawn(
        fn=fn,
        args=(4,),  # Pass world_size
        nprocs=4,  # Number of processes to create
        join=True,  # Whether to join the processes
        daemon=False,  # Whether the processes are daemons
        start_method="spawn",  # Set the start method
    )


In [None]:
!python ../src/ch2_distributed_programming/process_group_3.py

When launching with `python -m torch.distributed.launch --nproc_per_node=n OOO.py`, the following approach is used. The `rank` and `world_size` can be retrieved via functions like `dist.get_rank()` and `dist.get_world_size()`.


In [None]:
"""
src/ch2_distributed_programming/process_group_4.py
"""

import torch.distributed as dist

dist.init_process_group(backend="nccl")
# Initialize the process group

group = dist.new_group([_ for _ in range(dist.get_world_size())])
# Create a process group

print(f"{group} - rank: {dist.get_rank()}\n")


In [None]:
!python -m torch.distributed.launch --nproc_per_node=4 ../src/ch2_distributed_programming/process_group_4.py

### P2P Communication (Point-to-Point)

![](../images/p2p.png)

<br>

P2P (Point-to-Point) communication refers to a communication pattern in which a specific process sends data directly to another process. This type of communication can be implemented using the `send` and `recv` functions provided by the `torch.distributed` package.


In [None]:
"""
src/ch2_distributed_programming/p2p_communication.py
"""

import torch
import torch.distributed as dist

dist.init_process_group("gloo")
# Currently, NCCL does not support send/recv. (as of 2021/10/21)

if dist.get_rank() == 0:
    tensor = torch.randn(2, 2)
    dist.send(tensor, dst=1)

elif dist.get_rank() == 1:
    tensor = torch.zeros(2, 2)
    print(f"rank 1 before: {tensor}\n")
    dist.recv(tensor, src=0)
    print(f"rank 1 after: {tensor}\n")

else:
    raise RuntimeError("wrong rank")


In [None]:
!python -m torch.distributed.launch --nproc_per_node=2 ../src/ch2_distributed_programming/p2p_communication.py

It is important to note that these operations perform **synchronous communication**. For asynchronous (non-blocking) communication, you can use `isend` and `irecv`. Since they operate asynchronously, you must call the `wait()` method and wait for the communication with the other process to complete before accessing the data.


In [None]:
"""
src/ch2_distributed_programming/p2p_communication_non_blocking.py
"""

import torch
import torch.distributed as dist

dist.init_process_group("gloo")
# Currently, NCCL does not support send/recv. (as of 2021/10/21)

if dist.get_rank() == 0:
    tensor = torch.randn(2, 2)
    request = dist.isend(tensor, dst=1)
elif dist.get_rank() == 1:
    tensor = torch.zeros(2, 2)
    request = dist.irecv(tensor, src=0)
else:
    raise RuntimeError("wrong rank")

request.wait()

print(f"rank {dist.get_rank()}: {tensor}")


In [None]:
!python -m torch.distributed.launch --nproc_per_node=2 ../src/ch2_distributed_programming/p2p_communication_non_blocking.py

<br>

### Collective Communication

Collective communication refers to communication in which multiple processes participate together. While there are various collective operations, the basic set consists of the following four operations: `broadcast`, `scatter`, `gather`, and `reduce`.

![](../images/collective.png)

In addition, we will cover a total of eight operations, including composite operations such as `all-reduce`, `all-gather`, and `reduce-scatter`, as well as the synchronization operation `barrier`. Furthermore, if you want to execute these operations in asynchronous mode, you can set the `async_op` parameter to `True` when performing each operation.

<br>

#### 1) Broadcast

Broadcast is an operation that copies data from a specific process to all processes within a group.

![](../images/broadcast.png)


In [None]:
"""
src/ch2_distributed_programming/broadcast.py
"""

import torch
import torch.distributed as dist

dist.init_process_group("nccl")
rank = dist.get_rank()
torch.cuda.set_device(rank)
# By setting the device, you can later access the device corresponding to the rank.

if rank == 0:
    tensor = torch.randn(2, 2).to(torch.cuda.current_device())
else:
    tensor = torch.zeros(2, 2).to(torch.cuda.current_device())

print(f"before rank {rank}: {tensor}\n")
dist.broadcast(tensor, src=0)
print(f"after rank {rank}: {tensor}\n")

In [None]:
!python -m torch.distributed.launch --nproc_per_node=4 ../src/ch2_distributed_programming/broadcast.py

When P2P operations such as `send` and `recv` are not supported, `broadcast` is sometimes used as an alternative for point-to-point communication. For example, when `src = 0` and `dst = 1`, creating a group with `new_group([0, 1])` and performing a `broadcast` operation is equivalent to P2P communication from rank 0 to rank 1.


In [None]:
"""
Note: deepspeed/deepspeed/runtime/pipe/p2p.py
"""

def send(tensor, dest_stage, async_op=False):
    global _groups
    assert async_op == False, "Doesnt support async_op true"
    src_stage = _grid.get_stage_id()
    _is_valid_send_recv(src_stage, dest_stage)

    dest_rank = _grid.stage_to_global(stage_id=dest_stage)
    if async_op:
        global _async
        op = dist.isend(tensor, dest_rank)
        _async.append(op)
    else:

        if can_send_recv():
            return dist.send(tensor, dest_rank)
        else:
            group = _get_send_recv_group(src_stage, dest_stage)
            src_rank = _grid.stage_to_global(stage_id=src_stage)
            return dist.broadcast(tensor, src_rank, group=group, async_op=async_op)

<br>

#### 2) Reduce
Reduce is an operation that applies a specified computation to the data held by each process and collects the result on a single device. Common operations include sum, max, and min.

![](../images/reduce.png)


In [None]:
"""
src/ch2_distributed_programming/reduce_sum.py
"""

import torch
import torch.distributed as dist

dist.init_process_group("nccl")
rank = dist.get_rank()
torch.cuda.set_device(rank)

tensor = torch.ones(2, 2).to(torch.cuda.current_device()) * rank
# rank==0 => [[0, 0], [0, 0]]
# rank==1 => [[1, 1], [1, 1]]
# rank==2 => [[2, 2], [2, 2]]
# rank==3 => [[3, 3], [3, 3]]

dist.reduce(tensor, op=torch.distributed.ReduceOp.SUM, dst=0)

if rank == 0:
    print(tensor)

In [None]:
!python -m torch.distributed.launch --nproc_per_node=4 ../src/ch2_distributed_programming/reduce_sum.py

In [None]:
"""
src/ch2_distributed_programming/reduce_max.py
"""

import torch
import torch.distributed as dist

dist.init_process_group("nccl")
rank = dist.get_rank()
torch.cuda.set_device(rank)

tensor = torch.ones(2, 2).to(torch.cuda.current_device()) * rank
# rank==0 => [[0, 0], [0, 0]]
# rank==1 => [[1, 1], [1, 1]]
# rank==2 => [[2, 2], [2, 2]]
# rank==3 => [[3, 3], [3, 3]]

dist.reduce(tensor, op=torch.distributed.ReduceOp.MAX, dst=0)

if rank == 0:
    print(tensor)

In [None]:
!python -m torch.distributed.launch --nproc_per_node=4 ../src/ch2_distributed_programming/reduce_max.py

#### 3) Scatter
Scatter is an operation that splits multiple elements and distributes them across each device.


![](../images/scatter.png)

In [None]:
"""
src/ch2_distributed_programming/scatter.py
"""

import torch
import torch.distributed as dist

dist.init_process_group("gloo")
rank = dist.get_rank()
torch.cuda.set_device(rank)


output = torch.zeros(1)
print(f"before rank {rank}: {output}\n")

if rank == 0:
    inputs = torch.tensor([10.0, 20.0, 30.0, 40.0])
    inputs = torch.split(inputs, dim=0, split_size_or_sections=1)
    # (tensor([10]), tensor([20]), tensor([30]), tensor([40]))
    dist.scatter(output, scatter_list=list(inputs), src=0)
else:
    dist.scatter(output, src=0)

print(f"after rank {rank}: {output}\n")

In [None]:
!python -m torch.distributed.launch --nproc_per_node=4 ../src/ch2_distributed_programming/scatter.py

Since NCCL does not support the `scatter` operation, the following approach is used to perform a scatter operation.


In [None]:
"""
src/ch2_distributed_programming/scatter_nccl.py
"""

import torch
import torch.distributed as dist

dist.init_process_group("nccl")
rank = dist.get_rank()
torch.cuda.set_device(rank)

inputs = torch.tensor([10.0, 20.0, 30.0, 40.0])
inputs = torch.split(tensor=inputs, dim=-1, split_size_or_sections=1)
output = inputs[rank].contiguous().to(torch.cuda.current_device())
print(f"after rank {rank}: {output}\n")

In [None]:
!python -m torch.distributed.launch --nproc_per_node=4 ../src/ch2_distributed_programming/scatter_nccl.py

In [None]:
"""
Note: megatron-lm/megatron/mpu/mappings.py
"""

def _split(input_):
    """Split the tensor along its last dimension and keep the
    corresponding slice."""

    world_size = get_tensor_model_parallel_world_size()
    # Bypass the function if we are using only 1 GPU.
    if world_size==1:
        return input_

    # Split along last dimension.
    input_list = split_tensor_along_last_dim(input_, world_size)

    # Note: torch.split does not create contiguous tensors by default.
    rank = get_tensor_model_parallel_rank()
    output = input_list[rank].contiguous()

    return output

class _ScatterToModelParallelRegion(torch.autograd.Function):
    """Split the input and keep only the corresponding chuck to the rank."""

    @staticmethod
    def symbolic(graph, input_):
        return _split(input_)

    @staticmethod
    def forward(ctx, input_):
        return _split(input_)

    @staticmethod
    def backward(ctx, grad_output):
        return _gather(grad_output)

<br>

#### 4) Gather
Gather is an operation that collects tensors from multiple devices and combines them into a single tensor.

![](../images/gather.png)


In [None]:
"""
src/ch2_distributed_programming/gather.py
"""

import torch
import torch.distributed as dist

dist.init_process_group("gloo")
# NCCL does not support gather.
rank = dist.get_rank()
torch.cuda.set_device(rank)

input = torch.ones(1) * rank
# rank == 0 => [0]
# rank == 1 => [1]
# rank == 2 => [2]
# rank == 3 => [3]

if rank == 0:
    outputs_list = [torch.zeros(1), torch.zeros(1), torch.zeros(1), torch.zeros(1)]
    dist.gather(input, gather_list=outputs_list, dst=0)
    print(outputs_list)
else:
    dist.gather(input, dst=0)


In [None]:
!python -m torch.distributed.launch --nproc_per_node=4 ../src/ch2_distributed_programming/gather.py

<br>

#### 5) All-reduce
Operations prefixed with **All-** perform the specified operation and then broadcast the result to all devices. As shown in the figure below, All-reduce first performs a reduce operation and then copies the computed result to every device.

![](../images/allreduce.png)


In [None]:
"""
src/ch2_distributed_programming/allreduce_sum.py
"""

import torch
import torch.distributed as dist

dist.init_process_group("nccl")
rank = dist.get_rank()
torch.cuda.set_device(rank)

tensor = torch.ones(2, 2).to(torch.cuda.current_device()) * rank
# rank==0 => [[0, 0], [0, 0]]
# rank==1 => [[1, 1], [1, 1]]
# rank==2 => [[2, 2], [2, 2]]
# rank==3 => [[3, 3], [3, 3]]

dist.all_reduce(tensor, op=torch.distributed.ReduceOp.SUM)

print(f"rank {rank}: {tensor}\n")

In [None]:
!python -m torch.distributed.launch --nproc_per_node=4 ../src/ch2_distributed_programming/allreduce_sum.py

In [None]:
"""
src/allreduce_max.py
"""

import torch
import torch.distributed as dist

dist.init_process_group("nccl")
rank = dist.get_rank()
torch.cuda.set_device(rank)

tensor = torch.ones(2, 2).to(torch.cuda.current_device()) * rank
# rank==0 => [[0, 0], [0, 0]]
# rank==1 => [[1, 1], [1, 1]]
# rank==2 => [[2, 2], [2, 2]]
# rank==3 => [[3, 3], [3, 3]]

dist.all_reduce(tensor, op=torch.distributed.ReduceOp.MAX)

print(f"rank {rank}: {tensor}\n")

In [None]:
!python -m torch.distributed.launch --nproc_per_node=4 ../src/ch2_distributed_programming/allreduce_max.py

#### 6) All-gather
All-gather performs a gather operation and then copies the collected result to all devices.

![](../images/allgather.png)


In [None]:
"""
src/ch2_distributed_programming/allgather.py
"""

import torch
import torch.distributed as dist

dist.init_process_group("nccl")
rank = dist.get_rank()
torch.cuda.set_device(rank)

input = torch.ones(1).to(torch.cuda.current_device()) * rank
# rank==0 => [0]
# rank==1 => [1]
# rank==2 => [2]
# rank==3 => [3]

outputs_list = [
    torch.zeros(1, device=torch.device(torch.cuda.current_device())),
    torch.zeros(1, device=torch.device(torch.cuda.current_device())),
    torch.zeros(1, device=torch.device(torch.cuda.current_device())),
    torch.zeros(1, device=torch.device(torch.cuda.current_device())),
]

dist.all_gather(tensor_list=outputs_list, tensor=input)
print(outputs_list)


In [None]:
!python -m torch.distributed.launch --nproc_per_node=4 ../src/ch2_distributed_programming/allgather.py

#### 7) Reduce-scatter
Reduce-scatter performs a reduce operation and then splits the result, returning a portion of the output to each device.

![](../images/reduce_scatter.png)


In [None]:
"""
src/ch2_distributed_programming/reduce_scatter.py
"""

import torch
import torch.distributed as dist

dist.init_process_group("nccl")
rank = dist.get_rank()
torch.cuda.set_device(rank)

input_list = torch.tensor([1, 10, 100, 1000]).to(torch.cuda.current_device()) * rank
input_list = torch.split(input_list, dim=0, split_size_or_sections=1)
# rank==0 => [0, 00, 000, 0000]
# rank==1 => [1, 10, 100, 1000]
# rank==2 => [2, 20, 200, 2000]
# rank==3 => [3, 30, 300, 3000]

output = torch.tensor([0], device=torch.device(torch.cuda.current_device()),)

dist.reduce_scatter(
    output=output,
    input_list=list(input_list),
    op=torch.distributed.ReduceOp.SUM,
)

print(f"rank {rank}: {output}\n")


In [None]:
!python -m torch.distributed.launch --nproc_per_node=4 ../src/ch2_distributed_programming/reduce_scatter.py

#### 8) Barrier
A barrier is used to synchronize processes. Processes that reach the barrier first will wait until all processes have reached the same point.


In [None]:
"""
src/ch2_distributed_programming/barrier.py
"""
import time
import torch.distributed as dist

dist.init_process_group("nccl")
rank = dist.get_rank()

if rank == 0:
    seconds = 0
    while seconds <= 3:
        time.sleep(1)
        seconds += 1
        print(f"rank 0 - seconds: {seconds}\n")

print(f"rank {rank}: no-barrier\n")
dist.barrier()
print(f"rank {rank}: barrier\n")

In [None]:
!python -m torch.distributed.launch --nproc_per_node=4 ../src/ch2_distributed_programming/barrier.py

### That‚Äôs a lot, right...? üòÖ
You only need to remember the four basic operations below‚Äîmost others can be inferred from them.

![](../images/collective.png)

Based on these four operations, keep the following points in mind:

- `all-reduce` and `all-gather` can be thought of as performing the corresponding operation first, followed by a `broadcast`.
- `reduce-scatter` literally means taking the result of a `reduce` operation and then **scattering (splitting)** it.
- `barrier` works exactly like its name suggests‚Äîa wall. Processes that arrive early are blocked, like hitting a wall, until all processes reach the same point.
