# Pipeline Parallelism
In this session, we will explore pipeline parallelism.

## 1. Inter-layer Model Parallelism
Pipeline parallelism is an improvement over inter-layer model parallelism. Inter-layer model parallelism is a method in which specific layers are assigned to specific GPUs, as shown below. In the figure, layers 1, 2, and 3 are assigned to GPU 1, while layers 4 and 5 are assigned to GPU 2. Each partitioned segment is called a **stage**. In this example, the model is split into two stages.

![](../images/inter_layer.png)

However, due to the nature of neural networks‚Äîwhere the output of a previous layer becomes the input to the next layer‚Äîcomputation on one GPU must finish before another GPU can begin its computation. In other words, as illustrated below, inter-layer model parallelism has a critical limitation: **only one GPU can be utilized at a time**.

![](../images/inter_layer_2.png)
![](../images/inter_layer_3.gif)


## 2. GPipe
GPipe is a pipeline parallelism technique developed by Google. It was introduced to reduce GPU idle time in inter-layer model parallelism. GPipe works by further splitting a mini-batch into micro-batches and pipelining the training process.

![](../images/gpipe_1.png)

<br>
<br>

![](../images/pipeline_parallelism2.png)

<br>

### Micro-batch
- A **mini-batch** is a subset of samples obtained by dividing the entire dataset into *n* parts.
- A **micro-batch** is a further subdivision of a mini-batch into *m* smaller subsets.

![](../images/gpipe_2.png)

<br`m

### Pipelining
GPipe splits a mini-batch into micro-batches and pipelines the computation. The red regions (where GPUs are idle) are referred to as **bubble time**. As the micro-batch size increases, the bubble time decreases.

![](../images/gpipe_3.gif)


### GPipe with PyTorch
You can easily use GPipe by leveraging `torchgpipe`, which was released by KakaoBrain. However, only models wrapped with `nn.Sequential` are supported, and the input and output types of all modules are restricted to `torch.Tensor` or `Tuple[torch.Tensor]`. As a result, implementing models with this approach can be quite challenging.


In [None]:
"""
src/ch5_pipeline_parallelism/gpipe.py
"""

import torch
import torch.nn as nn
from datasets import load_dataset
from torch.optim import Adam
from torch.utils.data import DataLoader
from torchgpipe import GPipe
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers.models.gpt2.modeling_gpt2 import GPT2Block as GPT2BlockBase


class GPT2Preprocessing(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embed_dim = config.hidden_size
        self.wte = nn.Embedding(config.vocab_size, self.embed_dim)
        self.wpe = nn.Embedding(config.max_position_embeddings, self.embed_dim)
        self.drop = nn.Dropout(config.embd_pdrop)

    def forward(self, input_ids):
        input_shape = input_ids.size()
        input_ids = input_ids.view(-1, input_shape[-1])
        position_ids = torch.arange(
            0, input_shape[-1], dtype=torch.long, device=input_ids.device
        )
        position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])
        inputs_embeds = self.wte(input_ids)
        position_embeds = self.wpe(position_ids)
        hidden_states = inputs_embeds + position_embeds
        hidden_states = self.drop(hidden_states)
        return hidden_states


class GPT2Block(GPT2BlockBase):
    def forward(self, hidden_states):
        hidden_states = super(GPT2Block, self).forward(
            hidden_states=hidden_states,
        )
        return hidden_states[0]


class GPT2Postprocessing(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_f = nn.LayerNorm(
            config.hidden_size,
            eps=config.layer_norm_epsilon,
        )
        self.lm_head = nn.Linear(
            config.hidden_size,
            config.vocab_size,
            bias=False,
        )

    def forward(self, hidden_states):
        hidden_states = self.ln_f(hidden_states)
        lm_logits = self.lm_head(hidden_states)
        return lm_logits


def create_model_from_pretrained(model_name):
    pretrained = GPT2LMHeadModel.from_pretrained(model_name)
    preprocess = GPT2Preprocessing(pretrained.config)
    preprocess.wte.weight = pretrained.transformer.wte.weight
    preprocess.wpe.weight = pretrained.transformer.wpe.weight

    blocks = pretrained.transformer.h
    for i, block in enumerate(blocks):
        block.__class__ = GPT2Block

    postprocess = GPT2Postprocessing(pretrained.config)
    postprocess.ln_f.weight = pretrained.transformer.ln_f.weight
    postprocess.ln_f.bias = pretrained.transformer.ln_f.bias
    postprocess.lm_head.weight.data = pretrained.lm_head.weight.data.clone()

    return nn.Sequential(preprocess, *blocks, postprocess)


if __name__ == "__main__":
    world_size = 4

    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    tokenizer.pad_token = tokenizer.eos_token

    model = create_model_from_pretrained(model_name="gpt2")
    model = GPipe(
        model,
        balance=[4, 3, 3, 4],
        devices=[0, 1, 2, 3],
        chunks=world_size,
    )

    datasets = load_dataset("squad").data["train"]["context"]
    datasets = [str(sample) for sample in datasets]
    data_loader = DataLoader(datasets, batch_size=8, num_workers=8)

    optimizer = Adam(model.parameters(), lr=3e-5)
    loss_fn = nn.CrossEntropyLoss()

    for i, data in enumerate(data_loader):
        optimizer.zero_grad()
        tokens = tokenizer(data, return_tensors="pt", truncation=True, padding=True)
        input_ids = tokens.input_ids.to(0)
        labels = tokens.input_ids.to(world_size - 1)

        lm_logits = model(input_ids)
        shift_logits = lm_logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()
        loss = nn.CrossEntropyLoss()(
            shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)
        )
        loss.backward()
        optimizer.step()

        if i % 10 == 0:
            print(f"step: {i}, loss: {loss}")
        if i == 300:
            break


In [None]:
# !python -m torch.distributed.launch --nproc_per_node=4 ../src/gpipe.py
!python ../src/ch5_pipeline_parallelism/gpipe.py

## 3. 1F1B Pipelining (PipeDream)

`PipeDream`, released by Microsoft, performs pipelining in a slightly different way compared to `GPipe`. This approach is commonly referred to as **1F1B**. Unlike GPipe, which performs the backward pass only after all forward passes are completed, PipeDream alternates between **forward** and **backward** passes.

<img src="../images/1f1b.png" width=600>

There are two main challenges in 1F1B pipelining:
1. Weight version management  
2. Work partitioning  

<br>

### 1) Weight Version Management
In the case of GPipe, only a single version of the weights is maintained, but periodic **pipeline flushes** occur. A pipeline flush refers to the process of updating parameters using the computed gradients. During this flush phase, no forward or backward computations are performed, which reduces processing efficiency.

<img src="../images/pipeline_flush.png" width=600>

PipeDream continuously updates parameters without requiring such flushes. As a result, idle time for both forward and backward passes is eliminated. However, this requires maintaining **multiple versions of the model parameters**. If only the latest version of the parameters were stored, a situation could arise where the next layer is updated while outputs from previous layers are still being propagated.

<img src="../images/1f1b.gif" width=800>

To prevent this issue, multiple versions of the weights are stored and managed. This, however, consumes a significant amount of memory, leading to a trade-off:

- **GPipe**: Memory-efficient, but less efficient in processing  
- **PipeDream**: Memory-inefficient, but more efficient in processing  

<br>

### 2) Work Partitioning
The second challenge concerns how to partition the neural network. Simply assigning the same number of layers to each partition is not always the best solution. The most important goal is to **minimize idle time**, which requires the execution time of each partition to be as balanced as possible. In addition, factors such as parameter size and activation memory must also be considered.

<img src="../images/pipe_dream.png" width=600>

PipeDream determines an optimal partitioning strategy through **profiling and optimization**.

<br><br>


## 4. Variations of 1F1B Pipelining

Here, we introduce two pipeline strategies that improve upon PipeDream‚Äôs 1F1B pipelining.

<br>

### 1) PipeDream 2BW (2-buffered weight update)
PipeDream 2BW was introduced to address the memory inefficiency of the original PipeDream. The core idea is to perform **gradient accumulation** during pipelining. By accumulating multiple gradients and applying updates all at once, it mitigates memory inefficiency. Unlike the original approach, 2BW only needs to maintain **two versions of the weights**.

![](../images/pipe_dream_2bw.png)

<br>

### 2) PipeDream Flush
PipeDream Flush is a pipelining method that combines 1F1B with **pipeline flush**. Because flushes occur, the idle time is similar to that of GPipe, but the amount of **activation memory that must be maintained during the forward‚Äìbackward process is reduced**. Since PipeDream Flush performs flushes, there is no need to manage multiple versions of parameters. As a result, only a single set of weights needs to be maintained, making it even more memory-efficient than PipeDream 2BW. (Among the techniques introduced so far, this is the **most memory-efficient**.)

![](../images/pipe_dream_flush.png)

![](../images/pipe_dream_flush_2.png)

<br>


### Wait a moment‚Ä¶ what is activation memory?
Most layers store the outputs produced during the forward pass until the backward pass is executed. Those who have used `torch.autograd.Function` may already be familiar with this behavior. The outputs of the forward layer are stored in the `ctx` variable.


In [None]:
"""
Note: https://pytorch.org/tutorials/beginner/examples_autograd/two_layer_net_custom_function.html
"""

import torch


class ReLU(torch.autograd.Function):

    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        # input Í∞íÏùÑ Ï†ÄÏû•ÌïòÍ≥† ÏûàÏùå.
        
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

This is because the values used during the forward pass are required to compute gradients during differentiation. Let‚Äôs look at the following example.

![](../images/max_pooling.png)

The figure above shows a max pooling operation and the corresponding gradient computation. During the backward pass, a (2, 2) tensor such as `[[0.8, 1.2], [0.9, 0.5]]` is given as input. Using this input, we must compute the gradient matrix shown on the right. To do this, the original (4, 4) tensor from the forward pass is required.

For this reason, the tensor from the forward pass must be stored in memory. The memory required to store tensors that were used during the forward pass in order to perform the backward pass is called **activation memory**.


Now that we understand what activation memory is, shall we try practicing PipeDream? **PipeDream Flush is implemented in Microsoft‚Äôs distributed training library, DeepSpeed.**  
(Reference: https://github.com/microsoft/DeepSpeed/issues/1110)  
So, let‚Äôs use DeepSpeed.

### How to Use DeepSpeed Commands
Before that, let‚Äôs first look at a very convenient feature provided by `deepspeed`. Previously, we used  
`python -m torch.distributed.launch --nproc_per_node=n OOO.py`  
for distributed training, but it was long and inconvenient. DeepSpeed provides much simpler commands such as `deepspeed` or `ds`.

- `ds --num_gpus=n OOO.py`
- `deepspeed --num_gpus=n OOO.py`

Running these commands behaves exactly the same as `torch.distributed.launch`. From now on, we will use DeepSpeed commands for all distributed programs. (Honestly, `torch.distributed.launch` is just too long üò≠)


In [None]:
"""
src/ch5_pipeline_parallelism/pipe_dream.py
"""
import deepspeed
import torch
import torch.nn as nn
from datasets import load_dataset
from deepspeed import PipelineModule
from torch.optim import Adam
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers.models.gpt2.modeling_gpt2 import GPT2Block as GPT2BlockBase
import torch.distributed as dist


class GPT2Preprocessing(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embed_dim = config.hidden_size
        self.wte = nn.Embedding(config.vocab_size, self.embed_dim)
        self.wpe = nn.Embedding(config.max_position_embeddings, self.embed_dim)
        self.drop = nn.Dropout(config.embd_pdrop)

    def forward(self, input_ids):
        input_shape = input_ids.size()
        input_ids = input_ids.view(-1, input_shape[-1])
        position_ids = torch.arange(
            0, input_shape[-1], dtype=torch.long, device=input_ids.device
        )
        position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])
        inputs_embeds = self.wte(input_ids)
        position_embeds = self.wpe(position_ids)
        hidden_states = inputs_embeds + position_embeds
        hidden_states = self.drop(hidden_states)
        return hidden_states


class GPT2Block(GPT2BlockBase):
    def forward(self, hidden_states):
        hidden_states = super(GPT2Block, self).forward(
            hidden_states=hidden_states,
        )
        return hidden_states[0]


class GPT2Postprocessing(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_f = nn.LayerNorm(
            config.hidden_size,
            eps=config.layer_norm_epsilon,
        )
        self.lm_head = nn.Linear(
            config.hidden_size,
            config.vocab_size,
            bias=False,
        )

    def forward(self, hidden_states):
        hidden_states = self.ln_f(hidden_states)
        lm_logits = self.lm_head(hidden_states)
        return lm_logits


def create_model_from_pretrained(model_name):
    pretrained = GPT2LMHeadModel.from_pretrained(model_name)
    preprocess = GPT2Preprocessing(pretrained.config)
    preprocess.wte.weight = pretrained.transformer.wte.weight
    preprocess.wpe.weight = pretrained.transformer.wpe.weight

    blocks = pretrained.transformer.h
    for i, block in enumerate(blocks):
        block.__class__ = GPT2Block

    postprocess = GPT2Postprocessing(pretrained.config)
    postprocess.ln_f.weight = pretrained.transformer.ln_f.weight
    postprocess.ln_f.bias = pretrained.transformer.ln_f.bias
    postprocess.lm_head.weight.data = pretrained.lm_head.weight.data.clone()

    return nn.Sequential(preprocess, *blocks, postprocess)


def collate_fn(batch):
    batch_encoding = tokenizer.pad(
        {"input_ids": batch}, padding="max_length", max_length=1024
    )
    return batch_encoding.input_ids


def batch_fn(data):
    input_ids = data
    labels = data
    return input_ids, labels


def loss_fn(logits, labels):
    logits = logits[..., :-1, :].contiguous()
    labels = labels[..., 1:].contiguous()

    return nn.CrossEntropyLoss()(
        logits.view(-1, logits.size(-1)),
        labels.view(-1),
    )


if __name__ == "__main__":
    dist.init_process_group("nccl")
    world_size, rank = dist.get_world_size(), dist.get_rank()
    batch_size, train_steps = 16, 300
    train_samples = batch_size * train_steps

    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    tokenizer.pad_token = tokenizer.eos_token

    model = PipelineModule(
        create_model_from_pretrained(model_name="gpt2"),
        loss_fn=loss_fn,
        num_stages=world_size,
        partition_method="type:GPT2Block"
        # You can choose which layers to parallelize using partition_method.
    )
    engine, optimizer, _, _ = deepspeed.initialize(
        model=model,
        optimizer=Adam(model.parameters(), lr=3e-5),
        config={
            "train_batch_size": batch_size,
            "steps_per_print": 9999999,
            # Turn off logging: https://github.com/microsoft/DeepSpeed/issues/1119
        },
    )
    engine.set_batch_fn(batch_fn)

    datasets = load_dataset("squad").data["train"]["context"]
    datasets = [str(sample) for i, sample in enumerate(datasets) if i < train_samples]
    datasets = [
        tokenizer(data, return_tensors="pt", max_length=1024).input_ids[0]
        for data in tqdm(datasets)
    ]
    data_loader = iter(
        DataLoader(
            sorted(datasets, key=len, reverse=True),
            # Uniform-length batching
            # https://mccormickml.com/2020/07/29/smart-batching-tutorial/
            batch_size=batch_size,
            num_workers=8,
            collate_fn=collate_fn,
            shuffle=False,
        )
    )

    for i in range(train_steps):
        loss = engine.train_batch(data_loader)

        if i % 10 == 0 and rank == 0:
            print(f"step: {i}, loss: {loss}")


In [None]:
!ds --num_gpus=4 ../src/ch5_pipeline_parallelism/pipe_dream.py

<br><br>

## 5. Interleaved Scheduling
Previously, each stage (a contiguous set of layers) was computed sequentially to produce outputs. For example, if there are 8 layers and 2 devices, layers 1‚Äì4 are typically assigned to device 1, and layers 5‚Äì8 to device 2. In this setup, device 1 executes layers 1‚Äì4 sequentially and then outputs the result. (Both GPipe and 1F1B operate in this manner.)

![](../images/interleaved_1.png)

However, **interleaved scheduling overlaps the execution within a single stage to drastically reduce bubble time**. For instance, if device 1 is assigned layers 1‚Äì4, it can execute layers 1‚Äì2 and layers 3‚Äì4 concurrently. This approach reduces bubble time, but it increases communication overhead, so careful tuning is required. (There is a trade-off.)
