# Pipeline Parallelism

In this session, we will learn about **Pipeline Parallelism**.

## 1. Inter-layer Model Parallelism

Pipeline Parallelism is an improvement over **Inter-layer model parallelism**.

In Inter-layer model parallelism, different layers are assigned to different GPUs.  
Each group of layers allocated to a GPU is called a **stage**.

In the example below:
- GPU 1 contains layers (1, 2, 3)
- GPU 2 contains layers (4, 5)
- The model is divided into **2 stages**

<br>

![Inter-layer Model Parallelism](../images/inter_layer.png)

<br>

## Limitation of Inter-layer Model Parallelism

Because neural networks depend on the output of previous layers,  
the next stage **cannot start until the previous stage finishes**.

This results in:

- Only one GPU being active at a time  
- Other GPUs remaining idle  
- Very poor hardware utilization  

<br>

![Sequential Execution](../images/inter_layer_2.png)

<br>

![Idle Time Illustration](../images/inter_layer_3.gif)

## Key Problem

Inter-layer model parallelism behaves like a relay race —  only **one GPU works at any given time**.

## Summary

| Aspect | Inter-layer Parallelism |
|--------|------------------------|
| Execution style | Sequential |
| GPU utilization | Very low |
| Synchronization | Blocking |
| Scalability | Poor |



## 2. GPipe

GPipe is a pipeline parallelism technique developed by Google.  
It was introduced to reduce GPU idle time in inter-layer model parallelism by further splitting a **mini-batch into micro-batches** and executing them in a pipelined manner.

<br>

![GPipe Overview](../images/gpipe_1.png)


<br>

### Micro-batch

- A **mini-batch** is a subset of the dataset used for one training step.
- A **micro-batch** is a further division of a mini-batch into smaller pieces.

<br>

![Micro-batch Definition](../images/gpipe_2.png)

<br>

### Pipelining

GPipe splits mini-batches into micro-batches and schedules them through pipeline stages.

The red region in the diagram is called **bubble time**, representing time during which GPUs remain idle.

As the micro-batch size increases:
- Pipeline utilization improves
- Bubble time decreases

<br>

![Bubble Time Visualization](../images/gpipe_3.gif)


### GPipe with PyTorch

You can easily use GPipe in PyTorch using `torchgpipe`, released by KakaoBrain.

However, there are several important limitations:

- Only models wrapped with `nn.Sequential` are supported.
- Every module’s input and output must be:
  - `torch.Tensor` or  
  - `Tuple[torch.Tensor]`
- Complex model architectures (e.g., branching, multiple inputs) are difficult to implement.

As a result, implementing GPipe in practice can be **quite restrictive and challenging**.


In [None]:
"""
src/gpipe.py
"""

import torch
import torch.nn as nn
from datasets import load_dataset
from torch.optim import Adam
from torch.utils.data import DataLoader
from torchgpipe import GPipe
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers.models.gpt2.modeling_gpt2 import GPT2Block as GPT2BlockBase


class GPT2Preprocessing(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embed_dim = config.hidden_size
        self.wte = nn.Embedding(config.vocab_size, self.embed_dim)
        self.wpe = nn.Embedding(config.max_position_embeddings, self.embed_dim)
        self.drop = nn.Dropout(config.embd_pdrop)

    def forward(self, input_ids):
        input_shape = input_ids.size()
        input_ids = input_ids.view(-1, input_shape[-1])
        position_ids = torch.arange(
            0, input_shape[-1], dtype=torch.long, device=input_ids.device
        )
        position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])
        inputs_embeds = self.wte(input_ids)
        position_embeds = self.wpe(position_ids)
        hidden_states = inputs_embeds + position_embeds
        hidden_states = self.drop(hidden_states)
        return hidden_states


class GPT2Block(GPT2BlockBase):
    def forward(self, hidden_states):
        hidden_states = super(GPT2Block, self).forward(
            hidden_states=hidden_states,
        )
        return hidden_states[0]


class GPT2Postprocessing(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_f = nn.LayerNorm(
            config.hidden_size,
            eps=config.layer_norm_epsilon,
        )
        self.lm_head = nn.Linear(
            config.hidden_size,
            config.vocab_size,
            bias=False,
        )

    def forward(self, hidden_states):
        hidden_states = self.ln_f(hidden_states)
        lm_logits = self.lm_head(hidden_states)
        return lm_logits


def create_model_from_pretrained(model_name):
    pretrained = GPT2LMHeadModel.from_pretrained(model_name)
    preprocess = GPT2Preprocessing(pretrained.config)
    preprocess.wte.weight = pretrained.transformer.wte.weight
    preprocess.wpe.weight = pretrained.transformer.wpe.weight

    blocks = pretrained.transformer.h
    for i, block in enumerate(blocks):
        block.__class__ = GPT2Block

    postprocess = GPT2Postprocessing(pretrained.config)
    postprocess.ln_f.weight = pretrained.transformer.ln_f.weight
    postprocess.ln_f.bias = pretrained.transformer.ln_f.bias
    postprocess.lm_head.weight.data = pretrained.lm_head.weight.data.clone()

    return nn.Sequential(preprocess, *blocks, postprocess)


if __name__ == "__main__":
    world_size = 4

    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    tokenizer.pad_token = tokenizer.eos_token

    model = create_model_from_pretrained(model_name="gpt2")
    model = GPipe(
        model,
        balance=[4, 3, 3, 4],
        devices=[0, 1, 2, 3],
        chunks=world_size,
    )

    datasets = load_dataset("squad").data["train"]["context"]
    datasets = [str(sample) for sample in datasets]
    data_loader = DataLoader(datasets, batch_size=8, num_workers=8)

    optimizer = Adam(model.parameters(), lr=3e-5)
    loss_fn = nn.CrossEntropyLoss()

    for i, data in enumerate(data_loader):
        optimizer.zero_grad()
        tokens = tokenizer(data, return_tensors="pt", truncation=True, padding=True)
        input_ids = tokens.input_ids.to(0)
        labels = tokens.input_ids.to(world_size - 1)

        lm_logits = model(input_ids)
        shift_logits = lm_logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()
        loss = nn.CrossEntropyLoss()(
            shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)
        )
        loss.backward()
        optimizer.step()

        if i % 10 == 0:
            print(f"step: {i}, loss: {loss}")
        if i == 300:
            break


In [47]:
# !python -m torch.distributed.launch --nproc_per_node=4 ../src/gpipe.py
!python ../src/gpipe.py

Reusing dataset squad (/home/ubuntu/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)
100%|█████████████████████████████████████████████| 2/2 [00:00<00:00, 55.94it/s]
step: 0, loss: 6.084661483764648
step: 10, loss: 3.2574026584625244
step: 20, loss: 2.796205759048462
step: 30, loss: 2.5538008213043213
step: 40, loss: 2.8463237285614014
step: 50, loss: 2.3466761112213135
step: 60, loss: 2.5407633781433105
step: 70, loss: 2.2434418201446533
step: 80, loss: 2.4792842864990234
step: 90, loss: 2.9400510787963867
step: 100, loss: 2.8163280487060547
step: 110, loss: 2.4787795543670654
step: 120, loss: 2.9588236808776855
step: 130, loss: 2.3893203735351562
step: 140, loss: 2.9571073055267334
step: 150, loss: 3.9219329357147217
step: 160, loss: 3.023880958557129
step: 170, loss: 3.018484592437744
step: 180, loss: 1.6825034618377686
step: 190, loss: 3.5461761951446533
step: 200, loss: 3.6606838703155518
step: 210, loss: 3.527740

## 3. 1F1B Pipelining (PipeDream)

PipeDream, released by Microsoft, performs pipeline parallelism in a different way from GPipe.  
This method is commonly referred to as **1F1B (One Forward One Backward)**.

Unlike GPipe, which performs all forward passes first and then starts backward propagation,  
PipeDream alternates between forward and backward passes.

<img src="../images/1f1b.png" width=600>


### Challenges in 1F1B Pipelining

1F1B pipelining introduces two major challenges:
1. Weight version management  
2. Work partitioning  

<br>

### 1) Weight Version Management

GPipe operates using a single version of weights, but periodically performs a **pipeline flush**.

Pipeline flush:
- Updates parameters using accumulated gradients
- During this process, **no forward or backward computation occurs**
- Results in reduced GPU utilization

<img src="../images/pipeline_flush.png" width=600>

PipeDream eliminates pipeline flushes by continuously updating parameters.  
As a result, forward and backward passes are always active, significantly improving utilization.

However, this introduces another problem:
If only the latest weights are stored, then when activations from previous pipeline stages arrive,  
downstream stages may already be using **newer weights**, causing inconsistency.


To solve this, PipeDream **maintains multiple versions of weights**.  
This increases memory usage, leading to a trade-off:

- **GPipe**: Memory efficient, Compute inefficient  
- **PipeDream**: Compute efficient, Memory inefficient  

<br>

### 2) Work Partitioning

The second challenge is deciding how to partition the neural network.

Using the same number of layers per GPU is not always optimal.  
The primary goal is to **minimize idle time** across all pipeline stages.

To accomplish this:
- Each partition should have similar execution time
- Parameter size and activation memory must be considered

<img src="../images/pipe_dream.png" width=600>

PipeDream uses profiling and optimization techniques  
to determine an optimal partitioning strategy.

<br><br>
