Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

# Authors
- Deepak Narayanan‡★, Mohammad Shoeybi†, Jared Casper†, Patrick LeGresley†,
- Mostofa Patwary†, Vijay Korthikanti†, Dmitri Vainbrand†, Prethvi Kashinkunti†,
- Julie Bernauer†, Bryan Catanzaro†, Amar Phanishayee∗, Matei Zaharia‡
  - †NVIDIA ‡Stanford University∗Microsoft Research


# Abstract
- propose a novel interleaved pipelining schedule that can improve throughput by 10+% with memory foot-print comparable to existing approaches. 

# Introduction
- Larger models need to be split across multiple multi-GPU servers, which leads to two problems: (a) the all-reduce communication required for tensor parallelism needs to go through inter-server links, which are slower than the high-bandwidth NVLink [9] available within a multi-GPU server, and (b) a high degree of model parallelism can create small matrix multiplications (GEMMs), potentially decreasing GPU utilization (결국 GEMMs 효율관점에서는 TP같은게 좋은건 아닌듯)
- pp: A batch is split into smaller microbatches, and execution is pipelined across these microbatches. Layers can be assigned to workers in various ways, and various schedules for the forward and backward passes of inputs can be used. The layer assignment and scheduling strategy results in different performance tradeoffs. Regardless of schedule, to preserve strict optimizer semantics, optimizer steps need to be synchronized across devices, leading to a pipeline flush at the end of every batch, where microbatches are allowed to complete execution (and no new microbatches are injected).
- Q: How should parallelism techniques be combined to maximize the training throughput of large models given a batch size while retaining strict optimizer semantics? -> PTD-P
- We also compared to ZeRO [36], and found that our approach outperforms ZeRO-3 by 70% for models with 175 and 530 billion parameters due to less cross-node communication. These models are too large to fit on a multi-GPU server.
- efficient kernel implementations that allowed most of the computation to be compute-bound as opposed to memory-bound, smart partitioning of computation graphs over the devices to reduce the number of bytes sent over network links while also limiting device idle periods, domain-specific communication optimization, and fast hardware (state-of-the-art GPUs and high-bandwidth links between GPUs on the same and different servers).


# MODES OF PARALLELISM
- combine pipeline model parallelism and tensor model parallelism (combination shown in Figure 2) with data parallelism. We call this PTD-P for short.

<img width="1086" height="271" alt="Image" src="https://github.com/user-attachments/assets/e4db308b-b36d-4681-8b54-151a16c7d579" />

## Data Parallelism
- each worker has a copy of the full model, the input dataset is sharded, and workers aggregate their gradients periodically to ensure that all workers see a consistent version of the weights. Data parallelism can be used on smaller model shards

## Pipeline Model Parallelism
- the layers of a model are sharded across multiple devices. When used on models with the same transformer block repeated, each device can be assigned an equal number of transformer layers. (even pp)
- A batch is split into smaller microbatches; execution is then pipelined across microbatches. Pipelining schemes need to ensure that inputs see consistent weight versions across forward and backward passes for well-defined synchronous weight update semantics.
- To retain strict optimizer semantics exactly, we introduce periodic pipeline flushes so that optimizer steps are synchronized across devices. At the start and end of every batch, devices are idle. We call this idle time the pipeline bubble, and want to make it as small as possible

### Default Schedule. GPipe [20] 
- proposes a schedule where the forward passes for all microbatches in a batch are first executed followed by backward passes for all microbatches (shown in Figure 3).

<img width="1221" height="422" alt="Image" src="https://github.com/user-attachments/assets/b557c4e6-6019-44f1-95c5-8968ab13dc07" />


#### PipeDream-Flush warmup

<img width="796" height="324" alt="Image" src="https://github.com/user-attachments/assets/01e3557a-91b5-4fdd-b132-ada2f1df7b1b" />

밑에 그림이 interleaved..!

- **Instead, we use the PipeDream-Flush schedule** [30]. In this schedule, we first enter a warm-up phase where workers perform differing numbers of forward passes as shown in Figure 4 (top)

#### After Wamrup, PipeDream-Flush 
- This schedule limits the number of in-flight microbatches (the number of microbatches for which the backward pass is outstanding and activations need to be maintained) to the depth of the pipeline, instead of the number of microbatches in a batch
- After the warm-up phase, each worker then enters a steady state, where workers perform one forward pass followed by one backward pass (1F1B for short).
- Finally, at the end of a batch, we complete backward passes for all remaining in-flight microbatches. The time spent in the bubble is the same for this new schedule, but the number of outstanding forward passes is at most the number of pipeline stages for the PipeDream-Flush schedule
- As a result, this schedule requires activations to be stashed for 𝑝 or fewer microbatches (compared to 𝑚 microbatches for the GPipe schedule). Consequently, when 𝑚 ≫𝑝, PipeDream-Flush is much more memory-efficient than GPipe
- PipeDream-Flush 스케줄은 대규모 언어 모델 훈련에서 메모리 효율성을 크게 향상시키는 효과가 있습니다.
- 구체적으로 PipeDream-Flush의 효과는 다음과 같습니다:
  - 인플라이트 마이크로배치 수 제한: 이 스케줄은 백워드 패스가 미완료 상태이고 활성화가 유지되어야 하는 인플라이트(in-flight) 마이크로배치 수를 파이프라인 깊이로 제한합니다. 이는 GPipe 스케줄이 배치 내의 모든 마이크로배치에 대해 활성화를 유지해야 하는 것과 대조됩니다.
  - 메모리 풋프린트 감소: 결과적으로 PipeDream-Flush는 파이프라인 단계 수(p) 또는 그 이하의 마이크로배치에 대해서만 활성화를 저장하면 됩니다. 배치 내의 마이크로배치 수(m)가 파이프라인 단계 수(p)보다 훨씬 많을 때(m >> p), PipeDream-Flush는 GPipe보다 훨씬 더 메모리 효율적입니다.
  - 파이프라인 버블 시간 유지: 이 새로운 스케줄에서 버블에 소요되는 시간은 기본 스케줄과 동일합니다.
- 이 스케줄은 워밍업 단계를 거쳐 워커들이 다양한 수의 순방향 패스를 수행한 후, 각 워커가 하나의 순방향 패스와 하나의 역방향 패스를 수행하는 안정 상태(1F1B)에 진입하는 방식으로 작동합니다. 마지막으로 배치 종료 시 남아있는 모든 인플라이트 마이크로배치에 대한 역방향 패스를 완료합니다.

### Schedule with Interleaved Stages
- 각 디바이스에서 연속된 레이어가 아니라, 1,2,9,10 처럼 끊어서 레이어를 올려놓기 때문에 조금 더 단계를 나눠서 forward 할수있게 되고 piepline bubble을 조금 더 빨리오게 할수도있고 크기도 줄일수가있음 -> Throughput도 향상되고, 메모리 풋프린트는 거의 비슷하긴함. 대신 추가 통신을 요구하긴함.. 통신 최적화가 중요함
- To reduce the size of the pipeline bubble, each device can perform computation for multiple subsets of layers (called a model chunk), instead of a single contiguous set of layers. For example, if each device had 4 layers before (i.e., device 1 had layers 1−4, device 2 had layers 5−8, and so on), we could have each device perform computation for two model chunks (each with 2 layers), i.e., device 1 has layers 1,2,9,10; device 2 has layers 3,4,11,12; and so on. With this scheme, each device in the pipeline is assigned multiple pipeline stages (each pipeline stage has less computation compared to before).
- As before, we can use an “all-forward, all-backward” version of this schedule, but this has a high memory footprint (proportional to 𝑚). Instead, we developed an interleaved schedule that adapts the memory-efficient 1F1B schedule from before. This new schedule is shown in Figure 4, and requires the number of microbatches in a batch to be an integer multiple of the degree of pipeline parallelism (number of devices in the pipeline). For example, with 4 devices, the number of microbatches in a batch must be a multiple of 4.
  - PP가 4개 device로 나뉘어지면.. mbs도 4의 배수로..!
- As shown in Figure 4, the pipeline flush for the same batch size happens sooner in the new schedule
- This means that the new schedule reduces the bubble time by 𝑣. This reduced pipeline bubble size, however, does not come for free: this schedule requires extra communication. Quantitatively, the amount of communication also increases by 𝑣. In the next section, we discuss how we can utilize the 8 InfiniBand networking cards in a multi-GPU server (e.g., a DGX A100 node) to reduce the impact of this extra communication.

<img width="439" height="227" alt="Image" src="https://github.com/user-attachments/assets/8a0d65dd-d577-4e81-9ce6-821116e51ecf" />

## Tensor Model Parallelism
TBD

<img width="466" height="534" alt="Image" src="https://github.com/user-attachments/assets/49a7fee9-c0d4-4c87-8bbf-823511083511" />

# PERFORMANCE ANALYSIS OF PARALLELIZATION CONFIGURATION

## 3.4 Microbatch Size

<img width="449" height="238" alt="Image" src="https://github.com/user-attachments/assets/f78ad5c0-ad30-4b21-9948-a606796fdd3c" />

- The choice of the microbatch size 𝑏 also affects model-training throughput. For example, we see in Figure 7 that per-GPU throughput increases by up to 1.3×with a larger microbatch size on a single GPU. We now want to determine the optimal microbatch size 𝑏 given a parallel configuration (𝑝,𝑡,𝑑)and batch size 𝐵

<img width="900" height="514" alt="Image" src="https://github.com/user-attachments/assets/eb96e3d1-e6a4-4d9a-8940-a6f80e9c2be2" />

- The microbatch size thus affects both the arithmetic intensity of operations as well as the pipeline bubble size (by affecting 𝑚). Figure 8 shows estimated throughput (equation (1) used to estimate processing time) for a GPT model with a billion parameters and (𝑝,𝑡)= (8,8). The optimal 𝑏for both batch sizes is 4.

## 3.5 Activation Recomputation


<img width="443" height="249" alt="Image" src="https://github.com/user-attachments/assets/81cafb12-8517-4b21-896a-f680b11d99eb" />

- Activation recomputation [12, 18, 20, 21] is an optional technique that trades off an increase in the number of compute operations performed for additional memory footprint, by running the forward pass a second time just before the backward pass (and stashing only the input activations for a given pipeline stage, as opposed to the entire set of intermediate activations, which is much larger).
- Activation recomputation is required to train reasonably large models with pipeline parallelism to keep memory footprint acceptably low. (memory 아끼기 위해서 쓰는거고 결국 큰모델에선 쓸수밖에없긴함..)
- The number of activation checkpoints does not impact throughput, but impacts memory footprint. Let 𝐴input be the size of the input activations of a layer, and 𝐴intermediate be the size of intermediate activations per layer. If a model stage has 𝑙 layers, and if 𝑐 is the number of checkpoints, the total memory footprint is going to be 𝑐·𝐴input +𝑙/𝑐·𝐴intermediate. The minimum value of this function is obtained when 𝑐 = √︃𝑙· 𝐴intermediate/𝐴input . In practice, we measure 𝐴intermediate empirically. For most cases, checkpointing every 1 or 2 transformer layers is optimal.
- Figure 17 shows throughput with and without activation recomputation for a GPT model with 145 billion parameters (80 transformer layers, 96 attention heads, hidden size of 12288) using 128 A100 GPUs, (𝑡,𝑝)= (8,16), and a range of batch sizes. For small batch sizes, activation recomputation leads to up to 33% lower throughput (in sequences per second) due to the extra forward pass that needs to be executed during the backward pass. However, activation recomputation is needed to support larger batch sizes. Throughput at large batch sizes with activation recomputation is up to 2×higher than the best throughput achieved without activation recomputation (for a smaller batch size) due to a smaller pipeline bubble.


# IMPLEMENTATION
## 4.1 Communication Optimizations
- When using pipeline parallelism, we want to send and receive tensors in the forward and backward direction in parallel. Each DGX A100 is equipped with 8 InfiniBand (IB) networking cards. Unfortunately, sends and receives are point-to-point, and only happen between a pair of GPUs on two servers, making it hard to leverage all 8 cards for a single communication call within the pipeline
- However, we can leverage the fact that we use both tensor model parallelism and pipeline model parallelism to reduce the overhead of cross-node communication.
- In particular, we note that the output of each transformer layer is replicated (after 𝑔in MLP block, see Figure 5a) across the tensor-parallel ranks. As a result, ranks in two consecutive pipeline stages that are performing tensor model parallelism send and receive the exact same set of tensors (Figure 9a)

<img width="894" height="542" alt="Image" src="https://github.com/user-attachments/assets/00b22c7e-92ed-4cdb-8dd8-7ed3abd57275" />

- This optimization helps better leverage the multiple IB cards on the DGX A100 servers, and makes more communication intensive schedules such as the interleaved one feasible


## 4.2 Computation Optimization
- We implemented three model-specific optimizations to the computation graph to attain high performance. 
  - First, we changed the data layout in the transformer layer to avoid memory-intensive transpose operations, and to enable the use of **strided batched GEMM
kernels**. -> strided니까 데이터가 커도, 계산에는 문제없음. 메모리를 더 아끼게 만들었나봄. 메모리 집약적인 transpose 연산은 피하고 스트라이드 커널을 써서 계산효율화
    - we changed the data layout from [𝑏,𝑠,𝑎,ℎ] to [𝑠,𝑏,𝑎,ℎ], where 𝑏, 𝑠, 𝑎, and ℎ are batch, sequence, attention-head, and hidden-size dimensions, respectively
    - 즉 배치차원을 시퀀스차원이랑 바꿔서 꼈음
    - 데이터 레이아웃을 [b, s, a, h]에서 [s, b, a, h]로 변경한 이유는 메모리 집약적인 전치(transpose) 연산을 피하고, 스트라이드 배치(strided batched) GEMM(General Matrix Multiply) 커널의 사용을 가능하게 하기 위함
    -  이러한 최적화된 커널들은 특정 데이터 레이아웃에 맞춰 설계되는 경우가 많습니다. [s, b, a, h]와 같은 레이아웃은 시퀀스 길이(s)를 가장 바깥 차원으로, 배치 크기(b)를 그 다음 차원으로 배치하여, 시퀀스 내의 연산이 더 연속적으로 이루어지도록 돕습니다. 이는 여러 마이크로배치 또는 시퀀스에 걸쳐 동시에 연산을 수행할 때 데이터 접근 패턴을 개선하여, GPU의 연산 유닛(Tensor Cores 등)이 데이터를 더 효율적으로 로드하고 활용할 수 있도록 합니다.
    -  ◦ [s, b, a, h]와 같은 레이아웃은, 시퀀스(s)를 가장 바깥 차원으로, 배치(b)를 그다음 차원으로 배치함으로써, 특정 시퀀스 위치에서 여러 배치 또는 어텐션 헤드에 걸쳐 수행되는 연산에 필요한 데이터가 정확히 필요한 스트라이드를 가지고 배치될 수 있도록 돕습니다.
    ◦ 이렇게 하면 GPU의 연산 유닛(예: NVIDIA의 Tensor Cores)이 데이터를 한 번에 효율적으로 읽어 들여 배치된 연산을 동시에 처리할 수 있게 됩니다. 이는 커널 실행 오버헤드를 줄이고, 데이터 로드 및 활용 효율성을 극대화하여 전체 처리량을 향상시킵니다.
요약하자면, 데이터 레이아웃을 [b, s, a, h]에서 [s, b, a, h]로 변경하는 것은 GPU 하드웨어의 특성과 최적화된 연산 커널의 요구 사항에 맞춰 데이터를 재구성하는 것입니다. 이를 통해 불필요한 메모리 이동(전치 연산)을 제거하고, 고성능 스트라이드 배치 GEMM 커널의 활용을 극대화하여 대규모 언어 모델 훈련의 계산 효율성과 처리량을 크게 향상시키는 효과를 가져옵니다

  - Second, we generated fused kernels for a sequence of element-wise operations (bias + GeLU and bias + dropout + add) using PyTorch JIT [10]. 
  - Third, we created two custom kernels to enable the fusion of scale, mask, and
softmax (reduction) operations



## 5.2 Comparison to ZeRO-3


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM #46

Authors

Abstract

Introduction

MODES OF PARALLELISM

Data Parallelism

Pipeline Model Parallelism

Default Schedule. GPipe [20]

PipeDream-Flush warmup

After Wamrup, PipeDream-Flush

Schedule with Interleaved Stages

Tensor Model Parallelism

PERFORMANCE ANALYSIS OF PARALLELIZATION CONFIGURATION

3.4 Microbatch Size

3.5 Activation Recomputation

IMPLEMENTATION

4.1 Communication Optimizations

4.2 Computation Optimization

5.2 Comparison to ZeRO-3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM #46

Description

Authors

Abstract

Introduction

MODES OF PARALLELISM

Data Parallelism

Pipeline Model Parallelism

Default Schedule. GPipe [20]

PipeDream-Flush warmup

After Wamrup, PipeDream-Flush

Schedule with Interleaved Stages

Tensor Model Parallelism

PERFORMANCE ANALYSIS OF PARALLELIZATION CONFIGURATION

3.4 Microbatch Size

3.5 Activation Recomputation

IMPLEMENTATION

4.1 Communication Optimizations

4.2 Computation Optimization

5.2 Comparison to ZeRO-3

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions