Skip to content

DeepSeek-V3 Technical Report #40

@eagle705

Description

@eagle705

Author

Abstract

  • DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token.
  • DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2
  • DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance
  • pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities
  • Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks
  • Link: https://github.com/deepseek-ai/DeepSeek-V3

Image

1. Introduction

  • we implement two additional strategies to further enhance the model capabilities
    • Firstly, DeepSeek-V3 pioneers an auxiliary-loss-free strategy (Wang et al., 2024a) for load balancing, with the aim of minimizing the adverse impact on model performance that arises from the effort to encourage load balancing
    • Secondly, DeepSeek-V3 employs a multi-token prediction training objective,
      which we have observed to enhance the overall performance on evaluation benchmarks
  • support the FP8 mixed precision training and implement comprehensive optimizations for the training framework
  • design the DualPipe algorithm for efficient pipeline parallelism,
    which has fewer pipeline bubbles and hides most of the communication during training through computation-communication overlap
    • This overlap ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead
    • develop efficient cross-node all-to-all communication kernels to fully utilize
      InfiniBand (IB) and NVLink bandwidths
    • optimize the memory footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism. Combining these efforts, we achieve high training efficiency
  • train DeepSeek-V3 on 14.8T high-quality and diverse tokens.
  • conduct a two-stage context length extension for DeepSeek-V3.
    • In the first stage, the maximum context length is extended to 32K, and
    • in the second stage, it is further extended to 128K.
  • conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on the base model of DeepSeek-V3
    • During the post-training stage, we distill the reasoning capability from the DeepSeek-R1 series of models, and meanwhile carefully maintain the balance between model accuracy and generation length.
  • emphasize again the economical training costs of DeepSeek-V3, summarized in
    Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs
    • Consequently, our pre-training stage is completed in less than two months and costs 2664K GPU hours
    • Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training

Image

Image

Image

Image

2. Architecture

2.1. Basic Architecture

Image
  • For efficient inference and economical training, DeepSeek-V3 also adopts MLA and DeepSeekMoE, which have been thoroughly validated by DeepSeek-V2. Compared with DeepSeek-V2, an exception is that we additionally introduce an auxiliary-loss-free load balancing strategy (Wang et al., 2024a) for DeepSeekMoE to mitigate the performance degradation induced by the effort to ensure load balance

2.1.1. Multi-Head Latent Attention

  • 어차피 query를 캐싱안한다면 굳이 왜 query를 latent space로 내려야했을까? 그냥 MHA로 냅둬도 되지않나? GQA나 MQA처럼..?
    • For the attention queries, we also perform a low-rank compression, which can reduce the
      activation memory
      during training
Me:
Deepseek v3 paper에서 
For the attention queries, we also perform a low-rank compression, which can reduce the
activation memory during training:
라는 부분이 있는데, 왜 low-rank compression이 activation memory를 줄여준다는거야?

GPT:
Low-rank compression reduces activation memory because it effectively lowers the dimensionality of the query representations that must be stored during training. In standard attention, the queries are high-dimensional and their full activations need to be cached for the backward pass. By compressing them into a lower-dimensional (low-rank) representation, fewer elements are stored, which directly cuts down on the memory required for activations. This reduction is especially significant during training, where storing large intermediate tensors can be a major memory bottleneck.

Me:
하지만 다시 up projection하고 backward pass때 쓰려면 다시 계산해야되는거 아냐?

GPT:
맞아요. 하지만 up projection은 단순한 행렬 곱셈 연산이라서 계산 비용이 매우 낮습니다. 그래서 저차원 압축 덕분에 전체 메모리 사용량은 크게 줄어들고, 필요한 경우 backward pass에서 up projection을 다시 계산하는 오버헤드는 저장 메모리를 절감하는 효과에 비해 상대적으로 작습니다.
  • 왜 query는 latent 에서 RoPE 적용하고, key는 hidden에서 바로 RoPE를 적용하는걸까?
Image Image

2.1.2. DeepSeekMoE with Auxiliary-Loss-Free Load Balancing

  • 결국에는 routed expert의 centroid vector와 토큰간의 벡터곱에 시그모이드 취해서 확률 값(affinity score)으로 라우팅 확률을 얻어내고, 그 라우팅 확률중 TopK안에 드는 것들을 뽑아서 Gated Value로 사용하고 그 Gated Value들 끼리 normalization을 또 한번 해줘서 조정해준다음에 Shared Expert와 Routed Expert의 FFN을 계산할때 Routed의 경우 Gated Value의 weighted sum을 해주는 형태로 토큰에 대응되는 h를 구해낸다 (꽤 복잡하군)
    Image

Image

Auxiliary-Loss-Free Load Balancing

  • Auxiliary-Loss-Free Load Balancing 같은 경우에는 유사도에 Expert마다 있는 bias텀을 더해줌으로써 라우팅 자체에 대해서 보정해주는 컨셉이며, 최종적인 Gated Value에는 관여하지 않고 TopK를 뽑아내는 Routing에만 관여하게됨
    • 학습하는동안 각 스텝이 끝날때 모니터링해서 오버로드 걸리는건 낮춰주고 적게 쓰이는건 높여주는 식으로 bias를 교체해주고 이를 위한 변수인 𝛾은 bias update speed라고 부른다!! (신기한 컨셉이네)

Complementary Sequence-Wise Auxiliary Loss

  • Complementary Sequence-Wise Auxiliary Loss의 경우에는 하나의 싱글 시퀀스 내에서 (배치단위가 아닌 그냥 하나 싱글 시퀀스) Expert가 쏠리는걸 방지하는 컨셉인거고, Sequence-wise에서도 expert가 잘 배정되도록하려는거고 하나의 Expert가 너무 많이 할당받게되면 loss가 증가하는 식으로 디자인해서 골고루 배치하도록 만들려는 컨셉인듯..?

Image

Node-Limited Routing

  • Expert에 보낼때 최대 M개의 노드에게만 보내게해서 통신 오버헤드를 막겠다
    • uses a restricted routing mechanism to limit communication costs during training
    • ensure that each token will be sent to at most 𝑀 nodes, which are selected according to the sum of the highest (𝐾𝑟/𝑀) affinity scores of the experts distributed on each node
    • Under this constraint, our MoE training framework can nearly achieve full computation-communication overlap

No Token-Dropping

  • Due to the effective load balancing strategy, DeepSeek-V3 keeps a good load balance during its full training. Therefore, DeepSeek-V3 does not drop any tokens during training

2.2. Multi-Token Prediction

Image

  • extends the prediction scope to multiple future tokens at each position
    • On the one hand, an MTP objective densifies the training signals and may improve data efficiency.
    • On the other hand, MTP may enable the model to pre-plan its representations for better prediction of future tokens 🤔
  • Different from Gloeckle et al. (2024), which parallelly predicts 𝐷 additional tokens using independent output heads, we sequentially predict additional tokens and keep the complete causal chain at each prediction depth.
    • 기존 방법과 다른건 병렬로 𝐷개 예측이 아닌 causal chain을 살려서 예측한다는점

MTP Modules

  • input embed layer, output head layer를 share해서 나름 동일한 representation이 들어오면서 param sharing도 하는 전략을 씀
Image Image

MTP Training Objective

  • 각 prediction depth마다 cross entropy 계산 (당연한 얘기)
  • k가 1이면 main module이라고 보면되는데 Loss에서는 또 텀이 MTP가 따로 붙어서 k=1이면 MTP에 대한 loss인듯
  • 왜 2+k로 시작하는걸까? 🤔
Image

MTP in Inference

  • Our MTP strategy mainly aims to improve the performance of the main model, so during inference, we can directly discard the MTP modules and the main model can function independently and normally
    • 추론시에는 MTP를 버려도 상관없이 동작하긴함. 그냥 이게 메인모델 학습을 잘시키기 위한 목적이었다보니
  • Additionally, we can also repurpose these MTP modules for speculative decoding to further improve the generation latency.
    • 그러나 speculative decoding을 위해 재활용할 수도있다

3. Infrastructures

3.1. Compute Clusters

  • trained on a cluster equipped with 2048 NVIDIA H800 GPUs.
  • Each node in the H800 cluster contains 8 GPUs connected by NVLink and NVSwitch within nodes.
  • Across different nodes, InfiniBand (IB) interconnects are utilized to facilitate communications

3.2. Training Framework

  • supported by the HAI-LLM framework
  • DeepSeek-V3 applies 16-way Pipeline Parallelism (PP) (Qi et al., 2023a),
  • 64-way Expert Parallelism (EP) (Lepikhin et al., 2021) spanning 8 nodes, and
    • Expert를 64개해서 8노드나 쓰고..많이도 썼네
  • ZeRO-1 Data Parallelism (DP) (Rajbhandari et al., 2020).
  • Optimizations
    • Firstly, we design the DualPipe algorithm for efficient pipeline parallelism
      • Compared with existing PP methods, DualPipe has fewer pipeline bubbles
      • More importantly, it overlaps the computation and communication phases across forward and backward processes, thereby addressing the challenge of heavy communication overhead introduced by cross-node expert parallelism.
        • 파이프라인 버블을 줄인것도 좋지만, 더 중요한건 forward와 backward의 계산과 통신 페이즈를 겹쳐서 통신이 많이 필요한 8 nodes에 대한 Expert parallelism에 대한 오버헤드를 없앤게 제일 중요한 점인듯
    • Secondly, we develop efficient cross-node all-to-all communication kernels to fully utilize IB and NVLink bandwidths and conserve Streaming Multiprocessors (SMs) dedicated to communication
      • 이건 자기네들 잘한다는 일종의 자랑인듯하고
    • Finally, we meticulously optimize the memory footprint during training, thereby enabling us to train DeepSeek-V3 without using costly Tensor Parallelism (TP)
      • TP를 안썼다는건 좀 신기한듯...?!

3.2.1. DualPipe and Computation-Communication Overlap

Image
  • the communication overhead introduced by cross-node expert parallelism results in an inefficient computation-to-communication ratio of approximately 1:1
  • To tackle this challenge, we design an innovative pipeline parallelism algorithm called DualPipe, which not only accelerates model training by effectively overlapping forward and backward computation-communication phases, but also reduces the pipeline bubbles
    • H800이 NVLink bandwidth가 낮다보니 파이프라인 버블을 줄여줘도 메모리 코스트가 증가하게 되기도함. 그래서 NVLink bandwidth가 큰 GPU에서는 고려안하기도하는데, cross node의 Expert parallelism communication 비용을 줄여준다는 점에서는 의미가 있는듯
  • The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks.
  • To be specific, we divide each chunk into four components:
    • attention
    • all-to-all dispatch
    • MLP
    • all-to-all combine
  • Specially, for a backward chunk, both attention and MLP are further split into two parts, backward for input and backward for weights, like in ZeroBubble (Qi et al., 2023b).
    • attetion이랑 MLP가 무겁긴하니까
  • we have a PP communication component.
  • for a pair of forward and backward chunks, we rearrange these components and manually adjust the ratio of GPU SMs dedicated to communication versus computation
    • 수작업으로 SMs ratio를 조절했다는데..흠
  • both all-to-all and PP communication can be fully hidden during execution
  • Given the efficient overlapping strategy, the full DualPipe scheduling is illustrated in Figure 5
    • employs a bidirectional pipeline scheduling, which feeds micro-batches from both ends of the pipeline simultaneously and a significant portion of communications can be fully overlapped
      • 양방향으로 오는게 포인트인듯, 디바이스마다 다르게 오는게 신기하네.. 정확히 two direction으로 온다는게 뭘까
    • This overlap also ensures that, as the model further scales up, as long as we maintain a constant
      computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead.
Image
  • In addition, even in more general scenarios without a heavy communication burden, DualPipe still exhibits efficiency advantages.
    • In Table 2, we summarize the pipeline bubbles and memory usage across different PP methods. As shown in the table, compared with ZB1P (Qiet al., 2023b) and 1F1B (Harlap et al., 2018), DualPipe significantly reduces the pipeline bubbles while only increasing the peak activation memory by 1/𝑃𝑃 times
Image - Although DualPipe requires keeping **two copies of the model parameters**, this does not significantly increase the memory consumption since we use a large EP size during training. - 실제 왜 two direction일까 헀는데 모델을 하나 더 복사떠서 모델 2개가 GPU device에 index상 reverse direction으로 올라가 있나보네.. inter layer는 맞긴한데, PP끼리 모델을 거꾸로 들고있는 형태인건가.. 암튼 EP를 많이해서 쪼개놨으니까 막 그렇게 늘진 않는다고 하는건가 어렵군.. - DualPipe only requires that the pipeline stages and **micro-batches be divisible by 2**, without requiring micro-batches to be divisible by pipeline stages - copy떠서 이렇게 해도되나봄 - In addition, for DualPipe, neither the bubbles nor activation memory will increase as the number of micro-batches grows (어차피 device 개수안에서 도는거니까 bubbles이 딱히 늘진 않을거같고.. activation memory는 잘 모르겠네

3.2.2. Efficient Implementation of Cross-Node All-to-All Communication

  • customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication
  • The implementation of the kernels is co-designed with the MoE gating algorithm and the network topology of our cluster.
  • cross-node GPUs are fully interconnected with IB, and intra-node communications are handled via NVLink. NVLink offers a bandwidth of 160 GB/s, roughly 3.2 times that of IB (50 GB/s).
    • 참고
      • NVLink: H100에서는 bandwidth가 900GB/sec, A100은 600GB/sec
      • IB: H100은 400Gbps, A100은 200Gbps
  • To effectively leverage the different bandwidths of IB and NVLink, we limit each token to be dispatched to at most 4 nodes, thereby reducing IB traffic
  • For each token, when its routing decision is made, it will first be transmitted via IB to the GPUs with the same in-node index on its target nodes
    • 예를 들어, 현재 노드의 0번 GPU가 있다면, 대상 노드의 0번 GPU로 데이터를 전송하는 방식
    • Once it reaches the target nodes, we will endeavor to ensure that it is instantaneously forwarded via NVLink to specific GPUs that host their target experts, without being blocked by subsequently arriving tokens
  • In this way, communications via IB and NVLink are fully overlapped, and each token can efficiently select an average of 3.2 experts per node without incurring additional overhead from NVLink.
    • 결국 다른 타겟 노드 가자마자 같은 in-node index GPU로 맵핑해버리니 NVLink쪽에선 그냥 그쪽 GPU로 보내주기만하면되니 일이 더 할게 없다는 뜻인가 -> 이게 all-to-all communication쪽의 핵심인듯
  • This implies that, although DeepSeek-V3 selects only 8 routed experts in practice, it can scale up this number to a maximum of 13 experts (4 nodes ×3.2 experts/node) while preserving the same communication cost.
    • IB랑 NVLink의 Bandwidth차이가 3.2배니까 더 쓸수 있다고 하는것인듯..?!
  • Overall, under such a communication strategy, only 20 SMs are sufficient to fully utilize the bandwidths of IB and NVLink.
    • In detail, we employ the warp specialization technique (Bauer et al., 2014) and partition 20 SMs into 10 communication channels
      • During the dispatching process
        • (1) IB sending,
        • (2) IB-to-NVLink forwarding, and
        • (3) NVLink receiving are handled by respective warps
          • GPU에서 "warp(워프)"는 GPU의 스레드들이 동시에 실행되는 최소 단위로, 보통 32개의 스레드가 한 묶음을 이루어 동일한 명령어를 동시에 수행함. 이로써 병렬 처리 효율이 극대화되됨
      • during the combining process,
        • (1) NVLink sending,
        • (2) NVLink-to-IB forwarding and accumulation, and
        • (3) IB receiving and accumulation are also handled by dynamically adjusted warps
    • In addition, both dispatching and combining kernels overlap with the computation stream, so we also consider their impact on other SM computation kernels. Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.
      • PTX is a fundamental part of CUDA -> PTX로 수정하는 경우도 더러 있긴함

3.2.3. Extremely Memory Saving with Minimal Overhead

  • In order to reduce the memory footprint during training, we employ the following techniques
  • [1] Recomputation of RMSNorm and MLA Up-Projection
    • recompute all RMSNorm operations and MLA up-projections during back-propagation, thereby eliminating the need to persistently store their output activations. With a minor overhead, this strategy significantly reduces memory requirements for storing activations.
      • backprop 할때 RMSNorm과 MLA up-projections 다시 계산함. 계산이 많아져서 비효율적일것 같은데 오히려 이런 matrix 연산은 가볍게 보는듯, activation을 다 storing해서 메모리를 들고 있는것보다는 다시 계산하는것을 통해서 output activation을 저장할 필요 없게 하는게 낫나봄
  • [2] Exponential Moving Average in CPU.
    • preserve the Exponential Moving Average (EMA) of the model parameters for early estimation of the model performance after learning rate decay. The EMA parameters are stored in CPU memory and are updated asynchronously after each training step. This method allows us to maintain EMA parameters without incurring additional memory or time overhead
      • CPU에 param의 EMA값을 갖고 있는것 같은데 뭐 업데이트하는건 CPU에서 해도 되는거 같은데 이거 다시 optim으로 가져올때 cpu<->gpu overhead 없나..?
        • EMA 값은 일반 모델 파라미터에 비해 크기가 작고, 평가 시점에만 주기적으로 GPU로 옮기면 되므로, CPU → GPU 데이터 전송에 따른 오버헤드는 전체 학습 시간에 비해 매우 미미하다고 함
  • [3] Shared Embedding and Output Head for Multi-Token Prediction.
    • With the DualPipe strategy, we deploy the shallowest layers (including the embedding layer) and deepest layers (including the output head) of the model on the same PP rank.
      • DualPipe할때 embed layer랑 head output layer를 같은 PP에 둬서 MTP 쉽게 할 수 있게 하는듯. 보통 PP는 inter-layer parallel이라서 저렇게 떨어져있는 레이어가 같은 GPU에 올라갈 일이 없는데 트릭으로 쓴듯..?! 어차피 two copies를 갖고있기도하고..?!
    • This arrangement enables the physical sharing of parameters and gradients, of the shared embedding and output head, between the MTP module and the main model. This physical sharing mechanism further enhances our memory efficiency.

3.3. FP8 Training

  • While low-precision training holds great promise, it is often limited by the presence of outliers in activations, weights, and gradients
Image

3.3.1. Mixed Precision Framework

  • most compute-density operations are conducted in FP8, while a few key operations are strategically maintained in their original data formats to balance training efficiency and numerical stability
    • 무거운 애들만 FP8로 해보자~
  • Firstly, in order to accelerate model training, the majority of core computation kernels, i.e.,
    GEMM operations, are implemented in FP8 precision
    • These GEMM operations accept FP8 tensors as inputs and produce outputs in BF16 or FP32
    • As depicted in Figure 6, all three GEMMs associated with the Linear operator, namely Fprop (forward pass), Dgrad (activation backward pass), and Wgrad (weight backward pass), are executed in FP8
    • Additionally, the FP8 Wgrad GEMM allows activations to be stored in FP8 for use in the backward pass. This significantly reduces memory consumption
      • Weight에 대한 backwards를 위해 activation을 fp8로 저장하는게 메모리 절약에 도움이 된다.. 일단 오케이
  • maintain the original precision (e.g., BF16 or FP32) for the following components: the embedding module, the output head, MoE gating modules, normalization operators, and attention operators
    • 얘네들은 예민하고, 성능에 있어서 정확도가 높아야되서 냅둠
    • these high-precision components incur some memory overheads, their impact can be minimized
      through efficient sharding across multiple DP ranks in our distributed training system.
      • DP랑은 뭔 상관이지..

3.3.2. Improved Precision from Quantization and Multiplication

Image

Fine-Grained Quantization.

  • In low-precision training frameworks, overflows and underflows are common challenges due to the limited dynamic range of the FP8 format, which is constrained by its reduced exponent bits.
  • the input distribution is aligned to the representable range of the FP8 format by scaling the maximum absolute value of the input tensor to the maximum representable value of FP8. This method makes low-precision training highly sensitive to activation outliers, which can heavily degrade quantization accuracy.
    • max value만으로 scaling하는게 아웃라이어에 sensitive하고 안좋다는건데
  • (얘네 기법)To solve this, we propose a fine-grained quantization method that applies scaling at a more granular level
    • 타일링으로 그룹핑해서 스케일링 한다는걸까?! 단순히 max val per tensor로 하기보다 좀 더 정밀하게 한다는거네.. activation도 하고 weight의 input output도 다 스케일링한다는 것
  • As illustrated in Figure 7 (a),
    • (1) for activations, we group and scale elements on a 1x128 tile basis (i.e., per token per 128 channels); and
    • (2) for weights, we group and scale elements on a 128x128 block basis (i.e., per 128 input channels per 128 output channels).
  • the quantization process can better accommodate outliers by adapting the scale according to smaller groups of elements
  • One key modification in our method is the introduction of per-group scaling factors along
    the inner dimension of GEMM operations.
  • our fine-grained quantization strategy is highly consistent with the idea of microscaling formats (Rouhani et al., 2023b), while the Tensor Cores of NVIDIA next-generation GPUs (Blackwell series) have announced the support for microscaling formats with smaller quantization granularity (NVIDIA, 2024a). We hope our design can serve as a reference for future work to keep pace with the latest GPU architectures
    • GPU 설계에 대한 얘기로 어필까지..깨알같다

Increasing Accumulation Precision

  • FP8 연산은 계산 속도와 메모리 효율 면에서 유리하지만, 누적 단계(accumulation)에서 FP32보다 낮은 정밀도(약 14비트)를 사용하므로, 특히 내적 차원(K)이 큰 경우(예: K = 4096) 상대 오차가 커질 수 있습니다. 실제 테스트에서는 최대 약 2%의 상대 오차가 발생했다고 합니다. 일부 FP8 프레임워크는 기본적으로 이 낮은 누적 정밀도를 사용하기 때문에, 모델의 학습 정확도가 제한될 수 있습니다.

  • 이 문제를 해결하기 위해 DeepSeek‑V3에서는 **"promotion to CUDA Cores"**라는 전략을 채택합니다. 먼저, 텐서 코어에서 MMA(Matrix Multiply-Accumulate) 연산을 수행할 때, 낮은 정밀도(14비트)로 부분 누적을 진행합니다. 일정 간격(여기서는 NC = 128개의 요소, 즉 4개의 WGMMAs에 해당하는 간격)이 지나면, 이 부분 결과를 CUDA 코어의 FP32 레지스터로 복사하여, FP32 정밀도로 누적을 다시 수행합니다.

  • 요약하면, FP8의 낮은 누적 정밀도가 대형 행렬 곱셈에서 정확도 문제를 일으킬 수 있는데, DeepSeek‑V3는 일정 구간마다 중간 결과를 FP32로 승격시켜 누적 정밀도를 높이는 전략을 도입함으로써, 전체 연산의 정확도와 안정성을 확보하고자 한 것입니다.

  • Low-precision GEMM operations often suffer from underflow issues, and their accuracy largely depends on high-precision accumulation, which is commonly performed in an FP32 precision

  • However, we observe that the accumulation precision of FP8 GEMM on NVIDIA H800 GPUs is limited to retaining around 14 bits(mantissa에 대한 얘기인거같은데) which is significantly lower than FP32 accumulation precision

    • This problem will become more pronounced when the inner dimension K is large (Wortsman et al., 2023), a typical scenario in large-scale model training where the batch size and model width are increased.
    • Taking GEMM operations of two random matrices with K = 4096 for example, in our preliminary test, the limited accumulation precision in Tensor Cores results in a maximum relative error of nearly 2%. Despite these problems, the limited accumulation precision is still the default option in a few FP8 frameworks (NVIDIA, 2024b), severely constraining the training accuracy.

Image

Mantissa over Exponents

  • In contrast to the hybrid FP8 format adopted by prior work (NVIDIA, 2024b; Peng et al., 2023b; Sun et al., 2019b), which uses E4M3 (4-bit exponent and 3-bit mantissa) in Fprop and E5M2 (5-bit exponent and 2-bit mantissa) in Dgrad and Wgrad, we adopt the E4M3 format on all tensors for higher precision.
  • By operating on smaller element groups, our methodology effectively shares exponent bits among these grouped elements, mitigating the impact of the limited dynamic range

Online Quantization (이건 그냥 delayed가 아닌 current scaling 했다는 것)

  • Delayed quantization is employed in tensor-wise quantization frameworks (NVIDIA, 2024b; Peng et al., 2023b), which maintains a history of the maximum absolute values across prior iterations to infer the current value
  • In order to ensure accurate scales and simplify the framework, we calculate the maximum absolute value online for each 1x128 activation tile or 128x128 weight block
  • Based on it, we derive the scaling factor and then quantize the activation or weight online into the FP8 format.

3.3.3. Low-Precision Storage and Communication

  • further reduce the memory consumption and communication overhead by compressing cached activations and optimizer states into lower-precision formats
    • activation이나 이런건.. 압축하고, optim도 lower precision 포멧으로 저장(보통 크게 저장하는데..) 많이도 깎았네

Low-Precision Optimizer States

  • adopt the BF16 data format instead of FP32 to track the first and second moments in the AdamW
  • However, the master weights (stored by the optimizer) and gradients (used for batch size accumulation) are still retained in FP32 to ensure numerical stability throughout training
  • 결국 first, second moments는 bf16으로 줄이되 중요한 weights, gradients는 fp32둬서 경향성만 조금 더 압축한거네

Low-Precision Activation

  • As illustrated in Figure 6, the Wgrad operation is performed in FP8.
  • it is a natural choice to cache activations in FP8 format for the backward pass of the Linear operator
    Image
  • (1) Inputs of the Linear after the attention operator.
    • Attention 연산 후 Linear 레이어의 입력(활성화 값)은 backward pass에서도 사용되기 때문에, 정밀도가 매우 중요한 역할을 해요. 낮은 정밀도 FP8로 저장하면 오차가 커질 위험이 있습니다.
    • 맞춤형 FP8 포맷(E5M6): 일반 FP8 포맷보다 더 많은 mantissa 비트를 할당하여(예: E5M6는 5비트의 exponent와 6비트의 mantissa) 정밀도를 보강합니다.
    • 양자화 타일 변환: 이 활성화 값들은 처음에 1×128 형태의 양자화 타일로 저장되다가, backward pass에서는 128×1 형태로 변환됩니다.
    • Scaling factor를 2의 거듭제곱(Integral Power of 2)으로 사용하는 이유:
      • 2의 거듭제곱인 스케일링 계수는 이진수 체계에서 정확히 표현되므로, 곱셈과 나눗셈 연산이 비트 쉬프트(shift)와 같이 정확하게 수행됩니다. 이를 통해 양자화 및 역양자화 과정에서 추가적인 반올림 오차(quantization error)가 발생하는 것을 방지할 수 있고, 연산 효율성도 높일 수 있어요.
  • 민감한 연산에 대해서는 맞춤형 포맷(E5M6)과, 양자화 시 스케일링 계수를 2의 거듭제곱으로 설정하는 등의 기법을 도입해 추가적인 양자화 오차를 줄이고, 안정적인 학습을 보장
  • (2) Inputs of the SwiGLU operator in MoE

Image

Low-Precision Communication

  • Communication bandwidth is a critical bottleneck in the training of MoE models.
  • To alleviate this challenge, we quantize the activation before MoE up-projections into FP8 and then apply dispatch components, which is compatible with FP8 Fprop in MoE up-projections
  • Like the inputs of the Linear after the attention operator, scaling factors for this activation are integral power of 2 (왜 power of 2로 하는게 좋은거지..?)
  • A similar strategy is applied to the activation gradient before MoE down-projections.
  • For both the forward and backward combine components, we retain them in BF16 to preserve training precision

3.4. Inference and Deployment

  • deploy DeepSeek-V3 on the H800 cluster, where GPUs within each node are interconnected using NVLink, and all GPUs across the cluster are fully interconnected via IB
  • To simultaneously ensure both the Service-Level Objective (SLO) for online services and high throughput, we employ the following deployment strategy that separates the prefilling and decoding stages.

3.4.1. Prefilling

  • The minimum deployment unit of the prefilling stage consists of 4 nodes with 32 GPUs. The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP), combined with 8-way Data Parallelism (DP8)
    • Its small TP size of 4 limits the overhead of TP communication.
    • 여기서는 TP 썼네
  • For the MoE part, we use 32-way Expert Parallelism (EP32), which ensures that each expert processes a sufficiently large batch size, thereby enhancing computational efficiency.
    • For the MoE all-to-all communication
      • first transferring tokens across nodes via IB, and then forwarding among the intra-node GPUs via NVLink. In particular, we use 1-way Tensor Parallelism for the dense MLPs in shallow layers to save TP communication.
        • 여기서 TP 1인 무슨뜻일까? Attention 같은 큰 연산에는 TP4, 그냥 얕은 mlp에서는 TP1이라는건가.. 신기하네
  • To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens
  • To this end, we introduce a deployment strategy of redundant experts, which duplicates high-load experts and deploys them redundantly
    • 많이 로딩되는 experts를 따로 떼서 중복용 expert로 만드는거네 많이 로딩되는건 배포 후에 10분마다 측정하는거고.. 신기하네
  • The high-load experts are detected based on statistics collected during the online
    deployment and are adjusted periodically (e.g., every 10 minutes)
  • After determining the set of redundant experts, we carefully rearrange experts among GPUs within a node based on the observed loads, striving to balance the load across GPUs as much as possible without increasing the cross-node all-to-all communication overhead.
    • 노드 안에 있는 experts 중에서 잘 로딩 되는거 있으면 노드 안의 GPU에서 다시 arrange하는 식으로 해서 cross node communication을 줄인다는거! 말은 쉬운데 뭐 어떻게 한다는건지..
  • For the deployment of DeepSeek-V3, we set 32 redundant experts for the prefilling stage. For each GPU, besides the original 8 experts it hosts, it will also host one additional redundant expert.
    • prefilling 스테이지에서 32개의 중복 experts를 설정하고 원래 8개의 experts를 뒀었는데 자주 사용되는 중복 Expert를 1개 더 주기 때문에 그래도 자주사용되는애가 하나씩 껴있으니 GPU 간에 균형있게 로딩될거라고 보는 컨셉인듯
  • simultaneously process two micro-batches with similar computational workloads, overlapping the attention and MoE of one micro-batch with the dispatch and combine of another.
    • Dispatch: Dispatch는 모델의 토큰(입력 데이터)을 각 토큰의 라우팅 결정에 따라, 해당 토큰을 처리할 전문가가 위치한 다른 GPU로 전송하는 과정입니다. 이 과정은 먼저 InfiniBand(IB)를 통해 노드 간 전송이 이루어지고, 그 후 NVLink를 통해 동일 노드 내에서 전문가가 호스팅된 GPU로 데이터를 전달합니다.
    • Combine: Combine은 각 전문가(GPU)에서 처리한 결과를 다시 모아서 하나의 최종 결과로 결합하는 과정입니다. 즉, 여러 GPU에서 처리된 부분 결과들을 모아 누적하거나 집계하여, 최종적으로 토큰의 표현을 완성합니다.
  • exploring a dynamic redundancy strategy for experts, where each GPU hosts
    more experts (e.g., 16 experts), but only 9 will be activated during each inference step. Before the all-to-all operation at each layer begins, we compute the globally optimal routing scheme on the fly

3.4.2. Decoding

  • treat the shared expert as a routed one
  • From this perspective, each token will select 9 experts during routing, where the shared expert is regarded as a heavy-load one that will always be selected. The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs
  • The attention part employs TP4 with SP, combined with DP80, while the MoE part uses EP320
  • For the MoE part, each GPU hosts only one expert, and 64 GPUs are responsible for hosting redundant experts and shared experts
    • 그니까 GPU마다 1개가 올라가는건데.. 64개는 또 뭐야 -> 1개씩 올라가는건 맞는데, 그중 64개는 shared한 녀석들이니까 자주 선택된다 뭐 그런건가
문장에서 "각 GPU는 단 하나의 expert를 호스팅한다"고 명시되어 있습니다. 즉, decoding 단계에서 MoE 파트는 EP320 설정으로 320개의 expert가 320개의 GPU에 분산되어 있으며, 기본적으로 각 GPU는 1개의 expert만 가지고 있습니다.

그러나 추가적으로 64개의 GPU는 중복 전문가(redundant experts)와 shared expert를 호스팅하도록 지정되어 있습니다. 여기서 "shared expert"는 모든 토큰 라우팅 시 항상 선택되는, 무거운 부하의 expert 역할을 합니다. 그래서 라우팅 관점에서는 각 토큰이 총 9개의 expert(8개의 routed expert + 1개의 shared expert)를 선택하게 되지만, 실제 하드웨어 구성상으로는 각 GPU는 1개의 expert만을 계산에 사용합니다.

즉, 전체적으로 보면:

MoE 파트에서 320개의 GPU 각각이 1개의 expert를 가지고,
그 중 64개의 GPU는 중복 전문가와 shared expert 역할도 겸하여 배치되어,
결과적으로 토큰은 9개의 expert를 선택하지만, 개별 GPU마다 expert는 1개만 올라간다는 의미입니다.
  • Similar to prefilling, we periodically determine the set of redundant experts in a certain interval, based on the statistical expert load from our online service
  • However, we do not need to rearrange experts since each GPU only hosts one expert.
    • 학습때는 node comm 줄이기 위해 rearrange했지만 추론때는 GPU가 1개의 Exert 쓰도록 EP적용해놔서 (학습때랑 다른 parallelism 셋팅인듯) rrearange가 필요없다고함
  • exploring the dynamic redundancy strategy for decoding. However, this requires more careful optimization of the algorithm that computes the globally optimal routing scheme and the fusion with the dispatch kernel to reduce overhead.
  • exploring processing two micro-batches with similar computational workloads
    simultaneously in the decoding stage. Unlike prefilling, attention consumes a larger portion of time in the decoding stage
    • Therefore, we overlap the attention of one micro-batch with the dispatch+MoE+combine of another.
    • 아니 Attention이 왜 prefilling때랑 다르게 더 큰 time portion을 차지하는거지?
    • In the decoding stage, the batch size per expert is relatively small (usually within 256 tokens), and the bottleneck is memory access rather than computation. Since the MoE part only needs to load the parameters of one expert, the memory access overhead is minimal, so using fewer SMs will not significantly affect the overall performance. Therefore, to avoid impacting the computation speed of the attention part, we can allocate only a small portion of SMs to dispatch+MoE+combine

3.5. Suggestions on Hardware Design

3.5.2. Compute Hardware

Support for Tile- and Block-Wise Quantization.

Support for Online Quantization

Support for Transposed GEMM Operations

4. Pre-Training

4.1. Data Construction

  • 14.8T tokens
  • FIM
  • BBPE 128k vocab

4.2. Hyper-Parameters

Model Hyper-Parameters

  • the number of Transformer layers to 61
  • the hidden dimension to 7168
  • the number of attention heads 𝑛ℎ to 128
  • the per-head dimension 𝑑ℎ to 128
  • The KV compression dimension 𝑑𝑐 is set to 512
  • the query compression dimension 𝑑′𝑐 is set to 1536 (query쪽 compression이 더 크네 신기하군)
  • the decoupled queries and key, we set the per-head dimension 𝑑^𝑅_ℎ to 64
  • substitute all FFNs except for the first three layers with MoE layers
    • 첫 3개 레이어는 FNNs?!
  • Each MoE layer consists of 1 shared expert and 256(왜 256이지..?!) routed experts where the intermediate hidden dimension of each expert is 2048
  • 8 experts will be activated for each token, and each token will be ensured to be sent to at most 4 nodes
  • The multi-token prediction depth 𝐷is set to 1
  • employs additional RMSNorm layers after the compressed latent vectors, and multiplies additional scaling factors at the width bottlenecks
  • DeepSeek-V3 comprises 671B total parameters, of which 37B are activated for each token

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions