[Public release 26/04] Introducing EPv2: faster EP, and Engram/PP/CP supports#605
Open
[Public release 26/04] Introducing EPv2: faster EP, and Engram/PP/CP supports#605
Conversation
sphish
approved these changes
Apr 23, 2026
Contributor
build failed on cuda 12.8dependency: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
With the evolution of hardware, networking, and model architectures, the previous DeepEP V1 had accumulated too much legacy baggage and performance issues. Today, we are excited to introduce DeepEP V2, which includes a complete refactoring of Expert Parallelism — achieving extreme performance with several times fewer SM resources compared to V1, while supporting significantly larger scale-up and scale-out domains — as well as experimental 0 SM Engram, 0 SM Pipeline Parallelism, and 0 SM Context Parallelism all-gather.
We are also happy to announce that we have switched from the NVSHMEM backend to the more lightweight NCCL Gin backend.
New Features
Notes
Still On-going Features
Performance
Following V3's configuration, we tested with 8K tokens per batch, 7168 hidden dimensions, top 8 experts, FP8 dispatching, and BF16 combining, and obtained the following results:
Notes, the results are logical bandwidth. For example, under the
EP 8 x 2case, 90 GB/s actually contains local rank traffic.Comparing with V1, V2 achieves up to 1.3x peak performance, while saving up to 4x SM count.
We omit results for larger EP configurations for the time being, but encourage interested users to benchmark them directly. Based on our internal experience, we expect the kernel to continue saturating hardware bandwidth at scale.
Contributors