Optimization of the standalone permute path by Autumn1998 · Pull Request #625 · deepseek-ai/DeepEP

Autumn1998 · 2026-05-07T03:48:21Z

Add new permute kernel, optimize the perf at the overlap case
unify the preprocess part of the fuse-path and non-fuse-path
Fix out-of-bound error on scan kernel

dongwang4096 · 2026-05-11T02:38:40Z

From what I can tell, num_permuted_tokens is used to size the output tensors, but the standalone permute kernel still writes according to the actual destinations generated by preprocessing. If the provided value is smaller than the real padded token count, this could lead to writes past the allocated output buffer.

Would it make sense to either validate the provided capacity against the actual padded token count in blocking mode, or apply the same clipping/overflow handling there as well?

Autumn1998 · 2026-05-12T03:30:41Z

From what I can tell, num_permuted_tokens is used to size the output tensors, but the standalone permute kernel still writes according to the actual destinations generated by preprocessing. If the provided value is smaller than the real padded token count, this could lead to writes past the allocated output buffer.

Would it make sense to either validate the provided capacity against the actual padded token count in blocking mode, or apply the same clipping/overflow handling there as well?

This truncation occurs during the scan — tokens exceeding num_permute_tokens will be marked as 'do not send' in the routing map after preprocessing, so the subsequent dispatch and permute operations will communicate based on the trimmed routing information.

Tong Liu (Engrg-Hardware 1) added 4 commits April 29, 2026 03:45

add new permute

33b29d4

code clean

512cc09

unblock sm count

aa5f03a

add policy and fix

0dbaaf0

Autumn1998 changed the title ~~Tongliu new permute~~ Optimization of the standalone permute path May 7, 2026

Fix DOCA build/path regressions from NIXL integration (deepseek-ai#606)…

892a1c4

… (deepseek-ai#616)

zhongbozhu mentioned this pull request May 12, 2026

[bug] Investigate convergence of performance features with Qwen3.5 VL as proxy model NVIDIA-NeMo/Megatron-Bridge#3801

Open

jershi425 merged commit 17cfb81 into deepseek-ai:hybrid-ep May 13, 2026

aroshanghias-nvd mentioned this pull request May 14, 2026

[bug] Nemotron 3 Super NVFP4 pretraining with HybridEP fails with "Assertion `args.output_tokens_ptr != nullptr' failed in 26.04.00" NVIDIA-NeMo/Megatron-Bridge#3599

Open

Victarry mentioned this pull request May 15, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap NVIDIA/Megatron-LM#4815

Open

69 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of the standalone permute path#625

Optimization of the standalone permute path#625
jershi425 merged 5 commits into
deepseek-ai:hybrid-epfrom
Autumn1998:tongliu_new_permute

Autumn1998 commented May 7, 2026

Uh oh!

dongwang4096 commented May 11, 2026

Uh oh!

Autumn1998 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Autumn1998 commented May 7, 2026

Uh oh!

dongwang4096 commented May 11, 2026

Uh oh!

Autumn1998 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants