Skip to content

Optimization of the standalone permute path#625

Merged
jershi425 merged 5 commits into
deepseek-ai:hybrid-epfrom
Autumn1998:tongliu_new_permute
May 13, 2026
Merged

Optimization of the standalone permute path#625
jershi425 merged 5 commits into
deepseek-ai:hybrid-epfrom
Autumn1998:tongliu_new_permute

Conversation

@Autumn1998
Copy link
Copy Markdown
Collaborator

  1. Add new permute kernel, optimize the perf at the overlap case
  2. unify the preprocess part of the fuse-path and non-fuse-path
  3. Fix out-of-bound error on scan kernel

@Autumn1998 Autumn1998 changed the title Tongliu new permute Optimization of the standalone permute path May 7, 2026
@dongwang4096
Copy link
Copy Markdown
Contributor

From what I can tell, num_permuted_tokens is used to size the output tensors, but the standalone permute kernel still writes according to the actual destinations generated by preprocessing. If the provided value is smaller than the real padded token count, this could lead to writes past the allocated output buffer.

Would it make sense to either validate the provided capacity against the actual padded token count in blocking mode, or apply the same clipping/overflow handling there as well?

@Autumn1998
Copy link
Copy Markdown
Collaborator Author

From what I can tell, num_permuted_tokens is used to size the output tensors, but the standalone permute kernel still writes according to the actual destinations generated by preprocessing. If the provided value is smaller than the real padded token count, this could lead to writes past the allocated output buffer.

Would it make sense to either validate the provided capacity against the actual padded token count in blocking mode, or apply the same clipping/overflow handling there as well?

This truncation occurs during the scan — tokens exceeding num_permute_tokens will be marked as 'do not send' in the routing map after preprocessing, so the subsequent dispatch and permute operations will communicate based on the trimmed routing information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants