Optimization of the standalone permute path#625
Conversation
Autumn1998
commented
May 7, 2026
- Add new permute kernel, optimize the perf at the overlap case
- unify the preprocess part of the fuse-path and non-fuse-path
- Fix out-of-bound error on scan kernel
|
From what I can tell, Would it make sense to either validate the provided capacity against the actual padded token count in blocking mode, or apply the same clipping/overflow handling there as well? |
This truncation occurs during the scan — tokens exceeding num_permute_tokens will be marked as 'do not send' in the routing map after preprocessing, so the subsequent dispatch and permute operations will communicate based on the trimmed routing information. |