cuda: add launch-bounded tile8 MoE down path#145
Open
amarrmb wants to merge 2 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds an optional CUDA MoE down path for prefill, guarded by
DS4_CUDA_MOE_DOWN_TILE8_ROWSPAN=1.This is stacked on top of the Blackwell F16 dispatch change from #121. Default behavior is unchanged unless the env flag is set.
Why
Nsight Systems showed MoE down as one of the dominant CUDA kernels during prefill on Blackwell-class devices. The existing tile16 row-span path is fast, but the launch stats show high register and shared-memory pressure.
This PR adds a tile8 row-span variant and uses
__launch_bounds__(256, 2)to reduce the register footprint while preserving the row-span structure.Thor NCU launch stats for the new kernel:
speed-bench
Command shape:
Results are averages over the 32 reported context points.
DS4_CUDA_MOE_DOWN_TILE8_ROWSPAN=1DS4_CUDA_MOE_DOWN_TILE8_ROWSPAN=1That is approximately +6.5% average prefill on Thor and +3.8% average prefill on DGX Spark, with generation unchanged within noise.
Tests
Passed on both Jetson Thor and DGX Spark:
Notes
This PR intentionally keeps the new path behind an env flag. It is a prefill-oriented tuning path and should not change defaults without more bake time.