cuda: add launch-bounded tile8 MoE down path by amarrmb · Pull Request #145 · antirez/ds4

amarrmb · 2026-05-14T13:52:39Z

Adds an optional CUDA MoE down path for prefill, guarded by DS4_CUDA_MOE_DOWN_TILE8_ROWSPAN=1.

This is stacked on top of the Blackwell F16 dispatch change from #121. Default behavior is unchanged unless the env flag is set.

Why

Nsight Systems showed MoE down as one of the dominant CUDA kernels during prefill on Blackwell-class devices. The existing tile16 row-span path is fast, but the launch stats show high register and shared-memory pressure.

This PR adds a tile8 row-span variant and uses __launch_bounds__(256, 2) to reduce the register footprint while preserving the row-span structure.

Thor NCU launch stats for the new kernel:

kernel	registers/thread	static shared memory/block	grid size
existing tile16 row-span	165	37,376 bytes	608
new launch-bounded tile8 row-span	128	18,688 bytes	704

speed-bench

Command shape:

./ds4-bench --cuda -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 --ctx-max 65536 \
  --step-incr 2048 --gen-tokens 128

Results are averages over the 32 reported context points.

device	variant	avg prefill tok/s	avg gen tok/s
Jetson Thor	default	138.17	9.67
Jetson Thor	`DS4_CUDA_MOE_DOWN_TILE8_ROWSPAN=1`	147.11	9.66
DGX Spark	default	341.26	13.37
DGX Spark	`DS4_CUDA_MOE_DOWN_TILE8_ROWSPAN=1`	354.08	13.36

That is approximately +6.5% average prefill on Thor and +3.8% average prefill on DGX Spark, with generation unchanged within noise.

Tests

Passed on both Jetson Thor and DGX Spark:

make ds4-bench ds4_test
DS4_CUDA_MOE_DOWN_TILE8_ROWSPAN=1 ./ds4_test --metal-kernels --server

Notes

This PR intentionally keeps the new path behind an env flag. It is a prefill-oriented tuning path and should not change defaults without more bake time.

amarrmb added 2 commits May 13, 2026 11:39

cuda: skip ordered f16 matmul on Blackwell

aa47b2c

cuda: add launch-bounded tile8 MoE down path

5e664f6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: add launch-bounded tile8 MoE down path#145

cuda: add launch-bounded tile8 MoE down path#145
amarrmb wants to merge 2 commits into
antirez:mainfrom
amarrmb:cuda-moe-down-tile8-rowspan

amarrmb commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amarrmb commented May 14, 2026

Why

speed-bench

Tests

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant