Add wide-token MoE prefill tiles (n64/n128 mul_mm_id)#264
Merged
Conversation
Use 64/128-token expert-major tiles for Q4_K, Q2_K, and IQ2_XXS routed prefill when batch length is aligned. Cap via DS4_METAL_MOE_TILE_MAX (default 128; set 32 to force legacy tiles for A/B). Co-authored-by: Cursor <cursoragent@cursor.com>
Owner
|
That's wonderful @beamivalice, thanks! Before merging, GPT-5.5 found a potential correctness bug that affects only the Q4_K path: when gate_type is Q4_K, the selected tile size can be 64/128 and is reused for the down projection. However, the PR only adds wide Q4_K kernels for the F32 RHS path; the F16 RHS down path still resolves to the n32 kernel_mul_mm_id_q4_K_f16 pipeline. So the host may dispatch an n32 kernel with a 64/128 tile grid. Could you confirm the issue please? |
Owner
|
That's what I get after sleeping each step to adjust for thermal contamination: |
Contributor
Author
|
I'll check on it. |
…olver to match Q2_K/IQ2_XXS.
Contributor
Author
|
Fix was in, all cleared. @antirez
|
Owner
|
Merged. Great work. |
hexxyan
pushed a commit
to hexxyan/ds4
that referenced
this pull request
May 27, 2026
ivp5
added a commit
to ivp5/ds4
that referenced
this pull request
May 27, 2026
Pulls 4 antirez/main commits: - 5224654 Add wide-token MoE prefill tiles (n64/n128 mul_mm_id) - d517726 Fix: Adding wide Q4_K F16 kernels + host pipeline resolver - 805368e Merge PR antirez#264 - 9ca9013 Harden wide MoE tile dispatch ds4_metal.m conflict resolution (6 markers): - 1-4: take antirez (function declaration + body updates for ds4_gpu_encode_mul_mm_id_mapped_tile + ds4_gpu_routed_mm_pipeline*) - 5-6: spurious conflict (merge tool conflated my softplus_sqrt MTL4 port with antirez's wide-tile MoE encode block). Take HEAD (my port is correct); antirez's wide-tile encode body needs separate wire-up. Post-merge fixups: - Add backward-compat wrappers ds4_gpu_routed_mm_pipeline + ds4_gpu_routed_mm_f16_rhs_pipeline (call _for_tile with 32u) - Add legacy tile_n=32u to 3 ds4_gpu_encode_mul_mm_id_mapped_tile call sites in routed-MoE encode body Post-merge canary validation: all 7 session MTL4 ports still pass bit-identical (cpy_f32_f32, cpy_f32_f16, cpy_f16_f32, sum_rows_f32_f32, set_rows_f32_i32, mul_mm_id_map0_ne20_8, repeat_f32). ds4_gpu_moe_mm_tile_n() lives as unused warning — wide-tile selection wire-up at call sites pending follow-up commit.
ivp5
added a commit
to ivp5/ds4
that referenced
this pull request
May 27, 2026
Adds ds4_mtl4_wide_tile_audit_run + ds4_gpu_mtl4_wide_tile_audit_iq2_xxs that routes R tokens to a single expert and tests all 3 width pipelines (n32, n64, n128) at the same R. Reports per-token mismatch + boundary token where failure begins. CRITICAL FINDING: my earlier "all-widths-bit-identical" canaries used ne21=1 per expert, which kept nr1=1 inside the kernel regardless of NR1. The wide-tile kernel's mc[8] accumulator only covers 32 rows of its NR1 stripe — the remaining NR1-32 rows are never written to shmem and the output-write loop reads uninitialized zeros. Audit results (M=64 K=256, IQ2_XXS, single expert): R=1 → n32 PASS, n64 PASS, n128 PASS (no rows past 32 exercised) R=32 → n32 PASS, n64 PASS, n128 PASS (exactly fills the safe band) R=33 → n32 PASS, n64 FAIL, n128 FAIL (token 32 = 0.0 vs 128.0 ref) R=64 → n32 PASS, n64 FAIL 2048/4096, n128 FAIL 2048/4096 R=128 → n32 PASS, n64 FAIL 4096/8192, n128 FAIL 6144/8192 The boundary is exactly at token 32 in every failing case. Root cause: kernel_mul_mm_id template body in metal/moe.metal hardcodes simdgroup_float8x8 mc[8]; The 8-tile array covers 32 cols × 16 rows per simdgroup (= 64 × 32 across 4 sgitgs). For NR1=64 each threadgroup needs to cover 64 × 64 cells but the kernel only produces 64 × 32 worth, leaving the upper half garbage. Antirez's PR antirez#264 inherited the same structure. My MTL4 ports mirror it faithfully, so the bug exists in both classic Metal and MTL4 paths. Production wide-tile dispatches only produce correct output when the DS4 routing distribution keeps per-expert neh1 ≤ 32 — above that threshold, half (n64) or three-quarters (n128) of routed-token output is silently corrupted. Practical impact: - Bit-identical-across-widths canaries (#735/#739/#740 et al.) were insufficient — they all ran in the safe band. - Production prefill with large batched routing has been writing zeros for tokens whose intra-expert position is past 32. - The advertised wide-tile speedup is unrealized as currently implemented (kernel under-utilizes its dispatch). Fix path (deferred to follow-up): - mc array must scale: mc[8 * NR1/32] - mma loop and store loop bounds scale accordingly - mb[2 * NR1/32] for matching col-tile count - temp_str row offset becomes (NR1/2)*(sgitg>>1)*NR0 - shmem_bytes must grow to NR0*NR1*sizeof(float) for temp_str - For NR1=128, may need threadgroup_size=256 for sb storage CLI: --wide-tile-audit M K R (defaults 64 256 64). Coverage: 86 unique MTL4 pipelines + 1 audit harness.
ivp5
added a commit
to ivp5/ds4
that referenced
this pull request
May 27, 2026
tmp/20260528_wide_tile_audit/finding.md captures: - Audit reproducer + per-R failure table - Root cause (mc[8] hardcode + temp_str offset 16*(sgitg>>1)) - Provenance (antirez PR antirez#264 inherited the structure) - Why my bit-identical-across-widths canaries missed it (nr1=1 stays in safe band; output-write loop only reads shmem+0) - 4-step speedup ladder: fix wide-tile (2-4×) × persistent threadgroups (10-100×) × ICB consolidation (4-10×) × dequant hot-cache (5-10×) = 500× = 2.7 OOM ceiling on overhead-dominated small-batch paths The audit itself is the 10× higher accuracy result silv asked for. The aberration (boundary token=32) was previously unnoticed by production tests and my own ne21=1 canaries. Fix path is sketched in the doc but deferred — requires threadgroup_size=256 + sa/sb storage restructuring + mc/mb array scaling.
cchuter
added a commit
to cchuter/ds4
that referenced
this pull request
May 28, 2026
…stream improvements (#11) * fix: repair unterminated DSML tool calls during long generations During long tool-call generations (2000+ tokens), the model's attention degrades and drops closing DSML tags before reaching max_tokens. This causes finish=error with 'unterminated tool call', aborting the turn. Fix: before returning error, attempt to repair by appending missing closing tags (parameter -> invoke -> tool_calls in nesting order), then re-parse to verify the repair produces valid tool calls. - Add try_repair_dsml() to detect and fix unclosed DSML blocks - Integrate repair at the unterminated tool call error path - Add test_dsml_repair_produces_parseable_calls() with 7 scenarios covering all three DSML styles and multiple truncation patterns - Tests verify structural accuracy: tool name and arguments are correct Results: 0 finish=error across 156+ requests, 100% repair success rate on unterminated tool calls. * fix: repair malformed DSML tool calls in long-context generations Long-context generations produce malformed DSML that parse_generated_message cannot parse, causing "invalid tool call" and breaking the agent loop. Three failure modes observed in stress testing (256K, q4-imatrix): Mode 1 (unterminated): model stops mid-DSML, missing closing tags Mode 2 (malformed closed): outer tags balanced but inner tags broken Mode 3 (hallucinated): tool_calls tags wrap plain reasoning text This commit addresses modes 1 and 2 via try_repair_dsml(): single-pass tag counting (O(n)) followed by appending missing closing tags in reverse nesting order (parameter -> invoke -> tool_calls). Also adds unit tests. Mode 3 is handled by antirez's commit 037ee39 which prevents DSML inside thinking from being detected as executable tool calls. Also adds orphan end tag guard: when toe>tos or ioe>ios or poe>pos, the size_t subtraction would underflow. Return false early. Signed-off-by: Rui Gu <jackygurui@gmail.com> * log: stderr message when thinking never closes When parse_generated_message_ex is called with require_thinking_closed=true and the model never outputs </thinking>, the entire generation is treated as reasoning and any DSML inside is silently ignored. This stderr log makes the gate visible for debugging. Refs: antirez#167, commit 037ee39 (Ignore tool calls emitted inside thinking) Signed-off-by: Rui Gu <jackygurui@gmail.com> * fix: try_repair_dsml ignores DSML tags inside thinking try_repair_dsml scanned the full generated text for DSML tags. When the model discusses DSML syntax in its reasoning (e.g. explaining the DSML tags), those text mentions inflate the tag counts, causing false positive repairs (appending unnecessary closing tags). Fix: find the last </thinking> boundary and start counting only from there. DSML mentioned inside reasoning is model text, not executable tool calls — matches the same approach used by parse_generated_message_ex (commit 037ee39). Also updated the hallucinated strip path to copy the thinking section verbatim and only strip from the post-thinking region. Real-world validation: observed this exact false positive in production. The model was explaining how try_repair_dsml works and quoted the DSML tag syntax in its explanation. The parser mistook the quote for a real tool call and the tag counting inflated, causing a failed repair. Metrics from production use (after all fixes in this branch): Tool calls: 169 | Invalid: 2 | Repaired: 38 | Orphan: 3 Only 2 cases remain unrecoverable; the other 41 (38 repaired + 3 orphan) are now gracefully handled instead of causing finish=error. Signed-off-by: Rui Gu <jackygurui@gmail.com> * Fix restored KV cache state boundaries * Add ds4-agent non-interactive mode Support --non-interactive for headless agent use. With --prompt, the agent runs one turn through the normal worker/tool loop and exits. Without --prompt, stdin becomes a simple persistent protocol: +DWARFSTAR_WAITING marks readiness, input is collected until a 200 ms quiet window, and later input received while the model is busy is queued with +DWARFSTAR_QUEUED. Keep the implementation on the existing append-only worker path so DSML parsing, tools, compaction, and KV state remain shared with the TUI. Mark accepted worker submissions busy immediately to avoid a race where non-interactive mode could exit before the worker reached prefill. Avoid terminal cursor-control escapes in plain stdout mode. * Fix restored KV cache state boundaries * Fix Metal short prompt prefill * Fix DSML repair edge cases * Handle malformed DSML with model retry * Metal Neural Acceleration initial implementation Squashes Ivan Fioravanti's Metal4/M5 Neural Acceleration scaffold, benchmark tooling, drift diagnostics, eval trace regrading, and initial Tensor/MPP kernel work. * NAX speedups and Metal4 cleanup Squashes Salvatore Sanfilippo's Metal4/NAX speedups and cleanup work: direct-RHS dense Tensor kernels, routed MoE Tensor coverage, full-512 indexer preservation, NAX prefill indexer scores, indexed-attention half shadow cache, and removal of rejected experimental paths. * Store Metal attention compressed KV cache in F16 * Preserve CUDA compressed KV write path * Fix compiler warnings in agent and TUI code * Fix CUDA batched MoE prefill signature * Fix agent session restore history * Add fine-grained prefill progress callback * Fix agent prefill progress bar sizing * agent: add working-directory option * fix wrong benchmark answer the answer was outside of the claimed energy precision. the evaluation after the fix (with smooth distribution over the tokens) ``` $ ./ds4-eval --temp 3.0 --min-p 0.25 --nothink ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics ds4-eval: context auto-sized to 16777 tokens (largest prompt=777 tokens, case=70, generation budget=16000) ds4-eval: context buffers 479.38 MiB (ctx=16777, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=4196) ds4-eval: 17/92 passed, 1 failed, runtime 00h:34m # state prompt gen total given correct test 1 PASSED 201 733 934 B B GPQA Diamond/recNu3MXkvWUzHZr9 2 PASSED 149 87 236 C C SuperGPQA/001b51d76b4d422988f2c11f104a2c6c 3 PASSED 81 574 655 70 70 AIME2025/aime2025-01 4 PASSED 313 239 552 C C GPQA Diamond/recoiTJPGUmzAkief 5 PASSED 272 177 449 J J SuperGPQA/b7e20eac98764fb0bf30e8366d951daa 6 PASSED 146 1140 1286 468 468 AIME2025/aime2025-16 7 PASSED 156 646 802 B B GPQA Diamond/rec4UqStf9WUVif1f 8 PASSED 127 52 179 E E SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8 9 PASSED 633 4780 5413 588 588 AIME2025/aime2025-02 10 PASSED 182 322 504 B B GPQA Diamond/recgI6tUQ7RLJRWGx 11 PASSED 137 68 205 A A SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09 12 PASSED 165 747 912 16 16 AIME2025/aime2025-03 13 PASSED 149 672 821 A A GPQA Diamond (modified)/recDytVnNYZe2HuUU 14 PASSED 167 68 235 J J SuperGPQA/bebf1ed45ae14ad7b4f205f3909cb58a 15 FAILED 305 4837 5142 86 82 AIME2025/aime2025-18 16 PASSED 131 671 802 D D GPQA Diamond/recNFJjE5PPTqVJGv 17 PASSED 175 67 242 I I SuperGPQA/7ca71b86327744b78e93185a45bc5cef 18 PASSED 102 1199 1301 117 117 AIME2025/aime2025-04 19 STOPPED 187 80 267 - B GPQA Diamond/rec2UlKqC6RFHdcro 20 PENDING 0 0 0 - E SuperGPQA/d44b94f7749345a39a65f6312bda8764 21 PENDING 0 0 0 - 106 AIME2025/aime2025-19 22 PENDING 0 0 0 - B GPQA Diamond/recv7GsQg3f0fvB1f 23 PENDING 0 0 0 - B SuperGPQA/febe406f44d74a40b50bb5b7c69d5dc1 ``` * Highlight agent code output * Avoid routed MoE TensorOps SwiGLU instability Full routed-MoE TensorOps enabled the gate, up, and down projections. The regression was isolated to the gate projection: enabling TensorOps for gate is sufficient to send a sensitive AIME continuation into a repeated wrong answer, while TensorOps for up+down remains stable. The kernel-side cause is small but real arithmetic drift in mpp::tensor_ops::matmul2d relative to the legacy simdgroup MMA contraction. A same-input routed-MoE probe showed no address/layout corruption: TensorOps gate was close to legacy, but not bit-identical. An isolated same-tile primitive probe confirmed the source outside DS4 routing and quantization: legacy simdgroup_multiply_accumulate matched a CPU FP32 serial dot-product reference on the tested tile, while TensorOps produced close nonzero FP32 differences. MTLMathModeSafe and the tested TensorOps descriptor variant did not remove the drift. That normally tiny drift matters here because MoE routing has discontinuous top-k expert selection. In the failing path the first observed safe-vs-full routing change was layer 3, token row 11: the selected sixth expert changed from 96 to 50 across a margin of only about 8e-4. Once an expert changes, the transformer state is no longer a smooth local perturbation, and autoregressive decoding can fall into a bad repetition basin. Attempts that preserved the full gate TensorOps speed did not produce a zero-drift or stable fix: forcing the routed intermediate to F32, using the older generic TensorOps routed matmul instead of the expert-major fast layout, changing the TensorOps descriptor mode, and compiling with strict Metal math all left the gate drift or the bad continuation in place. Retaining TensorOps for up and down keeps most of the MoE speedup, but gate stays on the legacy path because it feeds the nonlinear silu(gate) * up branch and is the projection that can flip later router decisions. * Fix F16 routed MoE graph dumps * Add GPU power throttling * Simplify agent edit tools * Add runtime power commands * Guard attention output TensorOps full tiles * Disable routed MoE TensorOps Remove the routed-MoE TensorOps/NAX path completely instead of leaving it as a gated-off mode. Semantic evals showed that gate, up, and down TensorOps routed-MoE variants can each move the model into bad continuations, while the full-tile-only expert-major experiment was correctness-interesting but slower than the legacy simdgroup path. Keeping the dead kernels around risks accidental re-enablement without a trustworthy correctness story. The routed MoE grouped matmul now always uses the legacy 32-token expert-major simdgroup kernel. Other Metal4 TensorOps paths, such as the attention-output projection, remain enabled independently. * Apply agent power changes while busy * Improve agent edit tooling * Fix anchored edit tail matching * TUI improvements: prompt growth, colors, commands Preserve generated output while linenoise prompt/history entries grow by scrolling the output region and keeping the output cursor column. Polish streaming colors by highlighting [upto], restoring active text attributes after prompt redraws, and showing throttled power in the status bar without duplicate messages. Keep unknown slash commands editable by beeping and restoring the input instead of printing an error or sending it to the model. * Refine agent tool prompt reminders * Improve agent session management * Stabilize agent session IDs Persist agent session titles in the KV file trailer and derive the saved session ID from title plus created_at so resaves keep a stable identity. Preserve and display titles in /list, keep stripped sessions readable, and migrate legacy rendered-text sessions on their next successful save. Also fix stripped-session reloads to accept the retokenized rendered text count, and adjust the /list footer wording to use session IDs. * Fix queued agent status bar fill Keep the status-row reset in linenoise when the agent footer is multiline for queued prompts, so the padded tail of the status bar retains the grey background. * Fix agent queued prompts and manual compaction * Clarify agent edit prompt and reminder logging * Add browser-backed web tools to ds4-agent * Inject session start time in ds4-agent * Stabilize ds4-agent web page extraction * Improve rendered page extraction * Avoid focusing Chrome during web tools * Improve browser page scrolling heuristics * Add timed yes-no prompts for browser approval * Deliver queued prompts after tool results * Stop generation on in-think tool calls * Improve prefill progress callbacks * Stabilize agent prefill label * Terminate displayed bash output with newline * DeepSeek v4 PRO support Add DeepSeek V4 Pro support on top of main. - add Flash/Pro shape selection and Pro Metal inference support - fix shared SwiGLU clamp and indexer QAT behavior for Flash and Pro - update eval/context tooling and imatrix quantization for Pro - tag KV caches/sessions with model ids so Flash and Pro state cannot cross-load - remove misc/PRO.md from tracked files while leaving it as a local scratch log * Support PRO official continuation collection * Document model-specific continuation data * Use parsed tensor span for Metal model views * Validate DS4 compression layout by shape * Document PRO support and model downloads Expose Flash and PRO model IDs as server compatibility aliases for the loaded GGUF. Add download targets for the mixed Flash q2/q4 quant and PRO GGUFs, update README guidance, and add the M3 Ultra PRO benchmark data. * Project renamed to DwarfStar, without the "4" * Document experimental PRO support * Fix PRO routed MoE expert mapping * fix(server): evict disk KV entry that fails prefill After an unclean shutdown the on-disk KV checkpoint can be intact (header, hash, token count all valid) but leave Metal in a state where prefill fails. Since the file keeps passing load-time checks it gets reloaded on every request, looping forever until the user manually deletes the cache directory. On prefill failure, if the prefix came from a disk entry, unlink it and invalidate the session. Next request gets a clean cache miss. Closes antirez#251 * Improve KV cache pre-store eviction Evict before writing a new KV checkpoint so the incoming entry cannot be selected as its own victim. Pass incoming-checkpoint context into eviction scoring and devalue compatible continued-prefix waypoints when making room for a longer checkpoint. Inspired by the KV-cache observations in antirez#174/antirez#175 and antirez#176/antirez#177. Co-authored-by: Salvatore Sanfilippo <antirez@gmail.com> Co-authored-by: unsaltedbutter-ai <261676361+unsaltedbutter-ai@users.noreply.github.com> * Harden disk KV cache compatibility checks Reject stale KVC graph payloads before loading them, bump the session payload ABI after recent runtime layout changes, and keep PR antirez#253's prefill-failure recovery as a final safety net. The server now discards the exact disk checkpoint that produced a restored-state prefill failure, resets the disk continuation marker, and logs unlink errors instead of silently ignoring them. Tests cover stale payload ABI rejection. Closes antirez#251 * Add wide-token MoE prefill tiles (n64/n128 mul_mm_id). Use 64/128-token expert-major tiles for Q4_K, Q2_K, and IQ2_XXS routed prefill when batch length is aligned. Cap via DS4_METAL_MOE_TILE_MAX (default 128; set 32 to force legacy tiles for A/B). Co-authored-by: Cursor <cursoragent@cursor.com> * Fix - Adding wide Q4_K F16 kernels and updating the host pipeline resolver to match Q2_K/IQ2_XXS. * Harden wide MoE tile dispatch * Revert "Harden wide MoE tile dispatch" This reverts commit 9ca9013. * Revert "Merge PR antirez#264: Add wide-token MoE prefill tiles" This reverts commit 805368e, reversing changes made to e8e8779. * Add local golden inference drift test * Guard MoE Metal tile shape * fix(metal): short-circuit tier-0 multi-tier allocators on Metal/CPU builds Half-B (PR #6) added per-tier graph tensor allocation via ds4_gpu_tensor_alloc_ptr_on(tier, bytes) and its _managed variant. On Metal and CPU builds these are stubs in ds4.c (no CUDA multi-tier to route to). The stubs returned NULL unconditionally — but metal_graph_alloc_raw_cap calls _ptr_on(g->head_tier, ...) and _ptr_on(g->emb_tier, ...) directly to populate Class H / Class E slots (g->output_pre_by_tier, g->logits_by_tier, g->prefill_tokens_by_tier, etc.). On Metal, every one of those calls returned NULL, the giant validation chain at end of alloc_raw_cap failed, and ds4_session_create returned 1 with no error message. Net effect: Mac ds4-server, ds4-cli, ds4-agent, ds4-bench have ALL been broken for fresh-session creation since PR #6 (May 27). The only thing that "worked" was processes already started before PR #6 landed (they were running the pre-Half-B code). Fix: stubs short-circuit tier == 0 to the legacy single-device allocators (ds4_gpu_tensor_alloc / ds4_gpu_tensor_alloc_managed, both implemented in ds4_metal.m). Byte-equivalent to pre-multi-tier Metal — same allocator, same behavior. Multi-tier on Metal/CPU remains unsupported (other tiers still return NULL), which matches the design (no CUDA → no multi-tier). Smoke verified on Mac (M3 Ultra, Metal): $ ./ds4-server --metal -m IQ2XXS.gguf --port 8081 --ctx 8192 \ --tokens 4096 --warm-weights ds4-server: listening on http://127.0.0.1:8081 $ curl /v1/chat/completions -d '{"messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":16}' → "We need to answer the question: \"What is 2+2?\" This" * chore: strip internal phase-tracker references from comments Drop "Half-A", "Half-B", "wave 1/2/3", "wave 3a/3b", "mgpu-*" task names, "codex round-N" review references from comments throughout the multi-GPU codebase. These were useful for tracking our internal implementation phases but are noise for any reader — including potential upstream review. Comments now describe the technical reality directly. Build clean on Mac (Metal). Functional smoke test PASS — multi-tier server starts and serves /v1/chat/completions correctly with the fix from 46e123c. No code semantics changed; only comments and string-free comment text. * chore: strip internal review-process references from comments Drop the three remaining "codex plan-review round N finding #M" / "issue #M" tags from ds4.c comments. The actual technical content stays — only the tracking metadata about which internal review round introduced the constraint goes. Note: "Codex" references in ds4_server.c are external — they refer to OpenAI's Codex CLI client which consumes our /v1/responses API. Those stay. --------- Signed-off-by: Rui Gu <jackygurui@gmail.com> Co-authored-by: Rui Gu <jackygurui@gmail.com> Co-authored-by: user <user@studio1.l0st.space> Co-authored-by: antirez <antirez@gmail.com> Co-authored-by: Fabio Malpezzi <fabio@MacBook-Pro-di-Fabio.fritz.box> Co-authored-by: Ivan Fioravanti <ivan.fioravanti@gmail.com> Co-authored-by: Giovanni Montana <giovanni.montana@gmail.com> Co-authored-by: alantsev <alantsev@users.noreply.github.com> Co-authored-by: unsaltedbutter-ai <261676361+unsaltedbutter-ai@users.noreply.github.com> Co-authored-by: Theinruj Toranavikrai <my.beam@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi, I'm experimenting with many things trying to speed up the prefill and i've nailed one on that can increase the prefill upto +62% for short 2048 burst and ~30% sustain until 32k context, ~16% for 64k context. The thing is apple chip is better at use wider token tiles compared to the default =32. Increasing this number help reducing the bottleneck routed through moe from 2742 ms (59%) to 1328 ms (41%).
As a bonus, we also gained ~4% token gen speed improvement at 64k but it could be noise. I'm using M5 Max 128GB by the way. I do not know if this will effect any other machine so you guys can use DS4_METAL_MOE_TILE_MAX=XX to optimize for yours.
Thanks @antirez for this wonderful project.
(AI help summarize below)
PR: MoE prefill wide token tiles (n64 / n128
mul_mm_id)Branch:
moe-wide-tilesBase:
e8e8779(currentmain)Summary
Adds wider token tiles (64 and 128) to the batched routed-MoE prefill path (
kernel_mul_mm_id_*) for Q4_K, Q2_K, and IQ2_XXS (including F16-RHS down). Aligned prefill chunks (e.g. 2048 tokens) use fewer, larger threadgroups, which cuts per-launch overhead and improves simdgroup MMA utilization on the dominant prefill cost.Scope:
metal/moe.metal(kernel template exports) +ds4_metal.m(tile selection, pipeline dispatch, encode). No flash, gate+up pair, docs-only churn, or bench scripts in the commit.What changed
kernel_mul_mm_id_{q2_K,q4_K,iq2_xxs}_{f32,f16}_n64and_n128.ds4_gpu_moe_mm_tile_n()picks 128 whenn_tokens % 128 == 0, else 64 when% 64 == 0, for the quant types above.ds4_gpu_routed_mm_pipeline_for_tile()/ds4_gpu_routed_mm_f16_rhs_pipeline_for_tile()select the matching pipeline.ds4_gpu_encode_mul_mm_id_mapped_tile()passesmoe_mm_tile_ninto the grid (ne21 / tile_nthreadgroups on X).Decode and tiny batches are unchanged (still
mul_mv_idor 32-tile fallback).A/B for maintainers
DS4_METAL_MOE_TILE_MAX=32DS4_METAL_MOE_TILE_MAX=64DS4_METAL_MOE_TILE_MAX=128Benchmark setup
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.ggufspeed-bench/promessi_sposi.txt./ds4-bench --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 --warm-weightsThroughput: before vs after
Before (baseline
main@ e8e8779):ds4cursor/speed-bench/deepseek_v4_flash_iq2_xxs_w2q2k_ds4main_m5_max.csvAfter (this PR, wide tiles on):
speed-bench/q2_m5_max.csv— graph:speed-bench/q2_m5_max_ts.svgGeneration throughput is essentially flat; the win is prefill-only, as intended.
At 2048 context the model goes from ~421 → ~683 incremental prefill tok/s on this IQ2 variant.
Layer-stage profile @ 2048 (why it is fast)
Profile:
DS4_METAL_LAYER_STAGE_PROFILE=1, single 2048-token prefill, all layers summed.Summarize:
python3 speed-bench/summarize_layer_profile.py <log>MOE_TILE_MAX=32(old)Takeaways:
mul_mm_id.Logs:
speed-bench/q2_profile_tile32_2048.log,speed-bench/q2_profile_wide_tiles_2048.logRollup notes:
speed-bench/q2_profile_buckets.txtWhy wide tiles help (short)
Prefill MoE uses
kernel_mul_mm_idwith template parameter NR1 = tokens per threadgroup along the expert-batch dimension. At 2048 tokens, n128 uses 16× fewer threadgroups on that axis than n32 (2048/128 vs 2048/32). Each threadgroup pays fixed cost (barriers, Q2_K/IQ2 dequant into threadgroup memory, expert id setup) before simdgroup MMA; wider tiles amortize that over more rows. Because routed MoE is ~half of profiled prefill GPU even after the change, total prefill tok/s jumps sharply.Test plan
make ds4-bench && make ds4_teston Apple SiliconDS4_METAL_MOE_TILE_MAX=32reproduces ~baseline prefill on 2048DS4_METAL_MOE_STAGE_PROFILE=1on one layer — confirmpath=mm_idand lowergate/up/downms vs tile32Files
metal/moe.metalmul_mm_idtemplatesds4_metal.m