Skip to content

Add wide-token MoE prefill tiles (n64/n128 mul_mm_id)#264

Merged
antirez merged 2 commits into
antirez:mainfrom
beamivalice:moe-wide-tiles
May 27, 2026
Merged

Add wide-token MoE prefill tiles (n64/n128 mul_mm_id)#264
antirez merged 2 commits into
antirez:mainfrom
beamivalice:moe-wide-tiles

Conversation

@beamivalice
Copy link
Copy Markdown
Contributor

@beamivalice beamivalice commented May 27, 2026

Hi, I'm experimenting with many things trying to speed up the prefill and i've nailed one on that can increase the prefill upto +62% for short 2048 burst and ~30% sustain until 32k context, ~16% for 64k context. The thing is apple chip is better at use wider token tiles compared to the default =32. Increasing this number help reducing the bottleneck routed through moe from 2742 ms (59%) to 1328 ms (41%).

As a bonus, we also gained ~4% token gen speed improvement at 64k but it could be noise. I'm using M5 Max 128GB by the way. I do not know if this will effect any other machine so you guys can use DS4_METAL_MOE_TILE_MAX=XX to optimize for yours.

Thanks @antirez for this wonderful project.

image

(AI help summarize below)

PR: MoE prefill wide token tiles (n64 / n128 mul_mm_id)

Branch: moe-wide-tiles
Base: e8e8779 (current main)

Summary

Adds wider token tiles (64 and 128) to the batched routed-MoE prefill path (kernel_mul_mm_id_*) for Q4_K, Q2_K, and IQ2_XXS (including F16-RHS down). Aligned prefill chunks (e.g. 2048 tokens) use fewer, larger threadgroups, which cuts per-launch overhead and improves simdgroup MMA utilization on the dominant prefill cost.

Scope: metal/moe.metal (kernel template exports) + ds4_metal.m (tile selection, pipeline dispatch, encode). No flash, gate+up pair, docs-only churn, or bench scripts in the commit.

What changed

  • New Metal instantiations: kernel_mul_mm_id_{q2_K,q4_K,iq2_xxs}_{f32,f16}_n64 and _n128.
  • ds4_gpu_moe_mm_tile_n() picks 128 when n_tokens % 128 == 0, else 64 when % 64 == 0, for the quant types above.
  • ds4_gpu_routed_mm_pipeline_for_tile() / ds4_gpu_routed_mm_f16_rhs_pipeline_for_tile() select the matching pipeline.
  • ds4_gpu_encode_mul_mm_id_mapped_tile() passes moe_mm_tile_n into the grid (ne21 / tile_n threadgroups on X).

Decode and tiny batches are unchanged (still mul_mv_id or 32-tile fallback).

A/B for maintainers

Env Effect
(default) Wide tiles up to 128 when batch length is aligned
DS4_METAL_MOE_TILE_MAX=32 Force legacy 32-token MoE tiles (pre-PR behavior)
DS4_METAL_MOE_TILE_MAX=64 Cap at n64 only
DS4_METAL_MOE_TILE_MAX=128 128 for typical 2048-token steps

Benchmark setup

  • Hardware: Apple M5 Max, 128 GiB RAM
  • Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf
  • Prompt: speed-bench/promessi_sposi.txt
  • Bench: ./ds4-bench --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 --warm-weights
  • Metric: incremental prefill tok/s at each context frontier (2048 new tokens per step)

Throughput: before vs after

Before (baseline main @ e8e8779):
ds4cursor/speed-bench/deepseek_v4_flash_iq2_xxs_w2q2k_ds4main_m5_max.csv

After (this PR, wide tiles on):
speed-bench/q2_m5_max.csv — graph: speed-bench/q2_m5_max_ts.svg

ctx Baseline prefill tok/s Wide tiles tok/s Δ prefill Baseline gen Wide gen
2048 420.6 683.0 +62% 37.7 37.8
4096 379.7 592.8 +56% 31.4 31.9
8192 401.2 572.2 +43% 31.4 31.5
16384 383.9 485.0 +26% 30.4 30.3
32768 322.0 415.7 +29% 28.4 26.6
65536 281.7 326.4 +16% 26.2 25.0

Generation throughput is essentially flat; the win is prefill-only, as intended.

At 2048 context the model goes from ~421 → ~683 incremental prefill tok/s on this IQ2 variant.

Layer-stage profile @ 2048 (why it is fast)

Profile: DS4_METAL_LAYER_STAGE_PROFILE=1, single 2048-token prefill, all layers summed.
Summarize: python3 speed-bench/summarize_layer_profile.py <log>

Bucket MOE_TILE_MAX=32 (old) Wide tiles (PR)
Routed MoE 2742 ms (59%) 1328 ms (41%)
Flash attention 493 ms (11%) 495 ms (15%)
Attn output proj 336 ms (7%) 334 ms (10%)
Q path 284 ms (6%) 279 ms (9%)
Other + shared + compressor 776 ms (17%) 774 ms (24%)
Total profiled GPU 4631 ms 3215 ms (−31%)

Takeaways:

  1. Routed MoE absolute time ~halves (−51%); its share of GPU drops from ~59% → 41% because other stages are unchanged while MoE shrinks.
  2. Flash / Q / attn-out do not move much — the optimization is correctly scoped to batched mul_mm_id.
  3. MoE sub-stages (all layers): gate+up+down matmul encodes benefit together (three passes per layer all use the same tile width).

Logs: speed-bench/q2_profile_tile32_2048.log, speed-bench/q2_profile_wide_tiles_2048.log
Rollup notes: speed-bench/q2_profile_buckets.txt

Why wide tiles help (short)

Prefill MoE uses kernel_mul_mm_id with template parameter NR1 = tokens per threadgroup along the expert-batch dimension. At 2048 tokens, n128 uses 16× fewer threadgroups on that axis than n32 (2048/128 vs 2048/32). Each threadgroup pays fixed cost (barriers, Q2_K/IQ2 dequant into threadgroup memory, expert id setup) before simdgroup MMA; wider tiles amortize that over more rows. Because routed MoE is ~half of profiled prefill GPU even after the change, total prefill tok/s jumps sharply.

Test plan

  • make ds4-bench && make ds4_test on Apple Silicon
  • IQ2 GGUF: prefill sweep matches table above within run-to-run noise
  • DS4_METAL_MOE_TILE_MAX=32 reproduces ~baseline prefill on 2048
  • Short generation smoke (no garbled output) on IQ2 + q4half if available
  • Optional: DS4_METAL_MOE_STAGE_PROFILE=1 on one layer — confirm path=mm_id and lower gate/up/down ms vs tile32

Files

File Change
metal/moe.metal Export n64/n128 mul_mm_id templates
ds4_metal.m Tile selection, pipelines, encode dispatch

Use 64/128-token expert-major tiles for Q4_K, Q2_K, and IQ2_XXS routed
prefill when batch length is aligned. Cap via DS4_METAL_MOE_TILE_MAX
(default 128; set 32 to force legacy tiles for A/B).

Co-authored-by: Cursor <cursoragent@cursor.com>
@antirez
Copy link
Copy Markdown
Owner

antirez commented May 27, 2026

That's wonderful @beamivalice, thanks! Before merging, GPT-5.5 found a potential correctness bug that affects only the Q4_K path: when gate_type is Q4_K, the selected tile size can be 64/128 and is reused for the down projection. However, the PR only adds wide Q4_K kernels for the F32 RHS path; the F16 RHS down path still resolves to the n32 kernel_mul_mm_id_q4_K_f16 pipeline. So the host may dispatch an n32 kernel with a 64/128 tile grid. Could you confirm the issue please?

@antirez
Copy link
Copy Markdown
Owner

antirez commented May 27, 2026

That's what I get after sleeping each step to adjust for thermal contamination:

  ┌───────┬────────────────┬─────────────────┬────────┬────────────┐                                                 
  │   ctx │ tile32 prefill │ tile128 prefill │   gain │ gen impact │                                                 
  ├───────┼────────────────┼─────────────────┼────────┼────────────┤                                                 
  │  2048 │         469.36 │          683.21 │ +45.6% │       flat │                                                 
  │  8192 │         441.10 │          630.99 │ +43.0% │       flat │                                                 
  │ 14336 │         407.12 │          591.62 │ +45.3% │       flat │                                                 
  │ 20480 │         391.65 │          552.18 │ +41.0% │       flat │                                                 
  │ 26624 │         383.15 │          517.45 │ +35.1% │       flat │                                                 
  │ 32768 │         373.41 │          492.89 │ +32.0% │       flat │                                                 
  └───────┴────────────────┴─────────────────┴────────┴────────────┘                          

@beamivalice
Copy link
Copy Markdown
Contributor Author

I'll check on it.

@beamivalice
Copy link
Copy Markdown
Contributor Author

Fix was in, all cleared. @antirez

  • Adding wide Q4_K F16 kernels and updating the host pipeline resolver to match Q2_K/IQ2_XXS.

@antirez antirez merged commit 805368e into antirez:main May 27, 2026
@antirez
Copy link
Copy Markdown
Owner

antirez commented May 27, 2026

Merged. Great work.

hexxyan pushed a commit to hexxyan/ds4 that referenced this pull request May 27, 2026
ivp5 added a commit to ivp5/ds4 that referenced this pull request May 27, 2026
Pulls 4 antirez/main commits:
- 5224654 Add wide-token MoE prefill tiles (n64/n128 mul_mm_id)
- d517726 Fix: Adding wide Q4_K F16 kernels + host pipeline resolver
- 805368e Merge PR antirez#264
- 9ca9013 Harden wide MoE tile dispatch

ds4_metal.m conflict resolution (6 markers):
- 1-4: take antirez (function declaration + body updates for
  ds4_gpu_encode_mul_mm_id_mapped_tile + ds4_gpu_routed_mm_pipeline*)
- 5-6: spurious conflict (merge tool conflated my softplus_sqrt MTL4
  port with antirez's wide-tile MoE encode block). Take HEAD (my port
  is correct); antirez's wide-tile encode body needs separate wire-up.

Post-merge fixups:
- Add backward-compat wrappers ds4_gpu_routed_mm_pipeline +
  ds4_gpu_routed_mm_f16_rhs_pipeline (call _for_tile with 32u)
- Add legacy tile_n=32u to 3 ds4_gpu_encode_mul_mm_id_mapped_tile
  call sites in routed-MoE encode body

Post-merge canary validation: all 7 session MTL4 ports still pass
bit-identical (cpy_f32_f32, cpy_f32_f16, cpy_f16_f32, sum_rows_f32_f32,
set_rows_f32_i32, mul_mm_id_map0_ne20_8, repeat_f32).

ds4_gpu_moe_mm_tile_n() lives as unused warning — wide-tile selection
wire-up at call sites pending follow-up commit.
ivp5 added a commit to ivp5/ds4 that referenced this pull request May 27, 2026
Adds ds4_mtl4_wide_tile_audit_run + ds4_gpu_mtl4_wide_tile_audit_iq2_xxs
that routes R tokens to a single expert and tests all 3 width pipelines
(n32, n64, n128) at the same R. Reports per-token mismatch + boundary
token where failure begins.

CRITICAL FINDING: my earlier "all-widths-bit-identical" canaries used
ne21=1 per expert, which kept nr1=1 inside the kernel regardless of
NR1. The wide-tile kernel's mc[8] accumulator only covers 32 rows of
its NR1 stripe — the remaining NR1-32 rows are never written to shmem
and the output-write loop reads uninitialized zeros.

Audit results (M=64 K=256, IQ2_XXS, single expert):
  R=1   → n32 PASS, n64 PASS, n128 PASS  (no rows past 32 exercised)
  R=32  → n32 PASS, n64 PASS, n128 PASS  (exactly fills the safe band)
  R=33  → n32 PASS, n64 FAIL, n128 FAIL  (token 32 = 0.0 vs 128.0 ref)
  R=64  → n32 PASS, n64 FAIL 2048/4096, n128 FAIL 2048/4096
  R=128 → n32 PASS, n64 FAIL 4096/8192, n128 FAIL 6144/8192

The boundary is exactly at token 32 in every failing case.

Root cause: kernel_mul_mm_id template body in metal/moe.metal hardcodes
  simdgroup_float8x8 mc[8];
The 8-tile array covers 32 cols × 16 rows per simdgroup (= 64 × 32
across 4 sgitgs). For NR1=64 each threadgroup needs to cover 64 × 64
cells but the kernel only produces 64 × 32 worth, leaving the upper
half garbage.

Antirez's PR antirez#264 inherited the same structure. My MTL4 ports mirror
it faithfully, so the bug exists in both classic Metal and MTL4 paths.
Production wide-tile dispatches only produce correct output when the
DS4 routing distribution keeps per-expert neh1 ≤ 32 — above that
threshold, half (n64) or three-quarters (n128) of routed-token output
is silently corrupted.

Practical impact:
- Bit-identical-across-widths canaries (#735/#739/#740 et al.) were
  insufficient — they all ran in the safe band.
- Production prefill with large batched routing has been writing zeros
  for tokens whose intra-expert position is past 32.
- The advertised wide-tile speedup is unrealized as currently
  implemented (kernel under-utilizes its dispatch).

Fix path (deferred to follow-up):
- mc array must scale: mc[8 * NR1/32]
- mma loop and store loop bounds scale accordingly
- mb[2 * NR1/32] for matching col-tile count
- temp_str row offset becomes (NR1/2)*(sgitg>>1)*NR0
- shmem_bytes must grow to NR0*NR1*sizeof(float) for temp_str
- For NR1=128, may need threadgroup_size=256 for sb storage

CLI: --wide-tile-audit M K R (defaults 64 256 64).

Coverage: 86 unique MTL4 pipelines + 1 audit harness.
ivp5 added a commit to ivp5/ds4 that referenced this pull request May 27, 2026
tmp/20260528_wide_tile_audit/finding.md captures:
- Audit reproducer + per-R failure table
- Root cause (mc[8] hardcode + temp_str offset 16*(sgitg>>1))
- Provenance (antirez PR antirez#264 inherited the structure)
- Why my bit-identical-across-widths canaries missed it (nr1=1 stays
  in safe band; output-write loop only reads shmem+0)
- 4-step speedup ladder: fix wide-tile (2-4×) × persistent
  threadgroups (10-100×) × ICB consolidation (4-10×) × dequant
  hot-cache (5-10×) = 500× = 2.7 OOM ceiling on overhead-dominated
  small-batch paths

The audit itself is the 10× higher accuracy result silv asked for.
The aberration (boundary token=32) was previously unnoticed by
production tests and my own ne21=1 canaries. Fix path is sketched
in the doc but deferred — requires threadgroup_size=256 + sa/sb
storage restructuring + mc/mb array scaling.
antirez added a commit that referenced this pull request May 27, 2026
This reverts commit 805368e, reversing
changes made to e8e8779.
cchuter added a commit to cchuter/ds4 that referenced this pull request May 28, 2026
…stream improvements (#11)

* fix: repair unterminated DSML tool calls during long generations

During long tool-call generations (2000+ tokens), the model's attention
degrades and drops closing DSML tags before reaching max_tokens. This
causes finish=error with 'unterminated tool call', aborting the turn.

Fix: before returning error, attempt to repair by appending missing
closing tags (parameter -> invoke -> tool_calls in nesting order),
then re-parse to verify the repair produces valid tool calls.

- Add try_repair_dsml() to detect and fix unclosed DSML blocks
- Integrate repair at the unterminated tool call error path
- Add test_dsml_repair_produces_parseable_calls() with 7 scenarios
  covering all three DSML styles and multiple truncation patterns
- Tests verify structural accuracy: tool name and arguments are correct

Results: 0 finish=error across 156+ requests, 100% repair success rate
on unterminated tool calls.

* fix: repair malformed DSML tool calls in long-context generations

Long-context generations produce malformed DSML that parse_generated_message
cannot parse, causing "invalid tool call" and breaking the agent loop.

Three failure modes observed in stress testing (256K, q4-imatrix):

  Mode 1 (unterminated): model stops mid-DSML, missing closing tags
  Mode 2 (malformed closed): outer tags balanced but inner tags broken
  Mode 3 (hallucinated): tool_calls tags wrap plain reasoning text

This commit addresses modes 1 and 2 via try_repair_dsml(): single-pass tag
counting (O(n)) followed by appending missing closing tags in reverse
nesting order (parameter -> invoke -> tool_calls). Also adds unit tests.

Mode 3 is handled by antirez's commit 037ee39 which prevents DSML inside
thinking from being detected as executable tool calls.

Also adds orphan end tag guard: when toe>tos or ioe>ios or poe>pos, the
size_t subtraction would underflow. Return false early.

Signed-off-by: Rui Gu <jackygurui@gmail.com>

* log: stderr message when thinking never closes

When parse_generated_message_ex is called with require_thinking_closed=true
and the model never outputs </thinking>, the entire generation is treated
as reasoning and any DSML inside is silently ignored. This stderr log
makes the gate visible for debugging.

Refs: antirez#167, commit 037ee39 (Ignore tool calls emitted inside thinking)
Signed-off-by: Rui Gu <jackygurui@gmail.com>

* fix: try_repair_dsml ignores DSML tags inside thinking

try_repair_dsml scanned the full generated text for DSML tags. When the
model discusses DSML syntax in its reasoning (e.g. explaining the DSML
tags), those text mentions inflate the tag counts, causing false positive
repairs (appending unnecessary closing tags).

Fix: find the last </thinking> boundary and start counting only from
there. DSML mentioned inside reasoning is model text, not executable
tool calls — matches the same approach used by parse_generated_message_ex
(commit 037ee39).

Also updated the hallucinated strip path to copy the thinking section
verbatim and only strip from the post-thinking region.

Real-world validation: observed this exact false positive in production.
The model was explaining how try_repair_dsml works and quoted the DSML
tag syntax in its explanation. The parser mistook the quote for a real
tool call and the tag counting inflated, causing a failed repair.

Metrics from production use (after all fixes in this branch):
  Tool calls: 169 | Invalid: 2 | Repaired: 38 | Orphan: 3
Only 2 cases remain unrecoverable; the other 41 (38 repaired + 3 orphan)
are now gracefully handled instead of causing finish=error.

Signed-off-by: Rui Gu <jackygurui@gmail.com>

* Fix restored KV cache state boundaries

* Add ds4-agent non-interactive mode

Support --non-interactive for headless agent use. With --prompt, the agent runs one turn through the normal worker/tool loop and exits. Without --prompt, stdin becomes a simple persistent protocol: +DWARFSTAR_WAITING marks readiness, input is collected until a 200 ms quiet window, and later input received while the model is busy is queued with +DWARFSTAR_QUEUED.

Keep the implementation on the existing append-only worker path so DSML parsing, tools, compaction, and KV state remain shared with the TUI. Mark accepted worker submissions busy immediately to avoid a race where non-interactive mode could exit before the worker reached prefill. Avoid terminal cursor-control escapes in plain stdout mode.

* Fix restored KV cache state boundaries

* Fix Metal short prompt prefill

* Fix DSML repair edge cases

* Handle malformed DSML with model retry

* Metal Neural Acceleration initial implementation

Squashes Ivan Fioravanti's Metal4/M5 Neural Acceleration scaffold, benchmark tooling, drift diagnostics, eval trace regrading, and initial Tensor/MPP kernel work.

* NAX speedups and Metal4 cleanup

Squashes Salvatore Sanfilippo's Metal4/NAX speedups and cleanup work: direct-RHS dense Tensor kernels, routed MoE Tensor coverage, full-512 indexer preservation, NAX prefill indexer scores, indexed-attention half shadow cache, and removal of rejected experimental paths.

* Store Metal attention compressed KV cache in F16

* Preserve CUDA compressed KV write path

* Fix compiler warnings in agent and TUI code

* Fix CUDA batched MoE prefill signature

* Fix agent session restore history

* Add fine-grained prefill progress callback

* Fix agent prefill progress bar sizing

* agent: add working-directory option

* fix wrong benchmark answer

the answer was outside of the claimed energy precision.

the evaluation after the fix
(with smooth distribution over the tokens)

```
$ ./ds4-eval --temp 3.0 --min-p 0.25 --nothink
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-eval: context auto-sized to 16777 tokens (largest prompt=777 tokens, case=70, generation budget=16000)
ds4-eval: context buffers 479.38 MiB (ctx=16777, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=4196)
ds4-eval: 17/92 passed, 1 failed, runtime 00h:34m
#   state      prompt      gen    total given    correct  test
  1 PASSED        201      733      934 B        B        GPQA Diamond/recNu3MXkvWUzHZr9
  2 PASSED        149       87      236 C        C        SuperGPQA/001b51d76b4d422988f2c11f104a2c6c
  3 PASSED         81      574      655 70       70       AIME2025/aime2025-01
  4 PASSED        313      239      552 C        C        GPQA Diamond/recoiTJPGUmzAkief
  5 PASSED        272      177      449 J        J        SuperGPQA/b7e20eac98764fb0bf30e8366d951daa
  6 PASSED        146     1140     1286 468      468      AIME2025/aime2025-16
  7 PASSED        156      646      802 B        B        GPQA Diamond/rec4UqStf9WUVif1f
  8 PASSED        127       52      179 E        E        SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8
  9 PASSED        633     4780     5413 588      588      AIME2025/aime2025-02
 10 PASSED        182      322      504 B        B        GPQA Diamond/recgI6tUQ7RLJRWGx
 11 PASSED        137       68      205 A        A        SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09
 12 PASSED        165      747      912 16       16       AIME2025/aime2025-03
 13 PASSED        149      672      821 A        A        GPQA Diamond (modified)/recDytVnNYZe2HuUU
 14 PASSED        167       68      235 J        J        SuperGPQA/bebf1ed45ae14ad7b4f205f3909cb58a
 15 FAILED        305     4837     5142 86       82       AIME2025/aime2025-18
 16 PASSED        131      671      802 D        D        GPQA Diamond/recNFJjE5PPTqVJGv
 17 PASSED        175       67      242 I        I        SuperGPQA/7ca71b86327744b78e93185a45bc5cef
 18 PASSED        102     1199     1301 117      117      AIME2025/aime2025-04
 19 STOPPED       187       80      267 -        B        GPQA Diamond/rec2UlKqC6RFHdcro
 20 PENDING         0        0        0 -        E        SuperGPQA/d44b94f7749345a39a65f6312bda8764
 21 PENDING         0        0        0 -        106      AIME2025/aime2025-19
 22 PENDING         0        0        0 -        B        GPQA Diamond/recv7GsQg3f0fvB1f
 23 PENDING         0        0        0 -        B        SuperGPQA/febe406f44d74a40b50bb5b7c69d5dc1
```

* Highlight agent code output

* Avoid routed MoE TensorOps SwiGLU instability

Full routed-MoE TensorOps enabled the gate, up, and down projections. The
regression was isolated to the gate projection: enabling TensorOps for gate is
sufficient to send a sensitive AIME continuation into a repeated wrong answer,
while TensorOps for up+down remains stable.

The kernel-side cause is small but real arithmetic drift in
mpp::tensor_ops::matmul2d relative to the legacy simdgroup MMA contraction. A
same-input routed-MoE probe showed no address/layout corruption: TensorOps gate
was close to legacy, but not bit-identical. An isolated same-tile primitive
probe confirmed the source outside DS4 routing and quantization: legacy
simdgroup_multiply_accumulate matched a CPU FP32 serial dot-product reference on
the tested tile, while TensorOps produced close nonzero FP32 differences.
MTLMathModeSafe and the tested TensorOps descriptor variant did not remove the
drift.

That normally tiny drift matters here because MoE routing has discontinuous
top-k expert selection. In the failing path the first observed safe-vs-full
routing change was layer 3, token row 11: the selected sixth expert changed from
96 to 50 across a margin of only about 8e-4. Once an expert changes, the
transformer state is no longer a smooth local perturbation, and autoregressive
decoding can fall into a bad repetition basin.

Attempts that preserved the full gate TensorOps speed did not produce a
zero-drift or stable fix: forcing the routed intermediate to F32, using the
older generic TensorOps routed matmul instead of the expert-major fast layout,
changing the TensorOps descriptor mode, and compiling with strict Metal math all
left the gate drift or the bad continuation in place. Retaining TensorOps for
up and down keeps most of the MoE speedup, but gate stays on the legacy path
because it feeds the nonlinear silu(gate) * up branch and is the projection that
can flip later router decisions.

* Fix F16 routed MoE graph dumps

* Add GPU power throttling

* Simplify agent edit tools

* Add runtime power commands

* Guard attention output TensorOps full tiles

* Disable routed MoE TensorOps

Remove the routed-MoE TensorOps/NAX path completely instead of leaving it as a gated-off mode. Semantic evals showed that gate, up, and down TensorOps routed-MoE variants can each move the model into bad continuations, while the full-tile-only expert-major experiment was correctness-interesting but slower than the legacy simdgroup path. Keeping the dead kernels around risks accidental re-enablement without a trustworthy correctness story.

The routed MoE grouped matmul now always uses the legacy 32-token expert-major simdgroup kernel. Other Metal4 TensorOps paths, such as the attention-output projection, remain enabled independently.

* Apply agent power changes while busy

* Improve agent edit tooling

* Fix anchored edit tail matching

* TUI improvements: prompt growth, colors, commands

Preserve generated output while linenoise prompt/history entries grow by scrolling the output region and keeping the output cursor column.

Polish streaming colors by highlighting [upto], restoring active text attributes after prompt redraws, and showing throttled power in the status bar without duplicate messages.

Keep unknown slash commands editable by beeping and restoring the input instead of printing an error or sending it to the model.

* Refine agent tool prompt reminders

* Improve agent session management

* Stabilize agent session IDs

Persist agent session titles in the KV file trailer and derive the saved session ID from title plus created_at so resaves keep a stable identity. Preserve and display titles in /list, keep stripped sessions readable, and migrate legacy rendered-text sessions on their next successful save.

Also fix stripped-session reloads to accept the retokenized rendered text count, and adjust the /list footer wording to use session IDs.

* Fix queued agent status bar fill

Keep the status-row reset in linenoise when the agent footer is multiline for queued prompts, so the padded tail of the status bar retains the grey background.

* Fix agent queued prompts and manual compaction

* Clarify agent edit prompt and reminder logging

* Add browser-backed web tools to ds4-agent

* Inject session start time in ds4-agent

* Stabilize ds4-agent web page extraction

* Improve rendered page extraction

* Avoid focusing Chrome during web tools

* Improve browser page scrolling heuristics

* Add timed yes-no prompts for browser approval

* Deliver queued prompts after tool results

* Stop generation on in-think tool calls

* Improve prefill progress callbacks

* Stabilize agent prefill label

* Terminate displayed bash output with newline

* DeepSeek v4 PRO support

Add DeepSeek V4 Pro support on top of main.

- add Flash/Pro shape selection and Pro Metal inference support
- fix shared SwiGLU clamp and indexer QAT behavior for Flash and Pro
- update eval/context tooling and imatrix quantization for Pro
- tag KV caches/sessions with model ids so Flash and Pro state cannot cross-load
- remove misc/PRO.md from tracked files while leaving it as a local scratch log

* Support PRO official continuation collection

* Document model-specific continuation data

* Use parsed tensor span for Metal model views

* Validate DS4 compression layout by shape

* Document PRO support and model downloads

Expose Flash and PRO model IDs as server compatibility aliases for the loaded GGUF. Add download targets for the mixed Flash q2/q4 quant and PRO GGUFs, update README guidance, and add the M3 Ultra PRO benchmark data.

* Project renamed to DwarfStar, without the "4"

* Document experimental PRO support

* Fix PRO routed MoE expert mapping

* fix(server): evict disk KV entry that fails prefill

After an unclean shutdown the on-disk KV checkpoint can be intact
(header, hash, token count all valid) but leave Metal in a state
where prefill fails.  Since the file keeps passing load-time checks
it gets reloaded on every request, looping forever until the user
manually deletes the cache directory.

On prefill failure, if the prefix came from a disk entry, unlink it
and invalidate the session.  Next request gets a clean cache miss.

Closes antirez#251

* Improve KV cache pre-store eviction

Evict before writing a new KV checkpoint so the incoming entry cannot be selected as its own victim. Pass incoming-checkpoint context into eviction scoring and devalue compatible continued-prefix waypoints when making room for a longer checkpoint.

Inspired by the KV-cache observations in antirez#174/antirez#175 and antirez#176/antirez#177.

Co-authored-by: Salvatore Sanfilippo <antirez@gmail.com>

Co-authored-by: unsaltedbutter-ai <261676361+unsaltedbutter-ai@users.noreply.github.com>

* Harden disk KV cache compatibility checks

Reject stale KVC graph payloads before loading them, bump the session payload ABI after recent runtime layout changes, and keep PR antirez#253's prefill-failure recovery as a final safety net.

The server now discards the exact disk checkpoint that produced a restored-state prefill failure, resets the disk continuation marker, and logs unlink errors instead of silently ignoring them.

Tests cover stale payload ABI rejection.

Closes antirez#251

* Add wide-token MoE prefill tiles (n64/n128 mul_mm_id).

Use 64/128-token expert-major tiles for Q4_K, Q2_K, and IQ2_XXS routed
prefill when batch length is aligned. Cap via DS4_METAL_MOE_TILE_MAX
(default 128; set 32 to force legacy tiles for A/B).

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix - Adding wide Q4_K F16 kernels and updating the host pipeline resolver to match Q2_K/IQ2_XXS.

* Harden wide MoE tile dispatch

* Revert "Harden wide MoE tile dispatch"

This reverts commit 9ca9013.

* Revert "Merge PR antirez#264: Add wide-token MoE prefill tiles"

This reverts commit 805368e, reversing
changes made to e8e8779.

* Add local golden inference drift test

* Guard MoE Metal tile shape

* fix(metal): short-circuit tier-0 multi-tier allocators on Metal/CPU builds

Half-B (PR #6) added per-tier graph tensor allocation via
ds4_gpu_tensor_alloc_ptr_on(tier, bytes) and its _managed variant.
On Metal and CPU builds these are stubs in ds4.c (no CUDA multi-tier
to route to). The stubs returned NULL unconditionally — but
metal_graph_alloc_raw_cap calls _ptr_on(g->head_tier, ...) and
_ptr_on(g->emb_tier, ...) directly to populate Class H / Class E
slots (g->output_pre_by_tier, g->logits_by_tier,
g->prefill_tokens_by_tier, etc.). On Metal, every one of those calls
returned NULL, the giant validation chain at end of alloc_raw_cap
failed, and ds4_session_create returned 1 with no error message.

Net effect: Mac ds4-server, ds4-cli, ds4-agent, ds4-bench have ALL
been broken for fresh-session creation since PR #6 (May 27). The
only thing that "worked" was processes already started before PR #6
landed (they were running the pre-Half-B code).

Fix: stubs short-circuit tier == 0 to the legacy single-device
allocators (ds4_gpu_tensor_alloc / ds4_gpu_tensor_alloc_managed,
both implemented in ds4_metal.m). Byte-equivalent to pre-multi-tier
Metal — same allocator, same behavior. Multi-tier on Metal/CPU
remains unsupported (other tiers still return NULL), which matches
the design (no CUDA → no multi-tier).

Smoke verified on Mac (M3 Ultra, Metal):
  $ ./ds4-server --metal -m IQ2XXS.gguf --port 8081 --ctx 8192 \
                  --tokens 4096 --warm-weights
  ds4-server: listening on http://127.0.0.1:8081
  $ curl /v1/chat/completions -d '{"messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":16}'
  → "We need to answer the question: \"What is 2+2?\" This"

* chore: strip internal phase-tracker references from comments

Drop "Half-A", "Half-B", "wave 1/2/3", "wave 3a/3b", "mgpu-*" task
names, "codex round-N" review references from comments throughout the
multi-GPU codebase. These were useful for tracking our internal
implementation phases but are noise for any reader — including
potential upstream review.

Comments now describe the technical reality directly. Build clean on
Mac (Metal). Functional smoke test PASS — multi-tier server starts and
serves /v1/chat/completions correctly with the fix from 46e123c.

No code semantics changed; only comments and string-free comment text.

* chore: strip internal review-process references from comments

Drop the three remaining "codex plan-review round N finding #M" /
"issue #M" tags from ds4.c comments. The actual technical content
stays — only the tracking metadata about which internal review round
introduced the constraint goes.

Note: "Codex" references in ds4_server.c are external — they refer
to OpenAI's Codex CLI client which consumes our /v1/responses API.
Those stay.

---------

Signed-off-by: Rui Gu <jackygurui@gmail.com>
Co-authored-by: Rui Gu <jackygurui@gmail.com>
Co-authored-by: user <user@studio1.l0st.space>
Co-authored-by: antirez <antirez@gmail.com>
Co-authored-by: Fabio Malpezzi <fabio@MacBook-Pro-di-Fabio.fritz.box>
Co-authored-by: Ivan Fioravanti <ivan.fioravanti@gmail.com>
Co-authored-by: Giovanni Montana <giovanni.montana@gmail.com>
Co-authored-by: alantsev <alantsev@users.noreply.github.com>
Co-authored-by: unsaltedbutter-ai <261676361+unsaltedbutter-ai@users.noreply.github.com>
Co-authored-by: Theinruj Toranavikrai <my.beam@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants