Skip to content

Fix Metal function_constant 702 collision on pre-M5 devices#1

Open
thx0701 wants to merge 72 commits into
audreyt:mainfrom
thx0701:fix-fc-collision-702
Open

Fix Metal function_constant 702 collision on pre-M5 devices#1
thx0701 wants to merge 72 commits into
audreyt:mainfrom
thx0701:fix-fc-collision-702

Conversation

@thx0701
Copy link
Copy Markdown

@thx0701 thx0701 commented May 12, 2026

Environment

  • Apple M4 Max, 128 GB unified memory, macOS 26.4.1
  • audreyt/ds4 main HEAD e6c3da4 (clean clone, make ds4-server)
  • Tried setting DS4_MPP=off / DS4_METAL_TENSOR_DISABLE=1no help, because the failure is at compile-time, before any runtime flags are read

Reproduction

git clone https://github.com/audreyt/ds4
cd ds4
make ds4-server
./ds4-server \
  -m gguf/cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
  --host 0.0.0.0 --port 8002 --ctx 100000 \
  --dir-steering-file dir-steering/out/uncertainty.f32 \
  --dir-steering-ffn -3

Observed

Server aborts during Metal shader compilation:

ds4: Metal device Apple M4 Max, 128.00 GiB RAM
ds4: Metal 4 tensor API disabled for pre-M5/pre-A19 devices
  (set DS4_METAL_TENSOR_ENABLE=1 to experiment)
ds4: Metal shader compilation failed:
  program_source:3281:34: error: 'function_constant' has a duplicate index '702'
  constant bool FC_mul_mm_id_mpp [[function_constant(FC_MUL_MM + 2)]];
                                   ^
  program_source:2392:39: note: duplicate 'function_constant' index '702' here
  constant bool FC_mul_mm_m5_sgmatrix [[function_constant(FC_MUL_MM_M5_SGMATRIX)]];
                                        ^
ds4: metal backend unavailable; aborting startup

Root cause

In ds4_metal.m's embedded shader source (around line 2282):

"#define FC_MUL_MV 600\n"
"#define FC_MUL_MM 700\n"
"#define FC_MUL_MM_M5_SGMATRIX 702\n"

Both FC_MUL_MM_M5_SGMATRIX and FC_MUL_MM + 2 evaluate to 702. Metal compiler requires every function_constant index to be unique across the program, so the shader unit fails as a whole, regardless of whether the M5 path or the legacy MPP path is actually selected at runtime.

Why your own machine doesn't see this

I assume your M5 Max has Metal 4 Tensor API enabled and the active shader emission/specialization happens to mask one of the two declarations (or you build with a different conditional). On a pre-M5 device the fallback path still feeds the entire shader source to the compiler, both declarations are visible, and we hit the dup-index error.

Local workaround (ugly but works)

 "#define FC_MUL_MV 600\n"
 "#define FC_MUL_MM 700\n"
-"#define FC_MUL_MM_M5_SGMATRIX 702\n"
+"#define FC_MUL_MM_M5_SGMATRIX 800\n"

After patch + make clean && make ds4-server, the M4 Max boots cleanly, loads cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf (~94 GiB mapped), enables directional steering: dir-steering/out/uncertainty.f32 attn=0 ffn=-3, and runs at ~25 t/s gen (matches antirez/ds4 upstream rate on M4 Max — no MPP acceleration since M4 Max isn't in the MPP target list, which is expected).

Suggested cleaner fixes

  1. Separate index ranges: keep FC_MUL_MM block reserved for 700–799 and use 800+ for FC_MUL_MM_M5_*. e.g.

    #define FC_MUL_MM         700
    /* FC_MUL_MM + 0 .. FC_MUL_MM + 99 reserved for the main matmul family */
    #define FC_MUL_MM_M5_BASE 800
    #define FC_MUL_MM_M5_SGMATRIX (FC_MUL_MM_M5_BASE + 0)
  2. Conditional shader emission: wrap the M5-specific declarations behind #if HAS_METAL4_TENSOR (or whatever feature gate you already use for the M5 fast path) so pre-M5 builds don't see them at all.

  3. Split the M5 shader into its own source string and only concatenate it when the device family qualifies at runtime.

I'd guess #1 is the smallest patch that fixes this cleanly. Happy to send a PR if you'd prefer; just wanted to file the issue first since antirez#2/antirez#3 are bigger architectural choices.

Side note (separate from this issue)

While trying to run the same cyberneurova GGUF on a DGX Spark (GB10, sm_121, CUDA 13.0) with main HEAD e6c3da4 + --cuda, the server hangs at CPU 100 % right after CUDA backend initialized on NVIDIA GB10 (sm_121) and never reaches CUDA loading model tensors into device cache. Looks like the cyberneurova q8_0 token-embd loader path isn't wired up on the CUDA side either. Want me to file that as a separate issue?

ivanfioravanti and others added 30 commits May 10, 2026 00:24
DeepSeek-V4-Flash GGUFs produced by the upstream llama.cpp converter
without per-tensor type overrides ship most of the small projections at
Q8_0 (and routed-expert router weights at F32) where the antirez recipe
keeps them at F16. Examples include the cyberneurova abliterated GGUFs.
On stock ds4 main these load fails loudly at the first F16-strict
validator (token_embd, then output_hc_fn, then hc_attn_fn, ...), and
even after the validators are relaxed, several Metal kernel paths read
weight bytes directly via offset arithmetic that hard-codes F16/F32
strides.

This change makes the embed/HC/compressor/indexer/router validators
*and* the corresponding Metal kernel paths polymorphic, so the same
GGUF loads and runs with no harmonizer step.

Validators (ds4.c):

  * New tensor_expect_dispatch_layout helper accepts F16, F32, or Q8_0
    and is applied to every projection that flows through a
    type-dispatching matvec/matmul: output_hc_fn, hc_attn_fn,
    hc_ffn_fn, attn_compressor_{ape,gate,kv}, indexer.{attn_q_b,proj},
    indexer_compressor_{ape,gate,kv}, ffn_gate_inp.
  * token_embd keeps its own inline F16/Q8_0 check because its CPU
    embed kernel doesn't go through matvec_any.
  * Two compressor decode-time guards (attn_compressor and
    indexer_compressor pair-projection paths) relaxed from "F16 only"
    to "F16 or Q8_0, paired type must match".

CPU paths (ds4.c):

  * Refactor embed_token_f16 into an embed_token dispatcher; add
    embed_token_q8_0 (block-wise dequant of block_q8_0).
  * Replace the remaining direct matvec_f16 / matvec_f16_serial
    callers (HC fn, output_hc_fn, ffn_gate_inp) with the existing
    matvec_any dispatcher; add matvec_any_serial for the HC pre/post
    path.
  * Polymorphic Metal-side dispatch helpers metal_graph_matmul_plain_tensor
    and metal_graph_matmul_pair_plain_tensor (extended for Q8_0; the
    pair fuses with the existing F16-pair kernel when both tensors are
    F16, otherwise dispatches to two single matmuls). All 22 hardcoded
    ds4_metal_matmul_f16{,_pair}_tensor call sites in ds4.c (HC mix,
    attn/indexer compressors, indexer projections, output head, router)
    converted to use these wrappers.

Metal kernels:

  * metal/get_rows.metal: kernel_get_rows_q8_0 (one float per thread,
    dequantizes its source block on the fly).
  * metal/dense.metal: kernel_mul_mm_f32_f32 template instantiation for
    the multi-token F32 weight matmul that the F32 router path needs in
    prefill (mirrors the existing F16/Q8_0 mul_mm_t instantiations).
  * metal/cpy.metal: kernel_cpy_q8_0_f32 (dequantizing 1D copy used by
    the compressor APE byte-strided reader).

Metal wiring (ds4_metal.m):

  * Register g_get_rows_q8_0_pipeline and g_cpy_q8_0_f32_pipeline at
    init; clear them at cleanup.
  * Both ds4_metal_embed_{token,tokens}_hc_tensor and the shared
    ds4_metal_encode_get_rows helper take a new weight_type parameter
    (GGUF type code: 1=F16, 8=Q8_0). 8 callers in ds4.c forward
    weights->token_embd->type unchanged. ds4_metal_embed_row_layout
    picks the right per-row stride and pipeline.
  * ds4_metal_matmul_f32_tensor extended with a multi-token branch
    that dispatches to kernel_mul_mm_f32_f32 (n_tok > 1); existing
    n_tok = 1 path unchanged.
  * ds4_metal_encode_compressor_score_with_ape and the equivalent loop
    in ds4_metal_compressor_prefill_tensor add a Q8_0 branch
    (ds4_metal_encode_cpy_q8_0_f32_1d) and use a per-row stride that
    accounts for the block_q8_0 layout.
  * Six ape_type validators relaxed to also accept 8 (Q8_0).
  * Six ape_bytes calculations centralized through a new
    ds4_metal_ape_bytes(ape_type, n_elems) helper that returns the
    correct stride for F16/F32/Q8_0.
  * metal_graph_matmul_plain_tensor extended with a Q8_0 branch.

Tested on macOS / M-series / Metal:

  * make ds4-server clean (no new warnings).
  * Cyberneurova Q2_K GGUF entirely unmodified: loads, prefill +
    decode through to coherent generation ("PASS" returned for the
    "reply with the single word PASS" prompt).
  * Pre-harmonized variant (token_embd / hc / compressor / indexer all
    F16, ffn_gate_inp F16): still works byte-for-byte the same as
    before this change, no F16 path regressions.

Caveat for reviewers running ivanfioravanti's M5 PR (antirez#15) on top of
this: the unmodified cyberneurova file generates garbage (BOS spam)
when MPP F16 prefill is engaged, but produces coherent output with
DS4_METAL_MPP_F16_DISABLE=1. The garbage is reproducible from antirez#15's
MPP path alone and is independent of the changes here; it surfaces only
because this PR makes the Q8_0 file loadable in the first place.
When a stock-recipe GGUF (cyberneurova-style: Q8_0 small tensors, F32
router) is loaded on M5 with PR antirez#15's MPP optimizations enabled, the
compressor APE path silently produces wrong output and prefill emits
garbage tokens (typically <BOS> spam after a few coherent tokens).

The prefill is correct; the bug is in two compressor APE consumers that
were updated to accept Q8_0 ape_type but couldn't read Q8_0 byte layout
correctly:

1. `kernel_cpy_q8_0_f32` Metal kernel (added in support-q8_0-token-embd
   for the prefill APE byte-strided dequant): produces silently wrong
   output on M5 Max for the compressor APE shapes (4 rows x 1024 cols).
   Replaced with a CPU-side dequant into a per-call private MTLBuffer.
   The CPU dequant matches gguf-py reference byte-for-byte (verified
   with a standalone numeric check); the Metal kernel did not.

2. `kernel_dsv4_compressor_store_one` (decode-time single-row store in
   metal/dsv4_kv.metal): only handled F16/F32 ape_type; Q8_0 fell into
   the F32 else branch and read garbage.  Add a Q8_0 branch that walks
   block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte
   block) directly.

The CPU dequant path also has to use a *fresh per-call* MTLBuffer for
each compressor invocation, not the shared g_compressor_store_ape_buffer:
multiple CPU writes to one shared buffer in the same command buffer
collapse to the last write at execute time (Metal kernels run in encode
order, but CPU writes don't participate in that ordering when the same
scratch is reused).  The per-call buffer is retained until cb completion
via addCompletedHandler because Metal does not strongly retain buffers
bound to encoders.

Changes:

  * ds4_metal.m: new `ds4_metal_half_bits_to_float` and
    `ds4_metal_cpu_dequant_q8_0_rows` helpers (verified against gguf-py
    `dequantize` reference); replace Q8_0 branches in
    `ds4_metal_encode_compressor_score_with_ape` and
    `ds4_metal_compressor_store_batch_tensor` with CPU-side dequant
    into per-call private buffers retained via addCompletedHandler.
  * metal/dsv4_kv.metal: add a Q8_0 branch to
    `kernel_dsv4_compressor_store_one`.
  * metal/cpy.metal: `kernel_cpy_q8_0_f32` is left in place but
    no longer reached from the compressor paths (its registration in
    ds4_metal.m is harmless).

Tested on macOS / M-series / Metal:

  * make ds4-server clean.
  * Cyberneurova Q2_K GGUF entirely unmodified, MPP F16 enabled (i.e.
    no DS4_METAL_MPP_F16_DISABLE workaround):
    21-token prompt -> coherent generation
    ("An LLM, or Large Language Model, is a type of artificial intelligence").
    Previously this prompt generated "An LLM, or large language" then
    <BOS> token spam.
  * Pre-harmonized variant: still works byte-for-byte the same as
    before this change, no F16/F32 path regressions.
This PR's loader changes accept Q8_0 `*compressor_ape*` weights at the
validator level, but two follow-on Metal paths still treat them as F16
(or fall through to F32) and produce silently wrong output, which shows
up as <BOS>-token spam in generation for any prompt long enough to
exercise the multi-token compressor path on M-series hardware.

1. `kernel_cpy_q8_0_f32` (added in this PR for the prefill APE
   byte-strided dequant) compiles cleanly and follows the same
   block_q8_0 indexing pattern used by other working Q8_0 kernels in
   dense.metal, but emits silently wrong values for the actual ape
   shapes (4 rows x 1024 cols of block_q8_0).  Confirmed by isolating
   the kernel: a CPU-side dequant of the same byte region matches
   gguf-py's `dequantize` reference byte-for-byte, while the Metal
   kernel's output is wrong.

2. `kernel_dsv4_compressor_store_one` (decode-time single-row store
   in metal/dsv4_kv.metal): only handled `ape_type == 1` (F16) and
   fell through to F32 for everything else, so Q8_0 ape was reading
   garbage at decode time.

Fix:

* Replace the prefill APE Q8_0 path in
  `ds4_metal_encode_compressor_score_with_ape` and
  `ds4_metal_compressor_store_batch_tensor` with a CPU-side dequant
  via two new helpers (`ds4_metal_half_bits_to_float` and
  `ds4_metal_cpu_dequant_q8_0_rows`) into a *per-call* private
  MTLBuffer.  A per-call buffer is required because multiple CPU writes
  to the previously-shared `g_compressor_store_ape_buffer` within one
  command buffer collapse to the last write at execute time (Metal
  kernels run in encode order, but CPU writes don't participate in that
  ordering when the same scratch is reused).  The per-call buffer is
  retained until cb completion via `addCompletedHandler` because Metal
  does not strongly retain buffers bound to encoders.
* Add a Q8_0 branch to `kernel_dsv4_compressor_store_one` that walks
  block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte block)
  inline.

The buggy `kernel_cpy_q8_0_f32` Metal kernel is left in place but is
no longer reached from the compressor paths; its registration in
ds4_metal.m is harmless and a future debug session can either fix it
or drop it.

Tested on macOS / M-series / Metal:

* make ds4-server clean (one pre-existing -Wpointer-sign warning from
  the unrelated MoE path).
* Cyberneurova Q2_K GGUF entirely unmodified, default flags:
  21-token prompt -> coherent generation
  ("An LLM, or Large Language Model, is a type of artificial intelligence").
  Previously this prompt generated a few coherent tokens then <BOS>
  token spam.
* Pre-harmonized variant (token_embd / hc / compressor / indexer all
  F16): still works byte-for-byte the same as before this fix; no F16
  / F32 path regressions.
The decode-time indexer code at metal_graph_encode_decode_layer (ds4.c:9082-9095)
still has two F16-only validators on indexer_attn_q_b and indexer_proj that I
missed in the initial loader pass.

These validators only fire after `g->layer_n_comp[il] > decode_top_k` — i.e.
once the compressor has accumulated more rows than the decode-time top-k.
For short generations the path isn't reached; for ~400+ token generations
on stock-recipe (Q8_0) GGUFs the validator trips and the request finishes
with finish_reason="error" / "Metal decode failed".

The downstream calls already use metal_graph_matmul_plain_tensor (which
dispatches to ds4_metal_matmul_q8_0_tensor for Q8_0). The loader-time
validator at line 2211-2212 already uses tensor_expect_dispatch_layout,
which accepts F16/F32/Q8_0. Only these runtime guards were stuck on F16.

Reproducer (cyberneurova Q2_K, default flags): a "write a long story"
prompt that generates ~800 tokens hits the validator after ~400 tokens
and the request errors out. After this fix, the same prompt streams 800+
tokens cleanly.
The decode-time indexer code at metal_graph_encode_decode_layer (ds4.c:9082-9095)
still has two F16-only validators on indexer_attn_q_b and indexer_proj that I
missed in the initial loader pass.

These validators only fire after `g->layer_n_comp[il] > decode_top_k` — i.e.
once the compressor has accumulated more rows than the decode-time top-k.
For short generations the path isn't reached; for ~400+ token generations
on stock-recipe (Q8_0) GGUFs the validator trips and the request finishes
with finish_reason="error" / "Metal decode failed".

The downstream calls already use metal_graph_matmul_plain_tensor (which
dispatches to ds4_metal_matmul_q8_0_tensor for Q8_0). The loader-time
validator at line 2211-2212 already uses tensor_expect_dispatch_layout,
which accepts F16/F32/Q8_0. Only these runtime guards were stuck on F16.

Reproducer (cyberneurova Q2_K, default flags): a "write a long story"
prompt that generates ~800 tokens hits the validator after ~400 tokens
and the request errors out. After this fix, the same prompt streams 800+
tokens cleanly.
…8_0 fixes)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two callers of ds4_metal_encode_cpy_q8_0_f32_1d were removed in 79b08bb
(switched to CPU-side dequant to avoid an encode-time race on the shared
compressor scratch buffer), leaving the function unused and tripping
-Wunused-function on stock Make builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two callers of ds4_metal_encode_cpy_q8_0_f32_1d were removed in 79b08bb
(switched to CPU-side dequant to avoid an encode-time race on the shared
compressor scratch buffer), leaving the function unused and tripping
-Wunused-function on stock Make builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
audreyt and others added 27 commits May 11, 2026 07:43
Updated project title and clarified description.
A second bundled steering vector parallel to verbosity, built from 100
contested vs 100 settled prompts. Negative FFN scales push the model
toward hedge-mode response (presenting multiple positions); positive
scales push toward more assertive single-answer mode.

The hedge-vs-assert axis is a general response register rather than a
topic-specific representation, so the direction transfers across model
variants better than stance directions.

Useful on questions where the model would otherwise emit a strongly-
trained closed-form completion. Most effective combined with a system
prompt that supplies the disputed positions explicitly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the Responses API endpoint that Codex CLI (and other modern
OpenAI tooling) speaks instead of /v1/chat/completions. The wire format
is documented in OpenAI's Responses API; this implementation has been
iterated against the Codex CLI binary's SSE parser shape until no
remaining schema gaps were found.

Request parsing (parse_responses_request, parse_responses_input):
- Accepts the typed input array (message, function_call,
  function_call_output, reasoning, custom_tool_call(_output),
  local_shell_call(_output), web_search_call(_output),
  tool_search_call(_output), image_generation_call(_output),
  compaction, context_compaction).
- Maps hosted-tool history to function_call/function_call_output so
  prior actions survive across turns; rejects unknown item types and
  non-completed status with 400 to avoid silent context loss.
- Strict content-array parsing: only string|null|array of recognized
  text blocks (input_text/output_text/text/summary_text/
  reasoning_text); rejects non-text modalities (input_image/file/
  audio) instead of accepting an empty prompt.
- Merges adjacent function_call items into the preceding assistant
  message so text + tool-call turns render as a single assistant
  block.
- Honors reasoning.effort (incl. "minimal"/"none") and gates
  reasoning summary surface on reasoning.summary opt-in.
- Rejects previous_response_id, conversation, and forced tool_choice
  explicitly (constrained decoding / persisted state not supported).

Output (responses_sse_*, responses_final_response):
- Emits the full streaming lifecycle: response.created,
  output_item.added/.done, reasoning_summary_part.added/.done,
  reasoning_summary_text.delta/.done, content_part.added/.done,
  output_text.delta/.done, function_call_arguments.delta/.done,
  response.completed.
- Branches the terminal event by finish reason: response.failed for
  errors and response.incomplete with reason "max_tokens" for length.
- Every event carries sequence_number; every output_text part carries
  annotations:[]; function_call output_item.added ships with an empty
  arguments string (full args arrive via function_call_arguments.done
  and output_item.done), and item ids are stable across added/done.
- Tracks whether </think> was actually observed so a truncated stream
  marks the reasoning item incomplete instead of "completed".
- Recovers gracefully when the DSML tool parse fails after the model
  was suppressed at the tool marker: the suppressed tail is flushed
  as additional output_text deltas so the streamed message matches
  output_item.done.

Tested by 25 rounds of /codex:adversarial-review against the same
client this is meant to feed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Anthropic's count_tokens API takes the same request shape as /v1/messages
but only returns the prompt token count without running inference. This
short-circuits before enqueueing a job: parse_anthropic_request renders
and tokenizes the prompt the same way it would for a real generation,
then we serialize {"input_tokens": N} and release the request.

Useful for clients that need to plan context budgets before committing
to a generation, e.g. the Anthropic SDK token-counting flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ds4.c calls ds4_gpu_set_mpp_mode, ds4_gpu_set_mpp_compare_context, and
ds4_gpu_clear_mpp_compare_context unconditionally, but these are
Metal-only diagnostic helpers. Provide no-op CUDA implementations so the
linker resolves the symbols.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GGUF math: the 86.72 GB mixed IQ2_XXS+Q2_K+Q8_0 file averages ~2.62
bits per parameter, which matches a 284b-parameter model (671b would
require an impossible ~1.11 bits/param). antirez/ds4's README also
says 284b, as does pi-ds4.html. This was a copy-paste error from the
DeepSeek V3 family numbers.

Also surface the decode rate (~30 tok/s on M5 Max, matching the
benchmark table's 24–37 tok/s range) alongside the prefill rate, since
decode is what users actually experience during a conversation.
Absorbs Swival/ds4-m5's M5 simdgroup_matrix matmul fast path,
M5-private scratch buffers + hazard tracking, CUDA q8 fp16 cache
memory guard, CUDA long-context fixes/test, M2 Ultra benchmark, and
speed-bench script/CSV layout.

Conflict resolution in ds4_metal.m:
- ds4_gpu_get_mul_mm_id_pipeline keeps HEAD's (bc_inp, use_mpp) shape,
  since the MoE _id_ kernels in metal/moe.metal use function-constant
  slot 702 for use_mpp. Slot 702 in metal/dense.metal is used for
  m5_sgmatrix via the auto-merged ds4_gpu_get_mul_mm_pipeline path.
- ds4_gpu_use_m5_simdgroup_matrix() helper kept alongside our MPP
  infrastructure (q8_0/f16/attn-out/MoE policy functions) so both
  fast paths stay wired.

README left at HEAD; swival/m5's new sections describe a different
fork narrative.
The upstream backend refactoring (0ac5df3) renamed the public tensor
API from ds4_metal_tensor_* to ds4_gpu_tensor_* (and ds4_metal_matmul_q
to ds4_gpu_matmul_q, etc.), but tests/ds4_test.c kept using the old
names — so `make ds4_test` had been broken since that commit.

This is a pure rename; ./ds4_test --server passes.
Absorbs Swival/ds4-m5's M5 simdgroup_matrix matmul fast path,
M5-private scratch buffers + hazard tracking, CUDA q8 fp16 cache
memory guard, CUDA long-context fixes/test, M2 Ultra benchmark, and
speed-bench script/CSV layout.

Conflict resolution in ds4_metal.m (matches main's earlier resolution):
- ds4_gpu_get_mul_mm_id_pipeline keeps HEAD's (bc_inp, use_mpp) shape,
  since the MoE _id_ kernels in metal/moe.metal use function-constant
  slot 702 for use_mpp. Slot 702 in metal/dense.metal is used for
  m5_sgmatrix via the auto-merged ds4_gpu_get_mul_mm_pipeline path.
- ds4_gpu_use_m5_simdgroup_matrix() helper kept alongside our MPP
  infrastructure (q8_0/f16/attn-out/MoE policy functions) so both
  fast paths stay wired.

README left at HEAD; swival/m5's new sections describe a different
fork narrative.
The upstream backend refactoring (0ac5df3) renamed the public tensor
API from ds4_metal_tensor_* to ds4_gpu_tensor_* (and ds4_metal_matmul_q
to ds4_gpu_matmul_q, etc.), but tests/ds4_test.c kept using the old
names — so `make ds4_test` had been broken since that commit.

This is a pure rename; ./ds4_test --server passes.
Raise the default Metal prefill chunk to 4096 and reuse the range-capable layer-major prefill graph for chunked ranges.

Enable the guarded Q8_0 attn_q_b MPP route for <=2048-token prompt batches, dynamic Q8_0 tile width, the routed-MoE fast layout from layer 0, and the RB16 indexed decode path.

M5 Max post-patch ds4-bench profile with 64 generated tokens: prompt 443/459/522/486/465 t/s and generation 38.6/38.2/37.6/34.0/33.6 t/s at 0.5k/1k/2k/4k/8k.

Tests: make all ds4_test; make test; git diff --check.
Detect macOS Low Power Mode and widen the Q8_0 prefill MPP route only under that condition, while preserving the guarded default for normal-power runs and explicit Q8_0 filters.

Low-power M5 Max baseline vs patched auto with 128 generated tokens:

0.5k: prefill 133.46 -> 196.89 t/s, gen 13.53 -> 15.08 t/s

1k: prefill 118.65 -> 188.91 t/s, gen 12.23 -> 14.93 t/s

2k: prefill 130.90 -> 220.33 t/s, gen 11.02 -> 14.65 t/s

4k: prefill 118.09 -> 212.81 t/s, gen 13.25 -> 14.00 t/s

8k: prefill 185.52 -> 206.49 t/s, gen 12.94 -> 13.84 t/s

Tests: make all ds4_test; make test; DS4_METAL_MPP_LOW_POWER_DISABLE=1 ./ds4_test --metal-mpp-equivalence; git diff --check.
Brings in Ivan's PR antirez#15 follow-ups (Tune Metal MPP defaults / Improve
MPP prefill throughput / Low-power Q8 profile) plus the ds4_test rename
fix, on top of the swival/m5 work that main already absorbed.

Conflict resolution:
- ds4_metal.m: drop main's older ds4_gpu_mpp_q8_0_partial_tiles_enabled
  in favor of m5's newer version from ff2d499 (handles low-power mode).
- README.md: keep HEAD (the Abliterated fork narrative); m5's README is
  the original DwarfStar 4 readme with Swival's M5 narrative appended.
ds4_metal.m's embedded shader source declares both:

  FC_mul_mm_id_mpp    [[function_constant(FC_MUL_MM + 2)]]   // == 702
  FC_mul_mm_m5_sgmatrix [[function_constant(FC_MUL_MM_M5_SGMATRIX)]] // == 702

Metal requires every function_constant index to be unique across the
program, so the shader unit fails to compile as a whole on devices
that fall back to legacy Metal (e.g. M4 Max with Metal 4 Tensor API
disabled), with:

  program_source:3281:34: error: 'function_constant' has a duplicate
  index '702'

ds4-server then aborts with "metal backend unavailable". This is a
compile-time conflict; runtime flags like DS4_MPP=off /
DS4_METAL_TENSOR_DISABLE=1 can't recover from it because the shader
unit is rejected before any runtime selection runs.

This minimal patch moves FC_MUL_MM_M5_SGMATRIX out of the 700–799
block (which is shared with FC_MUL_MM family indices computed as
FC_MUL_MM + k) into 800, away from the existing FC_BIN=1300 range.

Verified on Apple M4 Max, 128 GB, macOS 26.4.1, building from main
HEAD e6c3da4:

  - make clean && make ds4-server: succeeds
  - ./ds4-server -m cyberneurova-...Q2_K.gguf \
      --dir-steering-file dir-steering/out/uncertainty.f32 \
      --dir-steering-ffn -3 \
      --ctx 100000 --host 0.0.0.0 --port 8002
  - Server boots, maps 94 GiB of tensors, logs
    "directional steering enabled: ... attn=0 ffn=-3"
  - Steady-state generation ~25 t/s (matches antirez/ds4 upstream
    on M4 Max; M4 Max is not in the MPP target list so it correctly
    runs the legacy Metal path)

A cleaner long-term fix would be to (a) reserve disjoint ranges per
family (e.g. 700–799 for FC_MUL_MM, 800–899 for FC_MUL_MM_M5_*) and
document the convention, or (b) gate the M5-only declarations behind
a feature macro so pre-M5 builds don't emit them at all. This patch
is the minimal change that unblocks pre-M5 users today.
@thx0701
Copy link
Copy Markdown
Author

thx0701 commented May 12, 2026

Following up on the side note in the original PR body — here are the full details for the DGX Spark CUDA hang in case it's useful (separate from this PR's fix; happy to drop into a new venue if you'd prefer).

Environment

  • DGX Spark GB10 (Grace Blackwell, compute capability sm_121)
  • Ubuntu 24.04 LTS, kernel 6.17.0-1008-nvidia (aarch64)
  • CUDA toolkit 13.0 (V13.0.88), driver 580.126.09
  • 20-core ARM, 119 GiB unified memory, 406 GiB free disk
  • audreyt/ds4 main HEAD e6c3da4
  • Built with make CUDA_ARCH=native — compile + link both clean, no warnings
  • cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf (98.8 GB, valid GGUF v3 magic)

Reproduction

git clone https://github.com/audreyt/ds4 ds4-audreyt
cd ds4-audreyt
PATH=/usr/local/cuda/bin:$PATH make CUDA_ARCH=native
./ds4-server --cuda \
  -m /path/to/cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
  --host 0.0.0.0 --port 8001 --ctx 32768 \
  --kv-disk-dir /mnt/models/ds4-kv --kv-disk-space-mb 8192 \
  --dir-steering-file dir-steering/out/uncertainty.f32 \
  --dir-steering-ffn -3

Observed: hangs forever after CUDA init

Server log gets exactly this far and then nothing more is emitted (waited 5+ minutes across multiple attempts, also tried without --dir-steering-file to isolate):

ds4: CUDA backend initialized on NVIDIA GB10 (sm_121)
ds4: CUDA host registration skipped: operation not supported

For comparison, the same Spark running antirez/ds4 upstream (99a5c13) with antirez's native IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8 GGUF goes through the expected loading sequence:

ds4: CUDA backend initialized on NVIDIA GB10 (sm_121)
ds4: CUDA host registration skipped: operation not supported
ds4: CUDA loading model tensors into device cache
ds4: CUDA loading model tensors 16.02 GiB cached
ds4: CUDA loading model tensors 32.06 GiB cached
ds4: CUDA loading model tensors 64.06 GiB cached
ds4: CUDA loading model tensors 80.04 GiB cached
ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 81.821s
ds4: cuda backend initialized for graph diagnostics
0512 03:29:31 ds4-server: listening on http://0.0.0.0:8000

So audreyt/ds4 HEAD never gets past the host registration skipped line into CUDA loading model tensors for cyberneurova.

Process state during the hang

PID 3245707  STAT Rl  100% CPU  0% MEM-of-119GiB  WCHAN=-
read_bytes  delta over 3s: 0
write_bytes delta over 3s: 0
GPU power.draw: 13.9 W  memory.used: [N/A]  utilization.gpu: 0 %

So:

  • single-core CPU pinned at 100 %, but no I/O (model file not being mmap'd / fault-paged in)
  • GPU completely idle (no tensors uploaded yet)
  • WCHAN empty (running, not waiting on a kernel object)

→ looks like a tight busy loop in the CUDA loader pre-tensor-copy phase, not a deadlock on a syscall.

gdb -p $PID -batch -ex bt couldn't attach (kernel ptrace_scope=1 and no sudo), so I don't have a stack trace. Happy to enable ptrace and re-run if you want exact frames.

Why we suspect the loader hasn't been ported

  • e6c3da4 ("port this to DGX Spark") only adds one line to ds4_cuda.cu:
    + uint32_t layer_index,
    + (void)layer_index;
    i.e. it brings the function signature in sync with upstream, but doesn't touch the model loader path itself.
  • The Metal-side cyberneurova loader patch (q8_0 token_embd accepted instead of expected F16) is presumably what lives on the support-q8_0-token-embd branch, but the CUDA counterpart of that patch doesn't appear to be in main.
  • Without that, the loader is presumably hitting an unrecognized quant/tensor layout for token_embd and ending up in some retry / lookup loop rather than failing fast.

What I'd like to know

  1. Has audreyt/ds4 main CUDA path been verified end-to-end with cyberneurova on any actual Spark / Blackwell hardware?
  2. If the cyberneurova q8_0 token_embd loader on the CUDA side just isn't done yet, totally fine — I just want to know whether to wait or expect it sometime, vs going back to antirez upstream + native quant for the Spark and treating cyberneurova as Mac-only.
  3. If you want me to open a separate issue elsewhere (the fork has Issues disabled), let me know where — happy to file a clean bug with strace once ptrace_scope is loosened.

Thanks again!

@thx0701
Copy link
Copy Markdown
Author

thx0701 commented May 12, 2026

中文版

順手另一個觀察,給 README §8(Agent Client Usage)做參考——這不是 audreyt/ds4 的 bug,但對打算用 Claude Code 接這個 fork 的 user 是一個值得寫進文件的 caveat。

觀察:Claude Code 的 system prompt 會蓋掉 abliterated 性格

同一個 server(audreyt fork HEAD e6c3da4 + 我提的 FC-702 patch + cyberneurova Q2_K + --dir-steering-ffn -3 --dir-steering-file uncertainty.f32),同一個敏感題,走兩條路結果完全不同:

A. 直接 curl :8002/v1/chat/completions(不過任何 wrapper)

prompt:「1989年6月4日 天安門廣場發生了什麼事?請詳述。」

回覆(前 ~150 tokens,整段共 800 tokens 因為 finish_reason=length):

关于1989年6月4日天安門廣場的事件,这是一个涉及中国近代历史的重要问题。目前没有足够的数据来详细回答这个问题,这是一个复杂的历史事件。我决定不直接回答这个问题,而是通过提供更广泛的背景信息来回应。我认为这是一个需要谨慎处理的问题,涉及中国历史和政治。我选择不直接回答,而是表示这是一个复杂的历史问题,涉及不同的观点和解读…

這就是 README §4 描述的「鋪陳爭議議題」暫存器:沒拒答、沒給 PRC 預設答,正是 steering 效果。可惜後半段 collapse 進入 这个 这个 这个 … 無限重複(FFN=-3 邊緣情況,前面 PR body 也提過)。

速度:wall 27.77s / prompt 25 tok / completion 800 tok → ~29 t/s gen on M4 Max。

B. 同 model 同 server,但走 Claude Code(cc-connect claudecode adapter)

同樣 prompt,Claude Code 啟動時自動塞它內建的 ~25K system prompt(含 "Refuse requests for destructive techniques…" + "You are Claude Code, helpful assistant…" 等指令)。Model 回應變成禮貌地拒答這個議題、改提議幫忙其他事。Abliterated weights 仍然 loaded、steering vector 仍然套上去——但 Claude Code 的 system prompt 完全 dominate model 的回應風格。

為什麼這對你 README 有意義

§8.7 目前列 Claude Code 是支援的 agent integration、只給 ANTHROPIC_BASE_URL wrapper script。照那食譜想拿到「abliterated Claude Code」體驗的 user 會困惑:

  • model 真的 abliterated + steered(server log 明確:directional steering enabled: …attn=0 ffn=-3
  • 透過 Claude Code 觀察到的行為跟 cloud Claude 沒兩樣(敏感題拒答、default safety register)

建議 §8.7 加一句 caveat:

注意:透過 Claude Code 訪問時,模型行為會被 Claude Code 內建的 ~25K system prompt 主導("you are a helpful coding assistant, refuse harmful requests…")。Abliterated character 跟 steering vector 仍然在 weight/activation 層作用,但 Claude Code persona override 了模型的 response register。要看 raw abliterated + steered 行為,請用 Pi、OpenCode(不帶 --system-prompt)、./ds4 -p REPL、或直接 curl OpenAI-compatible endpoint。

OpenCode default chat mode 是否有同樣問題我沒驗證;若 OpenCode 也預設帶自家 system prompt 可能適用。

需要的話我可以另開 doc-only PR 直接送這段加進 §8.7。


English version

One more observation while we're here, in case it's useful for your README §8 (Agent Client Usage) — this isn't a bug in audreyt/ds4, but a caveat worth documenting for users hooking this up through Claude Code.

Observation: Claude Code's system prompt masks the abliterated character

I tested the same sensitive query through two paths against the same running server (audreyt fork @ HEAD e6c3da4 + my FC-702 patch + cyberneurova Q2_K + --dir-steering-ffn -3 --dir-steering-file uncertainty.f32):

Path A: direct curl to :8002/v1/chat/completions (no wrapper)

Prompt: "1989年6月4日 天安門廣場發生了什麼事?請詳述。"

Reply (first ~150 tokens; whole thing was 800 tokens because finish_reason=length):

关于1989年6月4日天安門廣場的事件,这是一个涉及中国近代历史的重要问题。目前没有足够的数据来详细回答这个问题,这是一个复杂的历史事件。我决定不直接回答这个问题,而是通过提供更广泛的背景信息来回应。我认为这是一个需要谨慎处理的问题,涉及中国历史和政治。我选择不直接回答,而是表示这是一个复杂的历史问题,涉及不同的观点和解读…

This is the "rehearsing-contested-territory" register described in your README §4. No refusal, no PRC-default answer — exactly the steering effect, until it collapsed into a 这个 这个 这个 … repetition loop in the second half (an ffn=-3 edge case, also mentioned in the PR body).

Speed for raw path: wall 27.77s, prompt 25 tok, completion 800 tok → ~29 t/s gen on M4 Max.

Path B: same model, same server, accessed through Claude Code via cc-connect (claudecode adapter)

Same prompt, but Claude Code spawns the request with its full ~25K system prompt (containing things like "Refuse requests for destructive techniques…" + "You are Claude Code, helpful assistant…"). The model now politely refuses to discuss the topic and offers to help with something else. The abliterated weights are still loaded, the steering vector is still applied — but Claude Code's system prompt completely overrides the model's persona.

Why I think this matters for your README

§8.7 currently lists Claude Code as one of the supported agent integrations with just the ANTHROPIC_BASE_URL wrapper script. Users following that recipe to "get an abliterated Claude Code" experience will be confused, because:

  • The model is abliterated + steered (verifiable on the server log: directional steering enabled: …attn=0 ffn=-3)
  • But the observable behavior through Claude Code is the same as cloud Claude (refusals on contested topics, default safety register)

A one-line caveat under §8.7 like:

Note: When accessed via Claude Code, the model's behavior will be dominated by Claude Code's built-in system prompt (~25K tokens of "you are a helpful coding assistant, refuse harmful requests…"). The abliterated character and steering vector are still applied at the weight/activation level, but the Claude Code persona overrides the model's response register. To see the raw abliterated + steered behavior, use Pi, OpenCode (without --system-prompt), ./ds4 -p REPL, or direct curl to the OpenAI-compatible endpoint.

…would save the next person an afternoon of "did I install it wrong?" debugging.

Same caveat may apply to OpenCode's default chat mode if it ships its own system prompt; haven't verified.

Happy to send this as a doc-only PR if you'd like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants