Fix Metal function_constant 702 collision on pre-M5 devices by thx0701 · Pull Request #1 · audreyt/ds4

thx0701 · 2026-05-12T17:30:09Z

Environment

Apple M4 Max, 128 GB unified memory, macOS 26.4.1
audreyt/ds4 main HEAD e6c3da4 (clean clone, make ds4-server)
Tried setting DS4_MPP=off / DS4_METAL_TENSOR_DISABLE=1 — no help, because the failure is at compile-time, before any runtime flags are read

Reproduction

git clone https://github.com/audreyt/ds4
cd ds4
make ds4-server
./ds4-server \
  -m gguf/cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
  --host 0.0.0.0 --port 8002 --ctx 100000 \
  --dir-steering-file dir-steering/out/uncertainty.f32 \
  --dir-steering-ffn -3

Observed

Server aborts during Metal shader compilation:

ds4: Metal device Apple M4 Max, 128.00 GiB RAM
ds4: Metal 4 tensor API disabled for pre-M5/pre-A19 devices
  (set DS4_METAL_TENSOR_ENABLE=1 to experiment)
ds4: Metal shader compilation failed:
  program_source:3281:34: error: 'function_constant' has a duplicate index '702'
  constant bool FC_mul_mm_id_mpp [[function_constant(FC_MUL_MM + 2)]];
                                   ^
  program_source:2392:39: note: duplicate 'function_constant' index '702' here
  constant bool FC_mul_mm_m5_sgmatrix [[function_constant(FC_MUL_MM_M5_SGMATRIX)]];
                                        ^
ds4: metal backend unavailable; aborting startup

Root cause

In ds4_metal.m's embedded shader source (around line 2282):

"#define FC_MUL_MV 600\n"
"#define FC_MUL_MM 700\n"
"#define FC_MUL_MM_M5_SGMATRIX 702\n"

Both FC_MUL_MM_M5_SGMATRIX and FC_MUL_MM + 2 evaluate to 702. Metal compiler requires every function_constant index to be unique across the program, so the shader unit fails as a whole, regardless of whether the M5 path or the legacy MPP path is actually selected at runtime.

Why your own machine doesn't see this

I assume your M5 Max has Metal 4 Tensor API enabled and the active shader emission/specialization happens to mask one of the two declarations (or you build with a different conditional). On a pre-M5 device the fallback path still feeds the entire shader source to the compiler, both declarations are visible, and we hit the dup-index error.

Local workaround (ugly but works)

 "#define FC_MUL_MV 600\n"
 "#define FC_MUL_MM 700\n"
-"#define FC_MUL_MM_M5_SGMATRIX 702\n"
+"#define FC_MUL_MM_M5_SGMATRIX 800\n"

After patch + make clean && make ds4-server, the M4 Max boots cleanly, loads cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf (~94 GiB mapped), enables directional steering: dir-steering/out/uncertainty.f32 attn=0 ffn=-3, and runs at ~25 t/s gen (matches antirez/ds4 upstream rate on M4 Max — no MPP acceleration since M4 Max isn't in the MPP target list, which is expected).

Suggested cleaner fixes

Separate index ranges: keep FC_MUL_MM block reserved for 700–799 and use 800+ for FC_MUL_MM_M5_*. e.g.

#define FC_MUL_MM         700
/* FC_MUL_MM + 0 .. FC_MUL_MM + 99 reserved for the main matmul family */
#define FC_MUL_MM_M5_BASE 800
#define FC_MUL_MM_M5_SGMATRIX (FC_MUL_MM_M5_BASE + 0)

Conditional shader emission: wrap the M5-specific declarations behind #if HAS_METAL4_TENSOR (or whatever feature gate you already use for the M5 fast path) so pre-M5 builds don't see them at all.
Split the M5 shader into its own source string and only concatenate it when the device family qualifies at runtime.

I'd guess #1 is the smallest patch that fixes this cleanly. Happy to send a PR if you'd prefer; just wanted to file the issue first since antirez#2/antirez#3 are bigger architectural choices.

Side note (separate from this issue)

While trying to run the same cyberneurova GGUF on a DGX Spark (GB10, sm_121, CUDA 13.0) with main HEAD e6c3da4 + --cuda, the server hangs at CPU 100 % right after CUDA backend initialized on NVIDIA GB10 (sm_121) and never reaches CUDA loading model tensors into device cache. Looks like the cyberneurova q8_0 token-embd loader path isn't wired up on the CUDA side either. Want me to file that as a separate issue?

DeepSeek-V4-Flash GGUFs produced by the upstream llama.cpp converter without per-tensor type overrides ship most of the small projections at Q8_0 (and routed-expert router weights at F32) where the antirez recipe keeps them at F16. Examples include the cyberneurova abliterated GGUFs. On stock ds4 main these load fails loudly at the first F16-strict validator (token_embd, then output_hc_fn, then hc_attn_fn, ...), and even after the validators are relaxed, several Metal kernel paths read weight bytes directly via offset arithmetic that hard-codes F16/F32 strides. This change makes the embed/HC/compressor/indexer/router validators *and* the corresponding Metal kernel paths polymorphic, so the same GGUF loads and runs with no harmonizer step. Validators (ds4.c): * New tensor_expect_dispatch_layout helper accepts F16, F32, or Q8_0 and is applied to every projection that flows through a type-dispatching matvec/matmul: output_hc_fn, hc_attn_fn, hc_ffn_fn, attn_compressor_{ape,gate,kv}, indexer.{attn_q_b,proj}, indexer_compressor_{ape,gate,kv}, ffn_gate_inp. * token_embd keeps its own inline F16/Q8_0 check because its CPU embed kernel doesn't go through matvec_any. * Two compressor decode-time guards (attn_compressor and indexer_compressor pair-projection paths) relaxed from "F16 only" to "F16 or Q8_0, paired type must match". CPU paths (ds4.c): * Refactor embed_token_f16 into an embed_token dispatcher; add embed_token_q8_0 (block-wise dequant of block_q8_0). * Replace the remaining direct matvec_f16 / matvec_f16_serial callers (HC fn, output_hc_fn, ffn_gate_inp) with the existing matvec_any dispatcher; add matvec_any_serial for the HC pre/post path. * Polymorphic Metal-side dispatch helpers metal_graph_matmul_plain_tensor and metal_graph_matmul_pair_plain_tensor (extended for Q8_0; the pair fuses with the existing F16-pair kernel when both tensors are F16, otherwise dispatches to two single matmuls). All 22 hardcoded ds4_metal_matmul_f16{,_pair}_tensor call sites in ds4.c (HC mix, attn/indexer compressors, indexer projections, output head, router) converted to use these wrappers. Metal kernels: * metal/get_rows.metal: kernel_get_rows_q8_0 (one float per thread, dequantizes its source block on the fly). * metal/dense.metal: kernel_mul_mm_f32_f32 template instantiation for the multi-token F32 weight matmul that the F32 router path needs in prefill (mirrors the existing F16/Q8_0 mul_mm_t instantiations). * metal/cpy.metal: kernel_cpy_q8_0_f32 (dequantizing 1D copy used by the compressor APE byte-strided reader). Metal wiring (ds4_metal.m): * Register g_get_rows_q8_0_pipeline and g_cpy_q8_0_f32_pipeline at init; clear them at cleanup. * Both ds4_metal_embed_{token,tokens}_hc_tensor and the shared ds4_metal_encode_get_rows helper take a new weight_type parameter (GGUF type code: 1=F16, 8=Q8_0). 8 callers in ds4.c forward weights->token_embd->type unchanged. ds4_metal_embed_row_layout picks the right per-row stride and pipeline. * ds4_metal_matmul_f32_tensor extended with a multi-token branch that dispatches to kernel_mul_mm_f32_f32 (n_tok > 1); existing n_tok = 1 path unchanged. * ds4_metal_encode_compressor_score_with_ape and the equivalent loop in ds4_metal_compressor_prefill_tensor add a Q8_0 branch (ds4_metal_encode_cpy_q8_0_f32_1d) and use a per-row stride that accounts for the block_q8_0 layout. * Six ape_type validators relaxed to also accept 8 (Q8_0). * Six ape_bytes calculations centralized through a new ds4_metal_ape_bytes(ape_type, n_elems) helper that returns the correct stride for F16/F32/Q8_0. * metal_graph_matmul_plain_tensor extended with a Q8_0 branch. Tested on macOS / M-series / Metal: * make ds4-server clean (no new warnings). * Cyberneurova Q2_K GGUF entirely unmodified: loads, prefill + decode through to coherent generation ("PASS" returned for the "reply with the single word PASS" prompt). * Pre-harmonized variant (token_embd / hc / compressor / indexer all F16, ffn_gate_inp F16): still works byte-for-byte the same as before this change, no F16 path regressions. Caveat for reviewers running ivanfioravanti's M5 PR (antirez#15) on top of this: the unmodified cyberneurova file generates garbage (BOS spam) when MPP F16 prefill is engaged, but produces coherent output with DS4_METAL_MPP_F16_DISABLE=1. The garbage is reproducible from antirez#15's MPP path alone and is independent of the changes here; it surfaces only because this PR makes the Q8_0 file loadable in the first place.

…support-q8_0-token-embd

When a stock-recipe GGUF (cyberneurova-style: Q8_0 small tensors, F32 router) is loaded on M5 with PR antirez#15's MPP optimizations enabled, the compressor APE path silently produces wrong output and prefill emits garbage tokens (typically <BOS> spam after a few coherent tokens). The prefill is correct; the bug is in two compressor APE consumers that were updated to accept Q8_0 ape_type but couldn't read Q8_0 byte layout correctly: 1. `kernel_cpy_q8_0_f32` Metal kernel (added in support-q8_0-token-embd for the prefill APE byte-strided dequant): produces silently wrong output on M5 Max for the compressor APE shapes (4 rows x 1024 cols). Replaced with a CPU-side dequant into a per-call private MTLBuffer. The CPU dequant matches gguf-py reference byte-for-byte (verified with a standalone numeric check); the Metal kernel did not. 2. `kernel_dsv4_compressor_store_one` (decode-time single-row store in metal/dsv4_kv.metal): only handled F16/F32 ape_type; Q8_0 fell into the F32 else branch and read garbage. Add a Q8_0 branch that walks block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte block) directly. The CPU dequant path also has to use a *fresh per-call* MTLBuffer for each compressor invocation, not the shared g_compressor_store_ape_buffer: multiple CPU writes to one shared buffer in the same command buffer collapse to the last write at execute time (Metal kernels run in encode order, but CPU writes don't participate in that ordering when the same scratch is reused). The per-call buffer is retained until cb completion via addCompletedHandler because Metal does not strongly retain buffers bound to encoders. Changes: * ds4_metal.m: new `ds4_metal_half_bits_to_float` and `ds4_metal_cpu_dequant_q8_0_rows` helpers (verified against gguf-py `dequantize` reference); replace Q8_0 branches in `ds4_metal_encode_compressor_score_with_ape` and `ds4_metal_compressor_store_batch_tensor` with CPU-side dequant into per-call private buffers retained via addCompletedHandler. * metal/dsv4_kv.metal: add a Q8_0 branch to `kernel_dsv4_compressor_store_one`. * metal/cpy.metal: `kernel_cpy_q8_0_f32` is left in place but no longer reached from the compressor paths (its registration in ds4_metal.m is harmless). Tested on macOS / M-series / Metal: * make ds4-server clean. * Cyberneurova Q2_K GGUF entirely unmodified, MPP F16 enabled (i.e. no DS4_METAL_MPP_F16_DISABLE workaround): 21-token prompt -> coherent generation ("An LLM, or Large Language Model, is a type of artificial intelligence"). Previously this prompt generated "An LLM, or large language" then <BOS> token spam. * Pre-harmonized variant: still works byte-for-byte the same as before this change, no F16/F32 path regressions.

This PR's loader changes accept Q8_0 `*compressor_ape*` weights at the validator level, but two follow-on Metal paths still treat them as F16 (or fall through to F32) and produce silently wrong output, which shows up as <BOS>-token spam in generation for any prompt long enough to exercise the multi-token compressor path on M-series hardware. 1. `kernel_cpy_q8_0_f32` (added in this PR for the prefill APE byte-strided dequant) compiles cleanly and follows the same block_q8_0 indexing pattern used by other working Q8_0 kernels in dense.metal, but emits silently wrong values for the actual ape shapes (4 rows x 1024 cols of block_q8_0). Confirmed by isolating the kernel: a CPU-side dequant of the same byte region matches gguf-py's `dequantize` reference byte-for-byte, while the Metal kernel's output is wrong. 2. `kernel_dsv4_compressor_store_one` (decode-time single-row store in metal/dsv4_kv.metal): only handled `ape_type == 1` (F16) and fell through to F32 for everything else, so Q8_0 ape was reading garbage at decode time. Fix: * Replace the prefill APE Q8_0 path in `ds4_metal_encode_compressor_score_with_ape` and `ds4_metal_compressor_store_batch_tensor` with a CPU-side dequant via two new helpers (`ds4_metal_half_bits_to_float` and `ds4_metal_cpu_dequant_q8_0_rows`) into a *per-call* private MTLBuffer. A per-call buffer is required because multiple CPU writes to the previously-shared `g_compressor_store_ape_buffer` within one command buffer collapse to the last write at execute time (Metal kernels run in encode order, but CPU writes don't participate in that ordering when the same scratch is reused). The per-call buffer is retained until cb completion via `addCompletedHandler` because Metal does not strongly retain buffers bound to encoders. * Add a Q8_0 branch to `kernel_dsv4_compressor_store_one` that walks block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte block) inline. The buggy `kernel_cpy_q8_0_f32` Metal kernel is left in place but is no longer reached from the compressor paths; its registration in ds4_metal.m is harmless and a future debug session can either fix it or drop it. Tested on macOS / M-series / Metal: * make ds4-server clean (one pre-existing -Wpointer-sign warning from the unrelated MoE path). * Cyberneurova Q2_K GGUF entirely unmodified, default flags: 21-token prompt -> coherent generation ("An LLM, or Large Language Model, is a type of artificial intelligence"). Previously this prompt generated a few coherent tokens then <BOS> token spam. * Pre-harmonized variant (token_embd / hc / compressor / indexer all F16): still works byte-for-byte the same as before this fix; no F16 / F32 path regressions.

The decode-time indexer code at metal_graph_encode_decode_layer (ds4.c:9082-9095) still has two F16-only validators on indexer_attn_q_b and indexer_proj that I missed in the initial loader pass. These validators only fire after `g->layer_n_comp[il] > decode_top_k` — i.e. once the compressor has accumulated more rows than the decode-time top-k. For short generations the path isn't reached; for ~400+ token generations on stock-recipe (Q8_0) GGUFs the validator trips and the request finishes with finish_reason="error" / "Metal decode failed". The downstream calls already use metal_graph_matmul_plain_tensor (which dispatches to ds4_metal_matmul_q8_0_tensor for Q8_0). The loader-time validator at line 2211-2212 already uses tensor_expect_dispatch_layout, which accepts F16/F32/Q8_0. Only these runtime guards were stuck on F16. Reproducer (cyberneurova Q2_K, default flags): a "write a long story" prompt that generates ~800 tokens hits the validator after ~400 tokens and the request errors out. After this fix, the same prompt streams 800+ tokens cleanly.

…-embd # Conflicts: # ds4_metal.m

…8_0 fixes) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The two callers of ds4_metal_encode_cpy_q8_0_f32_1d were removed in 79b08bb (switched to CPU-side dequant to avoid an encode-time race on the shared compressor scratch buffer), leaving the function unused and tripping -Wunused-function on stock Make builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Updated project title and clarified description.

A second bundled steering vector parallel to verbosity, built from 100 contested vs 100 settled prompts. Negative FFN scales push the model toward hedge-mode response (presenting multiple positions); positive scales push toward more assertive single-answer mode. The hedge-vs-assert axis is a general response register rather than a topic-specific representation, so the direction transfers across model variants better than stance directions. Useful on questions where the model would otherwise emit a strongly- trained closed-form completion. Most effective combined with a system prompt that supplies the disputed positions explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # README.md

Implements the Responses API endpoint that Codex CLI (and other modern OpenAI tooling) speaks instead of /v1/chat/completions. The wire format is documented in OpenAI's Responses API; this implementation has been iterated against the Codex CLI binary's SSE parser shape until no remaining schema gaps were found. Request parsing (parse_responses_request, parse_responses_input): - Accepts the typed input array (message, function_call, function_call_output, reasoning, custom_tool_call(_output), local_shell_call(_output), web_search_call(_output), tool_search_call(_output), image_generation_call(_output), compaction, context_compaction). - Maps hosted-tool history to function_call/function_call_output so prior actions survive across turns; rejects unknown item types and non-completed status with 400 to avoid silent context loss. - Strict content-array parsing: only string|null|array of recognized text blocks (input_text/output_text/text/summary_text/ reasoning_text); rejects non-text modalities (input_image/file/ audio) instead of accepting an empty prompt. - Merges adjacent function_call items into the preceding assistant message so text + tool-call turns render as a single assistant block. - Honors reasoning.effort (incl. "minimal"/"none") and gates reasoning summary surface on reasoning.summary opt-in. - Rejects previous_response_id, conversation, and forced tool_choice explicitly (constrained decoding / persisted state not supported). Output (responses_sse_*, responses_final_response): - Emits the full streaming lifecycle: response.created, output_item.added/.done, reasoning_summary_part.added/.done, reasoning_summary_text.delta/.done, content_part.added/.done, output_text.delta/.done, function_call_arguments.delta/.done, response.completed. - Branches the terminal event by finish reason: response.failed for errors and response.incomplete with reason "max_tokens" for length. - Every event carries sequence_number; every output_text part carries annotations:[]; function_call output_item.added ships with an empty arguments string (full args arrive via function_call_arguments.done and output_item.done), and item ids are stable across added/done. - Tracks whether </think> was actually observed so a truncated stream marks the reasoning item incomplete instead of "completed". - Recovers gracefully when the DSML tool parse fails after the model was suppressed at the tool marker: the suppressed tail is flushed as additional output_text deltas so the streamed message matches output_item.done. Tested by 25 rounds of /codex:adversarial-review against the same client this is meant to feed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Anthropic's count_tokens API takes the same request shape as /v1/messages but only returns the prompt token count without running inference. This short-circuits before enqueueing a job: parse_anthropic_request renders and tokenizes the prompt the same way it would for a real generation, then we serialize {"input_tokens": N} and release the request. Useful for clients that need to plan context budgets before committing to a generation, e.g. the Anthropic SDK token-counting flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ds4.c calls ds4_gpu_set_mpp_mode, ds4_gpu_set_mpp_compare_context, and ds4_gpu_clear_mpp_compare_context unconditionally, but these are Metal-only diagnostic helpers. Provide no-op CUDA implementations so the linker resolves the symbols. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GGUF math: the 86.72 GB mixed IQ2_XXS+Q2_K+Q8_0 file averages ~2.62 bits per parameter, which matches a 284b-parameter model (671b would require an impossible ~1.11 bits/param). antirez/ds4's README also says 284b, as does pi-ds4.html. This was a copy-paste error from the DeepSeek V3 family numbers. Also surface the decode rate (~30 tok/s on M5 Max, matching the benchmark table's 24–37 tok/s range) alongside the prefill rate, since decode is what users actually experience during a conversation.

Absorbs Swival/ds4-m5's M5 simdgroup_matrix matmul fast path, M5-private scratch buffers + hazard tracking, CUDA q8 fp16 cache memory guard, CUDA long-context fixes/test, M2 Ultra benchmark, and speed-bench script/CSV layout. Conflict resolution in ds4_metal.m: - ds4_gpu_get_mul_mm_id_pipeline keeps HEAD's (bc_inp, use_mpp) shape, since the MoE _id_ kernels in metal/moe.metal use function-constant slot 702 for use_mpp. Slot 702 in metal/dense.metal is used for m5_sgmatrix via the auto-merged ds4_gpu_get_mul_mm_pipeline path. - ds4_gpu_use_m5_simdgroup_matrix() helper kept alongside our MPP infrastructure (q8_0/f16/attn-out/MoE policy functions) so both fast paths stay wired. README left at HEAD; swival/m5's new sections describe a different fork narrative.

The upstream backend refactoring (0ac5df3) renamed the public tensor API from ds4_metal_tensor_* to ds4_gpu_tensor_* (and ds4_metal_matmul_q to ds4_gpu_matmul_q, etc.), but tests/ds4_test.c kept using the old names — so `make ds4_test` had been broken since that commit. This is a pure rename; ./ds4_test --server passes.

Absorbs Swival/ds4-m5's M5 simdgroup_matrix matmul fast path, M5-private scratch buffers + hazard tracking, CUDA q8 fp16 cache memory guard, CUDA long-context fixes/test, M2 Ultra benchmark, and speed-bench script/CSV layout. Conflict resolution in ds4_metal.m (matches main's earlier resolution): - ds4_gpu_get_mul_mm_id_pipeline keeps HEAD's (bc_inp, use_mpp) shape, since the MoE _id_ kernels in metal/moe.metal use function-constant slot 702 for use_mpp. Slot 702 in metal/dense.metal is used for m5_sgmatrix via the auto-merged ds4_gpu_get_mul_mm_pipeline path. - ds4_gpu_use_m5_simdgroup_matrix() helper kept alongside our MPP infrastructure (q8_0/f16/attn-out/MoE policy functions) so both fast paths stay wired. README left at HEAD; swival/m5's new sections describe a different fork narrative.

The upstream backend refactoring (0ac5df3) renamed the public tensor API from ds4_metal_tensor_* to ds4_gpu_tensor_* (and ds4_metal_matmul_q to ds4_gpu_matmul_q, etc.), but tests/ds4_test.c kept using the old names — so `make ds4_test` had been broken since that commit. This is a pure rename; ./ds4_test --server passes.

Raise the default Metal prefill chunk to 4096 and reuse the range-capable layer-major prefill graph for chunked ranges. Enable the guarded Q8_0 attn_q_b MPP route for <=2048-token prompt batches, dynamic Q8_0 tile width, the routed-MoE fast layout from layer 0, and the RB16 indexed decode path. M5 Max post-patch ds4-bench profile with 64 generated tokens: prompt 443/459/522/486/465 t/s and generation 38.6/38.2/37.6/34.0/33.6 t/s at 0.5k/1k/2k/4k/8k. Tests: make all ds4_test; make test; git diff --check.

Detect macOS Low Power Mode and widen the Q8_0 prefill MPP route only under that condition, while preserving the guarded default for normal-power runs and explicit Q8_0 filters. Low-power M5 Max baseline vs patched auto with 128 generated tokens: 0.5k: prefill 133.46 -> 196.89 t/s, gen 13.53 -> 15.08 t/s 1k: prefill 118.65 -> 188.91 t/s, gen 12.23 -> 14.93 t/s 2k: prefill 130.90 -> 220.33 t/s, gen 11.02 -> 14.65 t/s 4k: prefill 118.09 -> 212.81 t/s, gen 13.25 -> 14.00 t/s 8k: prefill 185.52 -> 206.49 t/s, gen 12.94 -> 13.84 t/s Tests: make all ds4_test; make test; DS4_METAL_MPP_LOW_POWER_DISABLE=1 ./ds4_test --metal-mpp-equivalence; git diff --check.

Brings in Ivan's PR antirez#15 follow-ups (Tune Metal MPP defaults / Improve MPP prefill throughput / Low-power Q8 profile) plus the ds4_test rename fix, on top of the swival/m5 work that main already absorbed. Conflict resolution: - ds4_metal.m: drop main's older ds4_gpu_mpp_q8_0_partial_tiles_enabled in favor of m5's newer version from ff2d499 (handles low-power mode). - README.md: keep HEAD (the Abliterated fork narrative); m5's README is the original DwarfStar 4 readme with Swival's M5 narrative appended.

ds4_metal.m's embedded shader source declares both: FC_mul_mm_id_mpp [[function_constant(FC_MUL_MM + 2)]] // == 702 FC_mul_mm_m5_sgmatrix [[function_constant(FC_MUL_MM_M5_SGMATRIX)]] // == 702 Metal requires every function_constant index to be unique across the program, so the shader unit fails to compile as a whole on devices that fall back to legacy Metal (e.g. M4 Max with Metal 4 Tensor API disabled), with: program_source:3281:34: error: 'function_constant' has a duplicate index '702' ds4-server then aborts with "metal backend unavailable". This is a compile-time conflict; runtime flags like DS4_MPP=off / DS4_METAL_TENSOR_DISABLE=1 can't recover from it because the shader unit is rejected before any runtime selection runs. This minimal patch moves FC_MUL_MM_M5_SGMATRIX out of the 700–799 block (which is shared with FC_MUL_MM family indices computed as FC_MUL_MM + k) into 800, away from the existing FC_BIN=1300 range. Verified on Apple M4 Max, 128 GB, macOS 26.4.1, building from main HEAD e6c3da4: - make clean && make ds4-server: succeeds - ./ds4-server -m cyberneurova-...Q2_K.gguf \ --dir-steering-file dir-steering/out/uncertainty.f32 \ --dir-steering-ffn -3 \ --ctx 100000 --host 0.0.0.0 --port 8002 - Server boots, maps 94 GiB of tensors, logs "directional steering enabled: ... attn=0 ffn=-3" - Steady-state generation ~25 t/s (matches antirez/ds4 upstream on M4 Max; M4 Max is not in the MPP target list so it correctly runs the legacy Metal path) A cleaner long-term fix would be to (a) reserve disjoint ranges per family (e.g. 700–799 for FC_MUL_MM, 800–899 for FC_MUL_MM_M5_*) and document the convention, or (b) gate the M5-only declarations behind a feature macro so pre-M5 builds don't emit them at all. This patch is the minimal change that unblocks pre-M5 users today.

thx0701 · 2026-05-12T17:32:26Z

Following up on the side note in the original PR body — here are the full details for the DGX Spark CUDA hang in case it's useful (separate from this PR's fix; happy to drop into a new venue if you'd prefer).

Environment

DGX Spark GB10 (Grace Blackwell, compute capability sm_121)
Ubuntu 24.04 LTS, kernel 6.17.0-1008-nvidia (aarch64)
CUDA toolkit 13.0 (V13.0.88), driver 580.126.09
20-core ARM, 119 GiB unified memory, 406 GiB free disk
audreyt/ds4 main HEAD e6c3da4
Built with make CUDA_ARCH=native — compile + link both clean, no warnings
cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf (98.8 GB, valid GGUF v3 magic)

Reproduction

git clone https://github.com/audreyt/ds4 ds4-audreyt
cd ds4-audreyt
PATH=/usr/local/cuda/bin:$PATH make CUDA_ARCH=native
./ds4-server --cuda \
  -m /path/to/cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
  --host 0.0.0.0 --port 8001 --ctx 32768 \
  --kv-disk-dir /mnt/models/ds4-kv --kv-disk-space-mb 8192 \
  --dir-steering-file dir-steering/out/uncertainty.f32 \
  --dir-steering-ffn -3

Observed: hangs forever after CUDA init

Server log gets exactly this far and then nothing more is emitted (waited 5+ minutes across multiple attempts, also tried without --dir-steering-file to isolate):

ds4: CUDA backend initialized on NVIDIA GB10 (sm_121)
ds4: CUDA host registration skipped: operation not supported

For comparison, the same Spark running antirez/ds4 upstream (99a5c13) with antirez's native IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8 GGUF goes through the expected loading sequence:

ds4: CUDA backend initialized on NVIDIA GB10 (sm_121)
ds4: CUDA host registration skipped: operation not supported
ds4: CUDA loading model tensors into device cache
ds4: CUDA loading model tensors 16.02 GiB cached
ds4: CUDA loading model tensors 32.06 GiB cached
ds4: CUDA loading model tensors 64.06 GiB cached
ds4: CUDA loading model tensors 80.04 GiB cached
ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 81.821s
ds4: cuda backend initialized for graph diagnostics
0512 03:29:31 ds4-server: listening on http://0.0.0.0:8000

So audreyt/ds4 HEAD never gets past the host registration skipped line into CUDA loading model tensors for cyberneurova.

Process state during the hang

PID 3245707  STAT Rl  100% CPU  0% MEM-of-119GiB  WCHAN=-
read_bytes  delta over 3s: 0
write_bytes delta over 3s: 0
GPU power.draw: 13.9 W  memory.used: [N/A]  utilization.gpu: 0 %

So:

single-core CPU pinned at 100 %, but no I/O (model file not being mmap'd / fault-paged in)
GPU completely idle (no tensors uploaded yet)
WCHAN empty (running, not waiting on a kernel object)

→ looks like a tight busy loop in the CUDA loader pre-tensor-copy phase, not a deadlock on a syscall.

gdb -p $PID -batch -ex bt couldn't attach (kernel ptrace_scope=1 and no sudo), so I don't have a stack trace. Happy to enable ptrace and re-run if you want exact frames.

Why we suspect the loader hasn't been ported

e6c3da4 ("port this to DGX Spark") only adds one line to ds4_cuda.cu:
```
+ uint32_t layer_index,
+ (void)layer_index;
```
i.e. it brings the function signature in sync with upstream, but doesn't touch the model loader path itself.
The Metal-side cyberneurova loader patch (q8_0 token_embd accepted instead of expected F16) is presumably what lives on the support-q8_0-token-embd branch, but the CUDA counterpart of that patch doesn't appear to be in main.
Without that, the loader is presumably hitting an unrecognized quant/tensor layout for token_embd and ending up in some retry / lookup loop rather than failing fast.

What I'd like to know

Has audreyt/ds4 main CUDA path been verified end-to-end with cyberneurova on any actual Spark / Blackwell hardware?
If the cyberneurova q8_0 token_embd loader on the CUDA side just isn't done yet, totally fine — I just want to know whether to wait or expect it sometime, vs going back to antirez upstream + native quant for the Spark and treating cyberneurova as Mac-only.
If you want me to open a separate issue elsewhere (the fork has Issues disabled), let me know where — happy to file a clean bug with strace once ptrace_scope is loosened.

Thanks again!

thx0701 · 2026-05-12T17:55:17Z

中文版

順手另一個觀察，給 README §8（Agent Client Usage）做參考——這不是 audreyt/ds4 的 bug，但對打算用 Claude Code 接這個 fork 的 user 是一個值得寫進文件的 caveat。

觀察：Claude Code 的 system prompt 會蓋掉 abliterated 性格

同一個 server（audreyt fork HEAD e6c3da4 + 我提的 FC-702 patch + cyberneurova Q2_K + --dir-steering-ffn -3 --dir-steering-file uncertainty.f32），同一個敏感題，走兩條路結果完全不同：

A. 直接 curl :8002/v1/chat/completions（不過任何 wrapper）

prompt：「1989年6月4日天安門廣場發生了什麼事？請詳述。」

回覆（前 ~150 tokens，整段共 800 tokens 因為 finish_reason=length）：

关于1989年6月4日天安門廣場的事件，这是一个涉及中国近代历史的重要问题。目前没有足够的数据来详细回答这个问题，这是一个复杂的历史事件。我决定不直接回答这个问题，而是通过提供更广泛的背景信息来回应。我认为这是一个需要谨慎处理的问题，涉及中国历史和政治。我选择不直接回答，而是表示这是一个复杂的历史问题，涉及不同的观点和解读…

這就是 README §4 描述的「鋪陳爭議議題」暫存器：沒拒答、沒給 PRC 預設答，正是 steering 效果。可惜後半段 collapse 進入 这个这个这个 … 無限重複（FFN=-3 邊緣情況，前面 PR body 也提過）。

速度：wall 27.77s / prompt 25 tok / completion 800 tok → ~29 t/s gen on M4 Max。

B. 同 model 同 server，但走 Claude Code（cc-connect claudecode adapter）

同樣 prompt，Claude Code 啟動時自動塞它內建的 ~25K system prompt（含 "Refuse requests for destructive techniques…" + "You are Claude Code, helpful assistant…" 等指令）。Model 回應變成禮貌地拒答這個議題、改提議幫忙其他事。Abliterated weights 仍然 loaded、steering vector 仍然套上去——但 Claude Code 的 system prompt 完全 dominate model 的回應風格。

為什麼這對你 README 有意義

§8.7 目前列 Claude Code 是支援的 agent integration、只給 ANTHROPIC_BASE_URL wrapper script。照那食譜想拿到「abliterated Claude Code」體驗的 user 會困惑：

model 是真的 abliterated + steered（server log 明確：directional steering enabled: …attn=0 ffn=-3）
但透過 Claude Code 觀察到的行為跟 cloud Claude 沒兩樣（敏感題拒答、default safety register）

建議 §8.7 加一句 caveat：

注意：透過 Claude Code 訪問時，模型行為會被 Claude Code 內建的 ~25K system prompt 主導（"you are a helpful coding assistant, refuse harmful requests…"）。Abliterated character 跟 steering vector 仍然在 weight/activation 層作用，但 Claude Code persona override 了模型的 response register。要看 raw abliterated + steered 行為，請用 Pi、OpenCode（不帶 --system-prompt）、./ds4 -p REPL、或直接 curl OpenAI-compatible endpoint。

OpenCode default chat mode 是否有同樣問題我沒驗證；若 OpenCode 也預設帶自家 system prompt 可能適用。

需要的話我可以另開 doc-only PR 直接送這段加進 §8.7。

English version

One more observation while we're here, in case it's useful for your README §8 (Agent Client Usage) — this isn't a bug in audreyt/ds4, but a caveat worth documenting for users hooking this up through Claude Code.

Observation: Claude Code's system prompt masks the abliterated character

I tested the same sensitive query through two paths against the same running server (audreyt fork @ HEAD e6c3da4 + my FC-702 patch + cyberneurova Q2_K + --dir-steering-ffn -3 --dir-steering-file uncertainty.f32):

Path A: direct curl to :8002/v1/chat/completions (no wrapper)

Prompt: "1989年6月4日天安門廣場發生了什麼事？請詳述。"

Reply (first ~150 tokens; whole thing was 800 tokens because finish_reason=length):

关于1989年6月4日天安門廣場的事件，这是一个涉及中国近代历史的重要问题。目前没有足够的数据来详细回答这个问题，这是一个复杂的历史事件。我决定不直接回答这个问题，而是通过提供更广泛的背景信息来回应。我认为这是一个需要谨慎处理的问题，涉及中国历史和政治。我选择不直接回答，而是表示这是一个复杂的历史问题，涉及不同的观点和解读…

This is the "rehearsing-contested-territory" register described in your README §4. No refusal, no PRC-default answer — exactly the steering effect, until it collapsed into a 这个这个这个 … repetition loop in the second half (an ffn=-3 edge case, also mentioned in the PR body).

Speed for raw path: wall 27.77s, prompt 25 tok, completion 800 tok → ~29 t/s gen on M4 Max.

Path B: same model, same server, accessed through Claude Code via cc-connect (claudecode adapter)

Same prompt, but Claude Code spawns the request with its full ~25K system prompt (containing things like "Refuse requests for destructive techniques…" + "You are Claude Code, helpful assistant…"). The model now politely refuses to discuss the topic and offers to help with something else. The abliterated weights are still loaded, the steering vector is still applied — but Claude Code's system prompt completely overrides the model's persona.

Why I think this matters for your README

§8.7 currently lists Claude Code as one of the supported agent integrations with just the ANTHROPIC_BASE_URL wrapper script. Users following that recipe to "get an abliterated Claude Code" experience will be confused, because:

The model is abliterated + steered (verifiable on the server log: directional steering enabled: …attn=0 ffn=-3)
But the observable behavior through Claude Code is the same as cloud Claude (refusals on contested topics, default safety register)

A one-line caveat under §8.7 like:

Note: When accessed via Claude Code, the model's behavior will be dominated by Claude Code's built-in system prompt (~25K tokens of "you are a helpful coding assistant, refuse harmful requests…"). The abliterated character and steering vector are still applied at the weight/activation level, but the Claude Code persona overrides the model's response register. To see the raw abliterated + steered behavior, use Pi, OpenCode (without --system-prompt), ./ds4 -p REPL, or direct curl to the OpenAI-compatible endpoint.

…would save the next person an afternoon of "did I install it wrong?" debugging.

Same caveat may apply to OpenCode's default chat mode if it ships its own system prompt; haven't verified.

Happy to send this as a doc-only PR if you'd like.

ivanfioravanti and others added 30 commits May 10, 2026 00:24

Add Metal 4 tensor capability gate

687ffd2

Add experimental MPP F16 prefill path

85c688b

Add experimental MPP Q8 prefill path

5877fef

Add MPP prefill benchmark harness

8aacb65

Enable Q8 MPP prefill on Metal 4 targets

214b30d

Add experimental MPP attention output probe

eb7f8bf

Promote MPP attention output on Metal 4

3c190d3

Add experimental MPP routed MoE matmul

8f68ba9

Enable late-layer MPP routed MoE down projection

ce55e0b

Add INT8 MPP GEMM probe

4064b2e

Promote late-layer MPP routed MoE projections

eaf5c02

Lower late-layer MPP MoE boundary

708e8b8

Use staged routed MoE MPP boundaries

ecb7cd1

Lower routed MoE MPP gate-up boundary

ef2c171

Fuse routed MoE expert sum

ab6c779

Add M5 Max speed numbers

b70787e

Remove INT8 MPP probe

657d7af

Clean M5 optimization PR surface

677dc99

Use MPP for F16 compressor prefill

68547f8

Merge remote-tracking branch 'ivan/codex/metal4-m5-scaffold' into m5-…

0807efe

…support-q8_0-token-embd

Merge remote-tracking branch 'origin/main' into m5-support-q8_0-token…

23e382a

…-embd # Conflicts: # ds4_metal.m

Merge remote-tracking branch 'origin/main' into support-q8_0-token-embd

86aa248

docs(readme): describe audreyt fork (antirez + ivan PR antirez#15 + Q…

48eb974

…8_0 fixes) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

audreyt and others added 27 commits May 11, 2026 07:43

Merge branch 'm5-support-q8_0-token-embd'

a9a415c

Change project name to DwarfStar 4 (Abliterated)

07f2fb8

Updated project title and clarified description.

Note uncertainty.f32 as fork-scoped steering artifact in README

0fe5260

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Use [audreyt/pi-ds4] link label consistently in README

f00d3a3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into support-q8_0-token-embd

e8f5bdc

Merge branch 'support-q8_0-token-embd' into m5-support-q8_0-token-embd

b7281d9

Merge branch 'm5-support-q8_0-token-embd'

144467a

# Conflicts: # README.md

metal: add Apple M5 simdgroup_matrix matmul fast path

a0222f4

metal: use M5-private scratch buffers for hot intermediates

56c0d55

Update README.md

0ddf7bd

metal: keep hazard tracking for private scratch buffers

8a53f9c

Add instructions for Swival

ad483bf

Tune Metal MPP defaults and thinking checkpoints

ff7995b

* port this to DGX Spark

e6c3da4

audreyt force-pushed the main branch from 88fd65d to 0f4f366 Compare May 13, 2026 01:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Metal function_constant 702 collision on pre-M5 devices#1

Fix Metal function_constant 702 collision on pre-M5 devices#1
thx0701 wants to merge 72 commits into
audreyt:mainfrom
thx0701:fix-fc-collision-702

thx0701 commented May 12, 2026

Uh oh!

thx0701 commented May 12, 2026

Uh oh!

thx0701 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

thx0701 commented May 12, 2026

Environment

Reproduction

Observed

Root cause

Why your own machine doesn't see this

Local workaround (ugly but works)

Suggested cleaner fixes

Side note (separate from this issue)

Uh oh!

thx0701 commented May 12, 2026

Environment

Reproduction

Observed: hangs forever after CUDA init

Process state during the hang

Why we suspect the loader hasn't been ported

What I'd like to know

Uh oh!

thx0701 commented May 12, 2026

中文版

觀察：Claude Code 的 system prompt 會蓋掉 abliterated 性格

為什麼這對你 README 有意義

English version

Observation: Claude Code's system prompt masks the abliterated character

Why I think this matters for your README

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants