Fix Metal function_constant 702 collision on pre-M5 devices#1
Conversation
DeepSeek-V4-Flash GGUFs produced by the upstream llama.cpp converter
without per-tensor type overrides ship most of the small projections at
Q8_0 (and routed-expert router weights at F32) where the antirez recipe
keeps them at F16. Examples include the cyberneurova abliterated GGUFs.
On stock ds4 main these load fails loudly at the first F16-strict
validator (token_embd, then output_hc_fn, then hc_attn_fn, ...), and
even after the validators are relaxed, several Metal kernel paths read
weight bytes directly via offset arithmetic that hard-codes F16/F32
strides.
This change makes the embed/HC/compressor/indexer/router validators
*and* the corresponding Metal kernel paths polymorphic, so the same
GGUF loads and runs with no harmonizer step.
Validators (ds4.c):
* New tensor_expect_dispatch_layout helper accepts F16, F32, or Q8_0
and is applied to every projection that flows through a
type-dispatching matvec/matmul: output_hc_fn, hc_attn_fn,
hc_ffn_fn, attn_compressor_{ape,gate,kv}, indexer.{attn_q_b,proj},
indexer_compressor_{ape,gate,kv}, ffn_gate_inp.
* token_embd keeps its own inline F16/Q8_0 check because its CPU
embed kernel doesn't go through matvec_any.
* Two compressor decode-time guards (attn_compressor and
indexer_compressor pair-projection paths) relaxed from "F16 only"
to "F16 or Q8_0, paired type must match".
CPU paths (ds4.c):
* Refactor embed_token_f16 into an embed_token dispatcher; add
embed_token_q8_0 (block-wise dequant of block_q8_0).
* Replace the remaining direct matvec_f16 / matvec_f16_serial
callers (HC fn, output_hc_fn, ffn_gate_inp) with the existing
matvec_any dispatcher; add matvec_any_serial for the HC pre/post
path.
* Polymorphic Metal-side dispatch helpers metal_graph_matmul_plain_tensor
and metal_graph_matmul_pair_plain_tensor (extended for Q8_0; the
pair fuses with the existing F16-pair kernel when both tensors are
F16, otherwise dispatches to two single matmuls). All 22 hardcoded
ds4_metal_matmul_f16{,_pair}_tensor call sites in ds4.c (HC mix,
attn/indexer compressors, indexer projections, output head, router)
converted to use these wrappers.
Metal kernels:
* metal/get_rows.metal: kernel_get_rows_q8_0 (one float per thread,
dequantizes its source block on the fly).
* metal/dense.metal: kernel_mul_mm_f32_f32 template instantiation for
the multi-token F32 weight matmul that the F32 router path needs in
prefill (mirrors the existing F16/Q8_0 mul_mm_t instantiations).
* metal/cpy.metal: kernel_cpy_q8_0_f32 (dequantizing 1D copy used by
the compressor APE byte-strided reader).
Metal wiring (ds4_metal.m):
* Register g_get_rows_q8_0_pipeline and g_cpy_q8_0_f32_pipeline at
init; clear them at cleanup.
* Both ds4_metal_embed_{token,tokens}_hc_tensor and the shared
ds4_metal_encode_get_rows helper take a new weight_type parameter
(GGUF type code: 1=F16, 8=Q8_0). 8 callers in ds4.c forward
weights->token_embd->type unchanged. ds4_metal_embed_row_layout
picks the right per-row stride and pipeline.
* ds4_metal_matmul_f32_tensor extended with a multi-token branch
that dispatches to kernel_mul_mm_f32_f32 (n_tok > 1); existing
n_tok = 1 path unchanged.
* ds4_metal_encode_compressor_score_with_ape and the equivalent loop
in ds4_metal_compressor_prefill_tensor add a Q8_0 branch
(ds4_metal_encode_cpy_q8_0_f32_1d) and use a per-row stride that
accounts for the block_q8_0 layout.
* Six ape_type validators relaxed to also accept 8 (Q8_0).
* Six ape_bytes calculations centralized through a new
ds4_metal_ape_bytes(ape_type, n_elems) helper that returns the
correct stride for F16/F32/Q8_0.
* metal_graph_matmul_plain_tensor extended with a Q8_0 branch.
Tested on macOS / M-series / Metal:
* make ds4-server clean (no new warnings).
* Cyberneurova Q2_K GGUF entirely unmodified: loads, prefill +
decode through to coherent generation ("PASS" returned for the
"reply with the single word PASS" prompt).
* Pre-harmonized variant (token_embd / hc / compressor / indexer all
F16, ffn_gate_inp F16): still works byte-for-byte the same as
before this change, no F16 path regressions.
Caveat for reviewers running ivanfioravanti's M5 PR (antirez#15) on top of
this: the unmodified cyberneurova file generates garbage (BOS spam)
when MPP F16 prefill is engaged, but produces coherent output with
DS4_METAL_MPP_F16_DISABLE=1. The garbage is reproducible from antirez#15's
MPP path alone and is independent of the changes here; it surfaces only
because this PR makes the Q8_0 file loadable in the first place.
…support-q8_0-token-embd
When a stock-recipe GGUF (cyberneurova-style: Q8_0 small tensors, F32 router) is loaded on M5 with PR antirez#15's MPP optimizations enabled, the compressor APE path silently produces wrong output and prefill emits garbage tokens (typically <BOS> spam after a few coherent tokens). The prefill is correct; the bug is in two compressor APE consumers that were updated to accept Q8_0 ape_type but couldn't read Q8_0 byte layout correctly: 1. `kernel_cpy_q8_0_f32` Metal kernel (added in support-q8_0-token-embd for the prefill APE byte-strided dequant): produces silently wrong output on M5 Max for the compressor APE shapes (4 rows x 1024 cols). Replaced with a CPU-side dequant into a per-call private MTLBuffer. The CPU dequant matches gguf-py reference byte-for-byte (verified with a standalone numeric check); the Metal kernel did not. 2. `kernel_dsv4_compressor_store_one` (decode-time single-row store in metal/dsv4_kv.metal): only handled F16/F32 ape_type; Q8_0 fell into the F32 else branch and read garbage. Add a Q8_0 branch that walks block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte block) directly. The CPU dequant path also has to use a *fresh per-call* MTLBuffer for each compressor invocation, not the shared g_compressor_store_ape_buffer: multiple CPU writes to one shared buffer in the same command buffer collapse to the last write at execute time (Metal kernels run in encode order, but CPU writes don't participate in that ordering when the same scratch is reused). The per-call buffer is retained until cb completion via addCompletedHandler because Metal does not strongly retain buffers bound to encoders. Changes: * ds4_metal.m: new `ds4_metal_half_bits_to_float` and `ds4_metal_cpu_dequant_q8_0_rows` helpers (verified against gguf-py `dequantize` reference); replace Q8_0 branches in `ds4_metal_encode_compressor_score_with_ape` and `ds4_metal_compressor_store_batch_tensor` with CPU-side dequant into per-call private buffers retained via addCompletedHandler. * metal/dsv4_kv.metal: add a Q8_0 branch to `kernel_dsv4_compressor_store_one`. * metal/cpy.metal: `kernel_cpy_q8_0_f32` is left in place but no longer reached from the compressor paths (its registration in ds4_metal.m is harmless). Tested on macOS / M-series / Metal: * make ds4-server clean. * Cyberneurova Q2_K GGUF entirely unmodified, MPP F16 enabled (i.e. no DS4_METAL_MPP_F16_DISABLE workaround): 21-token prompt -> coherent generation ("An LLM, or Large Language Model, is a type of artificial intelligence"). Previously this prompt generated "An LLM, or large language" then <BOS> token spam. * Pre-harmonized variant: still works byte-for-byte the same as before this change, no F16/F32 path regressions.
This PR's loader changes accept Q8_0 `*compressor_ape*` weights at the
validator level, but two follow-on Metal paths still treat them as F16
(or fall through to F32) and produce silently wrong output, which shows
up as <BOS>-token spam in generation for any prompt long enough to
exercise the multi-token compressor path on M-series hardware.
1. `kernel_cpy_q8_0_f32` (added in this PR for the prefill APE
byte-strided dequant) compiles cleanly and follows the same
block_q8_0 indexing pattern used by other working Q8_0 kernels in
dense.metal, but emits silently wrong values for the actual ape
shapes (4 rows x 1024 cols of block_q8_0). Confirmed by isolating
the kernel: a CPU-side dequant of the same byte region matches
gguf-py's `dequantize` reference byte-for-byte, while the Metal
kernel's output is wrong.
2. `kernel_dsv4_compressor_store_one` (decode-time single-row store
in metal/dsv4_kv.metal): only handled `ape_type == 1` (F16) and
fell through to F32 for everything else, so Q8_0 ape was reading
garbage at decode time.
Fix:
* Replace the prefill APE Q8_0 path in
`ds4_metal_encode_compressor_score_with_ape` and
`ds4_metal_compressor_store_batch_tensor` with a CPU-side dequant
via two new helpers (`ds4_metal_half_bits_to_float` and
`ds4_metal_cpu_dequant_q8_0_rows`) into a *per-call* private
MTLBuffer. A per-call buffer is required because multiple CPU writes
to the previously-shared `g_compressor_store_ape_buffer` within one
command buffer collapse to the last write at execute time (Metal
kernels run in encode order, but CPU writes don't participate in that
ordering when the same scratch is reused). The per-call buffer is
retained until cb completion via `addCompletedHandler` because Metal
does not strongly retain buffers bound to encoders.
* Add a Q8_0 branch to `kernel_dsv4_compressor_store_one` that walks
block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte block)
inline.
The buggy `kernel_cpy_q8_0_f32` Metal kernel is left in place but is
no longer reached from the compressor paths; its registration in
ds4_metal.m is harmless and a future debug session can either fix it
or drop it.
Tested on macOS / M-series / Metal:
* make ds4-server clean (one pre-existing -Wpointer-sign warning from
the unrelated MoE path).
* Cyberneurova Q2_K GGUF entirely unmodified, default flags:
21-token prompt -> coherent generation
("An LLM, or Large Language Model, is a type of artificial intelligence").
Previously this prompt generated a few coherent tokens then <BOS>
token spam.
* Pre-harmonized variant (token_embd / hc / compressor / indexer all
F16): still works byte-for-byte the same as before this fix; no F16
/ F32 path regressions.
The decode-time indexer code at metal_graph_encode_decode_layer (ds4.c:9082-9095) still has two F16-only validators on indexer_attn_q_b and indexer_proj that I missed in the initial loader pass. These validators only fire after `g->layer_n_comp[il] > decode_top_k` — i.e. once the compressor has accumulated more rows than the decode-time top-k. For short generations the path isn't reached; for ~400+ token generations on stock-recipe (Q8_0) GGUFs the validator trips and the request finishes with finish_reason="error" / "Metal decode failed". The downstream calls already use metal_graph_matmul_plain_tensor (which dispatches to ds4_metal_matmul_q8_0_tensor for Q8_0). The loader-time validator at line 2211-2212 already uses tensor_expect_dispatch_layout, which accepts F16/F32/Q8_0. Only these runtime guards were stuck on F16. Reproducer (cyberneurova Q2_K, default flags): a "write a long story" prompt that generates ~800 tokens hits the validator after ~400 tokens and the request errors out. After this fix, the same prompt streams 800+ tokens cleanly.
The decode-time indexer code at metal_graph_encode_decode_layer (ds4.c:9082-9095) still has two F16-only validators on indexer_attn_q_b and indexer_proj that I missed in the initial loader pass. These validators only fire after `g->layer_n_comp[il] > decode_top_k` — i.e. once the compressor has accumulated more rows than the decode-time top-k. For short generations the path isn't reached; for ~400+ token generations on stock-recipe (Q8_0) GGUFs the validator trips and the request finishes with finish_reason="error" / "Metal decode failed". The downstream calls already use metal_graph_matmul_plain_tensor (which dispatches to ds4_metal_matmul_q8_0_tensor for Q8_0). The loader-time validator at line 2211-2212 already uses tensor_expect_dispatch_layout, which accepts F16/F32/Q8_0. Only these runtime guards were stuck on F16. Reproducer (cyberneurova Q2_K, default flags): a "write a long story" prompt that generates ~800 tokens hits the validator after ~400 tokens and the request errors out. After this fix, the same prompt streams 800+ tokens cleanly.
…-embd # Conflicts: # ds4_metal.m
…8_0 fixes) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two callers of ds4_metal_encode_cpy_q8_0_f32_1d were removed in 79b08bb (switched to CPU-side dequant to avoid an encode-time race on the shared compressor scratch buffer), leaving the function unused and tripping -Wunused-function on stock Make builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two callers of ds4_metal_encode_cpy_q8_0_f32_1d were removed in 79b08bb (switched to CPU-side dequant to avoid an encode-time race on the shared compressor scratch buffer), leaving the function unused and tripping -Wunused-function on stock Make builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updated project title and clarified description.
A second bundled steering vector parallel to verbosity, built from 100 contested vs 100 settled prompts. Negative FFN scales push the model toward hedge-mode response (presenting multiple positions); positive scales push toward more assertive single-answer mode. The hedge-vs-assert axis is a general response register rather than a topic-specific representation, so the direction transfers across model variants better than stance directions. Useful on questions where the model would otherwise emit a strongly- trained closed-form completion. Most effective combined with a system prompt that supplies the disputed positions explicitly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # README.md
Implements the Responses API endpoint that Codex CLI (and other modern OpenAI tooling) speaks instead of /v1/chat/completions. The wire format is documented in OpenAI's Responses API; this implementation has been iterated against the Codex CLI binary's SSE parser shape until no remaining schema gaps were found. Request parsing (parse_responses_request, parse_responses_input): - Accepts the typed input array (message, function_call, function_call_output, reasoning, custom_tool_call(_output), local_shell_call(_output), web_search_call(_output), tool_search_call(_output), image_generation_call(_output), compaction, context_compaction). - Maps hosted-tool history to function_call/function_call_output so prior actions survive across turns; rejects unknown item types and non-completed status with 400 to avoid silent context loss. - Strict content-array parsing: only string|null|array of recognized text blocks (input_text/output_text/text/summary_text/ reasoning_text); rejects non-text modalities (input_image/file/ audio) instead of accepting an empty prompt. - Merges adjacent function_call items into the preceding assistant message so text + tool-call turns render as a single assistant block. - Honors reasoning.effort (incl. "minimal"/"none") and gates reasoning summary surface on reasoning.summary opt-in. - Rejects previous_response_id, conversation, and forced tool_choice explicitly (constrained decoding / persisted state not supported). Output (responses_sse_*, responses_final_response): - Emits the full streaming lifecycle: response.created, output_item.added/.done, reasoning_summary_part.added/.done, reasoning_summary_text.delta/.done, content_part.added/.done, output_text.delta/.done, function_call_arguments.delta/.done, response.completed. - Branches the terminal event by finish reason: response.failed for errors and response.incomplete with reason "max_tokens" for length. - Every event carries sequence_number; every output_text part carries annotations:[]; function_call output_item.added ships with an empty arguments string (full args arrive via function_call_arguments.done and output_item.done), and item ids are stable across added/done. - Tracks whether </think> was actually observed so a truncated stream marks the reasoning item incomplete instead of "completed". - Recovers gracefully when the DSML tool parse fails after the model was suppressed at the tool marker: the suppressed tail is flushed as additional output_text deltas so the streamed message matches output_item.done. Tested by 25 rounds of /codex:adversarial-review against the same client this is meant to feed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Anthropic's count_tokens API takes the same request shape as /v1/messages
but only returns the prompt token count without running inference. This
short-circuits before enqueueing a job: parse_anthropic_request renders
and tokenizes the prompt the same way it would for a real generation,
then we serialize {"input_tokens": N} and release the request.
Useful for clients that need to plan context budgets before committing
to a generation, e.g. the Anthropic SDK token-counting flow.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ds4.c calls ds4_gpu_set_mpp_mode, ds4_gpu_set_mpp_compare_context, and ds4_gpu_clear_mpp_compare_context unconditionally, but these are Metal-only diagnostic helpers. Provide no-op CUDA implementations so the linker resolves the symbols. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GGUF math: the 86.72 GB mixed IQ2_XXS+Q2_K+Q8_0 file averages ~2.62 bits per parameter, which matches a 284b-parameter model (671b would require an impossible ~1.11 bits/param). antirez/ds4's README also says 284b, as does pi-ds4.html. This was a copy-paste error from the DeepSeek V3 family numbers. Also surface the decode rate (~30 tok/s on M5 Max, matching the benchmark table's 24–37 tok/s range) alongside the prefill rate, since decode is what users actually experience during a conversation.
Absorbs Swival/ds4-m5's M5 simdgroup_matrix matmul fast path, M5-private scratch buffers + hazard tracking, CUDA q8 fp16 cache memory guard, CUDA long-context fixes/test, M2 Ultra benchmark, and speed-bench script/CSV layout. Conflict resolution in ds4_metal.m: - ds4_gpu_get_mul_mm_id_pipeline keeps HEAD's (bc_inp, use_mpp) shape, since the MoE _id_ kernels in metal/moe.metal use function-constant slot 702 for use_mpp. Slot 702 in metal/dense.metal is used for m5_sgmatrix via the auto-merged ds4_gpu_get_mul_mm_pipeline path. - ds4_gpu_use_m5_simdgroup_matrix() helper kept alongside our MPP infrastructure (q8_0/f16/attn-out/MoE policy functions) so both fast paths stay wired. README left at HEAD; swival/m5's new sections describe a different fork narrative.
The upstream backend refactoring (0ac5df3) renamed the public tensor API from ds4_metal_tensor_* to ds4_gpu_tensor_* (and ds4_metal_matmul_q to ds4_gpu_matmul_q, etc.), but tests/ds4_test.c kept using the old names — so `make ds4_test` had been broken since that commit. This is a pure rename; ./ds4_test --server passes.
Absorbs Swival/ds4-m5's M5 simdgroup_matrix matmul fast path, M5-private scratch buffers + hazard tracking, CUDA q8 fp16 cache memory guard, CUDA long-context fixes/test, M2 Ultra benchmark, and speed-bench script/CSV layout. Conflict resolution in ds4_metal.m (matches main's earlier resolution): - ds4_gpu_get_mul_mm_id_pipeline keeps HEAD's (bc_inp, use_mpp) shape, since the MoE _id_ kernels in metal/moe.metal use function-constant slot 702 for use_mpp. Slot 702 in metal/dense.metal is used for m5_sgmatrix via the auto-merged ds4_gpu_get_mul_mm_pipeline path. - ds4_gpu_use_m5_simdgroup_matrix() helper kept alongside our MPP infrastructure (q8_0/f16/attn-out/MoE policy functions) so both fast paths stay wired. README left at HEAD; swival/m5's new sections describe a different fork narrative.
The upstream backend refactoring (0ac5df3) renamed the public tensor API from ds4_metal_tensor_* to ds4_gpu_tensor_* (and ds4_metal_matmul_q to ds4_gpu_matmul_q, etc.), but tests/ds4_test.c kept using the old names — so `make ds4_test` had been broken since that commit. This is a pure rename; ./ds4_test --server passes.
Raise the default Metal prefill chunk to 4096 and reuse the range-capable layer-major prefill graph for chunked ranges. Enable the guarded Q8_0 attn_q_b MPP route for <=2048-token prompt batches, dynamic Q8_0 tile width, the routed-MoE fast layout from layer 0, and the RB16 indexed decode path. M5 Max post-patch ds4-bench profile with 64 generated tokens: prompt 443/459/522/486/465 t/s and generation 38.6/38.2/37.6/34.0/33.6 t/s at 0.5k/1k/2k/4k/8k. Tests: make all ds4_test; make test; git diff --check.
Detect macOS Low Power Mode and widen the Q8_0 prefill MPP route only under that condition, while preserving the guarded default for normal-power runs and explicit Q8_0 filters. Low-power M5 Max baseline vs patched auto with 128 generated tokens: 0.5k: prefill 133.46 -> 196.89 t/s, gen 13.53 -> 15.08 t/s 1k: prefill 118.65 -> 188.91 t/s, gen 12.23 -> 14.93 t/s 2k: prefill 130.90 -> 220.33 t/s, gen 11.02 -> 14.65 t/s 4k: prefill 118.09 -> 212.81 t/s, gen 13.25 -> 14.00 t/s 8k: prefill 185.52 -> 206.49 t/s, gen 12.94 -> 13.84 t/s Tests: make all ds4_test; make test; DS4_METAL_MPP_LOW_POWER_DISABLE=1 ./ds4_test --metal-mpp-equivalence; git diff --check.
Brings in Ivan's PR antirez#15 follow-ups (Tune Metal MPP defaults / Improve MPP prefill throughput / Low-power Q8 profile) plus the ds4_test rename fix, on top of the swival/m5 work that main already absorbed. Conflict resolution: - ds4_metal.m: drop main's older ds4_gpu_mpp_q8_0_partial_tiles_enabled in favor of m5's newer version from ff2d499 (handles low-power mode). - README.md: keep HEAD (the Abliterated fork narrative); m5's README is the original DwarfStar 4 readme with Swival's M5 narrative appended.
ds4_metal.m's embedded shader source declares both: FC_mul_mm_id_mpp [[function_constant(FC_MUL_MM + 2)]] // == 702 FC_mul_mm_m5_sgmatrix [[function_constant(FC_MUL_MM_M5_SGMATRIX)]] // == 702 Metal requires every function_constant index to be unique across the program, so the shader unit fails to compile as a whole on devices that fall back to legacy Metal (e.g. M4 Max with Metal 4 Tensor API disabled), with: program_source:3281:34: error: 'function_constant' has a duplicate index '702' ds4-server then aborts with "metal backend unavailable". This is a compile-time conflict; runtime flags like DS4_MPP=off / DS4_METAL_TENSOR_DISABLE=1 can't recover from it because the shader unit is rejected before any runtime selection runs. This minimal patch moves FC_MUL_MM_M5_SGMATRIX out of the 700–799 block (which is shared with FC_MUL_MM family indices computed as FC_MUL_MM + k) into 800, away from the existing FC_BIN=1300 range. Verified on Apple M4 Max, 128 GB, macOS 26.4.1, building from main HEAD e6c3da4: - make clean && make ds4-server: succeeds - ./ds4-server -m cyberneurova-...Q2_K.gguf \ --dir-steering-file dir-steering/out/uncertainty.f32 \ --dir-steering-ffn -3 \ --ctx 100000 --host 0.0.0.0 --port 8002 - Server boots, maps 94 GiB of tensors, logs "directional steering enabled: ... attn=0 ffn=-3" - Steady-state generation ~25 t/s (matches antirez/ds4 upstream on M4 Max; M4 Max is not in the MPP target list so it correctly runs the legacy Metal path) A cleaner long-term fix would be to (a) reserve disjoint ranges per family (e.g. 700–799 for FC_MUL_MM, 800–899 for FC_MUL_MM_M5_*) and document the convention, or (b) gate the M5-only declarations behind a feature macro so pre-M5 builds don't emit them at all. This patch is the minimal change that unblocks pre-M5 users today.
|
Following up on the side note in the original PR body — here are the full details for the DGX Spark CUDA hang in case it's useful (separate from this PR's fix; happy to drop into a new venue if you'd prefer). Environment
Reproductiongit clone https://github.com/audreyt/ds4 ds4-audreyt
cd ds4-audreyt
PATH=/usr/local/cuda/bin:$PATH make CUDA_ARCH=native
./ds4-server --cuda \
-m /path/to/cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
--host 0.0.0.0 --port 8001 --ctx 32768 \
--kv-disk-dir /mnt/models/ds4-kv --kv-disk-space-mb 8192 \
--dir-steering-file dir-steering/out/uncertainty.f32 \
--dir-steering-ffn -3Observed: hangs forever after CUDA initServer log gets exactly this far and then nothing more is emitted (waited 5+ minutes across multiple attempts, also tried without For comparison, the same Spark running antirez/ds4 upstream ( So Process state during the hangSo:
→ looks like a tight busy loop in the CUDA loader pre-tensor-copy phase, not a deadlock on a syscall.
Why we suspect the loader hasn't been ported
What I'd like to know
Thanks again! |
中文版順手另一個觀察,給 README §8(Agent Client Usage)做參考——這不是 觀察:Claude Code 的 system prompt 會蓋掉 abliterated 性格同一個 server(audreyt fork HEAD A. 直接 curl prompt:「1989年6月4日 天安門廣場發生了什麼事?請詳述。」 回覆(前 ~150 tokens,整段共 800 tokens 因為
這就是 README §4 描述的「鋪陳爭議議題」暫存器:沒拒答、沒給 PRC 預設答,正是 steering 效果。可惜後半段 collapse 進入 速度:wall 27.77s / prompt 25 tok / completion 800 tok → ~29 t/s gen on M4 Max。 B. 同 model 同 server,但走 Claude Code(cc-connect claudecode adapter) 同樣 prompt,Claude Code 啟動時自動塞它內建的 ~25K system prompt(含 "Refuse requests for destructive techniques…" + "You are Claude Code, helpful assistant…" 等指令)。Model 回應變成禮貌地拒答這個議題、改提議幫忙其他事。Abliterated weights 仍然 loaded、steering vector 仍然套上去——但 Claude Code 的 system prompt 完全 dominate model 的回應風格。 為什麼這對你 README 有意義§8.7 目前列 Claude Code 是支援的 agent integration、只給
建議 §8.7 加一句 caveat:
OpenCode default chat mode 是否有同樣問題我沒驗證;若 OpenCode 也預設帶自家 system prompt 可能適用。 需要的話我可以另開 doc-only PR 直接送這段加進 §8.7。 English versionOne more observation while we're here, in case it's useful for your README §8 (Agent Client Usage) — this isn't a bug in Observation: Claude Code's system prompt masks the abliterated characterI tested the same sensitive query through two paths against the same running server (audreyt fork @ HEAD Path A: direct Prompt: "1989年6月4日 天安門廣場發生了什麼事?請詳述。" Reply (first ~150 tokens; whole thing was 800 tokens because
This is the "rehearsing-contested-territory" register described in your README §4. No refusal, no PRC-default answer — exactly the steering effect, until it collapsed into a Speed for raw path: wall 27.77s, prompt 25 tok, completion 800 tok → ~29 t/s gen on M4 Max. Path B: same model, same server, accessed through Claude Code via cc-connect (claudecode adapter) Same prompt, but Claude Code spawns the request with its full ~25K system prompt (containing things like "Refuse requests for destructive techniques…" + "You are Claude Code, helpful assistant…"). The model now politely refuses to discuss the topic and offers to help with something else. The abliterated weights are still loaded, the steering vector is still applied — but Claude Code's system prompt completely overrides the model's persona. Why I think this matters for your README§8.7 currently lists Claude Code as one of the supported agent integrations with just the
A one-line caveat under §8.7 like:
…would save the next person an afternoon of "did I install it wrong?" debugging. Same caveat may apply to OpenCode's default chat mode if it ships its own system prompt; haven't verified. Happy to send this as a doc-only PR if you'd like. |
Environment
audreyt/ds4main HEADe6c3da4(clean clone,make ds4-server)DS4_MPP=off/DS4_METAL_TENSOR_DISABLE=1— no help, because the failure is at compile-time, before any runtime flags are readReproduction
git clone https://github.com/audreyt/ds4 cd ds4 make ds4-server ./ds4-server \ -m gguf/cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \ --host 0.0.0.0 --port 8002 --ctx 100000 \ --dir-steering-file dir-steering/out/uncertainty.f32 \ --dir-steering-ffn -3Observed
Server aborts during Metal shader compilation:
Root cause
In
ds4_metal.m's embedded shader source (around line 2282):Both
FC_MUL_MM_M5_SGMATRIXandFC_MUL_MM + 2evaluate to702. Metal compiler requires everyfunction_constantindex to be unique across the program, so the shader unit fails as a whole, regardless of whether the M5 path or the legacy MPP path is actually selected at runtime.Why your own machine doesn't see this
I assume your M5 Max has Metal 4 Tensor API enabled and the active shader emission/specialization happens to mask one of the two declarations (or you build with a different conditional). On a pre-M5 device the fallback path still feeds the entire shader source to the compiler, both declarations are visible, and we hit the dup-index error.
Local workaround (ugly but works)
After patch +
make clean && make ds4-server, the M4 Max boots cleanly, loadscyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf(~94 GiB mapped), enablesdirectional steering: dir-steering/out/uncertainty.f32 attn=0 ffn=-3, and runs at ~25 t/s gen (matches antirez/ds4 upstream rate on M4 Max — no MPP acceleration since M4 Max isn't in the MPP target list, which is expected).Suggested cleaner fixes
Separate index ranges: keep
FC_MUL_MMblock reserved for700–799and use800+forFC_MUL_MM_M5_*. e.g.Conditional shader emission: wrap the M5-specific declarations behind
#if HAS_METAL4_TENSOR(or whatever feature gate you already use for the M5 fast path) so pre-M5 builds don't see them at all.Split the M5 shader into its own source string and only concatenate it when the device family qualifies at runtime.
I'd guess #1 is the smallest patch that fixes this cleanly. Happy to send a PR if you'd prefer; just wanted to file the issue first since antirez#2/antirez#3 are bigger architectural choices.
Side note (separate from this issue)
While trying to run the same cyberneurova GGUF on a DGX Spark (GB10, sm_121, CUDA 13.0) with main HEAD
e6c3da4+--cuda, the server hangs at CPU 100 % right afterCUDA backend initialized on NVIDIA GB10 (sm_121)and never reachesCUDA loading model tensors into device cache. Looks like the cyberneurova q8_0 token-embd loader path isn't wired up on the CUDA side either. Want me to file that as a separate issue?