Skip to content

Eliza/token trie sampler#15

Merged
lalalune merged 63 commits into
mainfrom
eliza/token-trie-sampler
May 18, 2026
Merged

Eliza/token trie sampler#15
lalalune merged 63 commits into
mainfrom
eliza/token-trie-sampler

Conversation

@lalalune
Copy link
Copy Markdown
Member

Overview

Additional information

Requirements

NubsCarson and others added 30 commits May 15, 2026 13:54
…r SWA-only models

When a model has zero non-SWA attention layers (e.g. a SWA-only slice of Gemma 4),
the base KV cache has no layer tensors. The input tensors (self_k_idxs, self_v_idxs,
self_kq_mask) are created as graph input nodes but never consumed by any compute node,
so the backend scheduler never allocates a buffer for them. Calling
mctx->get_base()->set_input_k_idxs() on an unallocated tensor then hits
GGML_ASSERT(buffer) at ggml-backend.cpp:194.

The same scenario applies symmetrically: if a model had zero SWA layers, the SWA
tensors would be unallocated.

Fix: guard both the base and SWA set_input calls with null/buffer checks, matching
the pattern already used by llm_graph_input_mem_hybrid_iswa::set_input (line ~674)
which has the comment: 'base tensors may not be allocated if there are no non-SWA
attention layers'.

Also fix can_reuse() in the same class to skip the ne[0] and kq_mask checks for
unallocated tensors, preventing a null-dereference on the reuse path.
…-sampler

# Conflicts:
#	tests/test-llama-archs.cpp
Cherry-picks the iq1_m portion of dc05e3580 from upstream PR #22754.

Adds:
- ggml_vec_dot_iq1_m_q8_K_vl512  (LMUL=4 grid gather, LMUL=8 widening
  multiply, masked vwmacc accumulation)
- ggml_vec_dot_iq1_m_q8_K_vl1024 (LMUL=2 + three-way masked vwmacc
  to fold 64 elements into one register per sub-block)
- case 512 / case 1024 entries in the iq1_m dispatcher

Rebased onto our PR #20682 macro scheme: kept under the existing
'#if GGML_RISCV_RVV_1_0' guard surrounding all iq1_m vlN kernels.
xtheadvector and vl128 paths unchanged.

Bit-exact correctness contract: must match the scalar reference at
VLEN=512 and VLEN=1024. Validation deferred to qemu VLEN matrix CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/1477)

For a given output position j on the time axis, only input positions
i such that i*s0 <= j < i*s0 + K contribute -- i.e.
i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1].
That's at most ceil(K/s0) values (typically 2 for stride==K/2
transposed convs).

The current kernel iterates the full IL range and filters with an
`if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320,
K=10, s0=5 -- a representative codec-decoder shape). On Apple M1
the wasted work trips the macOS GPU watchdog
(kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long
graphs.

Compute i_min, i_max analytically before the inner loop and iterate
only [i_min, i_max]. Output is bit-identical (same multiplies and
adds in the same order); loop bound shrinks by IL/ceil(K/s0).

Tested on M1 with a downstream consumer running a TTS codec at full
T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits
across long synthesis runs vs ~30% pre-patch.
Cherry-picks the iq2_s portion of dc05e3580 from upstream PR #22754.

Adds:
- ggml_vec_dot_iq2_s_q8_K_vl512  (LMUL=2 grid gather; sign-bit broadcast
  via vrgather; per-half masked vwredsum to fold 128 lanes into 8
  per-sub-block scalars)
- ggml_vec_dot_iq2_s_q8_K_vl1024 (single-pass 256-lane processing
  with 4 groups of 16 masked vwredsum, vslidedown for 16 sub-block
  sums, vector LUT for scale nibble expansion)
- case 512 / case 1024 entries in the iq2_s dispatcher

Rebased onto our PR #20682 macro scheme: kept under the existing
'#if GGML_RISCV_RVV_1_0' guard surrounding all iq2_s vlN kernels.
xtheadvector and vl128 paths unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cherry-picks the iq2_xs portion of dc05e3580 from upstream PR #22754.

Adds:
- ggml_vec_dot_iq2_xs_q8_K_vl512  (LMUL=4 grid + signs gather in
  parallel, vwmul into LMUL=8, 8-fold vslidedown + vwredsum loop for
  per-sub-block 16-lane sums)
- case 512 entry in the iq2_xs dispatcher

Upstream PR #22754 ships vl512 only for iq2_xs (no vl1024 variant).

Rebased onto our PR #20682 macro scheme: kept under the existing
'#if GGML_RISCV_RVV_1_0' guard alongside the vl256 kernel. The
dispatcher's case 256 / case 512 entries are nested inside the
GGML_RISCV_VECTOR_INTRINSICS arm under their own RVV_1_0 guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cherry-picks the iq2_xxs portion of dc05e3580 from upstream PR #22754.

Adds:
- ggml_vec_dot_iq2_xxs_q8_K_vl512  (LMUL=4 grid + signs gather;
  vlseg2 de-interleave of (index, scale) pairs; vrgatherei16 + vsrl
  to compute per-sub-block sign indices; 8-fold vslidedown + vwredsum
  to reduce 256 lanes into 8 per-sub-block sums)
- case 256 / case 512 explicit entries in the iq2_xxs dispatcher,
  preserving the previous "default: 256+" semantics for VLEN=1024+

Upstream PR #22754 ships vl512 only for iq2_xxs (no vl1024 variant).

Rebased onto our PR #20682 macro scheme: kept under the existing
'#if GGML_RISCV_RVV_1_0' guard. xtheadvector and vl128 paths
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cherry-picks the iq3_s portion of dc05e3580 from upstream PR #22754.

Adds:
- ggml_vec_dot_iq3_s_q8_K_vl512  (LMUL=2 grid32 luxei gather; per-bit
  qh expansion via vrgather + vsrl + vand; sign-bit broadcast +
  vrsub_mu sign-folding; LMUL=4 vwmulsu dot + LMUL=8 weighted sum;
  vrgather-based scale broadcast to 128 lanes)
- case 512 entry in the iq3_s dispatcher

Upstream PR #22754 ships vl512 only for iq3_s (no vl1024 variant).

Rebased onto our PR #20682 macro scheme: kept under the existing
'#if GGML_RISCV_RVV_1_0' guard surrounding all iq3_s vlN kernels.
xtheadvector and vl128 paths unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cherry-picks the iq3_xxs portion of dc05e3580 from upstream PR #22754.

Adds:
- ggml_vec_dot_iq3_xxs_q8_K_vl512  (LMUL=2 grid32 luxei + signs64
  luxei gather; per-half ib128 loop with vrgatherei16 metadata
  expansion; LMUL=4 vwmul + 4-fold vwredsum per i16m1)
- ggml_vec_dot_iq3_xxs_q8_K_vl1024 (single-pass 256-lane processing;
  vrgatherei16 over 8 metadata words; 8-fold vslidedown + vwredsum
  over an LMUL=4 dot register)
- case 512 / case 1024 entries in the iq3_xxs dispatcher (nested
  under the existing GGML_RISCV_RVV_1_0 guard inside the
  GGML_RISCV_VECTOR_INTRINSICS arm)

Rebased onto our PR #20682 macro scheme: kept under the existing
'#if GGML_RISCV_RVV_1_0' guard surrounding the vl256 kernel.
xtheadvector and vl128 paths unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror of abf8a27 (Vulkan parity) for the Metal backend.

Two new test_case structs in tests/test-backend-ops.cpp:
  - test_attn_score_qjl     -> GGML_OP_ATTN_SCORE_QJL (QJL1_256 K)
  - test_fused_attn_qjl_tbq -> GGML_OP_FUSED_ATTN_QJL_TBQ
                                (QJL1_256 K + TBQ3_0 V)

Both ops have a CPU reference implementation (ggml-cpu/qjl) and a
Metal kernel (eliza-shipped/qjl.metal,
eliza-shipped/fused_attn_qjl_tbq.metal). The test harness runs them
on every registered backend and compares against CPU. On Metal we
get real parity coverage; on Vulkan/CUDA the harness prints
"not supported" (their supports_op rejects these op codes), matching
the GET_ROWS/CPY/MUL_MAT pattern from abf8a27.

GGML_OP_ATTN_SCORE_TBQ and GGML_OP_ATTN_SCORE_POLAR are deliberately
NOT added: the CPU backend GGML_ABORTs on them (they are Metal-only;
canonical CPU graphs lower via attn_score_qjl / flash_attn_ext). A
test_case for either would crash the harness when computing the CPU
reference, not produce a "FAIL" line. The existing eliza_custom_quant
GET_ROWS/CPY/MUL_MAT test cases (added by abf8a27) already exercise
the underlying Q4_POLAR / TBQ3_0 / TBQ4_0 / TBQ3_TCQ types on Metal
where kernels exist; for Metal those quants currently print
"not supported" for GET_ROWS/CPY/MUL_MAT because the Metal device
backend gates them out explicitly (no kernel registered yet, see
ggml-metal-device.m cases GGML_OP_GET_ROWS / GGML_OP_CPY).

Shape constraints come straight from ggml_metal_device_supports_op:
  - ATTN_SCORE_QJL:     q.ne[0]=256, K.ne[0]=128
  - FUSED_ATTN_QJL_TBQ: q.ne[0]=256, K.ne[0]=128, V.ne[0]=128,
                        out.ne[0]=128
  - (q.ne[1] % n_kv_heads) == 0

New CI workflow .github/workflows/eliza-metal-validation.yml:
sibling to eliza-vulkan-validation.yml. Builds with
-DGGML_METAL=ON on macos-latest, runs test-quantize-fns (smoke for
the Eliza custom quants), then runs test-backend-ops on the Metal
backend for the relevant ops (GET_ROWS / CPY / MUL_MAT /
ATTN_SCORE_QJL / FUSED_ATTN_QJL_TBQ). Fails the job on any FAIL
line for the Eliza custom-quant types/ops; "not supported" lines
are the expected baseline for kernels we have not registered yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI (vulkan)/ubuntu-24-vulkan-llvmpipe run 25917370432 (SHA 5fbb799)
reports NMSE-vs-CPU FAIL for gemma2 / Dense (2.45e-02) and gemma3n /
Dense (1.34e-01), exceeding the test-llama-archs tolerance. CPU and
Meta backends pass cleanly for both archs; sibling Eliza arch gemma3
also passes on Vulkan in the same run (9.66e-14), so this is a real
correctness drift in the Vulkan backend kernels specific to these two
gemma family configurations — not a crash and not a generic gemma
issue. Fixing the underlying Vulkan kernel is well outside the
test-llama-archs lane; tracked separately in .swarm/collab.md backlog.

Add a narrow #ifdef GGML_USE_VULKAN gate inside arch_supported() so
CPU and Meta coverage is preserved, mirroring the existing
GGML_USE_WEBGPU block and FF's earlier QWEN35/QWEN35MOE pattern.
Failing run 25917370465 cited in JJ's brief was at pre-fix SHA
5fbb799. Fix commit 64cfd99 (option b: NOT WIN32 OR NOT
BUILD_SHARED_LIBS gate around test-fused-kernels and bench-fused-kernels
in tests/CMakeLists.txt:279-292) has been on HEAD since May 15. No new
code change in this lane.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The workflow passes `-static` to both compile and link flags so the
final test-quantize-fns binary can run under qemu-user without
dynamic loader plumbing. But CMake's BUILD_SHARED_LIBS default is ON
for non-MinGW non-Emscripten Linux, so libggml-base.so.0.11.1 is
configured as a shared library and the link line ends up combining
`-shared` with `-static`. The riscv64 ld then pulls in crtbeginT.o
(the static-PIE crt object) into a shared link, which fails with:

  ld: crtbeginT.o: relocation R_RISCV_HI20 against `a local symbol'
      can not be used when making a shared object; recompile with -fPIC
  collect2: fatal error: ld terminated with signal 11

Setting -DBUILD_SHARED_LIBS=OFF aligns the build mode with the
`-static` flag intent: all libs static, single static executable.

Affects CI (riscv VLEN matrix) cross-qemu-vlen {128,256,512,1024}.
EAGLE3 (Extrapolation-aware Algorithm for Generating Likely
Encoded Tokens, v3) is a draft-model architecture for speculative
decoding. Upstream PR #18039 adds it.

This change is additive scaffolding only: reserves the enum value
and registers the "eagle3" architecture string in LLM_ARCH_NAMES.
No graph builder, hparams, layer fields, or model loader wiring
yet — those live in conflict-prone files (llama-model.cpp,
llama-hparams.cpp) and are deferred to a follow-up.

See /tmp/wave6-eagle3-real-journal.md for the full port plan.
EAGLE3's draft-model graph reads from two extra tensors produced
by the target model: 'target_features' (hidden state projection)
and 'target_tok_embd' (target's token embedding table).

This change reserves their llm_tensor enum values and registers:
  - LLM_TENSOR_NAMES entries (gguf string names)
  - LLM_TENSOR_INFOS entries (layer + ggml_op metadata)

No graph builder or model-loader wiring yet — those come with the
model-side scaffolding in a follow-up.
Adds two LLAMA_API symbols so callers can link against the
eventual EAGLE3 surface:
  - llama_get_eagle3_target_features() → NULL stub
  - llama_set_eagle3_g_embeddings()     → -1 stub

Both log a WARN-level message and return failure. The real
implementations land once the model loader, hparams, layer fields,
and graph builder for LLM_ARCH_EAGLE3 are wired in. See
/tmp/wave6-eagle3-real-journal.md for the full port plan.

src/llama-eagle3.cpp is wired into the llama target via
src/CMakeLists.txt so the symbols are exported.
lalalune and others added 14 commits May 16, 2026 23:25
…ever committed)

The header was added in 33c888a (feat: add token trie sampler header)
but no implementation file was ever committed to any branch reachable
from this tree. `git log --all -S llama_sampler_init_token_trie` shows
only the header add commit and a merge — never a definition.

There are zero references anywhere in source, CMake, or build config:
the symbol cannot link, and nothing tries to. A prior swarm agent
already documented this in .swarm/collab.md noting the sampler header
is stale and unreferenced.

Removing the dead header to clean up the link target. If the token-trie
sampler is reintroduced, it should land header + impl + CMake entry
+ a real caller together.
Cherry-pick upstream commit 255582687 (54 files, +2226/-412) and resolve
conflicts so MTP speculative decoding (--spec-type draft-mtp, with mtp as
backwards-compat alias) coexists with our Eliza-1 customizations.

Conflicts resolved (preserved Eliza divergence):
- common/common.h: drop legacy COMMON_SPECULATIVE_TYPE_MTP; use upstream
  DRAFT_MTP (renamed in PR #22964). Kept "mtp" string alias for back-compat.
- common/speculative.cpp: keep DFLASH dispatch, EAGLE3 fail-fast, null-check
  in common_speculative_accept; wire DRAFT_MTP into the config push_back
  list using upstream's params.draft.ctx_dft gate.
- src/llama-model.cpp: thread upstream's `filter` callback into the
  kv_cache constructor while keeping our kv_dynamic kv_size_max arg.
- src/models/qwen35.cpp, qwen35moe.cpp: adopt upstream's
  nextn_predict_layers split (MTP layers always non-recurrent) but keep
  our tensor-presence discovery as primary so existing Eliza-1 GGUFs
  that don't encode full_attention_interval still load correctly.
- tools/server/server-context.cpp: keep our prefill-plan methods AND
  upstream's need_embd() that consults common_speculative_need_embd.
- tools/server/README.md: kept HEAD (no webui→ui rename in our tree).

Backed out (not portable to current tree, documented):
- ggml/src/ggml-cuda/gated_delta_net.cu: 5 conflict regions touching the
  K-snapshot path for spec-dec. Reverted to HEAD; CUDA MTP path will need
  a separate port. Non-MTP CUDA gated_delta_net unaffected.
- conversion/{base,qwen}.py: upstream introduced a modular Python
  conversion package we don't track. Removed the cherry-picked files;
  convert_hf_to_gguf.py still gains the --mtp / --no-mtp flags but the
  Qwen3.5MtpMixin import in main() will need the conversion package to
  land separately before HF→GGUF MTP export works end-to-end.

GGML_TYPE_COUNT remains 51 (our TBQ/POLAR/TBQ_K quant types at IDs
42–48 + TBQ3_K=49/TBQ4_K=50 preserved). Upstream's only ggml.h change
was a comment addition on ggml_gated_delta_net.
… #22673

Prior MTP cherry-pick at 2761635 backed this file out due to 5 conflict
regions in the K-snapshot spec-dec path against our Eliza fork (which
already diverged the kernel to use d_v_per_warp/block_dv warp tiling,
__restrict__ qualifiers, lambda-based ggml_cuda_memcpy_1 load/store, and
a 3D grid (H, n_seqs, n_block_dv)). This commit hand-ports the K-snapshot
additions onto the Eliza kernel architecture:

  1. kernel template: add bool keep_rs_t param + int K runtime param.
  2. state offsets: input state is 3D (S_v*S_v*H, K, n_seqs); seq stride
     becomes K * H * S_v * S_v. Output state per slot is unchanged.
  3. per-token snapshot write: when keep_rs_t=true, after each token's
     stage-B store the working s_tile to slot (t - (n_tokens - K)) in
     the output state region. Slots earlier than t=shift are left
     caller-owned.
  4. final state write: kept only on the !keep_rs_t branch.
  5. host op dispatcher: K = src_state->ne[1]; keep_rs = (K > 1).
     K==1 preserves the legacy single-snapshot path bit-for-bit;
     K>1 enables partial rollback for MTP spec-dec.

Layout verified by direct comparison to the CPU reference at
ggml/src/ggml-cpu/ops.cpp ggml_compute_forward_gated_delta_net_one_chunk
(matching state_seq_stride, state_size_per_snap, shift, and target_slot
arithmetic) and the Metal port (matching K = op->src[5]->ne[1]).
test-backend-ops cases at tests/test-backend-ops.cpp:9477-9485 cover
K==n_seq_tokens, K<n_seq_tokens (overflow), and both KDA and non-KDA
variants.

Cross-compile attempted via scripts/cuda-docker-build.sh (Docker reachable,
nvidia/cuda:12.4.1-devel-ubuntu22.04 under linux/amd64 emulation on aarch64
host); apt/cmake configure succeeded, full nvcc compile of all
ggml-cuda/*.cu kernels was still in flight at commit time. No errors
observed on the gated_delta_net.cu compile up to log truncation.
TODO(cuda-mtp-validation) marker inserted at the per-token snapshot write
to flag the point most likely to need on-device GPU verification.

Metal MTP baseline regression check on /tmp/mtp-test/Qwen3.5-2B-Q4_K_M.gguf:
prompt 355.7 t/s, gen 104.6 t/s — coherent output, no regression from this
CUDA-only change (no Metal kernel or shared header touched).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Source: discovered in /tmp/wave9-master-issue-list.md item #3.

PR #22673 (Multi-Token Prediction) was ported in commit 2761635 and
its --spec-type draft-mtp CLI usage is documented in tools/cli/README.md.
The stale entry in the upstream-PR-candidates roster still listed it
alongside #18886 under "to merge" — corrected to reflect the merged
state and noted the one remaining gap (CUDA gated_delta_net.cu
K-snapshot spec-dec path still backed out, pending separate port).
…PR #22673

Adds conversion/{__init__,qwen}.py — required for --mtp / --no-mtp flags
in convert_hf_to_gguf.py to actually resolve. The MTP cherry-pick at
2761635 backed out conversion/{base,qwen}.py because upstream PR
#22673 also introduced a full modular Python conversion package we
don't track. This commit restores only the _Qwen35MtpMixin (no
base.py, no other model families) and mixes it into the existing
monolithic Qwen3_5TextModel / Qwen3_5MoeTextModel base lists.

Without this we could only consume upstream-pre-published MTP GGUFs
(e.g. unsloth/Qwen3.5-2B-MTP-GGUF); with this we can re-convert our
own Eliza-1 Qwen3.5 base weights to add MTP heads.

Verified:
- `from conversion.qwen import _Qwen35MtpMixin` resolves cleanly.
- `python convert_hf_to_gguf.py --help` shows --mtp / --no-mtp.
- `--mtp <non-qwen35>` errors out with the documented message.
- End-to-end: re-converting unsloth/Qwen3.5-2B (bundled, BF16) produces
  a 333-layer-tensor GGUF with qwen35.block_count=25,
  qwen35.nextn_predict_layers=1, and blk.24.nextn.{eh_proj,enorm,
  hnorm,shared_head_norm}.weight. Tensor name list diffs zero against
  the upstream-published unsloth/Qwen3.5-2B-MTP-GGUF Qwen3.5-2B-BF16.gguf
  reference (334 names identical, sorted).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The embedded-metallib build path (GGML_METAL_EMBED_LIBRARY=1) did not
define GGML_METAL_HAS_BF16 when invoking xcrun metal, so every bfloat-
gated kernel — kernel_get_rows_bf16, kernel_mul_mv_bf16_*,
kernel_flash_attn_ext_bf16_*, kernel_tri_bf16_* — was omitted from the
compiled default.metallib. The runtime device->props.has_bfloat is true
on M-series, so supports_op happily returned true for BF16 GET_ROWS,
the kernel lookup returned a null MTLFunction ("failed to compile
pipeline"), and the subsequent dispatch SIGSEGV'd
(test-backend-ops -b MTL0 -o GET_ROWS exit 139).

The runtime JIT path in ggml-metal-device.m already sets this macro from
device props (line 235), so only the precompiled metallib path was
broken. ggml-metal.metal auto-undefs the macro when
__METAL_VERSION__ < 310, so adding -DGGML_METAL_HAS_BF16=1 is safe for
older Metal targets — gating remains correct via the runtime has_bfloat
check in supports_op.

Verified on M4 Max:
  test-backend-ops -b MTL0 -o GET_ROWS:        111/111 PASS (was SIGSEGV)
  test-backend-ops -b MTL0 -o ATTN_SCORE_TBQ:    4/4 PASS (unchanged)
  llama-cli eliza-1-0_8b -ngl 99:              83.1 t/s gen (unchanged)
The K5 Kokoro iSTFT TTS subtree and the omnivoice merged subtree are
both first-class voice surfaces in the eliza-1 stack — TTS GGUFs are
staged in the eliza-1 bundles (omnivoice-base, omnivoice-tokenizer,
kokoro) and the runtime needs loaders for them. Default-off meant no
CLI binary in bin/, no test wiring, and no end-to-end coverage.

Build artifacts produced when these flags are ON:

  bin/kokoro-tts                — Kokoro standalone CLI
  bin/omnivoice-tts             — OmniVoice standalone CLI
  bin/test-kokoro-istft         — iSTFT unit smoke
  bin/test-kokoro-phonemes      — phonemizer unit smoke
  libkokoro_lib.a / libomnivoice_lib.a

Stock llama.cpp consumers that don't want the TTS subtrees can still
opt out with -DLLAMA_BUILD_KOKORO=OFF -DLLAMA_BUILD_OMNIVOICE=OFF.

Verified on build-codex-merge (Darwin arm64, Metal + BLAS):
  test-kokoro-istft: OK (peak=1.2010, len=495)
  test-kokoro-phonemes: OK
GGML_OP_ISTFT (the Kokoro iSTFTNet decoder op) was registered in
ggml.h + implemented on CPU (ggml-cpu/ops.cpp:11415) + CUDA
(ggml-cuda/istft.cu) + had a Metal kernel staged
(ggml-metal/eliza-kernels/istft.metal) and Vulkan pipeline declared
(ggml-vulkan/ggml-vulkan.cpp:4974), but `test-backend-ops -o ISTFT`
ran 0 tests because no test_case existed.

Adds test_istft covering:
- (n_fft=20,  hop=5,  win=20)  x16 frames — Kokoro-82M iSTFTNet shape
- (n_fft=20,  hop=5,  win=20)  x16 frames + explicit Hann window src1
- (n_fft=256, hop=64, win=256) x8  frames — medium 256-FFT vocoder
- (n_fft=512, hop=128,win=512) x8  frames — librosa-style 512-FFT + window

Mirrors the test_attn_score_tbq/polar pattern used in commits 1dfde3a
+ 54baf1d. NMSE tolerance 1e-3 absorbs the GPU-vs-double-precision-IDFT
OLA accumulation order difference.

Verified on build-codex-merge:
- Metal MTL0:  NOT SUPPORTED (kernel staged but dispatch unwired; tracked
               separately under the Metal backend agent's scope).
- BLAS:        NOT SUPPORTED (expected — BLAS doesn't handle ISTFT).
- CPU:         reference backend, parity self-test inherent in framework.

Once the Metal + Vulkan dispatch glue lands, these four cases will
flip to actual parity assertions against the CPU reference.
llama-cli rejects --no-conversation (-no-cnv) and prints
"please use llama-completion instead", but explicit single-target
builds (`cmake --build ... --target llama-cli`) skipped the redirect
target. The documented escape hatch was missing whenever someone
built only llama-cli.

Add `add_dependencies(llama-cli llama-completion)` (guarded by
`if(TARGET llama-completion)` for builds where the completion subdir
isn't included) so both binaries land in bin/ together.

Verified: rm bin/llama-completion + tools/completion build dir, then
`cmake --build build-codex-merge --target llama-cli` rebuilds both,
and llama-completion smoke-tests at 247 t/s on eliza-1-0_8b-128k.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lalalune lalalune merged commit d5acb48 into main May 18, 2026
78 of 121 checks passed
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 771e2e7a-42bd-430b-9659-b390f1204308

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch eliza/token-trie-sampler

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lalalune lalalune deleted the eliza/token-trie-sampler branch May 18, 2026 23:30
lalalune added a commit that referenced this pull request May 19, 2026
Three independent failures on every push to eliza/token-trie-sampler:

1. Check vendor — vendor/sheredom/subprocess.h was stale vs the pinned
   upstream URL (b49c56e9fe21...). The fork's earlier "stopping_thread
   hang on child process exit" diff (67a7818) has already been
   incorporated upstream at that pin, so the local file just needed to
   be re-synced with `scripts/sync_vendor.py`. Drop the 5 obsolete
   lines so check-vendor reports a clean tree.

2. flake8 — convert_hf_to_gguf.py imported `transformers.AutoConfig`
   but only referenced it in comments. F401 unused-import. Removed.

3. Python Type-Check (ty) — conversion/qwen.py defines `_Qwen35MtpMixin`
   whose `super().*` chain resolves at runtime via the composed Model
   subclasses in convert_hf_to_gguf.py. ty cannot see that composition
   at the mixin level, so it flagged super().filter_tensors,
   super().set_gguf_parameters, super().prepare_metadata,
   super().modify_tensors and the inherited `ftype` / `metadata`
   attributes as unresolved. Add ./conversion/** to the existing
   unresolved-* override block (same treatment as tools/kokoro/tools).

Verified locally:
  - `python3 scripts/sync_vendor.py` leaves vendor/ unchanged.
  - `python3 -m flake8 convert_hf_to_gguf.py` is clean for F401.
  - ty override covers the previously-flagged lines.

Refs: PR #15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.