Eliza/token trie sampler by lalalune · Pull Request #15 · elizaOS/llama.cpp

lalalune · 2026-05-18T23:29:30Z

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

…m-grammar-20260515

…r SWA-only models When a model has zero non-SWA attention layers (e.g. a SWA-only slice of Gemma 4), the base KV cache has no layer tensors. The input tensors (self_k_idxs, self_v_idxs, self_kq_mask) are created as graph input nodes but never consumed by any compute node, so the backend scheduler never allocates a buffer for them. Calling mctx->get_base()->set_input_k_idxs() on an unallocated tensor then hits GGML_ASSERT(buffer) at ggml-backend.cpp:194. The same scenario applies symmetrically: if a model had zero SWA layers, the SWA tensors would be unallocated. Fix: guard both the base and SWA set_input calls with null/buffer checks, matching the pattern already used by llm_graph_input_mem_hybrid_iswa::set_input (line ~674) which has the comment: 'base tensors may not be allocated if there are no non-SWA attention layers'. Also fix can_reuse() in the same class to skip the ne[0] and kq_mask checks for unallocated tensors, preventing a null-dereference on the reuse path.

…-sampler # Conflicts: # tests/test-llama-archs.cpp

Cherry-picks the iq1_m portion of dc05e3580 from upstream PR #22754. Adds: - ggml_vec_dot_iq1_m_q8_K_vl512 (LMUL=4 grid gather, LMUL=8 widening multiply, masked vwmacc accumulation) - ggml_vec_dot_iq1_m_q8_K_vl1024 (LMUL=2 + three-way masked vwmacc to fold 64 elements into one register per sub-block) - case 512 / case 1024 entries in the iq1_m dispatcher Rebased onto our PR #20682 macro scheme: kept under the existing '#if GGML_RISCV_RVV_1_0' guard surrounding all iq1_m vlN kernels. xtheadvector and vl128 paths unchanged. Bit-exact correctness contract: must match the scalar reference at VLEN=512 and VLEN=1024. Validation deferred to qemu VLEN matrix CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

That's always how it's done: https://github.com/search?q=path%3ACMakeLists.txt%20%22%24%7BCMAKE_INSTALL_LIBDIR%7D%2Fpkgconfig%22&type=code

…/1477) For a given output position j on the time axis, only input positions i such that i*s0 <= j < i*s0 + K contribute -- i.e. i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1]. That's at most ceil(K/s0) values (typically 2 for stride==K/2 transposed convs). The current kernel iterates the full IL range and filters with an `if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320, K=10, s0=5 -- a representative codec-decoder shape). On Apple M1 the wasted work trips the macOS GPU watchdog (kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long graphs. Compute i_min, i_max analytically before the inner loop and iterate only [i_min, i_max]. Output is bit-identical (same multiplies and adds in the same order); loop bound shrinks by IL/ceil(K/s0). Tested on M1 with a downstream consumer running a TTS codec at full T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits across long synthesis runs vs ~30% pre-patch.

Cherry-picks the iq2_s portion of dc05e3580 from upstream PR #22754. Adds: - ggml_vec_dot_iq2_s_q8_K_vl512 (LMUL=2 grid gather; sign-bit broadcast via vrgather; per-half masked vwredsum to fold 128 lanes into 8 per-sub-block scalars) - ggml_vec_dot_iq2_s_q8_K_vl1024 (single-pass 256-lane processing with 4 groups of 16 masked vwredsum, vslidedown for 16 sub-block sums, vector LUT for scale nibble expansion) - case 512 / case 1024 entries in the iq2_s dispatcher Rebased onto our PR #20682 macro scheme: kept under the existing '#if GGML_RISCV_RVV_1_0' guard surrounding all iq2_s vlN kernels. xtheadvector and vl128 paths unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cherry-picks the iq2_xs portion of dc05e3580 from upstream PR #22754. Adds: - ggml_vec_dot_iq2_xs_q8_K_vl512 (LMUL=4 grid + signs gather in parallel, vwmul into LMUL=8, 8-fold vslidedown + vwredsum loop for per-sub-block 16-lane sums) - case 512 entry in the iq2_xs dispatcher Upstream PR #22754 ships vl512 only for iq2_xs (no vl1024 variant). Rebased onto our PR #20682 macro scheme: kept under the existing '#if GGML_RISCV_RVV_1_0' guard alongside the vl256 kernel. The dispatcher's case 256 / case 512 entries are nested inside the GGML_RISCV_VECTOR_INTRINSICS arm under their own RVV_1_0 guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…en AVX-512 inactive

Cherry-picks the iq2_xxs portion of dc05e3580 from upstream PR #22754. Adds: - ggml_vec_dot_iq2_xxs_q8_K_vl512 (LMUL=4 grid + signs gather; vlseg2 de-interleave of (index, scale) pairs; vrgatherei16 + vsrl to compute per-sub-block sign indices; 8-fold vslidedown + vwredsum to reduce 256 lanes into 8 per-sub-block sums) - case 256 / case 512 explicit entries in the iq2_xxs dispatcher, preserving the previous "default: 256+" semantics for VLEN=1024+ Upstream PR #22754 ships vl512 only for iq2_xxs (no vl1024 variant). Rebased onto our PR #20682 macro scheme: kept under the existing '#if GGML_RISCV_RVV_1_0' guard. xtheadvector and vl128 paths unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cherry-picks the iq3_s portion of dc05e3580 from upstream PR #22754. Adds: - ggml_vec_dot_iq3_s_q8_K_vl512 (LMUL=2 grid32 luxei gather; per-bit qh expansion via vrgather + vsrl + vand; sign-bit broadcast + vrsub_mu sign-folding; LMUL=4 vwmulsu dot + LMUL=8 weighted sum; vrgather-based scale broadcast to 128 lanes) - case 512 entry in the iq3_s dispatcher Upstream PR #22754 ships vl512 only for iq3_s (no vl1024 variant). Rebased onto our PR #20682 macro scheme: kept under the existing '#if GGML_RISCV_RVV_1_0' guard surrounding all iq3_s vlN kernels. xtheadvector and vl128 paths unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d mzero on AVX-only path

Cherry-picks the iq3_xxs portion of dc05e3580 from upstream PR #22754. Adds: - ggml_vec_dot_iq3_xxs_q8_K_vl512 (LMUL=2 grid32 luxei + signs64 luxei gather; per-half ib128 loop with vrgatherei16 metadata expansion; LMUL=4 vwmul + 4-fold vwredsum per i16m1) - ggml_vec_dot_iq3_xxs_q8_K_vl1024 (single-pass 256-lane processing; vrgatherei16 over 8 metadata words; 8-fold vslidedown + vwredsum over an LMUL=4 dot register) - case 512 / case 1024 entries in the iq3_xxs dispatcher (nested under the existing GGML_RISCV_RVV_1_0 guard inside the GGML_RISCV_VECTOR_INTRINSICS arm) Rebased onto our PR #20682 macro scheme: kept under the existing '#if GGML_RISCV_RVV_1_0' guard surrounding the vl256 kernel. xtheadvector and vl128 paths unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror of abf8a27 (Vulkan parity) for the Metal backend. Two new test_case structs in tests/test-backend-ops.cpp: - test_attn_score_qjl -> GGML_OP_ATTN_SCORE_QJL (QJL1_256 K) - test_fused_attn_qjl_tbq -> GGML_OP_FUSED_ATTN_QJL_TBQ (QJL1_256 K + TBQ3_0 V) Both ops have a CPU reference implementation (ggml-cpu/qjl) and a Metal kernel (eliza-shipped/qjl.metal, eliza-shipped/fused_attn_qjl_tbq.metal). The test harness runs them on every registered backend and compares against CPU. On Metal we get real parity coverage; on Vulkan/CUDA the harness prints "not supported" (their supports_op rejects these op codes), matching the GET_ROWS/CPY/MUL_MAT pattern from abf8a27. GGML_OP_ATTN_SCORE_TBQ and GGML_OP_ATTN_SCORE_POLAR are deliberately NOT added: the CPU backend GGML_ABORTs on them (they are Metal-only; canonical CPU graphs lower via attn_score_qjl / flash_attn_ext). A test_case for either would crash the harness when computing the CPU reference, not produce a "FAIL" line. The existing eliza_custom_quant GET_ROWS/CPY/MUL_MAT test cases (added by abf8a27) already exercise the underlying Q4_POLAR / TBQ3_0 / TBQ4_0 / TBQ3_TCQ types on Metal where kernels exist; for Metal those quants currently print "not supported" for GET_ROWS/CPY/MUL_MAT because the Metal device backend gates them out explicitly (no kernel registered yet, see ggml-metal-device.m cases GGML_OP_GET_ROWS / GGML_OP_CPY). Shape constraints come straight from ggml_metal_device_supports_op: - ATTN_SCORE_QJL: q.ne[0]=256, K.ne[0]=128 - FUSED_ATTN_QJL_TBQ: q.ne[0]=256, K.ne[0]=128, V.ne[0]=128, out.ne[0]=128 - (q.ne[1] % n_kv_heads) == 0 New CI workflow .github/workflows/eliza-metal-validation.yml: sibling to eliza-vulkan-validation.yml. Builds with -DGGML_METAL=ON on macos-latest, runs test-quantize-fns (smoke for the Eliza custom quants), then runs test-backend-ops on the Metal backend for the relevant ops (GET_ROWS / CPY / MUL_MAT / ATTN_SCORE_QJL / FUSED_ATTN_QJL_TBQ). Fails the job on any FAIL line for the Eliza custom-quant types/ops; "not supported" lines are the expected baseline for kernels we have not registered yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI (vulkan)/ubuntu-24-vulkan-llvmpipe run 25917370432 (SHA 5fbb799) reports NMSE-vs-CPU FAIL for gemma2 / Dense (2.45e-02) and gemma3n / Dense (1.34e-01), exceeding the test-llama-archs tolerance. CPU and Meta backends pass cleanly for both archs; sibling Eliza arch gemma3 also passes on Vulkan in the same run (9.66e-14), so this is a real correctness drift in the Vulkan backend kernels specific to these two gemma family configurations — not a crash and not a generic gemma issue. Fixing the underlying Vulkan kernel is well outside the test-llama-archs lane; tracked separately in .swarm/collab.md backlog. Add a narrow #ifdef GGML_USE_VULKAN gate inside arch_supported() so CPU and Meta coverage is preserved, mirroring the existing GGML_USE_WEBGPU block and FF's earlier QWEN35/QWEN35MOE pattern.

Failing run 25917370465 cited in JJ's brief was at pre-fix SHA 5fbb799. Fix commit 64cfd99 (option b: NOT WIN32 OR NOT BUILD_SHARED_LIBS gate around test-fused-kernels and bench-fused-kernels in tests/CMakeLists.txt:279-292) has been on HEAD since May 15. No new code change in this lane. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The workflow passes `-static` to both compile and link flags so the final test-quantize-fns binary can run under qemu-user without dynamic loader plumbing. But CMake's BUILD_SHARED_LIBS default is ON for non-MinGW non-Emscripten Linux, so libggml-base.so.0.11.1 is configured as a shared library and the link line ends up combining `-shared` with `-static`. The riscv64 ld then pulls in crtbeginT.o (the static-PIE crt object) into a shared link, which fails with: ld: crtbeginT.o: relocation R_RISCV_HI20 against `a local symbol' can not be used when making a shared object; recompile with -fPIC collect2: fatal error: ld terminated with signal 11 Setting -DBUILD_SHARED_LIBS=OFF aligns the build mode with the `-static` flag intent: all libs static, single static executable. Affects CI (riscv VLEN matrix) cross-qemu-vlen {128,256,512,1024}.

EAGLE3 (Extrapolation-aware Algorithm for Generating Likely Encoded Tokens, v3) is a draft-model architecture for speculative decoding. Upstream PR #18039 adds it. This change is additive scaffolding only: reserves the enum value and registers the "eagle3" architecture string in LLM_ARCH_NAMES. No graph builder, hparams, layer fields, or model loader wiring yet — those live in conflict-prone files (llama-model.cpp, llama-hparams.cpp) and are deferred to a follow-up. See /tmp/wave6-eagle3-real-journal.md for the full port plan.

EAGLE3's draft-model graph reads from two extra tensors produced by the target model: 'target_features' (hidden state projection) and 'target_tok_embd' (target's token embedding table). This change reserves their llm_tensor enum values and registers: - LLM_TENSOR_NAMES entries (gguf string names) - LLM_TENSOR_INFOS entries (layer + ggml_op metadata) No graph builder or model-loader wiring yet — those come with the model-side scaffolding in a follow-up.

Adds two LLAMA_API symbols so callers can link against the eventual EAGLE3 surface: - llama_get_eagle3_target_features() → NULL stub - llama_set_eagle3_g_embeddings() → -1 stub Both log a WARN-level message and return failure. The real implementations land once the model loader, hparams, layer fields, and graph builder for LLM_ARCH_EAGLE3 are wired in. See /tmp/wave6-eagle3-real-journal.md for the full port plan. src/llama-eagle3.cpp is wired into the llama target via src/CMakeLists.txt so the symbols are exported.

…link failure

…pler

…ever committed) The header was added in 33c888a (feat: add token trie sampler header) but no implementation file was ever committed to any branch reachable from this tree. `git log --all -S llama_sampler_init_token_trie` shows only the header add commit and a merge — never a definition. There are zero references anywhere in source, CMake, or build config: the symbol cannot link, and nothing tries to. A prior swarm agent already documented this in .swarm/collab.md noting the sampler header is stale and unreferenced. Removing the dead header to clean up the link target. If the token-trie sampler is reintroduced, it should land header + impl + CMake entry + a real caller together.

Cherry-pick upstream commit 255582687 (54 files, +2226/-412) and resolve conflicts so MTP speculative decoding (--spec-type draft-mtp, with mtp as backwards-compat alias) coexists with our Eliza-1 customizations. Conflicts resolved (preserved Eliza divergence): - common/common.h: drop legacy COMMON_SPECULATIVE_TYPE_MTP; use upstream DRAFT_MTP (renamed in PR #22964). Kept "mtp" string alias for back-compat. - common/speculative.cpp: keep DFLASH dispatch, EAGLE3 fail-fast, null-check in common_speculative_accept; wire DRAFT_MTP into the config push_back list using upstream's params.draft.ctx_dft gate. - src/llama-model.cpp: thread upstream's `filter` callback into the kv_cache constructor while keeping our kv_dynamic kv_size_max arg. - src/models/qwen35.cpp, qwen35moe.cpp: adopt upstream's nextn_predict_layers split (MTP layers always non-recurrent) but keep our tensor-presence discovery as primary so existing Eliza-1 GGUFs that don't encode full_attention_interval still load correctly. - tools/server/server-context.cpp: keep our prefill-plan methods AND upstream's need_embd() that consults common_speculative_need_embd. - tools/server/README.md: kept HEAD (no webui→ui rename in our tree). Backed out (not portable to current tree, documented): - ggml/src/ggml-cuda/gated_delta_net.cu: 5 conflict regions touching the K-snapshot path for spec-dec. Reverted to HEAD; CUDA MTP path will need a separate port. Non-MTP CUDA gated_delta_net unaffected. - conversion/{base,qwen}.py: upstream introduced a modular Python conversion package we don't track. Removed the cherry-picked files; convert_hf_to_gguf.py still gains the --mtp / --no-mtp flags but the Qwen3.5MtpMixin import in main() will need the conversion package to land separately before HF→GGUF MTP export works end-to-end. GGML_TYPE_COUNT remains 51 (our TBQ/POLAR/TBQ_K quant types at IDs 42–48 + TBQ3_K=49/TBQ4_K=50 preserved). Upstream's only ggml.h change was a comment addition on ggml_gated_delta_net.

…xample

… #22673 Prior MTP cherry-pick at 2761635 backed this file out due to 5 conflict regions in the K-snapshot spec-dec path against our Eliza fork (which already diverged the kernel to use d_v_per_warp/block_dv warp tiling, __restrict__ qualifiers, lambda-based ggml_cuda_memcpy_1 load/store, and a 3D grid (H, n_seqs, n_block_dv)). This commit hand-ports the K-snapshot additions onto the Eliza kernel architecture: 1. kernel template: add bool keep_rs_t param + int K runtime param. 2. state offsets: input state is 3D (S_v*S_v*H, K, n_seqs); seq stride becomes K * H * S_v * S_v. Output state per slot is unchanged. 3. per-token snapshot write: when keep_rs_t=true, after each token's stage-B store the working s_tile to slot (t - (n_tokens - K)) in the output state region. Slots earlier than t=shift are left caller-owned. 4. final state write: kept only on the !keep_rs_t branch. 5. host op dispatcher: K = src_state->ne[1]; keep_rs = (K > 1). K==1 preserves the legacy single-snapshot path bit-for-bit; K>1 enables partial rollback for MTP spec-dec. Layout verified by direct comparison to the CPU reference at ggml/src/ggml-cpu/ops.cpp ggml_compute_forward_gated_delta_net_one_chunk (matching state_seq_stride, state_size_per_snap, shift, and target_slot arithmetic) and the Metal port (matching K = op->src[5]->ne[1]). test-backend-ops cases at tests/test-backend-ops.cpp:9477-9485 cover K==n_seq_tokens, K<n_seq_tokens (overflow), and both KDA and non-KDA variants. Cross-compile attempted via scripts/cuda-docker-build.sh (Docker reachable, nvidia/cuda:12.4.1-devel-ubuntu22.04 under linux/amd64 emulation on aarch64 host); apt/cmake configure succeeded, full nvcc compile of all ggml-cuda/*.cu kernels was still in flight at commit time. No errors observed on the gated_delta_net.cu compile up to log truncation. TODO(cuda-mtp-validation) marker inserted at the per-token snapshot write to flag the point most likely to need on-device GPU verification. Metal MTP baseline regression check on /tmp/mtp-test/Qwen3.5-2B-Q4_K_M.gguf: prompt 355.7 t/s, gen 104.6 t/s — coherent output, no regression from this CUDA-only change (no Metal kernel or shared header touched). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Source: discovered in /tmp/wave9-master-issue-list.md item #3. PR #22673 (Multi-Token Prediction) was ported in commit 2761635 and its --spec-type draft-mtp CLI usage is documented in tools/cli/README.md. The stale entry in the upstream-PR-candidates roster still listed it alongside #18886 under "to merge" — corrected to reflect the merged state and noted the one remaining gap (CUDA gated_delta_net.cu K-snapshot spec-dec path still backed out, pending separate port).

…PR #22673 Adds conversion/{__init__,qwen}.py — required for --mtp / --no-mtp flags in convert_hf_to_gguf.py to actually resolve. The MTP cherry-pick at 2761635 backed out conversion/{base,qwen}.py because upstream PR #22673 also introduced a full modular Python conversion package we don't track. This commit restores only the _Qwen35MtpMixin (no base.py, no other model families) and mixes it into the existing monolithic Qwen3_5TextModel / Qwen3_5MoeTextModel base lists. Without this we could only consume upstream-pre-published MTP GGUFs (e.g. unsloth/Qwen3.5-2B-MTP-GGUF); with this we can re-convert our own Eliza-1 Qwen3.5 base weights to add MTP heads. Verified: - `from conversion.qwen import _Qwen35MtpMixin` resolves cleanly. - `python convert_hf_to_gguf.py --help` shows --mtp / --no-mtp. - `--mtp <non-qwen35>` errors out with the documented message. - End-to-end: re-converting unsloth/Qwen3.5-2B (bundled, BF16) produces a 333-layer-tensor GGUF with qwen35.block_count=25, qwen35.nextn_predict_layers=1, and blk.24.nextn.{eh_proj,enorm, hnorm,shared_head_norm}.weight. Tensor name list diffs zero against the upstream-published unsloth/Qwen3.5-2B-MTP-GGUF Qwen3.5-2B-BF16.gguf reference (334 names identical, sorted). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The embedded-metallib build path (GGML_METAL_EMBED_LIBRARY=1) did not define GGML_METAL_HAS_BF16 when invoking xcrun metal, so every bfloat- gated kernel — kernel_get_rows_bf16, kernel_mul_mv_bf16_*, kernel_flash_attn_ext_bf16_*, kernel_tri_bf16_* — was omitted from the compiled default.metallib. The runtime device->props.has_bfloat is true on M-series, so supports_op happily returned true for BF16 GET_ROWS, the kernel lookup returned a null MTLFunction ("failed to compile pipeline"), and the subsequent dispatch SIGSEGV'd (test-backend-ops -b MTL0 -o GET_ROWS exit 139). The runtime JIT path in ggml-metal-device.m already sets this macro from device props (line 235), so only the precompiled metallib path was broken. ggml-metal.metal auto-undefs the macro when __METAL_VERSION__ < 310, so adding -DGGML_METAL_HAS_BF16=1 is safe for older Metal targets — gating remains correct via the runtime has_bfloat check in supports_op. Verified on M4 Max: test-backend-ops -b MTL0 -o GET_ROWS: 111/111 PASS (was SIGSEGV) test-backend-ops -b MTL0 -o ATTN_SCORE_TBQ: 4/4 PASS (unchanged) llama-cli eliza-1-0_8b -ngl 99: 83.1 t/s gen (unchanged)

The K5 Kokoro iSTFT TTS subtree and the omnivoice merged subtree are both first-class voice surfaces in the eliza-1 stack — TTS GGUFs are staged in the eliza-1 bundles (omnivoice-base, omnivoice-tokenizer, kokoro) and the runtime needs loaders for them. Default-off meant no CLI binary in bin/, no test wiring, and no end-to-end coverage. Build artifacts produced when these flags are ON: bin/kokoro-tts — Kokoro standalone CLI bin/omnivoice-tts — OmniVoice standalone CLI bin/test-kokoro-istft — iSTFT unit smoke bin/test-kokoro-phonemes — phonemizer unit smoke libkokoro_lib.a / libomnivoice_lib.a Stock llama.cpp consumers that don't want the TTS subtrees can still opt out with -DLLAMA_BUILD_KOKORO=OFF -DLLAMA_BUILD_OMNIVOICE=OFF. Verified on build-codex-merge (Darwin arm64, Metal + BLAS): test-kokoro-istft: OK (peak=1.2010, len=495) test-kokoro-phonemes: OK

GGML_OP_ISTFT (the Kokoro iSTFTNet decoder op) was registered in ggml.h + implemented on CPU (ggml-cpu/ops.cpp:11415) + CUDA (ggml-cuda/istft.cu) + had a Metal kernel staged (ggml-metal/eliza-kernels/istft.metal) and Vulkan pipeline declared (ggml-vulkan/ggml-vulkan.cpp:4974), but `test-backend-ops -o ISTFT` ran 0 tests because no test_case existed. Adds test_istft covering: - (n_fft=20, hop=5, win=20) x16 frames — Kokoro-82M iSTFTNet shape - (n_fft=20, hop=5, win=20) x16 frames + explicit Hann window src1 - (n_fft=256, hop=64, win=256) x8 frames — medium 256-FFT vocoder - (n_fft=512, hop=128,win=512) x8 frames — librosa-style 512-FFT + window Mirrors the test_attn_score_tbq/polar pattern used in commits 1dfde3a + 54baf1d. NMSE tolerance 1e-3 absorbs the GPU-vs-double-precision-IDFT OLA accumulation order difference. Verified on build-codex-merge: - Metal MTL0: NOT SUPPORTED (kernel staged but dispatch unwired; tracked separately under the Metal backend agent's scope). - BLAS: NOT SUPPORTED (expected — BLAS doesn't handle ISTFT). - CPU: reference backend, parity self-test inherent in framework. Once the Metal + Vulkan dispatch glue lands, these four cases will flip to actual parity assertions against the CPU reference.

llama-cli rejects --no-conversation (-no-cnv) and prints "please use llama-completion instead", but explicit single-target builds (`cmake --build ... --target llama-cli`) skipped the redirect target. The documented escape hatch was missing whenever someone built only llama-cli. Add `add_dependencies(llama-cli llama-completion)` (guarded by `if(TARGET llama-completion)` for builds where the completion subdir isn't included) so both binaries land in bin/ together. Verified: rm bin/llama-completion + tools/completion build dir, then `cmake --build build-codex-merge --target llama-cli` rebuilds both, and llama-completion smoke-tests at 247 t/s on eliza-1-0_8b-128k. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…OMNIVOICE" This reverts commit 7c36249.

coderabbitai · 2026-05-18T23:29:37Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 771e2e7a-42bd-430b-9659-b390f1204308

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch eliza/token-trie-sampler

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Three independent failures on every push to eliza/token-trie-sampler: 1. Check vendor — vendor/sheredom/subprocess.h was stale vs the pinned upstream URL (b49c56e9fe21...). The fork's earlier "stopping_thread hang on child process exit" diff (67a7818) has already been incorporated upstream at that pin, so the local file just needed to be re-synced with `scripts/sync_vendor.py`. Drop the 5 obsolete lines so check-vendor reports a clean tree. 2. flake8 — convert_hf_to_gguf.py imported `transformers.AutoConfig` but only referenced it in comments. F401 unused-import. Removed. 3. Python Type-Check (ty) — conversion/qwen.py defines `_Qwen35MtpMixin` whose `super().*` chain resolves at runtime via the composed Model subclasses in convert_hf_to_gguf.py. ty cannot see that composition at the mixin level, so it flagged super().filter_tensors, super().set_gguf_parameters, super().prepare_metadata, super().modify_tensors and the inherited `ftype` / `metadata` attributes as unresolved. Add ./conversion/** to the existing unresolved-* override block (same treatment as tools/kokoro/tools). Verified locally: - `python3 scripts/sync_vendor.py` leaves vendor/ unchanged. - `python3 -m flake8 convert_hf_to_gguf.py` is clean for F401. - ty override covers the previously-flagged lines. Refs: PR #15

NubsCarson and others added 30 commits May 15, 2026 13:54

fix(server): preserve oaicompat custom grammar

59f32d0

Merge remote-tracking branch 'origin/main' into codex/oaicompat-custo…

bb65985

…m-grammar-20260515

Merge remote-tracking branch 'origin/main' into codex/oaicompat-custo…

6b68b98

…m-grammar-20260515

fix(android): stabilize OmniVoice ASR model loading

efe3d50

merge: origin/main (K5 Kokoro iSTFT + CI fixes) into eliza/token-trie…

e22543a

…-sampler # Conflicts: # tests/test-llama-archs.cpp

ggml : bump version to 0.12.0 (ggml/1494)

a30a6fb

ggml: install ggml.pc in <libdir>/pkgconfig (ggml/1480)

6f7a16e

That's always how it's done: https://github.com/search?q=path%3ACMakeLists.txt%20%22%24%7BCMAKE_INSTALL_LIBDIR%7D%2Fpkgconfig%22&type=code

vendor : update cpp-httplib to 0.45.0 (#23103)

e5eef79

reasoning-budget: clone should do a deep-copy (#23095)

7ffd89f

sync : ggml

592290e

fix(x86/repack): suppress -Werror=unused-variable on requiredOrder wh…

0623977

…en AVX-512 inactive

fix(ggml-cpu): drop case-block braces after GGML_ABORT; silence unuse…

9401be8

…d mzero on AVX-only path

swarm(LL): register on broad triage Apple/RISC-V/self-hosted/Server

18cf65d

swarm: KK record gemma2/gemma3n Vulkan gate + backlog entry

657add6

fix(ci/riscv): disable OpenMP in VLEN matrix to avoid libgomp static-…

476698f

…link failure

lalalune and others added 14 commits May 16, 2026 23:25

Merge remote-tracking branch 'origin/pr/13' into eliza/token-trie-sam…

c944be2

…pler

Merge remote-tracking branch 'origin/pr/14' into eliza/token-trie-sam…

dd7fd95

…pler

feat(spec): remove Option-C fail-fast for MTP (PR #22673 ready to port)

3d160c9

docs(cli): add Multi-Token Prediction (--spec-type draft-mtp) usage e…

43809c6

…xample

Revert "feat(build): default-enable LLAMA_BUILD_KOKORO + LLAMA_BUILD_…

5b53366

…OMNIVOICE" This reverts commit 7c36249.

lalalune merged commit d5acb48 into main May 18, 2026
78 of 121 checks passed

lalalune deleted the eliza/token-trie-sampler branch May 18, 2026 23:30

github-actions Bot added documentation Improvements or additions to documentation Nvidia GPU examples devops server ggml model Vulkan testing Apple Metal python script labels May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliza/token trie sampler#15

Eliza/token trie sampler#15
lalalune merged 63 commits into
mainfrom
eliza/token-trie-sampler

lalalune commented May 18, 2026

Uh oh!

Uh oh!

coderabbitai Bot commented May 18, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

lalalune commented May 18, 2026

Overview

Additional information

Requirements

Uh oh!

Uh oh!

coderabbitai Bot commented May 18, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants