Eliza/token trie sampler#15
Merged
Merged
Conversation
…m-grammar-20260515
…m-grammar-20260515
…r SWA-only models When a model has zero non-SWA attention layers (e.g. a SWA-only slice of Gemma 4), the base KV cache has no layer tensors. The input tensors (self_k_idxs, self_v_idxs, self_kq_mask) are created as graph input nodes but never consumed by any compute node, so the backend scheduler never allocates a buffer for them. Calling mctx->get_base()->set_input_k_idxs() on an unallocated tensor then hits GGML_ASSERT(buffer) at ggml-backend.cpp:194. The same scenario applies symmetrically: if a model had zero SWA layers, the SWA tensors would be unallocated. Fix: guard both the base and SWA set_input calls with null/buffer checks, matching the pattern already used by llm_graph_input_mem_hybrid_iswa::set_input (line ~674) which has the comment: 'base tensors may not be allocated if there are no non-SWA attention layers'. Also fix can_reuse() in the same class to skip the ne[0] and kq_mask checks for unallocated tensors, preventing a null-dereference on the reuse path.
…-sampler # Conflicts: # tests/test-llama-archs.cpp
Cherry-picks the iq1_m portion of dc05e3580 from upstream PR #22754. Adds: - ggml_vec_dot_iq1_m_q8_K_vl512 (LMUL=4 grid gather, LMUL=8 widening multiply, masked vwmacc accumulation) - ggml_vec_dot_iq1_m_q8_K_vl1024 (LMUL=2 + three-way masked vwmacc to fold 64 elements into one register per sub-block) - case 512 / case 1024 entries in the iq1_m dispatcher Rebased onto our PR #20682 macro scheme: kept under the existing '#if GGML_RISCV_RVV_1_0' guard surrounding all iq1_m vlN kernels. xtheadvector and vl128 paths unchanged. Bit-exact correctness contract: must match the scalar reference at VLEN=512 and VLEN=1024. Validation deferred to qemu VLEN matrix CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/1477) For a given output position j on the time axis, only input positions i such that i*s0 <= j < i*s0 + K contribute -- i.e. i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1]. That's at most ceil(K/s0) values (typically 2 for stride==K/2 transposed convs). The current kernel iterates the full IL range and filters with an `if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320, K=10, s0=5 -- a representative codec-decoder shape). On Apple M1 the wasted work trips the macOS GPU watchdog (kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long graphs. Compute i_min, i_max analytically before the inner loop and iterate only [i_min, i_max]. Output is bit-identical (same multiplies and adds in the same order); loop bound shrinks by IL/ceil(K/s0). Tested on M1 with a downstream consumer running a TTS codec at full T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits across long synthesis runs vs ~30% pre-patch.
Cherry-picks the iq2_s portion of dc05e3580 from upstream PR #22754. Adds: - ggml_vec_dot_iq2_s_q8_K_vl512 (LMUL=2 grid gather; sign-bit broadcast via vrgather; per-half masked vwredsum to fold 128 lanes into 8 per-sub-block scalars) - ggml_vec_dot_iq2_s_q8_K_vl1024 (single-pass 256-lane processing with 4 groups of 16 masked vwredsum, vslidedown for 16 sub-block sums, vector LUT for scale nibble expansion) - case 512 / case 1024 entries in the iq2_s dispatcher Rebased onto our PR #20682 macro scheme: kept under the existing '#if GGML_RISCV_RVV_1_0' guard surrounding all iq2_s vlN kernels. xtheadvector and vl128 paths unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cherry-picks the iq2_xs portion of dc05e3580 from upstream PR #22754. Adds: - ggml_vec_dot_iq2_xs_q8_K_vl512 (LMUL=4 grid + signs gather in parallel, vwmul into LMUL=8, 8-fold vslidedown + vwredsum loop for per-sub-block 16-lane sums) - case 512 entry in the iq2_xs dispatcher Upstream PR #22754 ships vl512 only for iq2_xs (no vl1024 variant). Rebased onto our PR #20682 macro scheme: kept under the existing '#if GGML_RISCV_RVV_1_0' guard alongside the vl256 kernel. The dispatcher's case 256 / case 512 entries are nested inside the GGML_RISCV_VECTOR_INTRINSICS arm under their own RVV_1_0 guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…en AVX-512 inactive
Cherry-picks the iq2_xxs portion of dc05e3580 from upstream PR #22754. Adds: - ggml_vec_dot_iq2_xxs_q8_K_vl512 (LMUL=4 grid + signs gather; vlseg2 de-interleave of (index, scale) pairs; vrgatherei16 + vsrl to compute per-sub-block sign indices; 8-fold vslidedown + vwredsum to reduce 256 lanes into 8 per-sub-block sums) - case 256 / case 512 explicit entries in the iq2_xxs dispatcher, preserving the previous "default: 256+" semantics for VLEN=1024+ Upstream PR #22754 ships vl512 only for iq2_xxs (no vl1024 variant). Rebased onto our PR #20682 macro scheme: kept under the existing '#if GGML_RISCV_RVV_1_0' guard. xtheadvector and vl128 paths unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cherry-picks the iq3_s portion of dc05e3580 from upstream PR #22754. Adds: - ggml_vec_dot_iq3_s_q8_K_vl512 (LMUL=2 grid32 luxei gather; per-bit qh expansion via vrgather + vsrl + vand; sign-bit broadcast + vrsub_mu sign-folding; LMUL=4 vwmulsu dot + LMUL=8 weighted sum; vrgather-based scale broadcast to 128 lanes) - case 512 entry in the iq3_s dispatcher Upstream PR #22754 ships vl512 only for iq3_s (no vl1024 variant). Rebased onto our PR #20682 macro scheme: kept under the existing '#if GGML_RISCV_RVV_1_0' guard surrounding all iq3_s vlN kernels. xtheadvector and vl128 paths unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d mzero on AVX-only path
Cherry-picks the iq3_xxs portion of dc05e3580 from upstream PR #22754. Adds: - ggml_vec_dot_iq3_xxs_q8_K_vl512 (LMUL=2 grid32 luxei + signs64 luxei gather; per-half ib128 loop with vrgatherei16 metadata expansion; LMUL=4 vwmul + 4-fold vwredsum per i16m1) - ggml_vec_dot_iq3_xxs_q8_K_vl1024 (single-pass 256-lane processing; vrgatherei16 over 8 metadata words; 8-fold vslidedown + vwredsum over an LMUL=4 dot register) - case 512 / case 1024 entries in the iq3_xxs dispatcher (nested under the existing GGML_RISCV_RVV_1_0 guard inside the GGML_RISCV_VECTOR_INTRINSICS arm) Rebased onto our PR #20682 macro scheme: kept under the existing '#if GGML_RISCV_RVV_1_0' guard surrounding the vl256 kernel. xtheadvector and vl128 paths unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror of abf8a27 (Vulkan parity) for the Metal backend. Two new test_case structs in tests/test-backend-ops.cpp: - test_attn_score_qjl -> GGML_OP_ATTN_SCORE_QJL (QJL1_256 K) - test_fused_attn_qjl_tbq -> GGML_OP_FUSED_ATTN_QJL_TBQ (QJL1_256 K + TBQ3_0 V) Both ops have a CPU reference implementation (ggml-cpu/qjl) and a Metal kernel (eliza-shipped/qjl.metal, eliza-shipped/fused_attn_qjl_tbq.metal). The test harness runs them on every registered backend and compares against CPU. On Metal we get real parity coverage; on Vulkan/CUDA the harness prints "not supported" (their supports_op rejects these op codes), matching the GET_ROWS/CPY/MUL_MAT pattern from abf8a27. GGML_OP_ATTN_SCORE_TBQ and GGML_OP_ATTN_SCORE_POLAR are deliberately NOT added: the CPU backend GGML_ABORTs on them (they are Metal-only; canonical CPU graphs lower via attn_score_qjl / flash_attn_ext). A test_case for either would crash the harness when computing the CPU reference, not produce a "FAIL" line. The existing eliza_custom_quant GET_ROWS/CPY/MUL_MAT test cases (added by abf8a27) already exercise the underlying Q4_POLAR / TBQ3_0 / TBQ4_0 / TBQ3_TCQ types on Metal where kernels exist; for Metal those quants currently print "not supported" for GET_ROWS/CPY/MUL_MAT because the Metal device backend gates them out explicitly (no kernel registered yet, see ggml-metal-device.m cases GGML_OP_GET_ROWS / GGML_OP_CPY). Shape constraints come straight from ggml_metal_device_supports_op: - ATTN_SCORE_QJL: q.ne[0]=256, K.ne[0]=128 - FUSED_ATTN_QJL_TBQ: q.ne[0]=256, K.ne[0]=128, V.ne[0]=128, out.ne[0]=128 - (q.ne[1] % n_kv_heads) == 0 New CI workflow .github/workflows/eliza-metal-validation.yml: sibling to eliza-vulkan-validation.yml. Builds with -DGGML_METAL=ON on macos-latest, runs test-quantize-fns (smoke for the Eliza custom quants), then runs test-backend-ops on the Metal backend for the relevant ops (GET_ROWS / CPY / MUL_MAT / ATTN_SCORE_QJL / FUSED_ATTN_QJL_TBQ). Fails the job on any FAIL line for the Eliza custom-quant types/ops; "not supported" lines are the expected baseline for kernels we have not registered yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI (vulkan)/ubuntu-24-vulkan-llvmpipe run 25917370432 (SHA 5fbb799) reports NMSE-vs-CPU FAIL for gemma2 / Dense (2.45e-02) and gemma3n / Dense (1.34e-01), exceeding the test-llama-archs tolerance. CPU and Meta backends pass cleanly for both archs; sibling Eliza arch gemma3 also passes on Vulkan in the same run (9.66e-14), so this is a real correctness drift in the Vulkan backend kernels specific to these two gemma family configurations — not a crash and not a generic gemma issue. Fixing the underlying Vulkan kernel is well outside the test-llama-archs lane; tracked separately in .swarm/collab.md backlog. Add a narrow #ifdef GGML_USE_VULKAN gate inside arch_supported() so CPU and Meta coverage is preserved, mirroring the existing GGML_USE_WEBGPU block and FF's earlier QWEN35/QWEN35MOE pattern.
Failing run 25917370465 cited in JJ's brief was at pre-fix SHA 5fbb799. Fix commit 64cfd99 (option b: NOT WIN32 OR NOT BUILD_SHARED_LIBS gate around test-fused-kernels and bench-fused-kernels in tests/CMakeLists.txt:279-292) has been on HEAD since May 15. No new code change in this lane. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The workflow passes `-static` to both compile and link flags so the
final test-quantize-fns binary can run under qemu-user without
dynamic loader plumbing. But CMake's BUILD_SHARED_LIBS default is ON
for non-MinGW non-Emscripten Linux, so libggml-base.so.0.11.1 is
configured as a shared library and the link line ends up combining
`-shared` with `-static`. The riscv64 ld then pulls in crtbeginT.o
(the static-PIE crt object) into a shared link, which fails with:
ld: crtbeginT.o: relocation R_RISCV_HI20 against `a local symbol'
can not be used when making a shared object; recompile with -fPIC
collect2: fatal error: ld terminated with signal 11
Setting -DBUILD_SHARED_LIBS=OFF aligns the build mode with the
`-static` flag intent: all libs static, single static executable.
Affects CI (riscv VLEN matrix) cross-qemu-vlen {128,256,512,1024}.
EAGLE3 (Extrapolation-aware Algorithm for Generating Likely Encoded Tokens, v3) is a draft-model architecture for speculative decoding. Upstream PR #18039 adds it. This change is additive scaffolding only: reserves the enum value and registers the "eagle3" architecture string in LLM_ARCH_NAMES. No graph builder, hparams, layer fields, or model loader wiring yet — those live in conflict-prone files (llama-model.cpp, llama-hparams.cpp) and are deferred to a follow-up. See /tmp/wave6-eagle3-real-journal.md for the full port plan.
EAGLE3's draft-model graph reads from two extra tensors produced by the target model: 'target_features' (hidden state projection) and 'target_tok_embd' (target's token embedding table). This change reserves their llm_tensor enum values and registers: - LLM_TENSOR_NAMES entries (gguf string names) - LLM_TENSOR_INFOS entries (layer + ggml_op metadata) No graph builder or model-loader wiring yet — those come with the model-side scaffolding in a follow-up.
Adds two LLAMA_API symbols so callers can link against the eventual EAGLE3 surface: - llama_get_eagle3_target_features() → NULL stub - llama_set_eagle3_g_embeddings() → -1 stub Both log a WARN-level message and return failure. The real implementations land once the model loader, hparams, layer fields, and graph builder for LLM_ARCH_EAGLE3 are wired in. See /tmp/wave6-eagle3-real-journal.md for the full port plan. src/llama-eagle3.cpp is wired into the llama target via src/CMakeLists.txt so the symbols are exported.
…ever committed) The header was added in 33c888a (feat: add token trie sampler header) but no implementation file was ever committed to any branch reachable from this tree. `git log --all -S llama_sampler_init_token_trie` shows only the header add commit and a merge — never a definition. There are zero references anywhere in source, CMake, or build config: the symbol cannot link, and nothing tries to. A prior swarm agent already documented this in .swarm/collab.md noting the sampler header is stale and unreferenced. Removing the dead header to clean up the link target. If the token-trie sampler is reintroduced, it should land header + impl + CMake entry + a real caller together.
Cherry-pick upstream commit 255582687 (54 files, +2226/-412) and resolve
conflicts so MTP speculative decoding (--spec-type draft-mtp, with mtp as
backwards-compat alias) coexists with our Eliza-1 customizations.
Conflicts resolved (preserved Eliza divergence):
- common/common.h: drop legacy COMMON_SPECULATIVE_TYPE_MTP; use upstream
DRAFT_MTP (renamed in PR #22964). Kept "mtp" string alias for back-compat.
- common/speculative.cpp: keep DFLASH dispatch, EAGLE3 fail-fast, null-check
in common_speculative_accept; wire DRAFT_MTP into the config push_back
list using upstream's params.draft.ctx_dft gate.
- src/llama-model.cpp: thread upstream's `filter` callback into the
kv_cache constructor while keeping our kv_dynamic kv_size_max arg.
- src/models/qwen35.cpp, qwen35moe.cpp: adopt upstream's
nextn_predict_layers split (MTP layers always non-recurrent) but keep
our tensor-presence discovery as primary so existing Eliza-1 GGUFs
that don't encode full_attention_interval still load correctly.
- tools/server/server-context.cpp: keep our prefill-plan methods AND
upstream's need_embd() that consults common_speculative_need_embd.
- tools/server/README.md: kept HEAD (no webui→ui rename in our tree).
Backed out (not portable to current tree, documented):
- ggml/src/ggml-cuda/gated_delta_net.cu: 5 conflict regions touching the
K-snapshot path for spec-dec. Reverted to HEAD; CUDA MTP path will need
a separate port. Non-MTP CUDA gated_delta_net unaffected.
- conversion/{base,qwen}.py: upstream introduced a modular Python
conversion package we don't track. Removed the cherry-picked files;
convert_hf_to_gguf.py still gains the --mtp / --no-mtp flags but the
Qwen3.5MtpMixin import in main() will need the conversion package to
land separately before HF→GGUF MTP export works end-to-end.
GGML_TYPE_COUNT remains 51 (our TBQ/POLAR/TBQ_K quant types at IDs
42–48 + TBQ3_K=49/TBQ4_K=50 preserved). Upstream's only ggml.h change
was a comment addition on ggml_gated_delta_net.
… #22673 Prior MTP cherry-pick at 2761635 backed this file out due to 5 conflict regions in the K-snapshot spec-dec path against our Eliza fork (which already diverged the kernel to use d_v_per_warp/block_dv warp tiling, __restrict__ qualifiers, lambda-based ggml_cuda_memcpy_1 load/store, and a 3D grid (H, n_seqs, n_block_dv)). This commit hand-ports the K-snapshot additions onto the Eliza kernel architecture: 1. kernel template: add bool keep_rs_t param + int K runtime param. 2. state offsets: input state is 3D (S_v*S_v*H, K, n_seqs); seq stride becomes K * H * S_v * S_v. Output state per slot is unchanged. 3. per-token snapshot write: when keep_rs_t=true, after each token's stage-B store the working s_tile to slot (t - (n_tokens - K)) in the output state region. Slots earlier than t=shift are left caller-owned. 4. final state write: kept only on the !keep_rs_t branch. 5. host op dispatcher: K = src_state->ne[1]; keep_rs = (K > 1). K==1 preserves the legacy single-snapshot path bit-for-bit; K>1 enables partial rollback for MTP spec-dec. Layout verified by direct comparison to the CPU reference at ggml/src/ggml-cpu/ops.cpp ggml_compute_forward_gated_delta_net_one_chunk (matching state_seq_stride, state_size_per_snap, shift, and target_slot arithmetic) and the Metal port (matching K = op->src[5]->ne[1]). test-backend-ops cases at tests/test-backend-ops.cpp:9477-9485 cover K==n_seq_tokens, K<n_seq_tokens (overflow), and both KDA and non-KDA variants. Cross-compile attempted via scripts/cuda-docker-build.sh (Docker reachable, nvidia/cuda:12.4.1-devel-ubuntu22.04 under linux/amd64 emulation on aarch64 host); apt/cmake configure succeeded, full nvcc compile of all ggml-cuda/*.cu kernels was still in flight at commit time. No errors observed on the gated_delta_net.cu compile up to log truncation. TODO(cuda-mtp-validation) marker inserted at the per-token snapshot write to flag the point most likely to need on-device GPU verification. Metal MTP baseline regression check on /tmp/mtp-test/Qwen3.5-2B-Q4_K_M.gguf: prompt 355.7 t/s, gen 104.6 t/s — coherent output, no regression from this CUDA-only change (no Metal kernel or shared header touched). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Source: discovered in /tmp/wave9-master-issue-list.md item #3. PR #22673 (Multi-Token Prediction) was ported in commit 2761635 and its --spec-type draft-mtp CLI usage is documented in tools/cli/README.md. The stale entry in the upstream-PR-candidates roster still listed it alongside #18886 under "to merge" — corrected to reflect the merged state and noted the one remaining gap (CUDA gated_delta_net.cu K-snapshot spec-dec path still backed out, pending separate port).
…PR #22673
Adds conversion/{__init__,qwen}.py — required for --mtp / --no-mtp flags
in convert_hf_to_gguf.py to actually resolve. The MTP cherry-pick at
2761635 backed out conversion/{base,qwen}.py because upstream PR
#22673 also introduced a full modular Python conversion package we
don't track. This commit restores only the _Qwen35MtpMixin (no
base.py, no other model families) and mixes it into the existing
monolithic Qwen3_5TextModel / Qwen3_5MoeTextModel base lists.
Without this we could only consume upstream-pre-published MTP GGUFs
(e.g. unsloth/Qwen3.5-2B-MTP-GGUF); with this we can re-convert our
own Eliza-1 Qwen3.5 base weights to add MTP heads.
Verified:
- `from conversion.qwen import _Qwen35MtpMixin` resolves cleanly.
- `python convert_hf_to_gguf.py --help` shows --mtp / --no-mtp.
- `--mtp <non-qwen35>` errors out with the documented message.
- End-to-end: re-converting unsloth/Qwen3.5-2B (bundled, BF16) produces
a 333-layer-tensor GGUF with qwen35.block_count=25,
qwen35.nextn_predict_layers=1, and blk.24.nextn.{eh_proj,enorm,
hnorm,shared_head_norm}.weight. Tensor name list diffs zero against
the upstream-published unsloth/Qwen3.5-2B-MTP-GGUF Qwen3.5-2B-BF16.gguf
reference (334 names identical, sorted).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The embedded-metallib build path (GGML_METAL_EMBED_LIBRARY=1) did not
define GGML_METAL_HAS_BF16 when invoking xcrun metal, so every bfloat-
gated kernel — kernel_get_rows_bf16, kernel_mul_mv_bf16_*,
kernel_flash_attn_ext_bf16_*, kernel_tri_bf16_* — was omitted from the
compiled default.metallib. The runtime device->props.has_bfloat is true
on M-series, so supports_op happily returned true for BF16 GET_ROWS,
the kernel lookup returned a null MTLFunction ("failed to compile
pipeline"), and the subsequent dispatch SIGSEGV'd
(test-backend-ops -b MTL0 -o GET_ROWS exit 139).
The runtime JIT path in ggml-metal-device.m already sets this macro from
device props (line 235), so only the precompiled metallib path was
broken. ggml-metal.metal auto-undefs the macro when
__METAL_VERSION__ < 310, so adding -DGGML_METAL_HAS_BF16=1 is safe for
older Metal targets — gating remains correct via the runtime has_bfloat
check in supports_op.
Verified on M4 Max:
test-backend-ops -b MTL0 -o GET_ROWS: 111/111 PASS (was SIGSEGV)
test-backend-ops -b MTL0 -o ATTN_SCORE_TBQ: 4/4 PASS (unchanged)
llama-cli eliza-1-0_8b -ngl 99: 83.1 t/s gen (unchanged)
The K5 Kokoro iSTFT TTS subtree and the omnivoice merged subtree are both first-class voice surfaces in the eliza-1 stack — TTS GGUFs are staged in the eliza-1 bundles (omnivoice-base, omnivoice-tokenizer, kokoro) and the runtime needs loaders for them. Default-off meant no CLI binary in bin/, no test wiring, and no end-to-end coverage. Build artifacts produced when these flags are ON: bin/kokoro-tts — Kokoro standalone CLI bin/omnivoice-tts — OmniVoice standalone CLI bin/test-kokoro-istft — iSTFT unit smoke bin/test-kokoro-phonemes — phonemizer unit smoke libkokoro_lib.a / libomnivoice_lib.a Stock llama.cpp consumers that don't want the TTS subtrees can still opt out with -DLLAMA_BUILD_KOKORO=OFF -DLLAMA_BUILD_OMNIVOICE=OFF. Verified on build-codex-merge (Darwin arm64, Metal + BLAS): test-kokoro-istft: OK (peak=1.2010, len=495) test-kokoro-phonemes: OK
GGML_OP_ISTFT (the Kokoro iSTFTNet decoder op) was registered in ggml.h + implemented on CPU (ggml-cpu/ops.cpp:11415) + CUDA (ggml-cuda/istft.cu) + had a Metal kernel staged (ggml-metal/eliza-kernels/istft.metal) and Vulkan pipeline declared (ggml-vulkan/ggml-vulkan.cpp:4974), but `test-backend-ops -o ISTFT` ran 0 tests because no test_case existed. Adds test_istft covering: - (n_fft=20, hop=5, win=20) x16 frames — Kokoro-82M iSTFTNet shape - (n_fft=20, hop=5, win=20) x16 frames + explicit Hann window src1 - (n_fft=256, hop=64, win=256) x8 frames — medium 256-FFT vocoder - (n_fft=512, hop=128,win=512) x8 frames — librosa-style 512-FFT + window Mirrors the test_attn_score_tbq/polar pattern used in commits 1dfde3a + 54baf1d. NMSE tolerance 1e-3 absorbs the GPU-vs-double-precision-IDFT OLA accumulation order difference. Verified on build-codex-merge: - Metal MTL0: NOT SUPPORTED (kernel staged but dispatch unwired; tracked separately under the Metal backend agent's scope). - BLAS: NOT SUPPORTED (expected — BLAS doesn't handle ISTFT). - CPU: reference backend, parity self-test inherent in framework. Once the Metal + Vulkan dispatch glue lands, these four cases will flip to actual parity assertions against the CPU reference.
llama-cli rejects --no-conversation (-no-cnv) and prints "please use llama-completion instead", but explicit single-target builds (`cmake --build ... --target llama-cli`) skipped the redirect target. The documented escape hatch was missing whenever someone built only llama-cli. Add `add_dependencies(llama-cli llama-completion)` (guarded by `if(TARGET llama-completion)` for builds where the completion subdir isn't included) so both binaries land in bin/ together. Verified: rm bin/llama-completion + tools/completion build dir, then `cmake --build build-codex-merge --target llama-cli` rebuilds both, and llama-completion smoke-tests at 247 t/s on eliza-1-0_8b-128k. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…OMNIVOICE" This reverts commit 7c36249.
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
lalalune
added a commit
that referenced
this pull request
May 19, 2026
Three independent failures on every push to eliza/token-trie-sampler: 1. Check vendor — vendor/sheredom/subprocess.h was stale vs the pinned upstream URL (b49c56e9fe21...). The fork's earlier "stopping_thread hang on child process exit" diff (67a7818) has already been incorporated upstream at that pin, so the local file just needed to be re-synced with `scripts/sync_vendor.py`. Drop the 5 obsolete lines so check-vendor reports a clean tree. 2. flake8 — convert_hf_to_gguf.py imported `transformers.AutoConfig` but only referenced it in comments. F401 unused-import. Removed. 3. Python Type-Check (ty) — conversion/qwen.py defines `_Qwen35MtpMixin` whose `super().*` chain resolves at runtime via the composed Model subclasses in convert_hf_to_gguf.py. ty cannot see that composition at the mixin level, so it flagged super().filter_tensors, super().set_gguf_parameters, super().prepare_metadata, super().modify_tensors and the inherited `ftype` / `metadata` attributes as unresolved. Add ./conversion/** to the existing unresolved-* override block (same treatment as tools/kokoro/tools). Verified locally: - `python3 scripts/sync_vendor.py` leaves vendor/ unchanged. - `python3 -m flake8 convert_hf_to_gguf.py` is clean for F401. - ty override covers the previously-flagged lines. Refs: PR #15
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Additional information
Requirements