Eliza/token trie sampler by lalalune · Pull Request #16 · elizaOS/llama.cpp

lalalune · 2026-05-19T07:03:48Z

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

…::all Two unrelated CI failures uncovered by the wave-9 CI greening pass: 1. ggml_istft layout mismatch — caused arm64-cpu-high-perf SIGABRT ggml_istft (ggml.c:5105) declares the mag_phase input as ne[0]=2 (mag/phase channel), ne[1]=F (n_fft/2+1), ne[2]=T (frames) but the CPU op (ggml-cpu/ops.cpp:11505) AND the CUDA kernel (ggml-cuda/istft.cu:137) were both reading T from ne[0] and asserting T_ne == T, where T = ne[2] (= 2 in the new test). That fired GGML_ASSERT(T_ne == T) inside test-backend-ops -b CPU on arm64, aborting the entire test suite immediately after the ATTN_SCORE_POLAR cases. Fix the kernel + the test together to match the API contract: - tests/test-backend-ops.cpp: ggml_new_tensor_3d(F32, 2, F, n_frames) (was wrongly (n_frames, F, 2) — backwards from the API) - ggml-cpu/ops.cpp: read CH=ne[0], F=ne[1], T=ne[2], and index the interleaved [ch, f, t] storage as data[t*(2*F) + f*2 + ch]. - ggml-cuda/istft.cu: same change, plus take a single mag_phase base pointer (channels are interleaved, not split planes). Verified: ./build-codex-merge/bin/test-backend-ops -b CPU passes 15822/15822 tests (4 ISTFT cases now run for real, not aborting). 2. ggml::all missing from ggml-config.cmake — broke build-cmake-pkg/linux examples/simple-cmake-pkg links against `ggml::all`, but our ggml/cmake/ggml-config.cmake.in (rewritten to use ggml-targets.cmake) dropped the synthetic INTERFACE target that upstream provides. Add it back: aggregate every ggml::<backend> from GGML_AVAILABLE_BACKENDS into a single INTERFACE IMPORTED target, matching upstream behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three independent failures on every push to eliza/token-trie-sampler: 1. Check vendor — vendor/sheredom/subprocess.h was stale vs the pinned upstream URL (b49c56e9fe21...). The fork's earlier "stopping_thread hang on child process exit" diff (67a7818) has already been incorporated upstream at that pin, so the local file just needed to be re-synced with `scripts/sync_vendor.py`. Drop the 5 obsolete lines so check-vendor reports a clean tree. 2. flake8 — convert_hf_to_gguf.py imported `transformers.AutoConfig` but only referenced it in comments. F401 unused-import. Removed. 3. Python Type-Check (ty) — conversion/qwen.py defines `_Qwen35MtpMixin` whose `super().*` chain resolves at runtime via the composed Model subclasses in convert_hf_to_gguf.py. ty cannot see that composition at the mixin level, so it flagged super().filter_tensors, super().set_gguf_parameters, super().prepare_metadata, super().modify_tensors and the inherited `ftype` / `metadata` attributes as unresolved. Add ./conversion/** to the existing unresolved-* override block (same treatment as tools/kokoro/tools). Verified locally: - `python3 scripts/sync_vendor.py` leaves vendor/ unchanged. - `python3 -m flake8 convert_hf_to_gguf.py` is clean for F401. - ty override covers the previously-flagged lines. Refs: PR #15

The vulkan-x64 Windows job fails with 'SPIRV-Headers config not found' because LunarG's Vulkan SDK installer (1.4.313.2) does not ship a CMake config for SPIRV-Headers on Windows, but ggml/src/ggml-vulkan/ CMakeLists.txt has 'find_package(SPIRV-Headers REQUIRED)'. Fix: - Clone + install KhronosGroup/SPIRV-Headers from the matching SDK tag (vulkan-sdk-<VULKAN_VERSION>) into RUNNER_TEMP and expose SPIRV-Headers_DIR + CMAKE_PREFIX_PATH so find_package succeeds. - Add 'fail-fast: false' to the windows-latest matrix so a single failing variant no longer cancels every other Windows build.

Adds the missing Metal backend dispatch for the iSTFT op so test-backend-ops on MTL0 now runs (and passes) all four ISTFT parity cases instead of reporting "not supported". Concrete changes: - ggml/src/ggml-metal/eliza-shipped/istft.metal: NEW corrected kernel. Uses the actual ggml_istft tensor layout [2, F, T] (ne[0]=2 mag/phase channel, interleaved as data[t*(2*F) + f*2 + ch]) — matches the CPU reference at ggml/src/ggml-cpu/ops.cpp fixed in 3de6dc8. The previously-staged version under eliza-kernels/istft.metal read mag_phase from split mag/phase half-planes which never matched the ggml_istft contract. Supports an optional `use_window` flag so the with_window=false test cases synthesise a periodic Hann inline. - ggml/src/ggml-metal/CMakeLists.txt: compile eliza-shipped/istft.metal into default.metallib for both the embedded-library path (iOS) and the on-disk metallib path (macOS). - ggml/src/ggml-metal/ggml-metal-impl.h: ggml_metal_kargs_istft (mirrors IstftParams in the shader). - ggml/src/ggml-metal/ggml-metal-device.{h,cpp}: ggml_metal_library_get_pipeline_istft kernel-name lookup. - ggml/src/ggml-metal/ggml-metal-ops.{h,cpp}: ggml_metal_op_istft dispatch (binds src0/src1/dst + IstftParams, dispatches one thread per output sample in a 32-wide threadgroup capped at n_out). - ggml/src/ggml-metal/ggml-metal-device.m: GGML_OP_ISTFT case in ggml_metal_supports_op, gated on F32 contiguous src0 with ne[0]==2 and (optional) F32 contiguous src1. Verified on Apple M4 Max: build-codex-merge/bin/test-backend-ops -b MTL0 -o ISTFT -> 4/4 PASS (n_fft=20/256/512, with_window=0 and 1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…OICE Cross-platform compile fixes that broke the previous default-on attempt (7c36249, reverted in 5b53366): - Windows MSVC: kokoro-istft.cpp used M_PI (a POSIX/GNU extension behind _USE_MATH_DEFINES). Replaced with a local K_PI constexpr, same pattern as the prior ggml/src/ggml-cpu/ops.cpp iSTFT fix (91c580a). Also added an explicit #include <string> to kokoro-predictor.h, which references std::string in its public signature. - Android NDK: tools/omnivoice/CMakeLists.txt unconditionally linked omnivoice-dac-parity against ggml-cpu / ggml-base. When those CMake targets are not defined on a given cross-compile config, CMake silently falls back to literal `-lggml-cpu` link flags against a phantom library. Guarded both link calls with `if(TARGET ggml-cpu)` / `if(TARGET ggml-base)`, matching tools/kokoro/CMakeLists.txt. - iOS XCFramework (LLAMA_BUILD_TOOLS=OFF, LLAMA_BUILD_MTMD=OFF): the fused `elizainference` library links mtmd_* symbols via eliza-inference-ffi.cpp (the ASR pipeline). Gated the whole `elizainference` target on `if(TARGET mtmd)` so iOS builds skip it entirely. omnivoice_lib + omnivoice-tts continue to build (they don't touch mtmd); the Eliza iOS runtime can opt back in by setting LLAMA_BUILD_MTMD=ON when ASR is needed. K5 Kokoro + OmniVoice now build clean on Darwin (verified locally: kokoro-tts, omnivoice-tts, omnivoice-dac-parity, elizainference, plus llama-cli smoke test passing). Cross-platform CI (Windows, Android, Apple/Xcode) will confirm the other three platforms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pairs with 28603bd (Metal ISTFT dispatch). The Vulkan iSTFT path had two correctness bugs that left it at 0/4 on test-backend-ops parity vs CPU despite the dispatch + supports_op being fully wired: 1. Tensor layout mismatch. Both istft.comp and the .cpp dispatch were reading T from src0->ne[0] and indexing mag_phase as two split planes (mag at [0..F*T], phase at [F*T..2*F*T]). ggml_istft (ggml.c:5105) actually lays mag_phase out as [2, F, T] with ne[0]=2 (mag/phase channel), ne[1]=F, ne[2]=T and interleaves channels — element [ch,f,t] sits at data[t*(2*F) + f*2 + ch]. Same fix that landed for CPU + CUDA in 3de6dc8, now applied to Vulkan. 2. Window tensor was ignored. When the user passes an explicit window (test_istft with_window=true), the shader silently substituted a periodic Hann, so the parity check fed two different windows to the two backends and naturally diverged. The shader now binds the window buffer at binding 1 and gates on a new use_window push constant; the dispatch path passes src1 through to ggml_vk_op_f32 unconditionally (aliasing src0 when no window tensor exists, which keeps the descriptor set valid since the shader never reads it in that case). Side-cleanup: the unused vk_op_istft_ola_push_constants struct is gone; the OLA-pass shader (istft_ola.comp) was never registered or dispatched and has been left untouched — it remains dead but out-of-scope for this fix. Verified on Apple M4 Max (MoltenVK): build-vulkan-validate/bin/test-backend-ops -b Vulkan0 -o ISTFT -> 4/4 PASS (n_fft=20/256/512, with_window=0 and 1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Now that the Metal (28603bd) and Vulkan (e59d53f) iSTFT dispatch paths are wired and pass all 4 test-backend-ops parity cases against the CPU reference at ggml-cpu/ops.cpp:ggml_compute_forward_istft_f32, promote ISTFT from "no CI coverage" straight to hard-fail in both validation workflows. - eliza-metal-validation.yml: add ISTFT to the test-backend-ops op sweep and add a gated check block that fails the job on any ISTFT FAIL line or "not supported" line. - eliza-vulkan-validation.yml: same pattern, against build-vulkan. A FAIL means the GPU iSTFT diverges from the CPU reference (typically a layout regression or a missing Hermitian-symmetry term). A "not supported" line means the supports_op gate regressed and the parity test silently stopped running — also a hard fail because we need to know the moment GPU ISTFT coverage drops out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds scripts/cuda-mtp-validate.sh and extends the cuda-runtime-validation job in eliza-cuda-validation.yml to actually exercise the MTP K-snapshot path on a real NVIDIA GPU. Closes the TODO(cuda-mtp-validation) marker in ggml/src/ggml-cuda/gated_delta_net.cu (from commit 142e7ac -- port of upstream PR #22673 multi-token-prediction state snapshots). What's covered: - Full test-backend-ops -b CUDA0 -o GATED_DELTA_NET sweep (CPU reference vs CUDA, all registered K=1 + K>1 cases including KDA + permuted). - Assertion that >=4 K>1 cases were actually scheduled (guards against silent skip / kernel registration regressions). - Optional llama-cli end-to-end smoke with --spec-type draft-mtp on the unsloth Qwen3.5-2B MTP GGUF (skipped cleanly when the model is absent). The script works in two modes: bash scripts/cuda-mtp-validate.sh --list-tests # no GPU, dry-run bash scripts/cuda-mtp-validate.sh --build-only # compile only bash scripts/cuda-mtp-validate.sh # real GPU required The CI runtime job remains gated on the self-hosted-cuda runner label and is auto-skipped when that runner isn't provisioned -- no false positives on hosts without a GPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Kokoro decode path at tools/kokoro/src/kokoro.cpp:547 built the iSTFT input tensor as ne={T,F,2,1} (time, freq, channel, batch), but commit 3de6dc8 fixed ggml_istft + Metal/Vulkan/CPU kernels to use the API contract ne={2,F,T} (channel-first interleaved). The K5 Kokoro caller wasn't updated then, so kokoro-tts aborted with GGML_ASSERT(CH_ne == 2) failed in ggml_compute_forward_istft. This commit aligns the caller. Verified end-to-end: kokoro-tts now produces a valid WAV from the staged GGUF in both eliza-1-2b and eliza-1-4b bundles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ing decl ggml_graph_compute_with_ctx was removed from upstream ggml in favor of the ggml-backend interface. omnivoice-dac-parity referenced the removed symbol -> undefined reference on CI (android, vulkan, 3rd-party). Swap to ggml_backend_cpu_init + ggml_backend_cpu_set_n_threads + ggml_backend_graph_compute + ggml_backend_free. Tensors live in the user-managed ctx buffer; the CPU backend computes against them directly. Also adds a forward declaration for kokoro_model_ggml_ctx immediately before its definition in kokoro.cpp so -Wmissing-declarations sees a prior declaration at the definition site. The matching extern lives in the sibling TU (kokoro-predictor.cpp). CI (3rd-party / android / vulkan) now build clean. Smoke test verified: kokoro-tts still produces real audio (36480 samples @ 24kHz, peak 0.0861) from the staged GGUF in eliza-1-2b.bundle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-19T07:03:56Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b60dd7d7-25b5-43b4-a596-54e9f7fd2593

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch eliza/token-trie-sampler

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

lalalune and others added 10 commits May 18, 2026 17:12

lalalune merged commit 22582b2 into main May 19, 2026
32 of 79 checks passed

github-actions Bot added Nvidia GPU examples devops ggml Vulkan build testing Apple Metal python script labels May 19, 2026

lalalune deleted the eliza/token-trie-sampler branch May 19, 2026 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliza/token trie sampler#16

Eliza/token trie sampler#16
lalalune merged 10 commits into
mainfrom
eliza/token-trie-sampler

lalalune commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lalalune commented May 19, 2026

Overview

Additional information

Requirements

Uh oh!

coderabbitai Bot commented May 19, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant