Skip to content

Eliza/token trie sampler#16

Merged
lalalune merged 10 commits into
mainfrom
eliza/token-trie-sampler
May 19, 2026
Merged

Eliza/token trie sampler#16
lalalune merged 10 commits into
mainfrom
eliza/token-trie-sampler

Conversation

@lalalune
Copy link
Copy Markdown
Member

Overview

Additional information

Requirements

lalalune and others added 10 commits May 18, 2026 17:12
…::all

Two unrelated CI failures uncovered by the wave-9 CI greening pass:

1. ggml_istft layout mismatch — caused arm64-cpu-high-perf SIGABRT

ggml_istft (ggml.c:5105) declares the mag_phase input as
  ne[0]=2 (mag/phase channel), ne[1]=F (n_fft/2+1), ne[2]=T (frames)
but the CPU op (ggml-cpu/ops.cpp:11505) AND the CUDA kernel
(ggml-cuda/istft.cu:137) were both reading T from ne[0] and asserting
T_ne == T, where T = ne[2] (= 2 in the new test). That fired
GGML_ASSERT(T_ne == T) inside test-backend-ops -b CPU on arm64,
aborting the entire test suite immediately after the ATTN_SCORE_POLAR
cases.

Fix the kernel + the test together to match the API contract:
  - tests/test-backend-ops.cpp: ggml_new_tensor_3d(F32, 2, F, n_frames)
    (was wrongly (n_frames, F, 2) — backwards from the API)
  - ggml-cpu/ops.cpp: read CH=ne[0], F=ne[1], T=ne[2], and index the
    interleaved [ch, f, t] storage as data[t*(2*F) + f*2 + ch].
  - ggml-cuda/istft.cu: same change, plus take a single mag_phase
    base pointer (channels are interleaved, not split planes).

Verified: ./build-codex-merge/bin/test-backend-ops -b CPU passes
15822/15822 tests (4 ISTFT cases now run for real, not aborting).

2. ggml::all missing from ggml-config.cmake — broke build-cmake-pkg/linux

examples/simple-cmake-pkg links against `ggml::all`, but our
ggml/cmake/ggml-config.cmake.in (rewritten to use ggml-targets.cmake)
dropped the synthetic INTERFACE target that upstream provides.
Add it back: aggregate every ggml::<backend> from GGML_AVAILABLE_BACKENDS
into a single INTERFACE IMPORTED target, matching upstream behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three independent failures on every push to eliza/token-trie-sampler:

1. Check vendor — vendor/sheredom/subprocess.h was stale vs the pinned
   upstream URL (b49c56e9fe21...). The fork's earlier "stopping_thread
   hang on child process exit" diff (67a7818) has already been
   incorporated upstream at that pin, so the local file just needed to
   be re-synced with `scripts/sync_vendor.py`. Drop the 5 obsolete
   lines so check-vendor reports a clean tree.

2. flake8 — convert_hf_to_gguf.py imported `transformers.AutoConfig`
   but only referenced it in comments. F401 unused-import. Removed.

3. Python Type-Check (ty) — conversion/qwen.py defines `_Qwen35MtpMixin`
   whose `super().*` chain resolves at runtime via the composed Model
   subclasses in convert_hf_to_gguf.py. ty cannot see that composition
   at the mixin level, so it flagged super().filter_tensors,
   super().set_gguf_parameters, super().prepare_metadata,
   super().modify_tensors and the inherited `ftype` / `metadata`
   attributes as unresolved. Add ./conversion/** to the existing
   unresolved-* override block (same treatment as tools/kokoro/tools).

Verified locally:
  - `python3 scripts/sync_vendor.py` leaves vendor/ unchanged.
  - `python3 -m flake8 convert_hf_to_gguf.py` is clean for F401.
  - ty override covers the previously-flagged lines.

Refs: PR #15
The vulkan-x64 Windows job fails with 'SPIRV-Headers config not found'
because LunarG's Vulkan SDK installer (1.4.313.2) does not ship a
CMake config for SPIRV-Headers on Windows, but ggml/src/ggml-vulkan/
CMakeLists.txt has 'find_package(SPIRV-Headers REQUIRED)'.

Fix:
- Clone + install KhronosGroup/SPIRV-Headers from the matching SDK
  tag (vulkan-sdk-<VULKAN_VERSION>) into RUNNER_TEMP and expose
  SPIRV-Headers_DIR + CMAKE_PREFIX_PATH so find_package succeeds.
- Add 'fail-fast: false' to the windows-latest matrix so a single
  failing variant no longer cancels every other Windows build.
Adds the missing Metal backend dispatch for the iSTFT op so test-backend-ops
on MTL0 now runs (and passes) all four ISTFT parity cases instead of
reporting "not supported".

Concrete changes:
- ggml/src/ggml-metal/eliza-shipped/istft.metal: NEW corrected kernel.
  Uses the actual ggml_istft tensor layout [2, F, T] (ne[0]=2 mag/phase
  channel, interleaved as data[t*(2*F) + f*2 + ch]) — matches the CPU
  reference at ggml/src/ggml-cpu/ops.cpp fixed in 3de6dc8.  The
  previously-staged version under eliza-kernels/istft.metal read
  mag_phase from split mag/phase half-planes which never matched the
  ggml_istft contract.  Supports an optional `use_window` flag so the
  with_window=false test cases synthesise a periodic Hann inline.
- ggml/src/ggml-metal/CMakeLists.txt: compile eliza-shipped/istft.metal
  into default.metallib for both the embedded-library path (iOS) and
  the on-disk metallib path (macOS).
- ggml/src/ggml-metal/ggml-metal-impl.h: ggml_metal_kargs_istft (mirrors
  IstftParams in the shader).
- ggml/src/ggml-metal/ggml-metal-device.{h,cpp}:
  ggml_metal_library_get_pipeline_istft kernel-name lookup.
- ggml/src/ggml-metal/ggml-metal-ops.{h,cpp}: ggml_metal_op_istft
  dispatch (binds src0/src1/dst + IstftParams, dispatches one thread
  per output sample in a 32-wide threadgroup capped at n_out).
- ggml/src/ggml-metal/ggml-metal-device.m: GGML_OP_ISTFT case in
  ggml_metal_supports_op, gated on F32 contiguous src0 with ne[0]==2
  and (optional) F32 contiguous src1.

Verified on Apple M4 Max:
  build-codex-merge/bin/test-backend-ops -b MTL0 -o ISTFT
  -> 4/4 PASS (n_fft=20/256/512, with_window=0 and 1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…OICE

Cross-platform compile fixes that broke the previous default-on attempt
(7c36249, reverted in 5b53366):

- Windows MSVC: kokoro-istft.cpp used M_PI (a POSIX/GNU extension behind
  _USE_MATH_DEFINES). Replaced with a local K_PI constexpr, same pattern
  as the prior ggml/src/ggml-cpu/ops.cpp iSTFT fix (91c580a). Also
  added an explicit #include <string> to kokoro-predictor.h, which
  references std::string in its public signature.

- Android NDK: tools/omnivoice/CMakeLists.txt unconditionally linked
  omnivoice-dac-parity against ggml-cpu / ggml-base. When those CMake
  targets are not defined on a given cross-compile config, CMake
  silently falls back to literal `-lggml-cpu` link flags against a
  phantom library. Guarded both link calls with `if(TARGET ggml-cpu)` /
  `if(TARGET ggml-base)`, matching tools/kokoro/CMakeLists.txt.

- iOS XCFramework (LLAMA_BUILD_TOOLS=OFF, LLAMA_BUILD_MTMD=OFF): the
  fused `elizainference` library links mtmd_* symbols via
  eliza-inference-ffi.cpp (the ASR pipeline). Gated the whole
  `elizainference` target on `if(TARGET mtmd)` so iOS builds skip it
  entirely. omnivoice_lib + omnivoice-tts continue to build (they don't
  touch mtmd); the Eliza iOS runtime can opt back in by setting
  LLAMA_BUILD_MTMD=ON when ASR is needed.

K5 Kokoro + OmniVoice now build clean on Darwin (verified locally:
kokoro-tts, omnivoice-tts, omnivoice-dac-parity, elizainference, plus
llama-cli smoke test passing). Cross-platform CI (Windows, Android,
Apple/Xcode) will confirm the other three platforms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pairs with 28603bd (Metal ISTFT dispatch).  The Vulkan iSTFT path had
two correctness bugs that left it at 0/4 on test-backend-ops parity vs
CPU despite the dispatch + supports_op being fully wired:

1. Tensor layout mismatch.  Both istft.comp and the .cpp dispatch were
   reading T from src0->ne[0] and indexing mag_phase as two split planes
   (mag at [0..F*T], phase at [F*T..2*F*T]).  ggml_istft (ggml.c:5105)
   actually lays mag_phase out as [2, F, T] with ne[0]=2 (mag/phase
   channel), ne[1]=F, ne[2]=T and interleaves channels — element
   [ch,f,t] sits at data[t*(2*F) + f*2 + ch].  Same fix that landed for
   CPU + CUDA in 3de6dc8, now applied to Vulkan.

2. Window tensor was ignored.  When the user passes an explicit window
   (test_istft with_window=true), the shader silently substituted a
   periodic Hann, so the parity check fed two different windows to the
   two backends and naturally diverged.  The shader now binds the window
   buffer at binding 1 and gates on a new use_window push constant; the
   dispatch path passes src1 through to ggml_vk_op_f32 unconditionally
   (aliasing src0 when no window tensor exists, which keeps the
   descriptor set valid since the shader never reads it in that case).

Side-cleanup: the unused vk_op_istft_ola_push_constants struct is gone;
the OLA-pass shader (istft_ola.comp) was never registered or
dispatched and has been left untouched — it remains dead but
out-of-scope for this fix.

Verified on Apple M4 Max (MoltenVK):
  build-vulkan-validate/bin/test-backend-ops -b Vulkan0 -o ISTFT
  -> 4/4 PASS (n_fft=20/256/512, with_window=0 and 1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that the Metal (28603bd) and Vulkan (e59d53f) iSTFT dispatch
paths are wired and pass all 4 test-backend-ops parity cases against
the CPU reference at ggml-cpu/ops.cpp:ggml_compute_forward_istft_f32,
promote ISTFT from "no CI coverage" straight to hard-fail in both
validation workflows.

- eliza-metal-validation.yml: add ISTFT to the test-backend-ops op
  sweep and add a gated check block that fails the job on any
  ISTFT FAIL line or "not supported" line.
- eliza-vulkan-validation.yml: same pattern, against build-vulkan.

A FAIL means the GPU iSTFT diverges from the CPU reference (typically
a layout regression or a missing Hermitian-symmetry term).  A
"not supported" line means the supports_op gate regressed and the
parity test silently stopped running — also a hard fail because we
need to know the moment GPU ISTFT coverage drops out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds scripts/cuda-mtp-validate.sh and extends the cuda-runtime-validation
job in eliza-cuda-validation.yml to actually exercise the MTP K-snapshot
path on a real NVIDIA GPU. Closes the TODO(cuda-mtp-validation) marker in
ggml/src/ggml-cuda/gated_delta_net.cu (from commit 142e7ac -- port of
upstream PR #22673 multi-token-prediction state snapshots).

What's covered:
- Full test-backend-ops -b CUDA0 -o GATED_DELTA_NET sweep (CPU reference
  vs CUDA, all registered K=1 + K>1 cases including KDA + permuted).
- Assertion that >=4 K>1 cases were actually scheduled (guards against
  silent skip / kernel registration regressions).
- Optional llama-cli end-to-end smoke with --spec-type draft-mtp on the
  unsloth Qwen3.5-2B MTP GGUF (skipped cleanly when the model is absent).

The script works in two modes:
  bash scripts/cuda-mtp-validate.sh --list-tests   # no GPU, dry-run
  bash scripts/cuda-mtp-validate.sh --build-only   # compile only
  bash scripts/cuda-mtp-validate.sh                # real GPU required

The CI runtime job remains gated on the self-hosted-cuda runner label and
is auto-skipped when that runner isn't provisioned -- no false positives
on hosts without a GPU.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Kokoro decode path at tools/kokoro/src/kokoro.cpp:547 built the iSTFT
input tensor as ne={T,F,2,1} (time, freq, channel, batch), but commit
3de6dc8 fixed ggml_istft + Metal/Vulkan/CPU kernels to use the API
contract ne={2,F,T} (channel-first interleaved). The K5 Kokoro caller
wasn't updated then, so kokoro-tts aborted with
GGML_ASSERT(CH_ne == 2) failed in ggml_compute_forward_istft.

This commit aligns the caller. Verified end-to-end: kokoro-tts now
produces a valid WAV from the staged GGUF in both eliza-1-2b and
eliza-1-4b bundles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ing decl

ggml_graph_compute_with_ctx was removed from upstream ggml in favor of
the ggml-backend interface. omnivoice-dac-parity referenced the removed
symbol -> undefined reference on CI (android, vulkan, 3rd-party). Swap
to ggml_backend_cpu_init + ggml_backend_cpu_set_n_threads +
ggml_backend_graph_compute + ggml_backend_free. Tensors live in the
user-managed ctx buffer; the CPU backend computes against them directly.

Also adds a forward declaration for kokoro_model_ggml_ctx immediately
before its definition in kokoro.cpp so -Wmissing-declarations sees a
prior declaration at the definition site. The matching extern lives in
the sibling TU (kokoro-predictor.cpp).

CI (3rd-party / android / vulkan) now build clean.
Smoke test verified: kokoro-tts still produces real audio
(36480 samples @ 24kHz, peak 0.0861) from the staged GGUF in
eliza-1-2b.bundle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b60dd7d7-25b5-43b4-a596-54e9f7fd2593

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch eliza/token-trie-sampler

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lalalune lalalune merged commit 22582b2 into main May 19, 2026
32 of 79 checks passed
@lalalune lalalune deleted the eliza/token-trie-sampler branch May 19, 2026 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant