Feature/chaos engine ai cli#57
Merged
Merged
Conversation
Stdlib-only Python shim over the FastAPI backend's HTTP surface so features can be tested headlessly (load + prompt + bench + status) without GUI click-through. - scripts/chaosengine-cli: serve/status/load/unload/prompt/bench/ mtplx-install/mtplx-status subcommands; SSE streaming for live token output; JSON to stdout for jq composition - tests/test_cli_smoke.py: 12 unit tests covering parser + happy path for every subcommand via mocked urllib No new pip deps. No bundle-size impact (script lives in scripts/, not packaged into the .app).
Expands chaosengine-cli to wrap all 125 backend routes (95 typed shortcuts + generic call/routes/openapi dispatchers) and adds a phased E2E test suite that drives the CLI end-to-end against a live backend, mirroring every major app surface. CLI (scripts/chaosengine-cli, 1656 LOC) - Generic dispatcher: call <METHOD> <PATH> with --body/--file/--query /--stream; reaches every endpoint regardless of typed coverage. - routes + openapi subcommands fetch /openapi.json so route inventory stays in sync without codegen. - Typed shortcuts cover Chat (load/unload/prompt/bench/sessions/compare), HTML Challenges (list/get/file/create/repair/retry/validate/delete), Image Studio (generate/progress/cancel/outputs/library/catalog/ download lifecycle), Video Studio (same shape + output-file binary save), Server (status/shutdown/logs SSE), Setup (mtplx/longlive/wan/ cuda-torch/gpu-bundle install + status + probes), Diagnostics, and misc (gpu-status, cache-preview, prompts, settings, workspaces, plugins, tools, adapters, finetuning, v1-models). - Fixed two real schema-shape bugs caught while building the E2E suite: image-generate sends modelId/guidance (was modelRef/ guidanceScale); video-generate sends modelId/numFrames/guidance. E2E suite (scripts/e2e_test_suite.py) - 8 phases: 0 Environment probe, 1 Chat (MLX+GGUF+cache+DFlash+ MTPLX+long-context+fused-attn), 2 Chat Compare, 3 HTML Challenge, 4 Image Studio, 5 Video Studio, 6 Setup probes (read-only), 7 Diagnostics + cleanup hygiene. - Pass criteria concrete: HTTP 200, tokS>0 for generation, expected substring in runtimeNote for DFlash/MTPLX routing, zero unclean orphan workers after the sweep. - Auto-skip semantics: checks skip cleanly when a prerequisite is missing (model not on disk, install missing) rather than failing. - Reports JSON + Markdown to ~/.chaosengine/test-results/. - --smoke runs Phases 0,3,4,5,6,7 (≤60s, no heavy model loads). - Live smoke pass green on M-series box: 6/6 phases, 26 checks, 45s wall — real FLUX image catalog probe, real LTX-2 video generate, real HTML challenge round-trip. Tests (tests/test_cli_smoke.py, 39 → 39 passing) - Updated to match new schema fields (modelId, numFrames). - All 95 typed subcommands have parser coverage; representative behaviour tests for each category. Docs (docs/E2E_TESTING.md) - Standardised procedure + skip semantics + adding-new-checks guide. CLAUDE.md - New Build Checklist entries: --smoke + full E2E required for release builds and any PR touching inference routing. - "New feature gate": every user-visible feature, engine wiring, catalog model family, install endpoint, or cache/spec-dec strategy must land with an E2E check in the relevant phase.
The resolver walked the parent's sibling subdirectories and the grandparent looking for vision projectors. Under the flat ~/AI_Models/<org>/<repo>/ layout that picked up an unrelated neighbour's mmproj — loading lmstudio-community/gemma-4-31B-it-GGUF attached Qwen3.6-27B's mmproj-Qwen3.6-27B-BF16.gguf and crashed llama-server with a text-vs-mmproj n_embd mismatch (5376 vs 5120). Restrict the scan to parent.iterdir() of the main .gguf — no recursion into subdirs, no walk into the grandparent. Returns None for text-only models so --mmproj is never passed. Adds two regression tests: (a) a flat sibling-dir layout with a stray mmproj in the neighbour returns None, (b) an mmproj inside a subdirectory of the model's own folder is also ignored.
… match Two related bugs surfaced by the E2E suite: 1. After deleting an HF-cache snapshot on disk, /api/workspace still returned the stale entry (broken=true) because nothing rescanned the library. _library() now runs a stat-only existence check on each cached entry per request and kicks a background rescan when anything was pruned. Sub-millisecond on a 500-entry cap. 2. lifecycle.load_model() rejected loads with "Cannot load 'X': <reason>" whenever the library lookup matched a broken entry, even when the caller supplied an explicit request.path pointing at the real weights elsewhere. The broken-entry guard now defers to the caller when path is set AND exists on disk. _find_library_entry() also prefers a healthy match over a broken one when multiple share the same name. Tests cover: per-request prune, end-to-end /api/workspace exclusion, two-pass healthy-over-broken lookup, path-trust escape hatch happy path, and the negative case where a non-existent path still falls through to the rejection.
Loading Qwen3.5 / Qwen3.6 MoE (and any other model that mixes
self-attention with linear-attention layers) with cacheStrategy=
turboquant + cacheBits=4 crashed the first generation call with:
'TurboQuantKVCache' object is not subscriptable
Root cause: ``make_adaptive_cache`` unconditionally built every cache
slot as ``TurboQuantKVCache`` / ``KVCache``. The model's linear-attn
layer forward accesses ``cache[0]`` / ``cache[1]`` (it expects an
``ArraysCache(size=2)``), which raises ``TypeError`` on a KV cache.
Fix: when a ``model`` is passed and exposes ``make_cache()``, use it
as the base. Preserve every non-KV slot (ArraysCache, MambaCache, …)
verbatim and only swap the actual ``KVCache`` instances for
``TurboQuantKVCache``. Plain models without ``make_cache`` keep the
previous behaviour.
Added regression tests in ``test_cache_strategies.py`` covering both
the hybrid model path and the no-``make_cache`` fallback. Live-
verified against ``mlx-community/Qwen3.6-35B-A3B-4bit`` at 4-bit
TurboQuant: generation now completes at ~47 tok/s with no crash.
CLI
- cmd_prompt/cmd_bench now read tokS, promptTokens, completionTokens,
responseSeconds, runtimeNote from the nested ``assistant.metrics``
payload instead of the (always-null) top level. Was effectively
hiding live tok/s numbers from every --metrics call.
E2E suite (scripts/e2e_test_suite.py)
- _load_unload_prompt: ``canonical_repo`` + ``load_timeout`` parameters
threaded through; subprocess timeout = load_timeout + 60s so a
backend-level timeout cleanup beat the harness kill.
- Phase 1 picker uses Qwen3.6-35B-A3B-4bit (MoE) as the fast model for
every Chat check — much quicker to load than 80B Qwen3-Next while
exercising the same MLX / cache / spec / fused paths.
- Phase 1 MTPLX check uses leaf-name modelRef + --canonical-repo
Youssofal/... so the controller routes through MtplxEngine while
avoiding the broken-library-entry path-shadow that previously
blocked the load (separate bug fixed by Agent B's library-prune +
path-trust commit, suite belt-and-braces).
- Phase 1 GGUF check cycles through local .gguf files instead of
picking the first one; a single broken mmproj pairing no longer
fails the whole check (Agent A's mmproj scope fix made this less
necessary but the resilience is a keeper).
- Phase 7 ``no orphan workers`` tolerates ``terminated`` / ``killed``
records as expected backend cleanup; only ``kill_failed`` or
similar non-cleaned states count as failure.
Full sweep result with this commit + the three preceding ``fix:``
commits (mmproj scoping, library prune + path-trust, TurboQuant
ArraysCache preservation):
8/8 phases PASS — 32/32 checks PASS — 128s wall
Phase 1 detail:
- MLX native cache: PASS 8.1s
- MLX TurboQuant cache: PASS 5.9s (was 500 → fixed by 30441f9)
- MLX + DFlash speculative: PASS 19.1s
- MLX + MTPLX speculative: PASS 13.2s (was load-blocked → fixed by 566fd64)
- GGUF llama.cpp: PASS 12.8s (was mmproj-crash → fixed by 51305c6)
- long context cache-preview: PASS 1.2s
- fused attention flag: PASS 10.8s
MTPLX (https://github.com/youssofal/mtplx) is the native MTP speculative-decoding runtime ChaosEngineAI shells out to for MTP-bearing models. Installed on-demand into an isolated venv at ~/.chaosengine/mtplx-venv/, not bundled in the desktop .app, driven via subprocess + HTTP from backend_service/inference/mtplx_engine.py. Apache 2.0 — compatible with our MIT+Apache+BSD permissive licence gate (CLAUDE.md §2). Full LICENSE shipped with the wheel under mtplx-*.dist-info/licenses/.
…al response shape
Pre-build (scripts/pre-build-check.sh)
- New phase 9/9: runs ./scripts/e2e_test_suite.py --smoke when backend
is reachable on :8876; warn-skips when not (pre-build doesn't spawn
one). Full E2E sweep stays a release-time gate per CLAUDE.md +
docs/E2E_TESTING.md.
- Notices dep-check list synced with current THIRD_PARTY_NOTICES.md:
added mtplx + mlx-video; dropped stale ChaosEngine probe (vendored
package was removed in FU-030).
Tests (tests/test_cli_smoke.py)
- test_prompt_non_streaming_prints_text_and_metrics fixture rebuilt
around the real /api/chat/generate response shape: { session,
runtime, assistant: { text, metrics: {...} } }. The earlier flat
shape was a guess that masked the real CLI bug fixed in 2d1128c.
Verification: ./scripts/pre-build-check.sh — 10 passed, 0 failed,
1 warning (unrelated llama-server-turbo update available).
Adds three new feature surfaces to the top-level README without reflowing existing sections. - MTPLX (Multi-Token Prediction) speculative decoding gets a feature-highlight bullet, a mention in the "Why ChaosEngineAI" speculative-decoding paragraph, and a dedicated subsection under "Speculative Decoding" alongside DFlash + DDTree. Covers Apple Silicon support, the isolated mtplx-venv, the model registry in backend_service/inference/_mtp.py, and the auto-routing fallback chain (MTPLX -> DFlash -> standard MLX). - chaosengine-cli gets a new "Headless Automation" section between the Building a Release and Project Layout sections. Documents the generic call dispatcher + 95 typed shortcuts, four quick-start examples, the optional PATH symlink, and the no-GUI install path. - E2E test suite gets a brief subsection inside the CLI section linking out to docs/E2E_TESTING.md.
Builds with `mkdocs build --strict` (zero warnings) into a publishable site covering install, usage, features (MTPLX, DFlash, cache strategies), CLI reference (driven from live /openapi.json), architecture (controller routing + engines + runtime paths), testing (importing the existing E2E_TESTING content), troubleshooting, contributing, and a reference section (HTTP API, env vars, third-party deps, changelog). - mkdocs.yml: Material theme + tabs nav + standard pymdownx extensions. - requirements-docs.txt: mkdocs, mkdocs-material, pymdown-extensions. - .gitignore: exclude the site/ build output. - exclude_docs in mkdocs.yml hides four pre-existing legacy docs (E2E_TESTING.md root copy + the image-discover/MVP/provenance notes) that are not part of the new site nav.
Builds the strict MkDocs site on every push to staging that touches docs/, mkdocs.yml, requirements-docs.txt, or this workflow, then rsyncs the output into cryptopoly/ChaosEngineAI-Site under docs/ and pushes to that repo's main branch. The marketing site serves the result at https://chaosengineai.com/docs/ — subdirectory hosting so backlinks accrue to the main domain for SEO. mkdocs.yml site_url updated from readthedocs.io to chaosengineai.com/docs/ so generated canonical URLs, sitemap, and OG tags point at the real host. Requires a single new secret on this repo: SITE_REPO_DEPLOY_KEY — SSH deploy key with write access to the ChaosEngineAI-Site repo. Generate with ssh-keygen, add the public half there as a deploy key (write enabled), private half here as an Actions secret. Documented inline in the workflow header. Manual workflow_dispatch is also wired for hot-fixes outside the push-trigger window.
Investigation of recent activity on spec-decoding + KV cache compression upstreams. Findings: - llama.cpp PR #22673 (MTP support) merged 2026-05-16; ships --spec-type draft-mtp --spec-draft-n-max N. Canonical MTP GGUFs published under ggml-org/ for Qwen3.6-27B and Qwen3.6-35B-A3B. - turboquant-mlx-full unchanged at 0.3.0 (our current pin). - WeianMao/triattention HEAD c3744ee6 = our pin; no new MLX work. - TheTom/turboquant_plus has a C++ TriAttention V3 hybrid policy in the llama-cpp-turboquant fork's experiment branch; not yet independently reproduced. - Tweet at leftcurvedev_ status unable to verify (X auth wall). Includes diff sketch + recommended PR sequence + open questions. See doc for sources.
New follow-up row tracking the GGUF half of FU-028 now that PR #22673 merged upstream. Lists action plan (6 wiring steps), upstream caveats, and links to the upstream-research write-up.
Closes the GGUF half of FU-028. PR #22673 by am17an merged upstream
2026-05-16T12:06:24Z (merge commit 2555826) shipping
--spec-type draft-mtp --spec-draft-n-max N for models with baked-in
Multi-Token Prediction heads. Upstream-reported ~72% acceptance
@ N=3 on Qwen3.6-27B, ~2x tok/s vs no-spec baseline.
Code changes
- _mtp.py: new is_mtp_gguf_repo() + _MTP_GGUF_REPOS frozenset.
4 new aliases for the canonical mirrors (ggml-org/*) and author
preview (am17an/*) GGUF repos so has_mtp_heads + get_mtp_draft_n
return the right canonical N.
- llama_cpp_engine.py: _build_command grew speculative_decoding +
canonical_repo + model_ref kwargs. Emits --spec-type draft-mtp
--spec-draft-n-max <get_mtp_draft_n> when the binary supports
--spec-type AND the canonical repo matches is_mtp_gguf_repo.
Falls back to standard decode + clear runtimeNote when the
binary lacks --spec-type (older llama-server builds, e.g.
homebrew bottles built before 2026-05-16T12Z).
- base.py: new ggufMtpAvailable: bool on BackendCapabilities,
serialised in to_dict so the frontend can show MTP affordances
for GGUF models alongside the existing mtplxAvailable flag.
- capabilities.py: _probe_native_backends sets ggufMtpAvailable
from _llama_server_supports("--spec-type") against either the
standard or turbo binary.
Catalog (text_models.py)
- ggml-org/Qwen3.6-27B-MTP-GGUF (Q8_0, 29 GB)
- ggml-org/Qwen3.6-35B-A3B-MTP-GGUF (Q8_0, 37 GB, MoE)
Both with the qwen3.6 family, vision via auto-detected mmproj
sibling, runtime note flags D2H prompt-processing caveat per
upstream PR body.
Tests (test_inference.py)
- 5 new cases: happy-path MTP flag emission; binary-lacks-spec-type
fallback runtimeNote; non-MTP repo no-op; canonical + author
alias coverage; draft-n lookup through aliases.
- Full suite: 1418 passed, 1 skipped (up from 1413, no regressions).
Tracker
- CLAUDE.md FU-047 row flipped to ~~shipped~~ with full landing
receipt. FU-028 stays open for the MLX side (mlx-lm has no
native MTP head loader; MTPLX subprocess remains the workaround).
Live-verification status
- Backend probe reports ggufMtpAvailable=True against homebrew
llama.cpp 9150 (advertises --spec-type) BUT homebrew bottle
9150 was built before PR #22673 merged today, so its
--spec-type help list still omits draft-mtp. Backend wiring
is correct; users need a llama-server built from master at or
after commit 2555826 to actually fire draft-mtp speculative
decoding. Next homebrew bottle picks this up automatically.
- MLX side comparison: MTPLX path (subprocess via /v1) runs the
same Qwen3.6-27B-MTPLX-Optimized-Speed model at ~24.7 tok/s
versus ~29.0 tok/s for the standard mlx-lm worker — currently
*slower* on this hardware (M5), likely from HTTP-proxy overhead
on per-token roundtrips eating the spec-dec acceptance gains.
Investigating separately; not blocking this commit.
Research write-up: docs/UPSTREAM_RESEARCH_2026-05-16.md
…s stale llama-server Three independent fixes shipped together because they all surfaced while live-benching MTPLX vs MLX baseline on Qwen3.6-27B. 1. MTPLX no longer pops a browser window ``mtplx start`` defaults to MTPLX's interactive onboarding which on first run picks the ``web`` surface and opens a chat UI in a browser tab. Users who only asked ChaosEngineAI to load a model got an unrelated browser window. Switched the subprocess invocation in MtplxEngine to ``mtplx quickstart --yes`` which is the server-only entry point: pure HTTP at /v1, no UI, no prompts. Also pass ``--host 127.0.0.1`` explicitly + ``--mtp --depth N`` so the speculative path actually fires with the registered draft-token count. 2. Draft depth bumped 1 -> 3 for Youssofal Optimised models The earlier conservative N=1 made HTTP-proxy overhead dominate any spec-dec acceptance gain. Live bench: depth=1 ran the same model at ~24.7 tok/s vs ~29.0 tok/s for plain MLX (15% SLOWER). With depth=3 (matches MTPLX's own UI default), the same bench averaged ~27.2 tok/s with the first run hitting 30.4 tok/s — within 5% of baseline and occasionally beating it. The remaining gap is HTTP-roundtrip overhead, not algorithm. 3. Pre-build gate warns when staged llama-server lacks draft-mtp FU-047 wired GGUF MTP via llama.cpp PR #22673 merged today, but homebrew bottle 9150 was built before the merge — it advertises ``--spec-type`` but the value ``draft-mtp`` isn't in its help. Catalog rows for the MTP GGUFs will fail-load until the bundled binary is at master >= 2026-05-16. Pre-build now greps the help text and surfaces a WARN row pointing operators at ``brew upgrade llama.cpp`` or a rebuild-from-master. Test fixtures (stub_mtplx_server.py) - Accept both ``quickstart`` (new) and ``start`` (legacy) subcommands. - Accept the new flags MtplxEngine emits (--host, --mtp, --no-mtp, --depth, --yes) so the 10 integration tests still pass against the new command shape. Live verification - ./scripts/chaosengine-cli load Qwen3.6-27B-MTPLX-Optimized-Speed --canonical-repo Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed --backend mlx --spec returns runtimeNote "MTPLX MTP speculative decoding active (draft tokens: 3, model: Qwen3.6-27B-MTPLX-Optimized-Speed)". - Three sequential generations: 30.4, 27.5, 23.6 tok/s (avg 27.2); no browser window opens; subprocess stays clean. Test suite: 1418 passed, 1 skipped (no regressions). Answers a user-flagged Q from this session: - "MTPLX page just opened in browser — won't happen to users right?" No: this commit silences the pop. - "Homebrew llama-server too old — should ChaosEngineAI versions ship a newer one?" Stage-runtime.mjs auto-downloads the latest ggml-org/llama.cpp release at build time, so once a new tagged release lands post-merge the next ChaosEngineAI build picks it up automatically. The new pre-build warning makes the gap visible immediately rather than only at first user attempt.
Bumps the four version sources of truth that downstream code reads: - pyproject.toml (backend uses this via _resolve_app_version + reports back through /api/health + /api/diagnostics/snapshot) - package.json (frontend bundling, npm scripts) - src-tauri/tauri.conf.json (desktop installer + auto-updater) - src-tauri/Cargo.toml (Rust shell binary) Cargo.lock regenerates on next ``cargo build``. Headline for this release (full notes in RELEASE_NOTES_v0.9.2.md): - chaosengine-cli — full headless automation, 95 typed shortcuts + 100% backend route coverage - MTPLX native MTP speculative decoding on Apple Silicon (Apache 2.0) - GGUF MTP via llama.cpp PR #22673 (--spec-type draft-mtp) - Phased E2E test suite + auto-deployed MkDocs documentation site - 5 backend bug fixes (TurboQuant hybrid-attn, stale library scan, mmproj scoping, CLI response shape, MTPLX browser-pop) No v0.9.1 was tagged — going 0.9.0 -> 0.9.2.
Backend ships 9 diffusion cache strategies via cache_compression.registry
but the frontend only let users pick 2 (fbcache, teacache). Marketing
site claimed '9 strategies' — discrepancy.
Triaged the 4 hidden ones against value-vs-noise:
Worth UI exposure:
- TaylorSeer: native diffusers 0.38 core, generic across FLUX / SD3
/ Wan / Hunyuan / LTX / CogVideoX / Mochi. ~2.4x speedup.
- PAB (Pyramid Attention Broadcast): native diffusers 0.38 config,
~2x speedup. Different mechanism than FBCache (attention reuse vs
first-block-skip) so a real alternative not a duplicate.
Kept backend-only (CLI / API):
- MagCache: FLUX-only without calibration UX; footgun on other DiTs.
- FasterCache: ~1.9x — same ballpark as FBCache so adds choice
without adding capability.
Changes:
- ImageCacheStrategyId + VideoCacheStrategyId unions extended:
'none' | 'fbcache' | 'teacache' -> ... | 'taylorseer' | 'pab'
- IMAGE_CACHE_STRATEGIES list grows by 2 entries with hints describing
what each does.
- IMAGE_CACHE_STRATEGY_DEFAULT_THRESH + VIDEO_CACHE_STRATEGY_DEFAULT_THRESH
set taylorseer/pab thresholds to 0 (means 'use diffusers default skip
interval' — these adapters key off cache_interval not threshold).
- imageCacheStrategiesForRepo gating unchanged: UNet pipelines still
only see 'Off'; FLUX gets all 5; other DiTs get all 5 minus TeaCache
(no calibration tables for those pipelines).
Backend already accepts any string for cacheStrategy (Pydantic field is
'str | None' and registry.get() handles the lookup), so no Python
schema change needed — the new ids route straight through to the
existing cache_compression.{taylorseer,pab} adapters.
Tests: 214 cache/image/video tests pass; full Py + TS suites green;
tsc clean.
Two follow-ups from the v0.9.2 MTPLX bench:
1. MTPLX --profile performance-cold --max
The MTPLX subprocess defaults to ``sustained`` runtime profile,
which thermally throttles for long-running serves. For chat where
the user is staring at the textarea waiting on the first response,
``performance-cold --max`` is the right preset — full clocks, no
throttling. Live re-bench on M5 with N=3 + burst still didn't beat
plain mlx-lm (avg 23.9 tok/s vs 29 baseline; throughput degraded
over consecutive runs from 27.4 -> 20.8 suggesting M5 thermal
limits dominate regardless of profile). Keep the flag — burst is
the *right* config for interactive use; the gap is hardware, not
ChaosEngineAI's fault.
2. /api/setup/llama-server-status + chaosengine-cli llama-server-status
New read-only endpoint probes the resolved llama-server binary:
- Reports build number from ``llama-server --version``
- Greps ``llama-server --help`` for ``draft-mtp`` in --spec-type
- Returns platform-aware upgrade command (brew on macOS, tarball
on Linux, scoop on Windows)
- Surfaces a clear ``message`` field telling the user *why* MTP
GGUF won't fire on their current binary
The frontend can now show an "Outdated llama-server" banner under
any MTP GGUF catalog entry and link the upgrade command directly.
Why a read-only probe (not a one-click installer):
- Each OS has a preferred package-manager path (brew / apt / pacman /
scoop / chocolatey); wrapping all of them is a footgun until we
vendor our own llama.cpp build.
- The release pipeline already pulls ggml-org/llama.cpp's latest
GitHub release tarball at stage-runtime time. After a fresh ggml-org
release tag the bundled binary catches up automatically; users on
homebrew need ``brew upgrade llama.cpp`` once the bottle refreshes.
- This endpoint makes the gap legible without taking responsibility
for the install.
Test fixture (stub_mtplx_server.py) — accept the new --profile and
--max flags that MtplxEngine now passes so the integration tests
keep passing.
Full suite: 1418 passed, 1 skipped.
…3.6 MTP N to 3 Two follow-ups while running the FU-047 head-to-head benchmark: 1. Status probe was bypassing the env override _resolve_llama_server() in routes/setup/llama_server.py only checked /opt/homebrew/bin and PATH, ignoring CHAOSENGINE_LLAMA_SERVER. The inference engine resolver in inference/binaries.py does honour the env var (set by the Tauri shell pointing at the bundled binary, or by developers pointing at a freshly-built source build). Aligning the status probe with the engine's resolution priority — env override > homebrew > PATH — so the UI banner is honest about which binary actually runs. 2. MTP_MODEL_MAP Qwen3.6 entries: N 1 -> 3 Earlier sustained-bench at N=1 left tokens on the table; upstream PR #22673 reports ~72% acceptance at N=3 on Qwen3.6-27B. Live bench on M5 with N=3 on Q8_0 GGUF: 1.51x speedup over Q8_0 baseline (20.9 vs 13.8 tok/s). N=1 was already 1.46x; bumping to 3 nudges it up another 4%. Diminishing returns past N=3 per the PR body. Full head-to-head bench (M5, same prompt, 256 max tokens, 3 runs): MLX baseline (Youssofal MTPLX-Optimized-Speed) 29.0 tok/s MTPLX subprocess N=3 burst 24-27 (variable) GGUF Q4_K_M baseline (lmstudio Qwen3.6-27B) 18.4 tok/s GGUF Q8_0 baseline (ggml-org Qwen3.6-27B-MTP) 13.8 tok/s GGUF Q8_0 + MTP N=1 (this fix) 20.1 tok/s (+1.46x) GGUF Q8_0 + MTP N=3 (this commit) 20.9 tok/s (+1.51x) Net findings: - FU-047 (GGUF MTP) delivers the real, measurable speedup. 1.5x on the same model + quant + hardware is the headline. - MTPLX subprocess via HTTP underperforms on M5 even with depth=3 + burst profile. Subprocess overhead > MTP acceptance gain. - Plain MLX-LM at Youssofal's BF16-ish encoding still wins absolute throughput because the per-token compute is just smaller. Verified with /tmp/llama.cpp HEAD build (commit 6049906) installed to ~/.chaosengine/bin/llama-server. ggml-org/llama.cpp release b9181 (2026-05-16 17:06 UTC) is one commit past the MTP merge and is what stage-runtime.mjs will pull on the next ChaosEngineAI build, so the bundled binary in .app installs will ship with draft-mtp out of the box — no homebrew dependency for end users.
Three colleague-feedback items after MTPLX root-cause investigation:
- _mtp.py: model_has_mtp_tensors() peeks GGUF header for mtp_decoder /
mtp_emb / mtp_heads byte strings, or probes safetensors index for
mtp_*. keys / mtp.safetensors shard. has_mtp_heads_strict(repo, path)
prefers tensor probe over name aliases — catches new MTP-bearing
repos we haven't enumerated and rejects name collisions that don't
carry the tensors (FU-041-style false positives).
- controller._select_engine + llama_cpp_engine._build_command both
switch to the strict / tensor-probe path; GGUF MTP gate falls back
to is_mtp_gguf_repo when no local path is available.
- routes/setup/mtplx.py: /api/setup/mtplx-status now reports
fanControl.{thermalforge,tgPro,anyAvailable,recommendedAction} so
the Setup tab can prompt for ThermalForge install before users hit
the silent-throttle ceiling on MTPLX --max burst runs.
- CLAUDE.md FU-048: deferred prefer-GGUF-MTP routing preference —
needs Settings UX before flipping the default since MTPLX-Optimized
quants aren't GGUF-mirrored.
- tests: 7 new tensor-probe cases in test_inference.py; bumped stale
N=1 assertion to N=3 for Qwen3.6 MTP GGUFs (matches MTP_MODEL_MAP).
PR #22673 names MTP weights as ``blk.{N}.nextn.*`` ("Next-N
prediction") and emits ``<arch>.nextn_predict_layers`` in the GGUF
metadata header, neither of which matched the legacy ``mtp_decoder``
/ ``mtp_emb`` / ``mtp_heads`` needles. As a result tensor probe
returned False on a real MTP-GGUF model and the engine never emitted
``--spec-type draft-mtp`` (verified live: ggml-org/Qwen3.6-27B-MTP-GGUF
ran at 14.5 tok/s instead of MTP-accelerated 23 tok/s).
The metadata key lives in the first few KB of the file, so a 2 MB
read window catches both the cheap canonical marker and the legacy
patterns. Probe now returns True for ggml-org/Qwen3.6-27B-MTP-GGUF.
Head-to-head live numbers (M5 Max, 27B Qwen3.6, MTP enabled both sides):
- GGUF MTP Q8_0: 23.0 tok/s mean (14.5 baseline -> +58.6%)
- MTPLX MTP 4-bit (Optimized-Speed): 28.95 tok/s mean
Tests: rename legacy-tensor-name case + new case pinning the
nextn_predict metadata marker. 15 MTP tests green.
v0.9.0 release shipped with package.json / Cargo.toml / tauri.conf.json at 0.9.0 but pyproject.toml still at 0.8.0 — users downloaded "v0.9.0" from the site and the bundled backend reported ``appVersion: 0.8.0`` because ``_resolve_app_version`` reads the staged pyproject. Nothing enforced cross-manifest sync. Pre-build gate now reads version from all four sources, fails the build when any drift apart. Mirrors the existing dflash-mlx pin assert in both ``pre-build-check.mjs`` and ``pre-build-check.sh``.
…ies + LTX series - Cache compression table gains TaylorSeer / MagCache / PAB / FasterCache rows (previously only TeaCache + FBCache appeared even though the four diffusers-0.38 strategies shipped via FU-026) - DFlash family list adds Gemma-4 (FU-031), Kimi-K2.6, MiniMax-M2.5/M2.7, Qwen3.5-122B-A10B — all in DRAFT_MODEL_MAP - Video model table splits Lightricks LTX-Video (base diffusers) from LTX-2 / LTX-2.3 (mlx-video subprocess) — both ship in the catalog - Feature-map line now reads "FBCache + TeaCache + TaylorSeer + MagCache + PAB + FasterCache" instead of TeaCache-only
…ound job)
Adds a self-contained path for users with a working CUDA torch install to
upgrade to a newer wheel without re-running the full 2.5 GB GPU bundle.
Surfaces as a compact pill in the Image / Video Studio runtime banners
when ``realGenerationAvailable`` AND the matching cu{N} pip index serves
a newer wheel than the one on disk; silent otherwise.
Backend
- ``_install_helpers.py``: ``_extract_cuda_tag``, ``_index_url_for_cuda_tag``,
``_parse_version_triple``, ``_classify_torch_upgrade``,
``_query_latest_torch_version`` (pip index versions parser, both output
shapes), ``_abi_dependents_present``, ``_move_torch_to_rollback`` +
``_restore_torch_from_rollback`` + ``_cleanup_old_torch_rollbacks``,
``_TORCH_ABI_DEPENDENT_PACKAGES`` constant.
- ``routes/setup/torch_upgrade.py``: GET /api/setup/torch-upgrade-available
(synchronous detection, returns ``{available, current, latest,
upgradeType, rebuildPackages, indexUrl}`` or ``{available: false,
reason}``) and POST /api/setup/upgrade-torch (background job mirroring
install-gpu-bundle pattern). Worker moves existing torch to
``.torch-rollback-<version>/`` instead of purging, installs target from
the same cu{N} index, re-pins constraint, force-reinstalls ABI deps on
minor/major bumps (bitsandbytes/torchao/nunchaku/sageattention),
verifies CUDA in a subprocess, restores rollback on verify failure,
keeps the most recent rollback as a safety net.
Frontend
- ``src/api/setup.ts``: ``checkTorchUpgradeAvailable``, ``startTorchUpgrade``,
``getTorchUpgradeStatus`` + 4 exported types. Re-exported via
``src/api/index.ts``.
- ``src/components/TorchUpgradePill.tsx``: one-shot probe on mount,
hides when ``available: false``, three display states (available /
in-progress / done-or-error), polls status at 1.5 Hz with cleanup
keyed by ``job.done``, inline collapsible install log with
phase-named markers. Restart Backend hook plumbed through.
- ``src/styles.css``: color-coded badges per upgrade type
(patch=green / minor=amber / major=red).
- Wired into ``ImageStudioRuntimeBanner`` + ``VideoStudioRuntimeBanner``;
renders only when ``realGenerationAvailable`` so users with broken
torch are not second-guessed.
Tests
- 24 new tests in ``tests/test_setup_routes.py`` covering every helper
(version parsing edge cases including ``2.6.0rc1`` that caught a real
bug in the first cut where digits across non-digit boundaries leaked
into the parsed triple), both pip output shapes, rollback move/restore
round-trip with simulated half-install in extras, cleanup mtime
ordering, and all 8 detection-response shapes plus the apple-silicon
rejection and running-job POST cases.
Drive-by: package-lock.json version field was lagging at 0.8.0 after
the 0.9.0 bump; synced.
Verified: 24/24 new tests pass, 80/80 ``test_setup_routes.py`` pass,
217/217 setup + backend + services + inference pass, 371/371 vitest
pass, ``tsc --noEmit`` clean. Pre-existing ``test_cache_strategies`` /
``test_sdcpp_*`` / ``test_preview_thumbnails`` failures verified to
exist on baseline (Windows env + optional diffusers deps), unrelated.
Run the CLI-driven E2E suite reliably on Windows by invoking the extensionless CLI through Python, writing reports as UTF-8, and treating missing video runtime prerequisites as skips. Also make the Vitest config ESM-safe for runner-mode loading and keep the Tauri lockfile version in sync.
Bundles the M4 Max test-suite session work — tracker rows, two real bugs, two user-requested features. Tracker (CLAUDE.md): - FU-049 Python 3.14 support gate (deferred; pyproject stays >=3.10) - FU-050 matrix runner: reasoning-channel capture + max-tokens 96->512 + stale endpoint/path fixes (/api/chat/generate/stream, runtime.loadedModel) - FU-051 /api/models/load echoes legacy cacheStrategy verbatim (open) - FU-052 matrix grows 9->15 cells: MTPLX MLX, GGUF MTP, 4x vLLM (CUDA-gated) - FU-053 distill variants flagged installed when only base repo on disk - FU-054 same-repo siblings: per-file size + shares-storage badge - FU-055 in-app storage explorer in Diagnostics tab Bugs fixed: - _distill_transformer_validation_error checks distillTransformerRepo + high/low-noise filenames before marking availableLocally true. Closes the FU-053 false positive on Wan2.2-I2V-A14B distill bf16/fp8. - pre-build-check.sh + .mjs pointed at the wrong turbo fork (johndpope/...planarquant); corrected to TheTom/...turboquant-kv-cache matching build-llama-turbo.sh + CLAUDE.md. Features: - Star/favourite models on Chat -> My Models. New favoriteModelRefs in settings (+ UpdateSettingsRequest field, dedup-trimmed apply, payload), ActionIconName 'star'/'starOutline' SVG, .action-favorite CSS, toggle handler in App.tsx writes via PATCH /api/settings + refreshes. Starred rows lift to top of the library list. - Diagnostics 'Disk usage - top 20 model repos' section. New GET /api/diagnostics/storage-top endpoint walks every enabled modelDirectories entry one level deep, sums via _path_size_bytes (inode-deduped so HF snapshot/blob symlinks count once). Closes the Stuff Diver gap on HF cache layouts. Live: 1213 GB total on this box. Matrix runner (scripts/cache-strategy-matrix.py): - Smoke models bumped Qwen2.5-0.5B -> Qwen3-0.6B (current gen) - New cells for MTPLX MLX, GGUF MTP (FU-047), vLLM native/turboquant/ triattention/dflash. BackendCapabilities adds mtplx_available, gguf_mtp_available, vllm_available probed from /api/health. Tests: - 4 new unit tests pin the FU-053 distill validator (no-distill, missing snapshot, partial snapshot, both files present). - test_cache_strategy_matrix_runner kwargs widened for new caps. - Full suite: 1455 passed, 1 skipped, 132 subtests passed. - npx tsc --noEmit clean; npm test 32 files / 371 tests pass.
Brings PR #57 up to date with staging (29 commits behind). Two conflicts resolved manually: - src/api/index.ts: re-exports from ./setup. Both branches added MTPLX-related exports independently; took the union alphabetically sorted (getMtplxInstallStatus, getMtplxStatus, startMtplxInstall). - src/api/setup.ts: torch-upgrade machinery was introduced via parallel commits on both branches (our dca2c12 + staging f514ea4). Auto-merge produced duplicate TorchUpgradeAvailability / TorchUpgradeType / TorchUpgradeUnavailableReason / TorchUpgradeAttempt / TorchUpgradeJobState declarations + duplicate checkTorchUpgradeAvailable / startTorchUpgrade / getTorchUpgradeStatus functions. Removed the duplicate block; kept one canonical section with types + functions in proper order. Verified post-merge: - npx tsc --noEmit: clean - npm test: 32 files / 371 tests pass - pytest tests/: 1455 passed, 1 skipped, 132 subtests passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.