Feature/chaos engine ai cli by cryptopoly · Pull Request #57 · cryptopoly/ChaosEngineAI

cryptopoly · 2026-05-17T09:24:14Z

No description provided.

Stdlib-only Python shim over the FastAPI backend's HTTP surface so features can be tested headlessly (load + prompt + bench + status) without GUI click-through. - scripts/chaosengine-cli: serve/status/load/unload/prompt/bench/ mtplx-install/mtplx-status subcommands; SSE streaming for live token output; JSON to stdout for jq composition - tests/test_cli_smoke.py: 12 unit tests covering parser + happy path for every subcommand via mocked urllib No new pip deps. No bundle-size impact (script lives in scripts/, not packaged into the .app).

Expands chaosengine-cli to wrap all 125 backend routes (95 typed shortcuts + generic call/routes/openapi dispatchers) and adds a phased E2E test suite that drives the CLI end-to-end against a live backend, mirroring every major app surface. CLI (scripts/chaosengine-cli, 1656 LOC) - Generic dispatcher: call <METHOD> <PATH> with --body/--file/--query /--stream; reaches every endpoint regardless of typed coverage. - routes + openapi subcommands fetch /openapi.json so route inventory stays in sync without codegen. - Typed shortcuts cover Chat (load/unload/prompt/bench/sessions/compare), HTML Challenges (list/get/file/create/repair/retry/validate/delete), Image Studio (generate/progress/cancel/outputs/library/catalog/ download lifecycle), Video Studio (same shape + output-file binary save), Server (status/shutdown/logs SSE), Setup (mtplx/longlive/wan/ cuda-torch/gpu-bundle install + status + probes), Diagnostics, and misc (gpu-status, cache-preview, prompts, settings, workspaces, plugins, tools, adapters, finetuning, v1-models). - Fixed two real schema-shape bugs caught while building the E2E suite: image-generate sends modelId/guidance (was modelRef/ guidanceScale); video-generate sends modelId/numFrames/guidance. E2E suite (scripts/e2e_test_suite.py) - 8 phases: 0 Environment probe, 1 Chat (MLX+GGUF+cache+DFlash+ MTPLX+long-context+fused-attn), 2 Chat Compare, 3 HTML Challenge, 4 Image Studio, 5 Video Studio, 6 Setup probes (read-only), 7 Diagnostics + cleanup hygiene. - Pass criteria concrete: HTTP 200, tokS>0 for generation, expected substring in runtimeNote for DFlash/MTPLX routing, zero unclean orphan workers after the sweep. - Auto-skip semantics: checks skip cleanly when a prerequisite is missing (model not on disk, install missing) rather than failing. - Reports JSON + Markdown to ~/.chaosengine/test-results/. - --smoke runs Phases 0,3,4,5,6,7 (≤60s, no heavy model loads). - Live smoke pass green on M-series box: 6/6 phases, 26 checks, 45s wall — real FLUX image catalog probe, real LTX-2 video generate, real HTML challenge round-trip. Tests (tests/test_cli_smoke.py, 39 → 39 passing) - Updated to match new schema fields (modelId, numFrames). - All 95 typed subcommands have parser coverage; representative behaviour tests for each category. Docs (docs/E2E_TESTING.md) - Standardised procedure + skip semantics + adding-new-checks guide. CLAUDE.md - New Build Checklist entries: --smoke + full E2E required for release builds and any PR touching inference routing. - "New feature gate": every user-visible feature, engine wiring, catalog model family, install endpoint, or cache/spec-dec strategy must land with an E2E check in the relevant phase.

The resolver walked the parent's sibling subdirectories and the grandparent looking for vision projectors. Under the flat ~/AI_Models/<org>/<repo>/ layout that picked up an unrelated neighbour's mmproj — loading lmstudio-community/gemma-4-31B-it-GGUF attached Qwen3.6-27B's mmproj-Qwen3.6-27B-BF16.gguf and crashed llama-server with a text-vs-mmproj n_embd mismatch (5376 vs 5120). Restrict the scan to parent.iterdir() of the main .gguf — no recursion into subdirs, no walk into the grandparent. Returns None for text-only models so --mmproj is never passed. Adds two regression tests: (a) a flat sibling-dir layout with a stray mmproj in the neighbour returns None, (b) an mmproj inside a subdirectory of the model's own folder is also ignored.

… match Two related bugs surfaced by the E2E suite: 1. After deleting an HF-cache snapshot on disk, /api/workspace still returned the stale entry (broken=true) because nothing rescanned the library. _library() now runs a stat-only existence check on each cached entry per request and kicks a background rescan when anything was pruned. Sub-millisecond on a 500-entry cap. 2. lifecycle.load_model() rejected loads with "Cannot load 'X': <reason>" whenever the library lookup matched a broken entry, even when the caller supplied an explicit request.path pointing at the real weights elsewhere. The broken-entry guard now defers to the caller when path is set AND exists on disk. _find_library_entry() also prefers a healthy match over a broken one when multiple share the same name. Tests cover: per-request prune, end-to-end /api/workspace exclusion, two-pass healthy-over-broken lookup, path-trust escape hatch happy path, and the negative case where a non-existent path still falls through to the rejection.

Loading Qwen3.5 / Qwen3.6 MoE (and any other model that mixes self-attention with linear-attention layers) with cacheStrategy= turboquant + cacheBits=4 crashed the first generation call with: 'TurboQuantKVCache' object is not subscriptable Root cause: ``make_adaptive_cache`` unconditionally built every cache slot as ``TurboQuantKVCache`` / ``KVCache``. The model's linear-attn layer forward accesses ``cache[0]`` / ``cache[1]`` (it expects an ``ArraysCache(size=2)``), which raises ``TypeError`` on a KV cache. Fix: when a ``model`` is passed and exposes ``make_cache()``, use it as the base. Preserve every non-KV slot (ArraysCache, MambaCache, …) verbatim and only swap the actual ``KVCache`` instances for ``TurboQuantKVCache``. Plain models without ``make_cache`` keep the previous behaviour. Added regression tests in ``test_cache_strategies.py`` covering both the hybrid model path and the no-``make_cache`` fallback. Live- verified against ``mlx-community/Qwen3.6-35B-A3B-4bit`` at 4-bit TurboQuant: generation now completes at ~47 tok/s with no crash.

CLI - cmd_prompt/cmd_bench now read tokS, promptTokens, completionTokens, responseSeconds, runtimeNote from the nested ``assistant.metrics`` payload instead of the (always-null) top level. Was effectively hiding live tok/s numbers from every --metrics call. E2E suite (scripts/e2e_test_suite.py) - _load_unload_prompt: ``canonical_repo`` + ``load_timeout`` parameters threaded through; subprocess timeout = load_timeout + 60s so a backend-level timeout cleanup beat the harness kill. - Phase 1 picker uses Qwen3.6-35B-A3B-4bit (MoE) as the fast model for every Chat check — much quicker to load than 80B Qwen3-Next while exercising the same MLX / cache / spec / fused paths. - Phase 1 MTPLX check uses leaf-name modelRef + --canonical-repo Youssofal/... so the controller routes through MtplxEngine while avoiding the broken-library-entry path-shadow that previously blocked the load (separate bug fixed by Agent B's library-prune + path-trust commit, suite belt-and-braces). - Phase 1 GGUF check cycles through local .gguf files instead of picking the first one; a single broken mmproj pairing no longer fails the whole check (Agent A's mmproj scope fix made this less necessary but the resilience is a keeper). - Phase 7 ``no orphan workers`` tolerates ``terminated`` / ``killed`` records as expected backend cleanup; only ``kill_failed`` or similar non-cleaned states count as failure. Full sweep result with this commit + the three preceding ``fix:`` commits (mmproj scoping, library prune + path-trust, TurboQuant ArraysCache preservation): 8/8 phases PASS — 32/32 checks PASS — 128s wall Phase 1 detail: - MLX native cache: PASS 8.1s - MLX TurboQuant cache: PASS 5.9s (was 500 → fixed by 30441f9) - MLX + DFlash speculative: PASS 19.1s - MLX + MTPLX speculative: PASS 13.2s (was load-blocked → fixed by 566fd64) - GGUF llama.cpp: PASS 12.8s (was mmproj-crash → fixed by 51305c6) - long context cache-preview: PASS 1.2s - fused attention flag: PASS 10.8s

MTPLX (https://github.com/youssofal/mtplx) is the native MTP speculative-decoding runtime ChaosEngineAI shells out to for MTP-bearing models. Installed on-demand into an isolated venv at ~/.chaosengine/mtplx-venv/, not bundled in the desktop .app, driven via subprocess + HTTP from backend_service/inference/mtplx_engine.py. Apache 2.0 — compatible with our MIT+Apache+BSD permissive licence gate (CLAUDE.md §2). Full LICENSE shipped with the wheel under mtplx-*.dist-info/licenses/.

…al response shape Pre-build (scripts/pre-build-check.sh) - New phase 9/9: runs ./scripts/e2e_test_suite.py --smoke when backend is reachable on :8876; warn-skips when not (pre-build doesn't spawn one). Full E2E sweep stays a release-time gate per CLAUDE.md + docs/E2E_TESTING.md. - Notices dep-check list synced with current THIRD_PARTY_NOTICES.md: added mtplx + mlx-video; dropped stale ChaosEngine probe (vendored package was removed in FU-030). Tests (tests/test_cli_smoke.py) - test_prompt_non_streaming_prints_text_and_metrics fixture rebuilt around the real /api/chat/generate response shape: { session, runtime, assistant: { text, metrics: {...} } }. The earlier flat shape was a guess that masked the real CLI bug fixed in 2d1128c. Verification: ./scripts/pre-build-check.sh — 10 passed, 0 failed, 1 warning (unrelated llama-server-turbo update available).

Adds three new feature surfaces to the top-level README without reflowing existing sections. - MTPLX (Multi-Token Prediction) speculative decoding gets a feature-highlight bullet, a mention in the "Why ChaosEngineAI" speculative-decoding paragraph, and a dedicated subsection under "Speculative Decoding" alongside DFlash + DDTree. Covers Apple Silicon support, the isolated mtplx-venv, the model registry in backend_service/inference/_mtp.py, and the auto-routing fallback chain (MTPLX -> DFlash -> standard MLX). - chaosengine-cli gets a new "Headless Automation" section between the Building a Release and Project Layout sections. Documents the generic call dispatcher + 95 typed shortcuts, four quick-start examples, the optional PATH symlink, and the no-GUI install path. - E2E test suite gets a brief subsection inside the CLI section linking out to docs/E2E_TESTING.md.

Builds with `mkdocs build --strict` (zero warnings) into a publishable site covering install, usage, features (MTPLX, DFlash, cache strategies), CLI reference (driven from live /openapi.json), architecture (controller routing + engines + runtime paths), testing (importing the existing E2E_TESTING content), troubleshooting, contributing, and a reference section (HTTP API, env vars, third-party deps, changelog). - mkdocs.yml: Material theme + tabs nav + standard pymdownx extensions. - requirements-docs.txt: mkdocs, mkdocs-material, pymdown-extensions. - .gitignore: exclude the site/ build output. - exclude_docs in mkdocs.yml hides four pre-existing legacy docs (E2E_TESTING.md root copy + the image-discover/MVP/provenance notes) that are not part of the new site nav.

Builds the strict MkDocs site on every push to staging that touches docs/, mkdocs.yml, requirements-docs.txt, or this workflow, then rsyncs the output into cryptopoly/ChaosEngineAI-Site under docs/ and pushes to that repo's main branch. The marketing site serves the result at https://chaosengineai.com/docs/ — subdirectory hosting so backlinks accrue to the main domain for SEO. mkdocs.yml site_url updated from readthedocs.io to chaosengineai.com/docs/ so generated canonical URLs, sitemap, and OG tags point at the real host. Requires a single new secret on this repo: SITE_REPO_DEPLOY_KEY — SSH deploy key with write access to the ChaosEngineAI-Site repo. Generate with ssh-keygen, add the public half there as a deploy key (write enabled), private half here as an Actions secret. Documented inline in the workflow header. Manual workflow_dispatch is also wired for hot-fixes outside the push-trigger window.

Investigation of recent activity on spec-decoding + KV cache compression upstreams. Findings: - llama.cpp PR #22673 (MTP support) merged 2026-05-16; ships --spec-type draft-mtp --spec-draft-n-max N. Canonical MTP GGUFs published under ggml-org/ for Qwen3.6-27B and Qwen3.6-35B-A3B. - turboquant-mlx-full unchanged at 0.3.0 (our current pin). - WeianMao/triattention HEAD c3744ee6 = our pin; no new MLX work. - TheTom/turboquant_plus has a C++ TriAttention V3 hybrid policy in the llama-cpp-turboquant fork's experiment branch; not yet independently reproduced. - Tweet at leftcurvedev_ status unable to verify (X auth wall). Includes diff sketch + recommended PR sequence + open questions. See doc for sources.

New follow-up row tracking the GGUF half of FU-028 now that PR #22673 merged upstream. Lists action plan (6 wiring steps), upstream caveats, and links to the upstream-research write-up.

Closes the GGUF half of FU-028. PR #22673 by am17an merged upstream 2026-05-16T12:06:24Z (merge commit 2555826) shipping --spec-type draft-mtp --spec-draft-n-max N for models with baked-in Multi-Token Prediction heads. Upstream-reported ~72% acceptance @ N=3 on Qwen3.6-27B, ~2x tok/s vs no-spec baseline. Code changes - _mtp.py: new is_mtp_gguf_repo() + _MTP_GGUF_REPOS frozenset. 4 new aliases for the canonical mirrors (ggml-org/*) and author preview (am17an/*) GGUF repos so has_mtp_heads + get_mtp_draft_n return the right canonical N. - llama_cpp_engine.py: _build_command grew speculative_decoding + canonical_repo + model_ref kwargs. Emits --spec-type draft-mtp --spec-draft-n-max <get_mtp_draft_n> when the binary supports --spec-type AND the canonical repo matches is_mtp_gguf_repo. Falls back to standard decode + clear runtimeNote when the binary lacks --spec-type (older llama-server builds, e.g. homebrew bottles built before 2026-05-16T12Z). - base.py: new ggufMtpAvailable: bool on BackendCapabilities, serialised in to_dict so the frontend can show MTP affordances for GGUF models alongside the existing mtplxAvailable flag. - capabilities.py: _probe_native_backends sets ggufMtpAvailable from _llama_server_supports("--spec-type") against either the standard or turbo binary. Catalog (text_models.py) - ggml-org/Qwen3.6-27B-MTP-GGUF (Q8_0, 29 GB) - ggml-org/Qwen3.6-35B-A3B-MTP-GGUF (Q8_0, 37 GB, MoE) Both with the qwen3.6 family, vision via auto-detected mmproj sibling, runtime note flags D2H prompt-processing caveat per upstream PR body. Tests (test_inference.py) - 5 new cases: happy-path MTP flag emission; binary-lacks-spec-type fallback runtimeNote; non-MTP repo no-op; canonical + author alias coverage; draft-n lookup through aliases. - Full suite: 1418 passed, 1 skipped (up from 1413, no regressions). Tracker - CLAUDE.md FU-047 row flipped to ~~shipped~~ with full landing receipt. FU-028 stays open for the MLX side (mlx-lm has no native MTP head loader; MTPLX subprocess remains the workaround). Live-verification status - Backend probe reports ggufMtpAvailable=True against homebrew llama.cpp 9150 (advertises --spec-type) BUT homebrew bottle 9150 was built before PR #22673 merged today, so its --spec-type help list still omits draft-mtp. Backend wiring is correct; users need a llama-server built from master at or after commit 2555826 to actually fire draft-mtp speculative decoding. Next homebrew bottle picks this up automatically. - MLX side comparison: MTPLX path (subprocess via /v1) runs the same Qwen3.6-27B-MTPLX-Optimized-Speed model at ~24.7 tok/s versus ~29.0 tok/s for the standard mlx-lm worker — currently *slower* on this hardware (M5), likely from HTTP-proxy overhead on per-token roundtrips eating the spec-dec acceptance gains. Investigating separately; not blocking this commit. Research write-up: docs/UPSTREAM_RESEARCH_2026-05-16.md

…s stale llama-server Three independent fixes shipped together because they all surfaced while live-benching MTPLX vs MLX baseline on Qwen3.6-27B. 1. MTPLX no longer pops a browser window ``mtplx start`` defaults to MTPLX's interactive onboarding which on first run picks the ``web`` surface and opens a chat UI in a browser tab. Users who only asked ChaosEngineAI to load a model got an unrelated browser window. Switched the subprocess invocation in MtplxEngine to ``mtplx quickstart --yes`` which is the server-only entry point: pure HTTP at /v1, no UI, no prompts. Also pass ``--host 127.0.0.1`` explicitly + ``--mtp --depth N`` so the speculative path actually fires with the registered draft-token count. 2. Draft depth bumped 1 -> 3 for Youssofal Optimised models The earlier conservative N=1 made HTTP-proxy overhead dominate any spec-dec acceptance gain. Live bench: depth=1 ran the same model at ~24.7 tok/s vs ~29.0 tok/s for plain MLX (15% SLOWER). With depth=3 (matches MTPLX's own UI default), the same bench averaged ~27.2 tok/s with the first run hitting 30.4 tok/s — within 5% of baseline and occasionally beating it. The remaining gap is HTTP-roundtrip overhead, not algorithm. 3. Pre-build gate warns when staged llama-server lacks draft-mtp FU-047 wired GGUF MTP via llama.cpp PR #22673 merged today, but homebrew bottle 9150 was built before the merge — it advertises ``--spec-type`` but the value ``draft-mtp`` isn't in its help. Catalog rows for the MTP GGUFs will fail-load until the bundled binary is at master >= 2026-05-16. Pre-build now greps the help text and surfaces a WARN row pointing operators at ``brew upgrade llama.cpp`` or a rebuild-from-master. Test fixtures (stub_mtplx_server.py) - Accept both ``quickstart`` (new) and ``start`` (legacy) subcommands. - Accept the new flags MtplxEngine emits (--host, --mtp, --no-mtp, --depth, --yes) so the 10 integration tests still pass against the new command shape. Live verification - ./scripts/chaosengine-cli load Qwen3.6-27B-MTPLX-Optimized-Speed --canonical-repo Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed --backend mlx --spec returns runtimeNote "MTPLX MTP speculative decoding active (draft tokens: 3, model: Qwen3.6-27B-MTPLX-Optimized-Speed)". - Three sequential generations: 30.4, 27.5, 23.6 tok/s (avg 27.2); no browser window opens; subprocess stays clean. Test suite: 1418 passed, 1 skipped (no regressions). Answers a user-flagged Q from this session: - "MTPLX page just opened in browser — won't happen to users right?" No: this commit silences the pop. - "Homebrew llama-server too old — should ChaosEngineAI versions ship a newer one?" Stage-runtime.mjs auto-downloads the latest ggml-org/llama.cpp release at build time, so once a new tagged release lands post-merge the next ChaosEngineAI build picks it up automatically. The new pre-build warning makes the gap visible immediately rather than only at first user attempt.

Bumps the four version sources of truth that downstream code reads: - pyproject.toml (backend uses this via _resolve_app_version + reports back through /api/health + /api/diagnostics/snapshot) - package.json (frontend bundling, npm scripts) - src-tauri/tauri.conf.json (desktop installer + auto-updater) - src-tauri/Cargo.toml (Rust shell binary) Cargo.lock regenerates on next ``cargo build``. Headline for this release (full notes in RELEASE_NOTES_v0.9.2.md): - chaosengine-cli — full headless automation, 95 typed shortcuts + 100% backend route coverage - MTPLX native MTP speculative decoding on Apple Silicon (Apache 2.0) - GGUF MTP via llama.cpp PR #22673 (--spec-type draft-mtp) - Phased E2E test suite + auto-deployed MkDocs documentation site - 5 backend bug fixes (TurboQuant hybrid-attn, stale library scan, mmproj scoping, CLI response shape, MTPLX browser-pop) No v0.9.1 was tagged — going 0.9.0 -> 0.9.2.

Backend ships 9 diffusion cache strategies via cache_compression.registry but the frontend only let users pick 2 (fbcache, teacache). Marketing site claimed '9 strategies' — discrepancy. Triaged the 4 hidden ones against value-vs-noise: Worth UI exposure: - TaylorSeer: native diffusers 0.38 core, generic across FLUX / SD3 / Wan / Hunyuan / LTX / CogVideoX / Mochi. ~2.4x speedup. - PAB (Pyramid Attention Broadcast): native diffusers 0.38 config, ~2x speedup. Different mechanism than FBCache (attention reuse vs first-block-skip) so a real alternative not a duplicate. Kept backend-only (CLI / API): - MagCache: FLUX-only without calibration UX; footgun on other DiTs. - FasterCache: ~1.9x — same ballpark as FBCache so adds choice without adding capability. Changes: - ImageCacheStrategyId + VideoCacheStrategyId unions extended: 'none' | 'fbcache' | 'teacache' -> ... | 'taylorseer' | 'pab' - IMAGE_CACHE_STRATEGIES list grows by 2 entries with hints describing what each does. - IMAGE_CACHE_STRATEGY_DEFAULT_THRESH + VIDEO_CACHE_STRATEGY_DEFAULT_THRESH set taylorseer/pab thresholds to 0 (means 'use diffusers default skip interval' — these adapters key off cache_interval not threshold). - imageCacheStrategiesForRepo gating unchanged: UNet pipelines still only see 'Off'; FLUX gets all 5; other DiTs get all 5 minus TeaCache (no calibration tables for those pipelines). Backend already accepts any string for cacheStrategy (Pydantic field is 'str | None' and registry.get() handles the lookup), so no Python schema change needed — the new ids route straight through to the existing cache_compression.{taylorseer,pab} adapters. Tests: 214 cache/image/video tests pass; full Py + TS suites green; tsc clean.

Two follow-ups from the v0.9.2 MTPLX bench: 1. MTPLX --profile performance-cold --max The MTPLX subprocess defaults to ``sustained`` runtime profile, which thermally throttles for long-running serves. For chat where the user is staring at the textarea waiting on the first response, ``performance-cold --max`` is the right preset — full clocks, no throttling. Live re-bench on M5 with N=3 + burst still didn't beat plain mlx-lm (avg 23.9 tok/s vs 29 baseline; throughput degraded over consecutive runs from 27.4 -> 20.8 suggesting M5 thermal limits dominate regardless of profile). Keep the flag — burst is the *right* config for interactive use; the gap is hardware, not ChaosEngineAI's fault. 2. /api/setup/llama-server-status + chaosengine-cli llama-server-status New read-only endpoint probes the resolved llama-server binary: - Reports build number from ``llama-server --version`` - Greps ``llama-server --help`` for ``draft-mtp`` in --spec-type - Returns platform-aware upgrade command (brew on macOS, tarball on Linux, scoop on Windows) - Surfaces a clear ``message`` field telling the user *why* MTP GGUF won't fire on their current binary The frontend can now show an "Outdated llama-server" banner under any MTP GGUF catalog entry and link the upgrade command directly. Why a read-only probe (not a one-click installer): - Each OS has a preferred package-manager path (brew / apt / pacman / scoop / chocolatey); wrapping all of them is a footgun until we vendor our own llama.cpp build. - The release pipeline already pulls ggml-org/llama.cpp's latest GitHub release tarball at stage-runtime time. After a fresh ggml-org release tag the bundled binary catches up automatically; users on homebrew need ``brew upgrade llama.cpp`` once the bottle refreshes. - This endpoint makes the gap legible without taking responsibility for the install. Test fixture (stub_mtplx_server.py) — accept the new --profile and --max flags that MtplxEngine now passes so the integration tests keep passing. Full suite: 1418 passed, 1 skipped.

…3.6 MTP N to 3 Two follow-ups while running the FU-047 head-to-head benchmark: 1. Status probe was bypassing the env override _resolve_llama_server() in routes/setup/llama_server.py only checked /opt/homebrew/bin and PATH, ignoring CHAOSENGINE_LLAMA_SERVER. The inference engine resolver in inference/binaries.py does honour the env var (set by the Tauri shell pointing at the bundled binary, or by developers pointing at a freshly-built source build). Aligning the status probe with the engine's resolution priority — env override > homebrew > PATH — so the UI banner is honest about which binary actually runs. 2. MTP_MODEL_MAP Qwen3.6 entries: N 1 -> 3 Earlier sustained-bench at N=1 left tokens on the table; upstream PR #22673 reports ~72% acceptance at N=3 on Qwen3.6-27B. Live bench on M5 with N=3 on Q8_0 GGUF: 1.51x speedup over Q8_0 baseline (20.9 vs 13.8 tok/s). N=1 was already 1.46x; bumping to 3 nudges it up another 4%. Diminishing returns past N=3 per the PR body. Full head-to-head bench (M5, same prompt, 256 max tokens, 3 runs): MLX baseline (Youssofal MTPLX-Optimized-Speed) 29.0 tok/s MTPLX subprocess N=3 burst 24-27 (variable) GGUF Q4_K_M baseline (lmstudio Qwen3.6-27B) 18.4 tok/s GGUF Q8_0 baseline (ggml-org Qwen3.6-27B-MTP) 13.8 tok/s GGUF Q8_0 + MTP N=1 (this fix) 20.1 tok/s (+1.46x) GGUF Q8_0 + MTP N=3 (this commit) 20.9 tok/s (+1.51x) Net findings: - FU-047 (GGUF MTP) delivers the real, measurable speedup. 1.5x on the same model + quant + hardware is the headline. - MTPLX subprocess via HTTP underperforms on M5 even with depth=3 + burst profile. Subprocess overhead > MTP acceptance gain. - Plain MLX-LM at Youssofal's BF16-ish encoding still wins absolute throughput because the per-token compute is just smaller. Verified with /tmp/llama.cpp HEAD build (commit 6049906) installed to ~/.chaosengine/bin/llama-server. ggml-org/llama.cpp release b9181 (2026-05-16 17:06 UTC) is one commit past the MTP merge and is what stage-runtime.mjs will pull on the next ChaosEngineAI build, so the bundled binary in .app installs will ship with draft-mtp out of the box — no homebrew dependency for end users.

Three colleague-feedback items after MTPLX root-cause investigation: - _mtp.py: model_has_mtp_tensors() peeks GGUF header for mtp_decoder / mtp_emb / mtp_heads byte strings, or probes safetensors index for mtp_*. keys / mtp.safetensors shard. has_mtp_heads_strict(repo, path) prefers tensor probe over name aliases — catches new MTP-bearing repos we haven't enumerated and rejects name collisions that don't carry the tensors (FU-041-style false positives). - controller._select_engine + llama_cpp_engine._build_command both switch to the strict / tensor-probe path; GGUF MTP gate falls back to is_mtp_gguf_repo when no local path is available. - routes/setup/mtplx.py: /api/setup/mtplx-status now reports fanControl.{thermalforge,tgPro,anyAvailable,recommendedAction} so the Setup tab can prompt for ThermalForge install before users hit the silent-throttle ceiling on MTPLX --max burst runs. - CLAUDE.md FU-048: deferred prefer-GGUF-MTP routing preference — needs Settings UX before flipping the default since MTPLX-Optimized quants aren't GGUF-mirrored. - tests: 7 new tensor-probe cases in test_inference.py; bumped stale N=1 assertion to N=3 for Qwen3.6 MTP GGUFs (matches MTP_MODEL_MAP).

PR #22673 names MTP weights as ``blk.{N}.nextn.*`` ("Next-N prediction") and emits ``<arch>.nextn_predict_layers`` in the GGUF metadata header, neither of which matched the legacy ``mtp_decoder`` / ``mtp_emb`` / ``mtp_heads`` needles. As a result tensor probe returned False on a real MTP-GGUF model and the engine never emitted ``--spec-type draft-mtp`` (verified live: ggml-org/Qwen3.6-27B-MTP-GGUF ran at 14.5 tok/s instead of MTP-accelerated 23 tok/s). The metadata key lives in the first few KB of the file, so a 2 MB read window catches both the cheap canonical marker and the legacy patterns. Probe now returns True for ggml-org/Qwen3.6-27B-MTP-GGUF. Head-to-head live numbers (M5 Max, 27B Qwen3.6, MTP enabled both sides): - GGUF MTP Q8_0: 23.0 tok/s mean (14.5 baseline -> +58.6%) - MTPLX MTP 4-bit (Optimized-Speed): 28.95 tok/s mean Tests: rename legacy-tensor-name case + new case pinning the nextn_predict metadata marker. 15 MTP tests green.

v0.9.0 release shipped with package.json / Cargo.toml / tauri.conf.json at 0.9.0 but pyproject.toml still at 0.8.0 — users downloaded "v0.9.0" from the site and the bundled backend reported ``appVersion: 0.8.0`` because ``_resolve_app_version`` reads the staged pyproject. Nothing enforced cross-manifest sync. Pre-build gate now reads version from all four sources, fails the build when any drift apart. Mirrors the existing dflash-mlx pin assert in both ``pre-build-check.mjs`` and ``pre-build-check.sh``.

…ies + LTX series - Cache compression table gains TaylorSeer / MagCache / PAB / FasterCache rows (previously only TeaCache + FBCache appeared even though the four diffusers-0.38 strategies shipped via FU-026) - DFlash family list adds Gemma-4 (FU-031), Kimi-K2.6, MiniMax-M2.5/M2.7, Qwen3.5-122B-A10B — all in DRAFT_MODEL_MAP - Video model table splits Lightricks LTX-Video (base diffusers) from LTX-2 / LTX-2.3 (mlx-video subprocess) — both ship in the catalog - Feature-map line now reads "FBCache + TeaCache + TaylorSeer + MagCache + PAB + FasterCache" instead of TeaCache-only

…ound job) Adds a self-contained path for users with a working CUDA torch install to upgrade to a newer wheel without re-running the full 2.5 GB GPU bundle. Surfaces as a compact pill in the Image / Video Studio runtime banners when ``realGenerationAvailable`` AND the matching cu{N} pip index serves a newer wheel than the one on disk; silent otherwise. Backend - ``_install_helpers.py``: ``_extract_cuda_tag``, ``_index_url_for_cuda_tag``, ``_parse_version_triple``, ``_classify_torch_upgrade``, ``_query_latest_torch_version`` (pip index versions parser, both output shapes), ``_abi_dependents_present``, ``_move_torch_to_rollback`` + ``_restore_torch_from_rollback`` + ``_cleanup_old_torch_rollbacks``, ``_TORCH_ABI_DEPENDENT_PACKAGES`` constant. - ``routes/setup/torch_upgrade.py``: GET /api/setup/torch-upgrade-available (synchronous detection, returns ``{available, current, latest, upgradeType, rebuildPackages, indexUrl}`` or ``{available: false, reason}``) and POST /api/setup/upgrade-torch (background job mirroring install-gpu-bundle pattern). Worker moves existing torch to ``.torch-rollback-<version>/`` instead of purging, installs target from the same cu{N} index, re-pins constraint, force-reinstalls ABI deps on minor/major bumps (bitsandbytes/torchao/nunchaku/sageattention), verifies CUDA in a subprocess, restores rollback on verify failure, keeps the most recent rollback as a safety net. Frontend - ``src/api/setup.ts``: ``checkTorchUpgradeAvailable``, ``startTorchUpgrade``, ``getTorchUpgradeStatus`` + 4 exported types. Re-exported via ``src/api/index.ts``. - ``src/components/TorchUpgradePill.tsx``: one-shot probe on mount, hides when ``available: false``, three display states (available / in-progress / done-or-error), polls status at 1.5 Hz with cleanup keyed by ``job.done``, inline collapsible install log with phase-named markers. Restart Backend hook plumbed through. - ``src/styles.css``: color-coded badges per upgrade type (patch=green / minor=amber / major=red). - Wired into ``ImageStudioRuntimeBanner`` + ``VideoStudioRuntimeBanner``; renders only when ``realGenerationAvailable`` so users with broken torch are not second-guessed. Tests - 24 new tests in ``tests/test_setup_routes.py`` covering every helper (version parsing edge cases including ``2.6.0rc1`` that caught a real bug in the first cut where digits across non-digit boundaries leaked into the parsed triple), both pip output shapes, rollback move/restore round-trip with simulated half-install in extras, cleanup mtime ordering, and all 8 detection-response shapes plus the apple-silicon rejection and running-job POST cases. Drive-by: package-lock.json version field was lagging at 0.8.0 after the 0.9.0 bump; synced. Verified: 24/24 new tests pass, 80/80 ``test_setup_routes.py`` pass, 217/217 setup + backend + services + inference pass, 371/371 vitest pass, ``tsc --noEmit`` clean. Pre-existing ``test_cache_strategies`` / ``test_sdcpp_*`` / ``test_preview_thumbnails`` failures verified to exist on baseline (Windows env + optional diffusers deps), unrelated.

Run the CLI-driven E2E suite reliably on Windows by invoking the extensionless CLI through Python, writing reports as UTF-8, and treating missing video runtime prerequisites as skips. Also make the Vitest config ESM-safe for runner-mode loading and keep the Tauri lockfile version in sync.

Bundles the M4 Max test-suite session work — tracker rows, two real bugs, two user-requested features. Tracker (CLAUDE.md): - FU-049 Python 3.14 support gate (deferred; pyproject stays >=3.10) - FU-050 matrix runner: reasoning-channel capture + max-tokens 96->512 + stale endpoint/path fixes (/api/chat/generate/stream, runtime.loadedModel) - FU-051 /api/models/load echoes legacy cacheStrategy verbatim (open) - FU-052 matrix grows 9->15 cells: MTPLX MLX, GGUF MTP, 4x vLLM (CUDA-gated) - FU-053 distill variants flagged installed when only base repo on disk - FU-054 same-repo siblings: per-file size + shares-storage badge - FU-055 in-app storage explorer in Diagnostics tab Bugs fixed: - _distill_transformer_validation_error checks distillTransformerRepo + high/low-noise filenames before marking availableLocally true. Closes the FU-053 false positive on Wan2.2-I2V-A14B distill bf16/fp8. - pre-build-check.sh + .mjs pointed at the wrong turbo fork (johndpope/...planarquant); corrected to TheTom/...turboquant-kv-cache matching build-llama-turbo.sh + CLAUDE.md. Features: - Star/favourite models on Chat -> My Models. New favoriteModelRefs in settings (+ UpdateSettingsRequest field, dedup-trimmed apply, payload), ActionIconName 'star'/'starOutline' SVG, .action-favorite CSS, toggle handler in App.tsx writes via PATCH /api/settings + refreshes. Starred rows lift to top of the library list. - Diagnostics 'Disk usage - top 20 model repos' section. New GET /api/diagnostics/storage-top endpoint walks every enabled modelDirectories entry one level deep, sums via _path_size_bytes (inode-deduped so HF snapshot/blob symlinks count once). Closes the Stuff Diver gap on HF cache layouts. Live: 1213 GB total on this box. Matrix runner (scripts/cache-strategy-matrix.py): - Smoke models bumped Qwen2.5-0.5B -> Qwen3-0.6B (current gen) - New cells for MTPLX MLX, GGUF MTP (FU-047), vLLM native/turboquant/ triattention/dflash. BackendCapabilities adds mtplx_available, gguf_mtp_available, vllm_available probed from /api/health. Tests: - 4 new unit tests pin the FU-053 distill validator (no-distill, missing snapshot, partial snapshot, both files present). - test_cache_strategy_matrix_runner kwargs widened for new caps. - Full suite: 1455 passed, 1 skipped, 132 subtests passed. - npx tsc --noEmit clean; npm test 32 files / 371 tests pass.

Brings PR #57 up to date with staging (29 commits behind). Two conflicts resolved manually: - src/api/index.ts: re-exports from ./setup. Both branches added MTPLX-related exports independently; took the union alphabetically sorted (getMtplxInstallStatus, getMtplxStatus, startMtplxInstall). - src/api/setup.ts: torch-upgrade machinery was introduced via parallel commits on both branches (our dca2c12 + staging f514ea4). Auto-merge produced duplicate TorchUpgradeAvailability / TorchUpgradeType / TorchUpgradeUnavailableReason / TorchUpgradeAttempt / TorchUpgradeJobState declarations + duplicate checkTorchUpgradeAvailable / startTorchUpgrade / getTorchUpgradeStatus functions. Removed the duplicate block; kept one canonical section with types + functions in proper order. Verified post-merge: - npx tsc --noEmit: clean - npm test: 32 files / 371 tests pass - pytest tests/: 1455 passed, 1 skipped, 132 subtests passed

cryptopoly added 30 commits May 15, 2026 10:11

docs: CLAUDE.md FU-047 entry for GGUF MTP via llama.cpp #22673

c160fb3

New follow-up row tracking the GGUF half of FU-028 now that PR #22673 merged upstream. Lists action plan (6 wiring steps), upstream caveats, and links to the upstream-research write-up.

docs: v0.9.2 release notes + changelog entry

6442769

Update CLAUDE.md

d0f4b06

Update e2e_test_suite.py

7ace451

cryptopoly merged commit 687daad into staging May 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/chaos engine ai cli#57

Feature/chaos engine ai cli#57
cryptopoly merged 30 commits into
stagingfrom
feature/ChaosEngineAI-CLI

cryptopoly commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cryptopoly commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant