Feature/mtp kvtc strategy modernization by cryptopoly · Pull Request #49 · cryptopoly/ChaosEngineAI

cryptopoly · 2026-05-11T13:09:46Z

No description provided.

Both slots added zero value over TurboQuant after the May 2026 landscape review. ChaosEngine (cryptopoly/ChaosEngine, 1 commit upstream) was eclipsed by NVIDIA's KVTC at ICLR 2026 — same PCA + adaptive quantization approach but 8–32x compression vs ChaosEngine's 3.7x, peer-reviewed, with a healthy upstream. KVTC slot lands separately in FU-029. RotorQuant shipped as a misleading alias for TurboQuant: same ``--cache-type-k turbo{N}`` flags, same ``turboquant`` Python module marker. Real scrya-com RotorQuant uses Clifford Cl(3,0) rotors with their own kernel path that we never wired up. Persisted user configs that still reference these ids coerce silently to ``turboquant`` via a new ``CacheStrategyRegistry.resolve_legacy_id`` helper + module-level ``_LEGACY_STRATEGY_ALIASES`` map. Frontend mirrors the coercion via ``LEGACY_STRATEGY_ALIASES`` + ``canonicalStrategyId`` in runtimeSupport.ts so chip filters and incompat-reason banners work for older session snapshots. The llama.cpp fallback chain shrank from 3-level (requested → ChaosEngine → native) to 2-level (requested → native) — the ChaosEngine intermediate only ever emitted standard q-type cache flags that native already covers. Vendored ChaosEngine bundling ripped from scripts/stage-runtime.mjs (3 helper functions removed: stageVendoredChaosEngine, ensureSetuptoolsForPep639, resolveChaosEngineVendor). Pre-build probe now asserts the legacy-id coercion works in CI rather than at runtime. ``[rotorquant]`` extra removed from pyproject.toml. ``CHAOSENGINE_VENDOR_PATH`` env var dropped. Test coverage: 1293 pytest pass, 341 vitest pass, tsc --noEmit clean. Migration test added at tests/test_cache_strategies.py asserts both legacy ids coerce + resolve to the TurboQuant strategy via registry.get(). New fixture entry in tests/inference-batch-strategies.json exercises the coercion end-to-end through the inference test runner.

Two complementary changes: 1. ``scripts/cache-strategy-matrix.py`` sweeps every supported (cache strategy × spec-dec method × representative model) combination through a running backend on port 8876 and writes a CSV + Markdown report to ``~/.chaosengine/test-results/``. Replaces the ad-hoc per-strategy smoke scripts with a single end-to-end harness, and **asserts the FU-030 legacy alias coercion** at runtime — runs with ``cacheStrategy=chaosengine`` and ``cacheStrategy=rotorquant`` must come back loaded as ``turboquant``, exit code 2 on regression. Skips cells where the strategy isn't installed, the turbo binary is missing, the model isn't in the local library, or the spec-dec method isn't supported on the chosen backend, so a fresh CI box reports honest skip reasons rather than failing. Includes 20 unit tests covering the pure functions (``skip_reason``, ``write_csv``, ``write_markdown``, ``print_summary``, matrix definition checks) without standing up a backend. 2. FU-028 (MTP) and FU-029 (KVTC) tracker entries flipped from "in progress" to "deferred — upstream blockers" with the actual blockers documented: - **FU-028 MTP:** mlx-lm 0.31.3 has ``stream_generate(..., draft_model=...)`` for separately-trained drafts but no native MTP-head loader (the Gemma-4 / Qwen3.5 MTP drafters share activations + KV cache with the target and cannot be loaded as a standalone ``mlx.nn.Module``). Verified by inspecting the installed package source. llama.cpp PR #22673 still in Draft. MTPLX (third-party) is HTTP-only. Re-evaluate when (a) mlx-lm gains native MTP-head loading, OR (b) llama.cpp #22673 merges, OR (c) MTPLX exposes a Python in-process API. - **FU-029 KVTC:** OnlyTerp/kvtc is CUDA-only (MLX/Metal "planned" but not implemented), not on PyPI (distributed as a ``src.*`` repo), and integrates as a HuggingFace ``DynamicCache`` wrapper rather than a llama.cpp cache type. Apple Silicon dev box can't validate end-to-end. Re-evaluate when upstream ships MLX support or a CUDA dev box becomes available. The honest "deferred + reasoned" tracker entries are themselves the right output here per the project guidelines — the alternative was landing a half-wired CUDA-only KVTC slot or an HTTP-chained MTPLX adapter, both of which would have shipped surface area without delivering actual quality/performance to the user. Test totals: 1313 pytest pass (+20 new), 341 vitest pass, tsc clean.

The probe was still running ``registry.get('chaosengine').llama_cpp_cache_flags(bits)`` and asserting the emitted cache types were standard llama-server types. After FU-030 the legacy id coerces to TurboQuant, which emits ``turbo2/turbo3/turbo4`` — those are the *correct* types for the turbo binary but the probe rejected them as INVALID. Replaced with: native validates standard cache types, TurboQuant must declare the turbo binary, and both legacy ids (chaosengine + rotorquant) must coerce to turboquant via ``registry.resolve_legacy_id`` and resolve via ``registry.get``. Mirrors the assertion already in ``scripts/pre-build-check.mjs`` so both runners agree. All 7 pre-build-check.sh gates green.

…n-sync probe Five interlocking maintenance items found while auditing the upstream landscape for the four repos: z-lab/dflash, bstnxbt/dflash-mlx, youssofal/MTPLX, TheTom/turboquant_plus. 1. **dflash-mlx pin bumped from 8d8545d (v0.1.5.1) to fada1eb (HEAD).** 11 upstream commits cover the new Gemma4 DFlash backend (commit 05cc456 — biggest payload), v0.1.5 serving surface, live server metrics endpoint, prefix-cache survival test gate, async L2 writer fix, long-context runtime diagnostics hardening, benchmark slugging fixes, and a license switch to Apache-2.0. No breaking API changes per the upstream commit log. 2. **stage-runtime.mjs pin synced to match pyproject.toml.** Caught a real bug: pyproject.toml was at 8d8545d (v0.1.5.1) but scripts/stage-runtime.mjs was lagging on f825ffb (v0.1.4.1) — dev .venv ran new, but ``npm run stage:runtime`` was bundling the OLD binary into release builds. Both files now share fada1eb. 3. **DRAFT_MODEL_MAP extended for new z-lab drafters.** Added entries for google/gemma-4-31B-it, google/gemma-4-26B-A4B-it, Qwen/Qwen3.5-122B-A10B, MiniMaxAI/MiniMax-M2.5, MiniMaxAI/MiniMax-M2.7, and moonshotai/Kimi-K2.6, plus the mlx-community/* aliases for each so Apple Silicon quants resolve via the existing fuzzy-match path. 7 new unit tests in test_dflash.py pin the mappings. 4. **TriAttention git+url pinned to commit c3744ee.** The ``[triattention]`` and ``[triattention-mlx]`` extras were pulling ``git+...git`` HEAD with no commit pin, making fresh installs non-reproducible whenever upstream landed unreleased work between our staging snapshots. Pin matches the v0.2.0 release surface plus the AMD GPU port. 5. **FU-033 pin-sync probe shipped in pre-build-check.{mjs,sh}.** Regex-extracts the dflash-mlx commit hash from both files and fails the build when they diverge. Same commit also drops the orphan vendor/ChaosEngine staleness check from both runners (FU-030 removed the vendored package; the probe would never resolve again). CLAUDE.md tracker updates: FU-006 entry rewritten to document the fada1eb bump, three new entries (FU-031 dflash drafter expansion + TriAttention pin; FU-032 turboquant_plus watch-closely; FU-033 pin-sync probe shipped). Test totals: 1321 pytest pass (+8 from previous 1313 — 7 new dflash + 1 housekeeping), 341 vitest pass, tsc clean, pre-build-check 8/8 gates green.

…all hash Three related cleanups in src/components/RuntimeControls.tsx. 1. **Cache-strategy cards now hide when engine-incompatible or when the turbo binary is missing on GGUF.** Previously every strategy rendered for every model + engine combo with a greyed-out N/A badge. That taught users the wrong thing — a disabled card with no install button suggests something they could fix, when the only fix lived outside the app (engine mismatch is fundamental; ``llama-server-turbo`` build is a terminal-side script). The "package not installed but installable" case stays visible because the install button gets the user to ready in one click. ``native`` always survives. 2. **DFlash speculative-decoding toggle now hides when the selected model has no draft in DRAFT_MODEL_MAP, or when the engine is GGUF.** Same principle — both cases give the user no in-app path to recover, so a disabled checkbox with an "N/A" badge added confusion without value. ``canInstallDflashForModel`` keeps the install affordance visible whenever the gap is the missing pip package (one-click install path) and the model would be supported. 3. **Hardcoded ``f825ffb`` install hint string fixed.** The DFlash help panel still printed the v0.1.4.1 commit hash even after the FU-006 / FU-033 bumps to ``fada1eb`` (v0.1.5.1). Same drift bug FU-033 caught between pyproject.toml + stage-runtime.mjs; now all three carry the same hash. Comment added so a future bump touches all three. Popover-side filter (src/components/kvStrategyFilter.ts) already followed the hide rule, so the modal now matches. CLAUDE.md tracker gains FU-034 entry documenting the change + the design rule for future strategy slots. Test totals: 1321 pytest pass, 341 vitest pass, tsc clean.

…gle row Two visual fixes for the per-turn telemetry chips below assistant messages. 1. **Runtime note tone now reflects actual fault state.** The "Using python with MLX 0.31.x and mlx-lm 0.31.y." chip used to render in the orange ``substrate-chip--warn`` style because the note slot was hardcoded to ``tone: "warn"``. That same slot also carries real warnings ("DFLASH unavailable", "Cache strategy failed. Fell back to native f16 cache.") — when every turn shows the orange chip, operators stop noticing it on the rare turns that actually flag a problem. New ``runtimeNoteIsWarning`` helper in SubstrateRoutingBadge.tsx scans for actionable tokens (``unavailable``, ``fell back``, ``failed``, ``error``, ``warning``, ``cannot``, etc.) and only then promotes the chip to the warn tone. The benign version banner now uses the default muted tone, matching the "MLX" / "Native f16" chips next to it. 2. **SubstrateRoutingBadge + ChatPerfStrip now share a single wrap-row.** Previously rendered as two sibling ``<div>`` strips, so the engine/cache/note chips broke onto a separate line from the perf chips (tok/s, CPU%, mem-free, thermal). New ``.message-runtime-strip`` wrapper in ChatThread.tsx is the outer flex container; the two inner strips switch to ``display: contents`` so their chips become direct flex children of the wrapper and flow as one continuous row, wrapping only when the viewport actually requires it. Test coverage: 10 new vitest cases in SubstrateRoutingBadge.test.ts pin the tone-detect logic for both benign and faulty notes. Test totals: 1321 pytest pass, 351 vitest pass (+10), tsc clean.

Previous commit (FU-035) accidentally captured a local dev flip of ``createUpdaterArtifacts: true → false`` in src-tauri/tauri.conf.json. That flag belongs at ``true`` for release builds (without it the auto-update channel never publishes new artifacts). Restore the release-correct value; the FU-035 chip changes remain intact.

…erks Two fixes for the HTML Challenge model card stream view. 1. **Stream box now fills the available model-frame height.** ``.html-challenge-stream`` was ``flex: 0 0 auto`` with a fixed ``height: clamp(280px, 38vh, 520px)``, which left a tall band of empty area below the streaming code while a model was generating. Switched to ``flex: 1 1 auto; min-height: 280px`` so the stream consumes the same vertical space the rendered iframe would use when the run completes. ``min-height`` keeps it usable on short viewports. 2. **Scroll-up no longer fights the user.** Two related races: - ``handleStreamScroll`` re-flipped ``streamAtBottom`` to true after every ``element.scrollTop = …`` write because the browser fires ``scroll`` for both user wheel input and programmatic writes. New ``lastProgrammaticScrollRef`` records the timestamp of each programmatic scroll and the handler ignores scroll events fired within 80ms of one — so user wheel events register as "stop tracking" instead of being overwritten by the post-write event. - The streaming chunk auto-scroll ``useEffect`` read ``streamAtBottom`` from the React closure, which lagged behind the user's wheel by one render. The effect now re-measures scroll position inside the rAF and bails (clearing tracking for that slot) if the user has moved away in the gap, instead of yanking the box back to bottom. Net effect: scrolling up during streaming holds position, the box takes the full panel height, and only the explicit "scroll to bottom" button or scrolling within 32px of the tail re-engages auto-tracking. Test totals: 1321 pytest pass, 351 vitest pass, tsc clean.

Triggered by a real crash report: a tool-call in the Chat tab against Qwen3-Coder-Next blanked the entire packaged macOS app. Webview reload returned the user to the Dashboard, and any subsequent Chat navigation crashed again with no diagnostic surface to read. Two related root causes, fixed together. 1. **No React error boundary anywhere in the tree.** A single uncaught render error in one tab tore down the whole ``<main>`` content frame. New ``src/components/ErrorBoundary.tsx`` uses ``getDerivedStateFromError`` + ``componentDidCatch`` to capture the error and render an inline fallback with: - the error name + message - the JS stack and component stack inside a collapsible - a "Try again" button that resets local state for transient errors (e.g. stale streaming buffer from an aborted tool call) - a "Copy details" button that writes a self-contained bug report to the clipboard (timestamp, UA, error, both stacks) The boundary wraps ``{content}`` in App.tsx keyed by ``activeTab`` so switching tabs unmounts the boundary entirely, giving the user a clean navigation-based recovery path even when "Try again" hits the same error. 2. **Release builds had no way to open devtools.** Tauri's ``devtools`` Cargo feature was in ``declared_features`` but not in the active ``features`` array on the ``tauri`` crate, so the WebKit inspector was compiled out in release. Without it, the only path to a JS stack was rebuilding the app via ``cargo tauri dev`` — useless to a user staring at a blank screen. Flipping the feature on adds the right-click → Inspect Element entry to release builds. Surrounding work: - CSS for ``.error-boundary`` lives next to the existing notice banners in src/styles.css; same colour vocabulary as ``.error-banner``. - Unit tests in src/components/__tests__/ErrorBoundary.test.ts pin the ``getDerivedStateFromError`` contract so the boundary cannot silently stop catching errors. - CLAUDE.md tracker entry (FU-037) records the root cause + fix for future regressions. Test totals: 1321 pytest pass, 353 vitest pass (+2), tsc clean.

…ash alias Three bugs surfaced by a live /api/diagnostics/snapshot payload taken during a Qwen3-Coder-Next + Tools repro. 1. ``_free_bytes`` ImportError in diagnostics snapshot. backend_service/routes/diagnostics.py imported ``_free_bytes`` from backend_service.routes.setup, but the setup package's __init__.py never re-exported it from gpu_bundle.py — every snapshot reported ``ImportError: cannot import name '_free_bytes'`` in the ``extras`` section. Added the re-export. 2. MallocStackLogging spam drowning the backend log. macOS hardened-runtime (we ship bundle.macOS.hardenedRuntime: true) inherited an env var into every Python subprocess, producing three lines of ``Python(PID) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.`` at each spawn. With the metrics polling loop firing 1 Hz that's hundreds per minute, drowning out the INFO / ERROR lines the Diagnostics tab is meant to surface. Two-pronged fix: - src-tauri/src/backend.rs: ``command.env_remove`` the three MallocStackLogging / MallocScribble vars before spawning the backend so NEW builds never produce the spam. - backend_service/routes/diagnostics.py: regex filter ``_LOG_NOISE_PATTERNS`` + ``_filter_log_noise`` strips the spam from /api/diagnostics/log-tail and the snapshot's logs section so OLDER builds get a clean diagnostic surface immediately without rebuilding. Filter reads 4x the requested window so 200 useful lines survive even when the raw log is 50% spam. 3. DFlash unavailable for ``mlx-community/Qwen3.6-27B-4bit``. Qwen3-Coder-Next was rebranded ``Qwen3.6-27B`` upstream; the lmstudio-community MLX conversion's HF metadata reports ``mlx-community/Qwen3.6-27B-4bit`` as the canonical repo and model_resolution.resolve_dflash_target_ref prefers canonical over the lmstudio alias. DRAFT_MODEL_MAP had no entry → DFlash silently unavailable per snapshot ("DFLASH unavailable for 'mlx-community/Qwen3.6-27B-4bit': no compatible draft model is registered."). Aliased the three quant variants (4bit / bf16 / 8bit) back to Qwen/Qwen3-Coder-Next so the existing z-lab/Qwen3-Coder-Next-DFlash drafter resolves. New unit test pins the mapping. CLAUDE.md tracker gains FU-038 entry recording all three. Test totals: 1321 pytest pass (+3 new dflash cases), 353 vitest pass, tsc clean.

The first real bug caught by the FU-037 ErrorBoundary. User repro: Qwen3-Coder-Next, Tools ON, prompt 'What is 17 * 23 plus the square root of 144?'. ErrorBoundary fallback rendered: TypeError: Object.entries requires that input parameter not be null or undefined Pinned _Y in the minified bundle to src/components/ToolCallCard.tsx (line 116). Backend trace: Coder-Next emitted ``{"arguments": null}`` for a tool call that needed no parameters, and ``backend_service/agent.py::_execute_tool_call`` evaluated ``isinstance(None, str) -> False`` then set ``arguments = None``. The None serialised into the persisted session, so every subsequent render of the affected turn re-crashed the Chat tab — the user could not even reach earlier history. Two-layer fix. 1. Backend (root cause). ``_execute_tool_call`` coerces every non-dict shape (``None``, empty string, raw list, etc.) to ``{}`` at the source. The ``arguments is always a dict`` contract now holds for every downstream consumer (frontend card, persisted session, OpenAI-compat passthrough). Four new unit tests in tests/test_agent.py pin the null / empty / missing-key / dict shapes. 2. Frontend (legacy data + belt-and-braces). ToolCallCard defensively wraps arguments in ``Record<string, unknown>`` with a default of ``{}``, and renders ``(no arguments)`` when the entries list is empty. Older persisted sessions that contain ``null`` arguments from before the backend fix stop crashing without requiring a manual localStorage wipe. CLAUDE.md tracker gains FU-039 entry documenting the root cause + both layers. Test totals: 1325 pytest pass (+4 new agent cases), 353 vitest pass, tsc clean.

…vision tag Three fixes surfaced by a Coder-Next chat session. 1. Tool-call parser widened to handle three real-world shapes. The old regex required a closing ``</tool_call>`` tag and only matched JSON objects. Coder-Next emitted three shapes in a single session: - canonical: ``<tool_call>{"name": ...}</tool_call>`` - open-only: ``<tool_call>{"name": ...}`` with no close tag - array-shaped: ``<tool_call>[{"url": ...}]`` (hallucinated pseudo-results inside a call tag) The new parser uses ``json.JSONDecoder.raw_decode`` on each ``<tool_call>`` opener so it consumes exactly the next valid JSON value regardless of close tag, dispatches objects with a ``name``, drops list payloads silently (no dispatchable ``name``), and continues scanning so a later well-formed call still lands. Cases (2) and (3) used to silently render the raw XML in the assistant bubble with no execution. 2. ``_strip_tool_call_xml`` helper removes the JSON region the parser consumed from ``result.text`` before the streaming layer hands it to the chat bubble. Without this, every parsed call appeared twice on screen — once as raw XML noise, once as the rendered ``ToolCallCard``. Applied in both ``run_agent_loop`` and ``run_agent_loop_streaming``. Excess blank lines collapsed so a mid-paragraph strip doesn't leave a visible gap. 3. Qwen3.6-27B + Qwen3.5 vision tag cleanup. Dense Qwen3.6-27B (Coder-Next branding), Qwen3.6-27B-FP8, mlx-community /Qwen3.6-27B-4bit, and the family-level Qwen3.6 + Qwen3.5 entries all carried ``"vision"`` in their capabilities — a copy-paste bug from when the catalog was scaffolded. Vision lives on a separate ``Qwen3.6-27B-VL`` variant we do not yet ship; the stale tag was promoting ``supportsVision: true`` for every community quant variant, making ``ChatComposer`` render the "Attach image" affordance for a text-only model. Dropped from all five entries. Test coverage: 13 new agent-parser + strip tests; total 1339 pytest pass (+14), 353 vitest pass, tsc --noEmit clean. CLAUDE.md tracker entry FU-040 records all three.

…wen3.6-27B aliases User-spotted mismatch: their local install at ``/Users/dan/AI_Models/lmstudio-community/Qwen3-Coder-Next-MLX-4bit`` was surfacing as canonical repo ``mlx-community/Qwen3.6-27B-4bit`` in the diagnostics snapshot, picking up the wrong catalog row and the wrong DFlash drafter. Confirmed via on-disk config.json that the model is Qwen3-Next (architectures ``Qwen3NextForCausalLM``, ``model_type: "qwen3_next"``, sparse MoE with 512 experts, hidden_size 2048, ~3B active per token) — fundamentally different from the dense Qwen3.6-27B (``qwen3`` arch, hidden_size 5120, no MoE). Root cause: the catalog had no variant for the lmstudio-community MLX 4-bit conversion of Coder-Next, so the fuzzy matcher in src/utils/library.ts::libraryVariantMatchScore settled for the closest "MLX + 4-bit + Qwen3" entry, which happened to be the unrelated ``mlx-community/Qwen3.6-27B-4bit`` row. Three changes. 1. Added an explicit ``lmstudio-community/Qwen3-Coder-Next-MLX-4bit`` variant to the ``qwen3-coder-next`` family in backend_service/catalog/text_models.py. Correct params: 80B sparse / ~45 GB on disk / qwen3_next family capabilities (coding / agents / tool-use / reasoning / thinking). The matcher now scores 80+ on an exact repo-path substring hit instead of the previous fuzzy fallback. 2. Reverted the FU-038 DFlash aliases that wrongly pointed ``mlx-community/Qwen3.6-27B-4bit / bf16 / 8bit`` at ``Qwen/Qwen3-Coder-Next``. Those quants are the dense 27B Coder (text-only, ``qwen3`` arch) and have no drafter today; leaving them aliased to the Qwen3-Next MoE drafter would route DFlash to the wrong architecture and either crash at load or degrade silently. 3. Replaced them with the correct ``lmstudio-community/Qwen3-Coder-Next-MLX-4bit`` alias plus an ``-Instruct`` sibling. New regression tests in tests/test_dflash.py pin (a) the new alias resolves to ``z-lab/Qwen3-Coder-Next-DFlash`` and (b) the dense 27B-4bit MUST NOT alias to the MoE drafter. Test totals: 1340 pytest pass, 353 vitest pass, tsc clean. CLAUDE.md tracker entry FU-041 records the root cause + fix.

cryptopoly added 13 commits May 10, 2026 14:47

cryptopoly merged commit 3a5125d into staging May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/mtp kvtc strategy modernization#49

Feature/mtp kvtc strategy modernization#49
cryptopoly merged 13 commits into
stagingfrom
feature/mtp-kvtc-strategy-modernization

cryptopoly commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cryptopoly commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant