Skip to content

Feature/mtp kvtc strategy modernization#49

Merged
cryptopoly merged 13 commits into
stagingfrom
feature/mtp-kvtc-strategy-modernization
May 11, 2026
Merged

Feature/mtp kvtc strategy modernization#49
cryptopoly merged 13 commits into
stagingfrom
feature/mtp-kvtc-strategy-modernization

Conversation

@cryptopoly
Copy link
Copy Markdown
Owner

No description provided.

cryptopoly added 13 commits May 10, 2026 14:47
Both slots added zero value over TurboQuant after the May 2026 landscape review.

ChaosEngine (cryptopoly/ChaosEngine, 1 commit upstream) was eclipsed by
NVIDIA's KVTC at ICLR 2026 — same PCA + adaptive quantization approach but
8–32x compression vs ChaosEngine's 3.7x, peer-reviewed, with a healthy
upstream. KVTC slot lands separately in FU-029.

RotorQuant shipped as a misleading alias for TurboQuant: same
``--cache-type-k turbo{N}`` flags, same ``turboquant`` Python module marker.
Real scrya-com RotorQuant uses Clifford Cl(3,0) rotors with their own kernel
path that we never wired up.

Persisted user configs that still reference these ids coerce silently to
``turboquant`` via a new ``CacheStrategyRegistry.resolve_legacy_id`` helper +
module-level ``_LEGACY_STRATEGY_ALIASES`` map. Frontend mirrors the
coercion via ``LEGACY_STRATEGY_ALIASES`` + ``canonicalStrategyId`` in
runtimeSupport.ts so chip filters and incompat-reason banners work for
older session snapshots.

The llama.cpp fallback chain shrank from 3-level (requested → ChaosEngine
→ native) to 2-level (requested → native) — the ChaosEngine intermediate
only ever emitted standard q-type cache flags that native already covers.

Vendored ChaosEngine bundling ripped from scripts/stage-runtime.mjs (3
helper functions removed: stageVendoredChaosEngine, ensureSetuptoolsForPep639,
resolveChaosEngineVendor). Pre-build probe now asserts the legacy-id
coercion works in CI rather than at runtime. ``[rotorquant]`` extra removed
from pyproject.toml. ``CHAOSENGINE_VENDOR_PATH`` env var dropped.

Test coverage: 1293 pytest pass, 341 vitest pass, tsc --noEmit clean.
Migration test added at tests/test_cache_strategies.py asserts both legacy
ids coerce + resolve to the TurboQuant strategy via registry.get(). New
fixture entry in tests/inference-batch-strategies.json exercises the
coercion end-to-end through the inference test runner.
Two complementary changes:

1. ``scripts/cache-strategy-matrix.py`` sweeps every supported (cache
   strategy × spec-dec method × representative model) combination through
   a running backend on port 8876 and writes a CSV + Markdown report to
   ``~/.chaosengine/test-results/``. Replaces the ad-hoc per-strategy
   smoke scripts with a single end-to-end harness, and **asserts the
   FU-030 legacy alias coercion** at runtime — runs with
   ``cacheStrategy=chaosengine`` and ``cacheStrategy=rotorquant`` must
   come back loaded as ``turboquant``, exit code 2 on regression. Skips
   cells where the strategy isn't installed, the turbo binary is missing,
   the model isn't in the local library, or the spec-dec method isn't
   supported on the chosen backend, so a fresh CI box reports honest
   skip reasons rather than failing.

   Includes 20 unit tests covering the pure functions (``skip_reason``,
   ``write_csv``, ``write_markdown``, ``print_summary``, matrix
   definition checks) without standing up a backend.

2. FU-028 (MTP) and FU-029 (KVTC) tracker entries flipped from "in
   progress" to "deferred — upstream blockers" with the actual
   blockers documented:

   - **FU-028 MTP:** mlx-lm 0.31.3 has ``stream_generate(..., draft_model=...)``
     for separately-trained drafts but no native MTP-head loader (the
     Gemma-4 / Qwen3.5 MTP drafters share activations + KV cache with
     the target and cannot be loaded as a standalone ``mlx.nn.Module``).
     Verified by inspecting the installed package source. llama.cpp PR
     #22673 still in Draft. MTPLX (third-party) is HTTP-only.
     Re-evaluate when (a) mlx-lm gains native MTP-head loading, OR
     (b) llama.cpp #22673 merges, OR (c) MTPLX exposes a Python
     in-process API.

   - **FU-029 KVTC:** OnlyTerp/kvtc is CUDA-only (MLX/Metal "planned"
     but not implemented), not on PyPI (distributed as a ``src.*``
     repo), and integrates as a HuggingFace ``DynamicCache`` wrapper
     rather than a llama.cpp cache type. Apple Silicon dev box can't
     validate end-to-end. Re-evaluate when upstream ships MLX support
     or a CUDA dev box becomes available.

The honest "deferred + reasoned" tracker entries are themselves the
right output here per the project guidelines — the alternative was
landing a half-wired CUDA-only KVTC slot or an HTTP-chained MTPLX
adapter, both of which would have shipped surface area without
delivering actual quality/performance to the user.

Test totals: 1313 pytest pass (+20 new), 341 vitest pass, tsc clean.
The probe was still running ``registry.get('chaosengine').llama_cpp_cache_flags(bits)``
and asserting the emitted cache types were standard llama-server types.
After FU-030 the legacy id coerces to TurboQuant, which emits
``turbo2/turbo3/turbo4`` — those are the *correct* types for the turbo
binary but the probe rejected them as INVALID.

Replaced with: native validates standard cache types, TurboQuant must
declare the turbo binary, and both legacy ids (chaosengine + rotorquant)
must coerce to turboquant via ``registry.resolve_legacy_id`` and resolve
via ``registry.get``. Mirrors the assertion already in
``scripts/pre-build-check.mjs`` so both runners agree.

All 7 pre-build-check.sh gates green.
…n-sync probe

Five interlocking maintenance items found while auditing the upstream
landscape for the four repos: z-lab/dflash, bstnxbt/dflash-mlx,
youssofal/MTPLX, TheTom/turboquant_plus.

1. **dflash-mlx pin bumped from 8d8545d (v0.1.5.1) to fada1eb (HEAD).**
   11 upstream commits cover the new Gemma4 DFlash backend (commit
   05cc456 — biggest payload), v0.1.5 serving surface, live server
   metrics endpoint, prefix-cache survival test gate, async L2 writer
   fix, long-context runtime diagnostics hardening, benchmark slugging
   fixes, and a license switch to Apache-2.0. No breaking API changes
   per the upstream commit log.

2. **stage-runtime.mjs pin synced to match pyproject.toml.** Caught a
   real bug: pyproject.toml was at 8d8545d (v0.1.5.1) but
   scripts/stage-runtime.mjs was lagging on f825ffb (v0.1.4.1) — dev
   .venv ran new, but ``npm run stage:runtime`` was bundling the OLD
   binary into release builds. Both files now share fada1eb.

3. **DRAFT_MODEL_MAP extended for new z-lab drafters.** Added entries
   for google/gemma-4-31B-it, google/gemma-4-26B-A4B-it,
   Qwen/Qwen3.5-122B-A10B, MiniMaxAI/MiniMax-M2.5,
   MiniMaxAI/MiniMax-M2.7, and moonshotai/Kimi-K2.6, plus the
   mlx-community/* aliases for each so Apple Silicon quants resolve via
   the existing fuzzy-match path. 7 new unit tests in test_dflash.py
   pin the mappings.

4. **TriAttention git+url pinned to commit c3744ee.** The
   ``[triattention]`` and ``[triattention-mlx]`` extras were pulling
   ``git+...git`` HEAD with no commit pin, making fresh installs
   non-reproducible whenever upstream landed unreleased work between
   our staging snapshots. Pin matches the v0.2.0 release surface plus
   the AMD GPU port.

5. **FU-033 pin-sync probe shipped in pre-build-check.{mjs,sh}.**
   Regex-extracts the dflash-mlx commit hash from both files and fails
   the build when they diverge. Same commit also drops the orphan
   vendor/ChaosEngine staleness check from both runners (FU-030
   removed the vendored package; the probe would never resolve again).

CLAUDE.md tracker updates: FU-006 entry rewritten to document the
fada1eb bump, three new entries (FU-031 dflash drafter expansion +
TriAttention pin; FU-032 turboquant_plus watch-closely; FU-033 pin-sync
probe shipped).

Test totals: 1321 pytest pass (+8 from previous 1313 — 7 new dflash
+ 1 housekeeping), 341 vitest pass, tsc clean, pre-build-check 8/8
gates green.
…all hash

Three related cleanups in src/components/RuntimeControls.tsx.

1. **Cache-strategy cards now hide when engine-incompatible or when the
   turbo binary is missing on GGUF.** Previously every strategy
   rendered for every model + engine combo with a greyed-out N/A
   badge. That taught users the wrong thing — a disabled card with no
   install button suggests something they could fix, when the only
   fix lived outside the app (engine mismatch is fundamental;
   ``llama-server-turbo`` build is a terminal-side script). The
   "package not installed but installable" case stays visible because
   the install button gets the user to ready in one click. ``native``
   always survives.

2. **DFlash speculative-decoding toggle now hides when the selected
   model has no draft in DRAFT_MODEL_MAP, or when the engine is GGUF.**
   Same principle — both cases give the user no in-app path to
   recover, so a disabled checkbox with an "N/A" badge added confusion
   without value. ``canInstallDflashForModel`` keeps the install
   affordance visible whenever the gap is the missing pip package
   (one-click install path) and the model would be supported.

3. **Hardcoded ``f825ffb`` install hint string fixed.** The DFlash
   help panel still printed the v0.1.4.1 commit hash even after the
   FU-006 / FU-033 bumps to ``fada1eb`` (v0.1.5.1). Same drift bug
   FU-033 caught between pyproject.toml + stage-runtime.mjs; now all
   three carry the same hash. Comment added so a future bump touches
   all three.

Popover-side filter (src/components/kvStrategyFilter.ts) already
followed the hide rule, so the modal now matches. CLAUDE.md tracker
gains FU-034 entry documenting the change + the design rule for
future strategy slots.

Test totals: 1321 pytest pass, 341 vitest pass, tsc clean.
…gle row

Two visual fixes for the per-turn telemetry chips below assistant
messages.

1. **Runtime note tone now reflects actual fault state.** The
   "Using python with MLX 0.31.x and mlx-lm 0.31.y." chip used to
   render in the orange ``substrate-chip--warn`` style because the
   note slot was hardcoded to ``tone: "warn"``. That same slot also
   carries real warnings ("DFLASH unavailable", "Cache strategy
   failed. Fell back to native f16 cache.") — when every turn shows
   the orange chip, operators stop noticing it on the rare turns
   that actually flag a problem.

   New ``runtimeNoteIsWarning`` helper in SubstrateRoutingBadge.tsx
   scans for actionable tokens (``unavailable``, ``fell back``,
   ``failed``, ``error``, ``warning``, ``cannot``, etc.) and only
   then promotes the chip to the warn tone. The benign version
   banner now uses the default muted tone, matching the "MLX" /
   "Native f16" chips next to it.

2. **SubstrateRoutingBadge + ChatPerfStrip now share a single
   wrap-row.** Previously rendered as two sibling ``<div>`` strips,
   so the engine/cache/note chips broke onto a separate line from
   the perf chips (tok/s, CPU%, mem-free, thermal). New
   ``.message-runtime-strip`` wrapper in ChatThread.tsx is the
   outer flex container; the two inner strips switch to
   ``display: contents`` so their chips become direct flex children
   of the wrapper and flow as one continuous row, wrapping only
   when the viewport actually requires it.

Test coverage: 10 new vitest cases in SubstrateRoutingBadge.test.ts
pin the tone-detect logic for both benign and faulty notes.

Test totals: 1321 pytest pass, 351 vitest pass (+10), tsc clean.
Previous commit (FU-035) accidentally captured a local dev flip of
``createUpdaterArtifacts: true → false`` in src-tauri/tauri.conf.json.
That flag belongs at ``true`` for release builds (without it the
auto-update channel never publishes new artifacts). Restore the
release-correct value; the FU-035 chip changes remain intact.
…erks

Two fixes for the HTML Challenge model card stream view.

1. **Stream box now fills the available model-frame height.**
   ``.html-challenge-stream`` was ``flex: 0 0 auto`` with a fixed
   ``height: clamp(280px, 38vh, 520px)``, which left a tall band of
   empty area below the streaming code while a model was generating.
   Switched to ``flex: 1 1 auto; min-height: 280px`` so the stream
   consumes the same vertical space the rendered iframe would use
   when the run completes. ``min-height`` keeps it usable on short
   viewports.

2. **Scroll-up no longer fights the user.** Two related races:
   - ``handleStreamScroll`` re-flipped ``streamAtBottom`` to true
     after every ``element.scrollTop = …`` write because the browser
     fires ``scroll`` for both user wheel input and programmatic
     writes. New ``lastProgrammaticScrollRef`` records the timestamp
     of each programmatic scroll and the handler ignores scroll
     events fired within 80ms of one — so user wheel events register
     as "stop tracking" instead of being overwritten by the post-write
     event.
   - The streaming chunk auto-scroll ``useEffect`` read
     ``streamAtBottom`` from the React closure, which lagged behind
     the user's wheel by one render. The effect now re-measures
     scroll position inside the rAF and bails (clearing tracking
     for that slot) if the user has moved away in the gap, instead
     of yanking the box back to bottom.

Net effect: scrolling up during streaming holds position, the box
takes the full panel height, and only the explicit "scroll to
bottom" button or scrolling within 32px of the tail re-engages
auto-tracking.

Test totals: 1321 pytest pass, 351 vitest pass, tsc clean.
Triggered by a real crash report: a tool-call in the Chat tab against
Qwen3-Coder-Next blanked the entire packaged macOS app. Webview reload
returned the user to the Dashboard, and any subsequent Chat
navigation crashed again with no diagnostic surface to read.

Two related root causes, fixed together.

1. **No React error boundary anywhere in the tree.** A single
   uncaught render error in one tab tore down the whole ``<main>``
   content frame. New ``src/components/ErrorBoundary.tsx`` uses
   ``getDerivedStateFromError`` + ``componentDidCatch`` to capture
   the error and render an inline fallback with:
   - the error name + message
   - the JS stack and component stack inside a collapsible
   - a "Try again" button that resets local state for transient
     errors (e.g. stale streaming buffer from an aborted tool call)
   - a "Copy details" button that writes a self-contained bug
     report to the clipboard (timestamp, UA, error, both stacks)
   The boundary wraps ``{content}`` in App.tsx keyed by
   ``activeTab`` so switching tabs unmounts the boundary entirely,
   giving the user a clean navigation-based recovery path even
   when "Try again" hits the same error.

2. **Release builds had no way to open devtools.** Tauri's
   ``devtools`` Cargo feature was in ``declared_features`` but not
   in the active ``features`` array on the ``tauri`` crate, so the
   WebKit inspector was compiled out in release. Without it, the
   only path to a JS stack was rebuilding the app via
   ``cargo tauri dev`` — useless to a user staring at a blank
   screen. Flipping the feature on adds the right-click → Inspect
   Element entry to release builds.

Surrounding work:
- CSS for ``.error-boundary`` lives next to the existing notice
  banners in src/styles.css; same colour vocabulary as
  ``.error-banner``.
- Unit tests in src/components/__tests__/ErrorBoundary.test.ts pin
  the ``getDerivedStateFromError`` contract so the boundary
  cannot silently stop catching errors.
- CLAUDE.md tracker entry (FU-037) records the root cause + fix
  for future regressions.

Test totals: 1321 pytest pass, 353 vitest pass (+2), tsc clean.
…ash alias

Three bugs surfaced by a live /api/diagnostics/snapshot payload taken
during a Qwen3-Coder-Next + Tools repro.

1. ``_free_bytes`` ImportError in diagnostics snapshot.
   backend_service/routes/diagnostics.py imported ``_free_bytes`` from
   backend_service.routes.setup, but the setup package's __init__.py
   never re-exported it from gpu_bundle.py — every snapshot reported
   ``ImportError: cannot import name '_free_bytes'`` in the ``extras``
   section. Added the re-export.

2. MallocStackLogging spam drowning the backend log.
   macOS hardened-runtime (we ship bundle.macOS.hardenedRuntime: true)
   inherited an env var into every Python subprocess, producing three
   lines of ``Python(PID) MallocStackLogging: can't turn off malloc
   stack logging because it was not enabled.`` at each spawn. With
   the metrics polling loop firing 1 Hz that's hundreds per minute,
   drowning out the INFO / ERROR lines the Diagnostics tab is meant
   to surface. Two-pronged fix:
   - src-tauri/src/backend.rs: ``command.env_remove`` the three
     MallocStackLogging / MallocScribble vars before spawning the
     backend so NEW builds never produce the spam.
   - backend_service/routes/diagnostics.py: regex filter
     ``_LOG_NOISE_PATTERNS`` + ``_filter_log_noise`` strips the spam
     from /api/diagnostics/log-tail and the snapshot's logs section
     so OLDER builds get a clean diagnostic surface immediately
     without rebuilding. Filter reads 4x the requested window so
     200 useful lines survive even when the raw log is 50% spam.

3. DFlash unavailable for ``mlx-community/Qwen3.6-27B-4bit``.
   Qwen3-Coder-Next was rebranded ``Qwen3.6-27B`` upstream; the
   lmstudio-community MLX conversion's HF metadata reports
   ``mlx-community/Qwen3.6-27B-4bit`` as the canonical repo and
   model_resolution.resolve_dflash_target_ref prefers canonical
   over the lmstudio alias. DRAFT_MODEL_MAP had no entry → DFlash
   silently unavailable per snapshot ("DFLASH unavailable for
   'mlx-community/Qwen3.6-27B-4bit': no compatible draft model
   is registered."). Aliased the three quant variants (4bit /
   bf16 / 8bit) back to Qwen/Qwen3-Coder-Next so the existing
   z-lab/Qwen3-Coder-Next-DFlash drafter resolves. New unit test
   pins the mapping.

CLAUDE.md tracker gains FU-038 entry recording all three.

Test totals: 1321 pytest pass (+3 new dflash cases), 353 vitest
pass, tsc clean.
The first real bug caught by the FU-037 ErrorBoundary. User repro:
Qwen3-Coder-Next, Tools ON, prompt 'What is 17 * 23 plus the square
root of 144?'. ErrorBoundary fallback rendered:

  TypeError: Object.entries requires that input parameter not be
  null or undefined

Pinned _Y in the minified bundle to src/components/ToolCallCard.tsx
(line 116). Backend trace: Coder-Next emitted ``{"arguments": null}``
for a tool call that needed no parameters, and
``backend_service/agent.py::_execute_tool_call`` evaluated
``isinstance(None, str) -> False`` then set ``arguments = None``.
The None serialised into the persisted session, so every
subsequent render of the affected turn re-crashed the Chat tab —
the user could not even reach earlier history.

Two-layer fix.

1. Backend (root cause). ``_execute_tool_call`` coerces every
   non-dict shape (``None``, empty string, raw list, etc.) to ``{}``
   at the source. The ``arguments is always a dict`` contract now
   holds for every downstream consumer (frontend card, persisted
   session, OpenAI-compat passthrough). Four new unit tests in
   tests/test_agent.py pin the null / empty / missing-key / dict
   shapes.

2. Frontend (legacy data + belt-and-braces). ToolCallCard
   defensively wraps arguments in ``Record<string, unknown>`` with a
   default of ``{}``, and renders ``(no arguments)`` when the entries
   list is empty. Older persisted sessions that contain ``null``
   arguments from before the backend fix stop crashing without
   requiring a manual localStorage wipe.

CLAUDE.md tracker gains FU-039 entry documenting the root cause +
both layers.

Test totals: 1325 pytest pass (+4 new agent cases), 353 vitest pass,
tsc clean.
…vision tag

Three fixes surfaced by a Coder-Next chat session.

1. Tool-call parser widened to handle three real-world shapes.
   The old regex required a closing ``</tool_call>`` tag and only
   matched JSON objects. Coder-Next emitted three shapes in a
   single session:

   - canonical: ``<tool_call>{"name": ...}</tool_call>``
   - open-only: ``<tool_call>{"name": ...}`` with no close tag
   - array-shaped: ``<tool_call>[{"url": ...}]`` (hallucinated
     pseudo-results inside a call tag)

   The new parser uses ``json.JSONDecoder.raw_decode`` on each
   ``<tool_call>`` opener so it consumes exactly the next valid
   JSON value regardless of close tag, dispatches objects with a
   ``name``, drops list payloads silently (no dispatchable
   ``name``), and continues scanning so a later well-formed
   call still lands. Cases (2) and (3) used to silently render
   the raw XML in the assistant bubble with no execution.

2. ``_strip_tool_call_xml`` helper removes the JSON region the
   parser consumed from ``result.text`` before the streaming
   layer hands it to the chat bubble. Without this, every
   parsed call appeared twice on screen — once as raw XML
   noise, once as the rendered ``ToolCallCard``. Applied in
   both ``run_agent_loop`` and ``run_agent_loop_streaming``.
   Excess blank lines collapsed so a mid-paragraph strip
   doesn't leave a visible gap.

3. Qwen3.6-27B + Qwen3.5 vision tag cleanup. Dense Qwen3.6-27B
   (Coder-Next branding), Qwen3.6-27B-FP8, mlx-community
   /Qwen3.6-27B-4bit, and the family-level Qwen3.6 + Qwen3.5
   entries all carried ``"vision"`` in their capabilities — a
   copy-paste bug from when the catalog was scaffolded. Vision
   lives on a separate ``Qwen3.6-27B-VL`` variant we do not
   yet ship; the stale tag was promoting
   ``supportsVision: true`` for every community quant variant,
   making ``ChatComposer`` render the "Attach image" affordance
   for a text-only model. Dropped from all five entries.

Test coverage: 13 new agent-parser + strip tests; total 1339
pytest pass (+14), 353 vitest pass, tsc --noEmit clean. CLAUDE.md
tracker entry FU-040 records all three.
…wen3.6-27B aliases

User-spotted mismatch: their local install at
``/Users/dan/AI_Models/lmstudio-community/Qwen3-Coder-Next-MLX-4bit``
was surfacing as canonical repo ``mlx-community/Qwen3.6-27B-4bit``
in the diagnostics snapshot, picking up the wrong catalog row and
the wrong DFlash drafter. Confirmed via on-disk config.json that
the model is Qwen3-Next (architectures ``Qwen3NextForCausalLM``,
``model_type: "qwen3_next"``, sparse MoE with 512 experts,
hidden_size 2048, ~3B active per token) — fundamentally different
from the dense Qwen3.6-27B (``qwen3`` arch, hidden_size 5120, no
MoE).

Root cause: the catalog had no variant for the lmstudio-community
MLX 4-bit conversion of Coder-Next, so the fuzzy matcher in
src/utils/library.ts::libraryVariantMatchScore settled for the
closest "MLX + 4-bit + Qwen3" entry, which happened to be the
unrelated ``mlx-community/Qwen3.6-27B-4bit`` row.

Three changes.

1. Added an explicit ``lmstudio-community/Qwen3-Coder-Next-MLX-4bit``
   variant to the ``qwen3-coder-next`` family in
   backend_service/catalog/text_models.py. Correct params: 80B
   sparse / ~45 GB on disk / qwen3_next family capabilities
   (coding / agents / tool-use / reasoning / thinking). The matcher
   now scores 80+ on an exact repo-path substring hit instead of
   the previous fuzzy fallback.

2. Reverted the FU-038 DFlash aliases that wrongly pointed
   ``mlx-community/Qwen3.6-27B-4bit / bf16 / 8bit`` at
   ``Qwen/Qwen3-Coder-Next``. Those quants are the dense 27B
   Coder (text-only, ``qwen3`` arch) and have no drafter today;
   leaving them aliased to the Qwen3-Next MoE drafter would route
   DFlash to the wrong architecture and either crash at load or
   degrade silently.

3. Replaced them with the correct
   ``lmstudio-community/Qwen3-Coder-Next-MLX-4bit`` alias plus an
   ``-Instruct`` sibling.

New regression tests in tests/test_dflash.py pin (a) the new
alias resolves to ``z-lab/Qwen3-Coder-Next-DFlash`` and (b) the
dense 27B-4bit MUST NOT alias to the MoE drafter.

Test totals: 1340 pytest pass, 353 vitest pass, tsc clean.
CLAUDE.md tracker entry FU-041 records the root cause + fix.
@cryptopoly cryptopoly merged commit 3a5125d into staging May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant