Feature/mtp kvtc strategy modernization#49
Merged
Conversation
Both slots added zero value over TurboQuant after the May 2026 landscape review.
ChaosEngine (cryptopoly/ChaosEngine, 1 commit upstream) was eclipsed by
NVIDIA's KVTC at ICLR 2026 — same PCA + adaptive quantization approach but
8–32x compression vs ChaosEngine's 3.7x, peer-reviewed, with a healthy
upstream. KVTC slot lands separately in FU-029.
RotorQuant shipped as a misleading alias for TurboQuant: same
``--cache-type-k turbo{N}`` flags, same ``turboquant`` Python module marker.
Real scrya-com RotorQuant uses Clifford Cl(3,0) rotors with their own kernel
path that we never wired up.
Persisted user configs that still reference these ids coerce silently to
``turboquant`` via a new ``CacheStrategyRegistry.resolve_legacy_id`` helper +
module-level ``_LEGACY_STRATEGY_ALIASES`` map. Frontend mirrors the
coercion via ``LEGACY_STRATEGY_ALIASES`` + ``canonicalStrategyId`` in
runtimeSupport.ts so chip filters and incompat-reason banners work for
older session snapshots.
The llama.cpp fallback chain shrank from 3-level (requested → ChaosEngine
→ native) to 2-level (requested → native) — the ChaosEngine intermediate
only ever emitted standard q-type cache flags that native already covers.
Vendored ChaosEngine bundling ripped from scripts/stage-runtime.mjs (3
helper functions removed: stageVendoredChaosEngine, ensureSetuptoolsForPep639,
resolveChaosEngineVendor). Pre-build probe now asserts the legacy-id
coercion works in CI rather than at runtime. ``[rotorquant]`` extra removed
from pyproject.toml. ``CHAOSENGINE_VENDOR_PATH`` env var dropped.
Test coverage: 1293 pytest pass, 341 vitest pass, tsc --noEmit clean.
Migration test added at tests/test_cache_strategies.py asserts both legacy
ids coerce + resolve to the TurboQuant strategy via registry.get(). New
fixture entry in tests/inference-batch-strategies.json exercises the
coercion end-to-end through the inference test runner.
Two complementary changes:
1. ``scripts/cache-strategy-matrix.py`` sweeps every supported (cache
strategy × spec-dec method × representative model) combination through
a running backend on port 8876 and writes a CSV + Markdown report to
``~/.chaosengine/test-results/``. Replaces the ad-hoc per-strategy
smoke scripts with a single end-to-end harness, and **asserts the
FU-030 legacy alias coercion** at runtime — runs with
``cacheStrategy=chaosengine`` and ``cacheStrategy=rotorquant`` must
come back loaded as ``turboquant``, exit code 2 on regression. Skips
cells where the strategy isn't installed, the turbo binary is missing,
the model isn't in the local library, or the spec-dec method isn't
supported on the chosen backend, so a fresh CI box reports honest
skip reasons rather than failing.
Includes 20 unit tests covering the pure functions (``skip_reason``,
``write_csv``, ``write_markdown``, ``print_summary``, matrix
definition checks) without standing up a backend.
2. FU-028 (MTP) and FU-029 (KVTC) tracker entries flipped from "in
progress" to "deferred — upstream blockers" with the actual
blockers documented:
- **FU-028 MTP:** mlx-lm 0.31.3 has ``stream_generate(..., draft_model=...)``
for separately-trained drafts but no native MTP-head loader (the
Gemma-4 / Qwen3.5 MTP drafters share activations + KV cache with
the target and cannot be loaded as a standalone ``mlx.nn.Module``).
Verified by inspecting the installed package source. llama.cpp PR
#22673 still in Draft. MTPLX (third-party) is HTTP-only.
Re-evaluate when (a) mlx-lm gains native MTP-head loading, OR
(b) llama.cpp #22673 merges, OR (c) MTPLX exposes a Python
in-process API.
- **FU-029 KVTC:** OnlyTerp/kvtc is CUDA-only (MLX/Metal "planned"
but not implemented), not on PyPI (distributed as a ``src.*``
repo), and integrates as a HuggingFace ``DynamicCache`` wrapper
rather than a llama.cpp cache type. Apple Silicon dev box can't
validate end-to-end. Re-evaluate when upstream ships MLX support
or a CUDA dev box becomes available.
The honest "deferred + reasoned" tracker entries are themselves the
right output here per the project guidelines — the alternative was
landing a half-wired CUDA-only KVTC slot or an HTTP-chained MTPLX
adapter, both of which would have shipped surface area without
delivering actual quality/performance to the user.
Test totals: 1313 pytest pass (+20 new), 341 vitest pass, tsc clean.
The probe was still running ``registry.get('chaosengine').llama_cpp_cache_flags(bits)``
and asserting the emitted cache types were standard llama-server types.
After FU-030 the legacy id coerces to TurboQuant, which emits
``turbo2/turbo3/turbo4`` — those are the *correct* types for the turbo
binary but the probe rejected them as INVALID.
Replaced with: native validates standard cache types, TurboQuant must
declare the turbo binary, and both legacy ids (chaosengine + rotorquant)
must coerce to turboquant via ``registry.resolve_legacy_id`` and resolve
via ``registry.get``. Mirrors the assertion already in
``scripts/pre-build-check.mjs`` so both runners agree.
All 7 pre-build-check.sh gates green.
…n-sync probe
Five interlocking maintenance items found while auditing the upstream
landscape for the four repos: z-lab/dflash, bstnxbt/dflash-mlx,
youssofal/MTPLX, TheTom/turboquant_plus.
1. **dflash-mlx pin bumped from 8d8545d (v0.1.5.1) to fada1eb (HEAD).**
11 upstream commits cover the new Gemma4 DFlash backend (commit
05cc456 — biggest payload), v0.1.5 serving surface, live server
metrics endpoint, prefix-cache survival test gate, async L2 writer
fix, long-context runtime diagnostics hardening, benchmark slugging
fixes, and a license switch to Apache-2.0. No breaking API changes
per the upstream commit log.
2. **stage-runtime.mjs pin synced to match pyproject.toml.** Caught a
real bug: pyproject.toml was at 8d8545d (v0.1.5.1) but
scripts/stage-runtime.mjs was lagging on f825ffb (v0.1.4.1) — dev
.venv ran new, but ``npm run stage:runtime`` was bundling the OLD
binary into release builds. Both files now share fada1eb.
3. **DRAFT_MODEL_MAP extended for new z-lab drafters.** Added entries
for google/gemma-4-31B-it, google/gemma-4-26B-A4B-it,
Qwen/Qwen3.5-122B-A10B, MiniMaxAI/MiniMax-M2.5,
MiniMaxAI/MiniMax-M2.7, and moonshotai/Kimi-K2.6, plus the
mlx-community/* aliases for each so Apple Silicon quants resolve via
the existing fuzzy-match path. 7 new unit tests in test_dflash.py
pin the mappings.
4. **TriAttention git+url pinned to commit c3744ee.** The
``[triattention]`` and ``[triattention-mlx]`` extras were pulling
``git+...git`` HEAD with no commit pin, making fresh installs
non-reproducible whenever upstream landed unreleased work between
our staging snapshots. Pin matches the v0.2.0 release surface plus
the AMD GPU port.
5. **FU-033 pin-sync probe shipped in pre-build-check.{mjs,sh}.**
Regex-extracts the dflash-mlx commit hash from both files and fails
the build when they diverge. Same commit also drops the orphan
vendor/ChaosEngine staleness check from both runners (FU-030
removed the vendored package; the probe would never resolve again).
CLAUDE.md tracker updates: FU-006 entry rewritten to document the
fada1eb bump, three new entries (FU-031 dflash drafter expansion +
TriAttention pin; FU-032 turboquant_plus watch-closely; FU-033 pin-sync
probe shipped).
Test totals: 1321 pytest pass (+8 from previous 1313 — 7 new dflash
+ 1 housekeeping), 341 vitest pass, tsc clean, pre-build-check 8/8
gates green.
…all hash Three related cleanups in src/components/RuntimeControls.tsx. 1. **Cache-strategy cards now hide when engine-incompatible or when the turbo binary is missing on GGUF.** Previously every strategy rendered for every model + engine combo with a greyed-out N/A badge. That taught users the wrong thing — a disabled card with no install button suggests something they could fix, when the only fix lived outside the app (engine mismatch is fundamental; ``llama-server-turbo`` build is a terminal-side script). The "package not installed but installable" case stays visible because the install button gets the user to ready in one click. ``native`` always survives. 2. **DFlash speculative-decoding toggle now hides when the selected model has no draft in DRAFT_MODEL_MAP, or when the engine is GGUF.** Same principle — both cases give the user no in-app path to recover, so a disabled checkbox with an "N/A" badge added confusion without value. ``canInstallDflashForModel`` keeps the install affordance visible whenever the gap is the missing pip package (one-click install path) and the model would be supported. 3. **Hardcoded ``f825ffb`` install hint string fixed.** The DFlash help panel still printed the v0.1.4.1 commit hash even after the FU-006 / FU-033 bumps to ``fada1eb`` (v0.1.5.1). Same drift bug FU-033 caught between pyproject.toml + stage-runtime.mjs; now all three carry the same hash. Comment added so a future bump touches all three. Popover-side filter (src/components/kvStrategyFilter.ts) already followed the hide rule, so the modal now matches. CLAUDE.md tracker gains FU-034 entry documenting the change + the design rule for future strategy slots. Test totals: 1321 pytest pass, 341 vitest pass, tsc clean.
…gle row
Two visual fixes for the per-turn telemetry chips below assistant
messages.
1. **Runtime note tone now reflects actual fault state.** The
"Using python with MLX 0.31.x and mlx-lm 0.31.y." chip used to
render in the orange ``substrate-chip--warn`` style because the
note slot was hardcoded to ``tone: "warn"``. That same slot also
carries real warnings ("DFLASH unavailable", "Cache strategy
failed. Fell back to native f16 cache.") — when every turn shows
the orange chip, operators stop noticing it on the rare turns
that actually flag a problem.
New ``runtimeNoteIsWarning`` helper in SubstrateRoutingBadge.tsx
scans for actionable tokens (``unavailable``, ``fell back``,
``failed``, ``error``, ``warning``, ``cannot``, etc.) and only
then promotes the chip to the warn tone. The benign version
banner now uses the default muted tone, matching the "MLX" /
"Native f16" chips next to it.
2. **SubstrateRoutingBadge + ChatPerfStrip now share a single
wrap-row.** Previously rendered as two sibling ``<div>`` strips,
so the engine/cache/note chips broke onto a separate line from
the perf chips (tok/s, CPU%, mem-free, thermal). New
``.message-runtime-strip`` wrapper in ChatThread.tsx is the
outer flex container; the two inner strips switch to
``display: contents`` so their chips become direct flex children
of the wrapper and flow as one continuous row, wrapping only
when the viewport actually requires it.
Test coverage: 10 new vitest cases in SubstrateRoutingBadge.test.ts
pin the tone-detect logic for both benign and faulty notes.
Test totals: 1321 pytest pass, 351 vitest pass (+10), tsc clean.
Previous commit (FU-035) accidentally captured a local dev flip of ``createUpdaterArtifacts: true → false`` in src-tauri/tauri.conf.json. That flag belongs at ``true`` for release builds (without it the auto-update channel never publishes new artifacts). Restore the release-correct value; the FU-035 chip changes remain intact.
…erks
Two fixes for the HTML Challenge model card stream view.
1. **Stream box now fills the available model-frame height.**
``.html-challenge-stream`` was ``flex: 0 0 auto`` with a fixed
``height: clamp(280px, 38vh, 520px)``, which left a tall band of
empty area below the streaming code while a model was generating.
Switched to ``flex: 1 1 auto; min-height: 280px`` so the stream
consumes the same vertical space the rendered iframe would use
when the run completes. ``min-height`` keeps it usable on short
viewports.
2. **Scroll-up no longer fights the user.** Two related races:
- ``handleStreamScroll`` re-flipped ``streamAtBottom`` to true
after every ``element.scrollTop = …`` write because the browser
fires ``scroll`` for both user wheel input and programmatic
writes. New ``lastProgrammaticScrollRef`` records the timestamp
of each programmatic scroll and the handler ignores scroll
events fired within 80ms of one — so user wheel events register
as "stop tracking" instead of being overwritten by the post-write
event.
- The streaming chunk auto-scroll ``useEffect`` read
``streamAtBottom`` from the React closure, which lagged behind
the user's wheel by one render. The effect now re-measures
scroll position inside the rAF and bails (clearing tracking
for that slot) if the user has moved away in the gap, instead
of yanking the box back to bottom.
Net effect: scrolling up during streaming holds position, the box
takes the full panel height, and only the explicit "scroll to
bottom" button or scrolling within 32px of the tail re-engages
auto-tracking.
Test totals: 1321 pytest pass, 351 vitest pass, tsc clean.
Triggered by a real crash report: a tool-call in the Chat tab against
Qwen3-Coder-Next blanked the entire packaged macOS app. Webview reload
returned the user to the Dashboard, and any subsequent Chat
navigation crashed again with no diagnostic surface to read.
Two related root causes, fixed together.
1. **No React error boundary anywhere in the tree.** A single
uncaught render error in one tab tore down the whole ``<main>``
content frame. New ``src/components/ErrorBoundary.tsx`` uses
``getDerivedStateFromError`` + ``componentDidCatch`` to capture
the error and render an inline fallback with:
- the error name + message
- the JS stack and component stack inside a collapsible
- a "Try again" button that resets local state for transient
errors (e.g. stale streaming buffer from an aborted tool call)
- a "Copy details" button that writes a self-contained bug
report to the clipboard (timestamp, UA, error, both stacks)
The boundary wraps ``{content}`` in App.tsx keyed by
``activeTab`` so switching tabs unmounts the boundary entirely,
giving the user a clean navigation-based recovery path even
when "Try again" hits the same error.
2. **Release builds had no way to open devtools.** Tauri's
``devtools`` Cargo feature was in ``declared_features`` but not
in the active ``features`` array on the ``tauri`` crate, so the
WebKit inspector was compiled out in release. Without it, the
only path to a JS stack was rebuilding the app via
``cargo tauri dev`` — useless to a user staring at a blank
screen. Flipping the feature on adds the right-click → Inspect
Element entry to release builds.
Surrounding work:
- CSS for ``.error-boundary`` lives next to the existing notice
banners in src/styles.css; same colour vocabulary as
``.error-banner``.
- Unit tests in src/components/__tests__/ErrorBoundary.test.ts pin
the ``getDerivedStateFromError`` contract so the boundary
cannot silently stop catching errors.
- CLAUDE.md tracker entry (FU-037) records the root cause + fix
for future regressions.
Test totals: 1321 pytest pass, 353 vitest pass (+2), tsc clean.
…ash alias
Three bugs surfaced by a live /api/diagnostics/snapshot payload taken
during a Qwen3-Coder-Next + Tools repro.
1. ``_free_bytes`` ImportError in diagnostics snapshot.
backend_service/routes/diagnostics.py imported ``_free_bytes`` from
backend_service.routes.setup, but the setup package's __init__.py
never re-exported it from gpu_bundle.py — every snapshot reported
``ImportError: cannot import name '_free_bytes'`` in the ``extras``
section. Added the re-export.
2. MallocStackLogging spam drowning the backend log.
macOS hardened-runtime (we ship bundle.macOS.hardenedRuntime: true)
inherited an env var into every Python subprocess, producing three
lines of ``Python(PID) MallocStackLogging: can't turn off malloc
stack logging because it was not enabled.`` at each spawn. With
the metrics polling loop firing 1 Hz that's hundreds per minute,
drowning out the INFO / ERROR lines the Diagnostics tab is meant
to surface. Two-pronged fix:
- src-tauri/src/backend.rs: ``command.env_remove`` the three
MallocStackLogging / MallocScribble vars before spawning the
backend so NEW builds never produce the spam.
- backend_service/routes/diagnostics.py: regex filter
``_LOG_NOISE_PATTERNS`` + ``_filter_log_noise`` strips the spam
from /api/diagnostics/log-tail and the snapshot's logs section
so OLDER builds get a clean diagnostic surface immediately
without rebuilding. Filter reads 4x the requested window so
200 useful lines survive even when the raw log is 50% spam.
3. DFlash unavailable for ``mlx-community/Qwen3.6-27B-4bit``.
Qwen3-Coder-Next was rebranded ``Qwen3.6-27B`` upstream; the
lmstudio-community MLX conversion's HF metadata reports
``mlx-community/Qwen3.6-27B-4bit`` as the canonical repo and
model_resolution.resolve_dflash_target_ref prefers canonical
over the lmstudio alias. DRAFT_MODEL_MAP had no entry → DFlash
silently unavailable per snapshot ("DFLASH unavailable for
'mlx-community/Qwen3.6-27B-4bit': no compatible draft model
is registered."). Aliased the three quant variants (4bit /
bf16 / 8bit) back to Qwen/Qwen3-Coder-Next so the existing
z-lab/Qwen3-Coder-Next-DFlash drafter resolves. New unit test
pins the mapping.
CLAUDE.md tracker gains FU-038 entry recording all three.
Test totals: 1321 pytest pass (+3 new dflash cases), 353 vitest
pass, tsc clean.
The first real bug caught by the FU-037 ErrorBoundary. User repro:
Qwen3-Coder-Next, Tools ON, prompt 'What is 17 * 23 plus the square
root of 144?'. ErrorBoundary fallback rendered:
TypeError: Object.entries requires that input parameter not be
null or undefined
Pinned _Y in the minified bundle to src/components/ToolCallCard.tsx
(line 116). Backend trace: Coder-Next emitted ``{"arguments": null}``
for a tool call that needed no parameters, and
``backend_service/agent.py::_execute_tool_call`` evaluated
``isinstance(None, str) -> False`` then set ``arguments = None``.
The None serialised into the persisted session, so every
subsequent render of the affected turn re-crashed the Chat tab —
the user could not even reach earlier history.
Two-layer fix.
1. Backend (root cause). ``_execute_tool_call`` coerces every
non-dict shape (``None``, empty string, raw list, etc.) to ``{}``
at the source. The ``arguments is always a dict`` contract now
holds for every downstream consumer (frontend card, persisted
session, OpenAI-compat passthrough). Four new unit tests in
tests/test_agent.py pin the null / empty / missing-key / dict
shapes.
2. Frontend (legacy data + belt-and-braces). ToolCallCard
defensively wraps arguments in ``Record<string, unknown>`` with a
default of ``{}``, and renders ``(no arguments)`` when the entries
list is empty. Older persisted sessions that contain ``null``
arguments from before the backend fix stop crashing without
requiring a manual localStorage wipe.
CLAUDE.md tracker gains FU-039 entry documenting the root cause +
both layers.
Test totals: 1325 pytest pass (+4 new agent cases), 353 vitest pass,
tsc clean.
…vision tag
Three fixes surfaced by a Coder-Next chat session.
1. Tool-call parser widened to handle three real-world shapes.
The old regex required a closing ``</tool_call>`` tag and only
matched JSON objects. Coder-Next emitted three shapes in a
single session:
- canonical: ``<tool_call>{"name": ...}</tool_call>``
- open-only: ``<tool_call>{"name": ...}`` with no close tag
- array-shaped: ``<tool_call>[{"url": ...}]`` (hallucinated
pseudo-results inside a call tag)
The new parser uses ``json.JSONDecoder.raw_decode`` on each
``<tool_call>`` opener so it consumes exactly the next valid
JSON value regardless of close tag, dispatches objects with a
``name``, drops list payloads silently (no dispatchable
``name``), and continues scanning so a later well-formed
call still lands. Cases (2) and (3) used to silently render
the raw XML in the assistant bubble with no execution.
2. ``_strip_tool_call_xml`` helper removes the JSON region the
parser consumed from ``result.text`` before the streaming
layer hands it to the chat bubble. Without this, every
parsed call appeared twice on screen — once as raw XML
noise, once as the rendered ``ToolCallCard``. Applied in
both ``run_agent_loop`` and ``run_agent_loop_streaming``.
Excess blank lines collapsed so a mid-paragraph strip
doesn't leave a visible gap.
3. Qwen3.6-27B + Qwen3.5 vision tag cleanup. Dense Qwen3.6-27B
(Coder-Next branding), Qwen3.6-27B-FP8, mlx-community
/Qwen3.6-27B-4bit, and the family-level Qwen3.6 + Qwen3.5
entries all carried ``"vision"`` in their capabilities — a
copy-paste bug from when the catalog was scaffolded. Vision
lives on a separate ``Qwen3.6-27B-VL`` variant we do not
yet ship; the stale tag was promoting
``supportsVision: true`` for every community quant variant,
making ``ChatComposer`` render the "Attach image" affordance
for a text-only model. Dropped from all five entries.
Test coverage: 13 new agent-parser + strip tests; total 1339
pytest pass (+14), 353 vitest pass, tsc --noEmit clean. CLAUDE.md
tracker entry FU-040 records all three.
…wen3.6-27B aliases User-spotted mismatch: their local install at ``/Users/dan/AI_Models/lmstudio-community/Qwen3-Coder-Next-MLX-4bit`` was surfacing as canonical repo ``mlx-community/Qwen3.6-27B-4bit`` in the diagnostics snapshot, picking up the wrong catalog row and the wrong DFlash drafter. Confirmed via on-disk config.json that the model is Qwen3-Next (architectures ``Qwen3NextForCausalLM``, ``model_type: "qwen3_next"``, sparse MoE with 512 experts, hidden_size 2048, ~3B active per token) — fundamentally different from the dense Qwen3.6-27B (``qwen3`` arch, hidden_size 5120, no MoE). Root cause: the catalog had no variant for the lmstudio-community MLX 4-bit conversion of Coder-Next, so the fuzzy matcher in src/utils/library.ts::libraryVariantMatchScore settled for the closest "MLX + 4-bit + Qwen3" entry, which happened to be the unrelated ``mlx-community/Qwen3.6-27B-4bit`` row. Three changes. 1. Added an explicit ``lmstudio-community/Qwen3-Coder-Next-MLX-4bit`` variant to the ``qwen3-coder-next`` family in backend_service/catalog/text_models.py. Correct params: 80B sparse / ~45 GB on disk / qwen3_next family capabilities (coding / agents / tool-use / reasoning / thinking). The matcher now scores 80+ on an exact repo-path substring hit instead of the previous fuzzy fallback. 2. Reverted the FU-038 DFlash aliases that wrongly pointed ``mlx-community/Qwen3.6-27B-4bit / bf16 / 8bit`` at ``Qwen/Qwen3-Coder-Next``. Those quants are the dense 27B Coder (text-only, ``qwen3`` arch) and have no drafter today; leaving them aliased to the Qwen3-Next MoE drafter would route DFlash to the wrong architecture and either crash at load or degrade silently. 3. Replaced them with the correct ``lmstudio-community/Qwen3-Coder-Next-MLX-4bit`` alias plus an ``-Instruct`` sibling. New regression tests in tests/test_dflash.py pin (a) the new alias resolves to ``z-lab/Qwen3-Coder-Next-DFlash`` and (b) the dense 27B-4bit MUST NOT alias to the MoE drafter. Test totals: 1340 pytest pass, 353 vitest pass, tsc clean. CLAUDE.md tracker entry FU-041 records the root cause + fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.