What's new in v0.9.4
New model families in Discover
Four frontier families added to the catalog:
- DeepSeek V4 — DeepSeek's latest MoE (685B total / 37B active), 128K context. MLX 4-bit (~236 GB) and GGUF Q4_K_M variants for both standard and DeepSeek-V4-0324.
- GLM-5 — ZhipuAI's GLM-5 (130B MoE, ~33B active), 128K context, strong multilingual performance. MLX 4-bit and GGUF paths.
- Gemma 4 E2B + 31B (multimodal) — Google's latest Gemma 4 generation. E2B = 2B / 128K ctx; 31B = 31B / 256K ctx. Both sizes ship
vision_config— all variants tagged multimodal. QAT GGUFs from Google's official HF org, plus MLX 8-bit and BF16 options. - MiniMax M2.7 — MiniMax's MoE flagship (256 routed experts, 8 active, ~240B total params), 200K context window. MLX mxfp4, GGUF Q4_K_M, and BF16 variants.
Chat: persistent prompt cache + lower latency
- MLX persistent prompt cache (tier 4) — the system prompt + conversation prefix KV cache survives between turns within the same session. Turn-2 onwards can reuse the cached context directly; measured ~5.6× faster time-to-first-token on repeated exchanges against the same model.
- Token-budgeted history window — the backend now tracks total tokens across the conversation and trims the sent history to stay within the model's usable context, eliminating silent truncation and the
OOM on long sessionsclass of errors. - Cross-turn KV reuse for llama.cpp — the GGUF backend now persists its KV slot across turns when no model unload occurs, matching the MLX tier-4 behaviour on the llama.cpp side.
- Flash-attention routing — automatically enables flash attention when loading via llama.cpp on supported hardware.
- DRY + XTC samplers — two additional diversity-control samplers (Don't Repeat Yourself repetition penalty, Exclude Top Choices de-repetition) exposed end-to-end: backend plumbing through to UI sliders in the Advanced panel.
- Token stream coalescing — internal SSE frames for both the standard generate path and the agent/tool-call path now batch characters before writing, reducing redundant
json.dumps+ HTTP write syscalls per token.
Platform openness
- Ollama-compatible API shim — the backend now serves
/v1endpoints in the Ollama wire format, so any tool expecting an Ollama server (scripts, OpenWebUI, etc.) works without reconfiguration. - One-click embedding & RAG — new Setup action installs the embedding stack and wires it to the document store; the Chat tab gains a context-document rail for retrieval-augmented generation.
- Model import — import models from local paths or any Hugging Face repo URL directly from the Library tab, without downloading through the Discover catalog first.
- Run any HF model — the backend can now load arbitrary HF repos that aren't in the curated catalog; the Chat launcher exposes a "Custom model" entry point.
- Server connect presets — Diagnostics tab gains a presets panel for configuring external llama.cpp / OpenAI-compatible server addresses, replacing the manual env-var workaround.
Challenge prompt library
A curated library of benchmark-style prompts accessible from the Chat composer — useful for quickly probing a newly loaded model across reasoning, coding, instruction-following, and multilingual tasks without typing them from scratch.
Spec-dec + cold-start fixes
- Fixed a stale
configure_full_attention_splitimport that silently disabled the entire MLX DFlash / DDTree / MTPLX speculative-decoding stack for all users (FU-075). - Fixed the MTP tensor probe missing top-level
mtp.*keys, which blocked MTPLX from being selected on Qwen3.5/3.6 (FU-076). - Fixed MTPLX isolated venv truncated-install verification (FU-077).
- Fixed
MtplxEnginepassing a bare HF repo id instead of the local snapshot path (FU-078). - Backend cold-start reduced from ~2.6 s → ~0.85 s by deferring all
diffusers/torchimports out of the strategy-availability probe (FU-080). - MLX capability probe ceiling raised 30 s → 45 s to prevent false
MLX unavailableon cold-boot M-series machines (FU-068).
Dependency updates
| Package | Before | After |
|---|---|---|
turboquant-mlx-full |
≥0.6.2 | ≥0.8.0 — adds Mamba/hybrid arch support (Nemotron-3), improved GPT-OSS path |
vllm |
≥0.22.1 | ≥0.23.0 |
mlx-vlm |
≥0.6.0 | ≥0.6.3 |
tauri |
2.11.0 | 2.11.2 |
tauri-build |
2.6.0 | 2.6.2 |
tauri-plugin-dialog |
2.7.0 | 2.7.1 |
tauri-plugin-opener |
2.5.3 | 2.5.4 |
rust-i18n |
3.1.2 | 4.0.0 |
serde_json |
1.0.149 | 1.0.150 |
GitHub Actions (checkout, setup-python, upload-artifact) |
v4/5 | v6/7 |