Skip to content

v0.9.4 — New model families, chat performance, platform openness

Latest

Choose a tag to compare

@cryptopoly cryptopoly released this 16 Jun 08:45
· 1 commit to staging since this release

What's new in v0.9.4

New model families in Discover

Four frontier families added to the catalog:

  • DeepSeek V4 — DeepSeek's latest MoE (685B total / 37B active), 128K context. MLX 4-bit (~236 GB) and GGUF Q4_K_M variants for both standard and DeepSeek-V4-0324.
  • GLM-5 — ZhipuAI's GLM-5 (130B MoE, ~33B active), 128K context, strong multilingual performance. MLX 4-bit and GGUF paths.
  • Gemma 4 E2B + 31B (multimodal) — Google's latest Gemma 4 generation. E2B = 2B / 128K ctx; 31B = 31B / 256K ctx. Both sizes ship vision_config — all variants tagged multimodal. QAT GGUFs from Google's official HF org, plus MLX 8-bit and BF16 options.
  • MiniMax M2.7 — MiniMax's MoE flagship (256 routed experts, 8 active, ~240B total params), 200K context window. MLX mxfp4, GGUF Q4_K_M, and BF16 variants.

Chat: persistent prompt cache + lower latency

  • MLX persistent prompt cache (tier 4) — the system prompt + conversation prefix KV cache survives between turns within the same session. Turn-2 onwards can reuse the cached context directly; measured ~5.6× faster time-to-first-token on repeated exchanges against the same model.
  • Token-budgeted history window — the backend now tracks total tokens across the conversation and trims the sent history to stay within the model's usable context, eliminating silent truncation and the OOM on long sessions class of errors.
  • Cross-turn KV reuse for llama.cpp — the GGUF backend now persists its KV slot across turns when no model unload occurs, matching the MLX tier-4 behaviour on the llama.cpp side.
  • Flash-attention routing — automatically enables flash attention when loading via llama.cpp on supported hardware.
  • DRY + XTC samplers — two additional diversity-control samplers (Don't Repeat Yourself repetition penalty, Exclude Top Choices de-repetition) exposed end-to-end: backend plumbing through to UI sliders in the Advanced panel.
  • Token stream coalescing — internal SSE frames for both the standard generate path and the agent/tool-call path now batch characters before writing, reducing redundant json.dumps + HTTP write syscalls per token.

Platform openness

  • Ollama-compatible API shim — the backend now serves /v1 endpoints in the Ollama wire format, so any tool expecting an Ollama server (scripts, OpenWebUI, etc.) works without reconfiguration.
  • One-click embedding & RAG — new Setup action installs the embedding stack and wires it to the document store; the Chat tab gains a context-document rail for retrieval-augmented generation.
  • Model import — import models from local paths or any Hugging Face repo URL directly from the Library tab, without downloading through the Discover catalog first.
  • Run any HF model — the backend can now load arbitrary HF repos that aren't in the curated catalog; the Chat launcher exposes a "Custom model" entry point.
  • Server connect presets — Diagnostics tab gains a presets panel for configuring external llama.cpp / OpenAI-compatible server addresses, replacing the manual env-var workaround.

Challenge prompt library

A curated library of benchmark-style prompts accessible from the Chat composer — useful for quickly probing a newly loaded model across reasoning, coding, instruction-following, and multilingual tasks without typing them from scratch.

Spec-dec + cold-start fixes

  • Fixed a stale configure_full_attention_split import that silently disabled the entire MLX DFlash / DDTree / MTPLX speculative-decoding stack for all users (FU-075).
  • Fixed the MTP tensor probe missing top-level mtp.* keys, which blocked MTPLX from being selected on Qwen3.5/3.6 (FU-076).
  • Fixed MTPLX isolated venv truncated-install verification (FU-077).
  • Fixed MtplxEngine passing a bare HF repo id instead of the local snapshot path (FU-078).
  • Backend cold-start reduced from ~2.6 s → ~0.85 s by deferring all diffusers / torch imports out of the strategy-availability probe (FU-080).
  • MLX capability probe ceiling raised 30 s → 45 s to prevent false MLX unavailable on cold-boot M-series machines (FU-068).

Dependency updates

Package Before After
turboquant-mlx-full ≥0.6.2 ≥0.8.0 — adds Mamba/hybrid arch support (Nemotron-3), improved GPT-OSS path
vllm ≥0.22.1 ≥0.23.0
mlx-vlm ≥0.6.0 ≥0.6.3
tauri 2.11.0 2.11.2
tauri-build 2.6.0 2.6.2
tauri-plugin-dialog 2.7.0 2.7.1
tauri-plugin-opener 2.5.3 2.5.4
rust-i18n 3.1.2 4.0.0
serde_json 1.0.149 1.0.150
GitHub Actions (checkout, setup-python, upload-artifact) v4/5 v6/7