mlx-serve v26.6.10

Latest

Latest

github-actions released this 12 Jun 12:35

· 8 commits to main since this release

cc13d28

v26.6.10 — Text diffusion lands on Apple Silicon

DiffusionGemma runs natively. Google's block-diffusion model (diffusiongemma-26B-A4B-it) writes whole 256-token blocks in parallel instead of one token at a time: the full canvas-denoising loop — entropy-bound sampling, self-conditioning, adaptive early stopping — validated tensor-by-tensor against the reference implementation. Up to 25 tokens land per forward pass, and decode runs ~30% faster than the mlx-vlm reference on the same M-series hardware (31.8 vs 24.6 tok/s on a story prompt).
Diffusion on every API surface, day one. Chat completions, Anthropic messages, Responses, and FIM completions all serve it — streaming arrives block-by-block as each canvas commits, thinking mode separates reasoning cleanly, and tool calls come out with exact JSON arguments, ready for agent loops.
NVFP4 quantized models load and serve. Checkpoints converted with MLX's NVIDIA-FP4 mode (gemma-4-31b-it-nvfp4, Qwen3.6-27B-nvfp4, Qwen3-Next-80B-A3B-Thinking-mlx-nvfp4, and the rest of the growing nvfp4 catalog) now run out of the box instead of crashing at load — output verified token-identical to the reference implementation at temperature 0. mxfp4 and mxfp8 checkpoints ride the same path.
Mixed-precision QAT checkpoints resolve per weight. NVFP4 QAT conversions that keep sensitive layers at affine 8-bit (the gemma-4 QAT series overrides the shared MLP and MoE router) dispatch each tensor to its own scheme automatically — dense, MoE expert gather, embeddings, and vision projections included.
Discovery picks them up. --model-dir folders now list nvfp4/mxfp4/mxfp8 models in /v1/models and the app's model picker instead of skipping them, and the startup banner reports the quantization mode. The Model Browser offers these repos for download too — they were still stamped "Unsupported quantization" by a stale client-side gate — and badges them with their format (NVFP4/MXFP4/MXFP8) in the quant column.
Ask your documents. Attach a folder of mixed files — chat transcripts, notes, PDFs, JSON/YAML exports — from the chat's paperclip menu and ask questions about them in plain language. The app indexes the folder in memory (nothing leaves your Mac, nothing written to disk) and the model pulls in the relevant passages automatically, citing source filenames. Works in plain chat or alongside Agent and MCP tools.
Document indexing runs on the GPU — about 5× faster, zero setup. The first time you attach a folder, the app quietly fetches a 35 MB embedding model (one-time, resumable) and registers it with the running server; from then on indexing rides the GPU — a 500-file folder indexes in ~7 s instead of ~33 s, with your CPU left free. Everything stays local: the model downloads once from Hugging Face, your documents never leave the Mac. The /v1/embeddings API got the same treatment for everyone: input arrays embed in single batched GPU passes (~1.4 ms per 1200-char passage), results identical to one-at-a-time calls, encoders hot-load beside your chat model, and /v1/load-model now accepts an absolute model path. Encoder repos (BGE, MiniLM …) are downloadable from the Model Browser too.
Agents that run colorful CLIs no longer derail the model. A tool result carrying raw terminal control codes (an interactive npm prompt, a spinner, anything ANSI) could silently break prompt construction from that turn on — Gemma 4 models would respond by hallucinating entire conversations, inventing tool calls and their results. Any byte a tool emits now round-trips safely into the conversation history, and a prompt-format downgrade is logged loudly instead of passing silently.

Assets 4