mlx-serve v26.6.11
v26.6.11 — Message your model from your phone
- Telegram bot — your model in your pocket. Make a bot in Telegram, paste its token, flip a switch, and message your local model from anywhere — no public URL, port-forwarding, or cloud relay; it works behind home Wi-Fi over your normal connection. Turn on Agent mode and it can run tools, read and write files (confined to a workspace folder), and even schedule tasks for you, all from your phone. The bot locks to the first chat that messages it, so no one else can drive your Mac.
- Paste anything straight into chat. Drop or paste an image, a PDF, or a whole folder into the message box — the same as the attach button. Folders get indexed for question-answering, PDFs have their text pulled in, and images go to vision models.
- Cleaner agent conversations. Tool calls and their results now fold into a compact, expandable summary, so a long agent run reads like a clear narrative instead of screens of raw output.
- Memory you can see — and that stops surprising you. A new memory readout in the menu-bar tray shows what your model and context are using, and a pre-flight check turns a "model too big for free RAM" crash into a clear, upfront message. On 16 GB Macs the context window and cross-request cache now size themselves to your RAM, so long agent sessions stay stable — and if a prompt genuinely won't fit, you get a plain "prompt too long" notice instead of an out-of-memory crash. The server log also shows the exact launch command at the top for easy troubleshooting.
- Run DeepSeek-V4-Flash even when it's bigger than your RAM. A new SSD weight-streaming option lets the DeepSeek-V4-Flash engine stream expert weights from disk instead of holding the whole model in memory — so the 80 GB checkpoint that used to crash at startup ("insufficient memory") now loads and serves. Flip it on in Settings when the model is larger than available RAM; it trades a little decode speed for the disk reads, and is ignored by every other model.
- More Gemma 3 models supported. Flat text-only Gemma 3 checkpoints — including the popular abliterated builds — now load and run out of the box.
- Smoother Voice Mode setup, cleaner Gemma replies. Turning on Voice Mode now shows a friendly card naming exactly what's missing — the on-device dictation model, microphone access — instead of quietly failing. And a Gemma quirk that occasionally leaked a raw thinking tag into the end of a reply is fixed, so answers stay clean.
Issues / PR's
- Fix GPU-memory oversubscription crash on concurrent model switch by @ujwal-setlur in #41
- Fix download cancel/delete in Model Browser; enrich Downloaded tab by @ujwal-setlur in #34
- Report full metadata for unloaded models in /v1/models by @ujwal-setlur in #38
- Add Qwen2 (Qwen2.5) architecture support by @ujwal-setlur in #36
- Fix macOS dependency syntax for MLX Core by @pnavais in #32
- Handle GPU OOM at model load gracefully (pre-flight + legible error) by @ujwal-setlur in #45
- Fix gemma-3-4b-it-4bit load crash from head counts omitted in text_config by @ddalcu in #46
- Fix double-free SIGSEGV on the GPU-OOM pre-flight refusal path (#45/#47) by @bitworks-io in #48
- Expose ds4 SSD weight-streaming via --ssd-streaming (#39) by @bitworks-io in #49
New Contributors
- @pnavais made their first contribution in #32
- @bitworks-io made their first contribution in #48
Full Changelog: v26.6.10...v26.6.11