Release mlx-serve v26.6.11 · ddalcu/mlx-serve

v26.6.11 — Message your model from your phone

Telegram bot — your model in your pocket. Make a bot in Telegram, paste its token, flip a switch, and message your local model from anywhere — no public URL, port-forwarding, or cloud relay; it works behind home Wi-Fi over your normal connection. Turn on Agent mode and it can run tools, read and write files (confined to a workspace folder), and even schedule tasks for you, all from your phone. The bot locks to the first chat that messages it, so no one else can drive your Mac.
Paste anything straight into chat. Drop or paste an image, a PDF, or a whole folder into the message box — the same as the attach button. Folders get indexed for question-answering, PDFs have their text pulled in, and images go to vision models.
Cleaner agent conversations. Tool calls and their results now fold into a compact, expandable summary, so a long agent run reads like a clear narrative instead of screens of raw output.
Memory you can see — and that stops surprising you. A new memory readout in the menu-bar tray shows what your model and context are using, and a pre-flight check turns a "model too big for free RAM" crash into a clear, upfront message. On 16 GB Macs the context window and cross-request cache now size themselves to your RAM, so long agent sessions stay stable — and if a prompt genuinely won't fit, you get a plain "prompt too long" notice instead of an out-of-memory crash. The server log also shows the exact launch command at the top for easy troubleshooting.
Run DeepSeek-V4-Flash even when it's bigger than your RAM. A new SSD weight-streaming option lets the DeepSeek-V4-Flash engine stream expert weights from disk instead of holding the whole model in memory — so the 80 GB checkpoint that used to crash at startup ("insufficient memory") now loads and serves. Flip it on in Settings when the model is larger than available RAM; it trades a little decode speed for the disk reads, and is ignored by every other model.
More Gemma 3 models supported. Flat text-only Gemma 3 checkpoints — including the popular abliterated builds — now load and run out of the box.
Smoother Voice Mode setup, cleaner Gemma replies. Turning on Voice Mode now shows a friendly card naming exactly what's missing — the on-device dictation model, microphone access — instead of quietly failing. And a Gemma quirk that occasionally leaked a raw thinking tag into the end of a reply is fixed, so answers stay clean.

Issues / PR's

Fix GPU-memory oversubscription crash on concurrent model switch by @ujwal-setlur in #41
Fix download cancel/delete in Model Browser; enrich Downloaded tab by @ujwal-setlur in #34
Report full metadata for unloaded models in /v1/models by @ujwal-setlur in #38
Add Qwen2 (Qwen2.5) architecture support by @ujwal-setlur in #36
Fix macOS dependency syntax for MLX Core by @pnavais in #32
Handle GPU OOM at model load gracefully (pre-flight + legible error) by @ujwal-setlur in #45
Fix gemma-3-4b-it-4bit load crash from head counts omitted in text_config by @ddalcu in #46
Fix double-free SIGSEGV on the GPU-OOM pre-flight refusal path (#45/#47) by @bitworks-io in #48
Expose ds4 SSD weight-streaming via --ssd-streaming (#39) by @bitworks-io in #49

New Contributors

@pnavais made their first contribution in #32
@bitworks-io made their first contribution in #48

Full Changelog: v26.6.10...v26.6.11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mlx-serve v26.6.11

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v26.6.11 — Message your model from your phone

Issues / PR's

New Contributors

Contributors

Uh oh!