Mesh your Apple Silicon Macs into a high-performance LLM serving cluster. Web UI, OpenAI-compatible API, one-line install.
GenMLX turns N Apple Silicon Macs (M-series) into a tensor-parallel inference cluster for large language models. Built on Apple's MLX framework, with Thunderbolt 5 mesh networking for low-latency cross-node communication. Designed to be productive in 15 minutes from curl | bash to first token.
⚠️ Pre-alpha — currently atv0.1.0.dev0(Phase 0 of the 7-phase build tov1.0.0). The architecture and roadmap below describe the target state. See the roadmap for what works today vs what's still being built.
- Why GenMLX
- Features
- Quick Start
- Architecture
- Hardware Requirements
- Supported Models
- Performance
- Documentation
- Roadmap
- How is this different from...
- Contributing
- License
- Acknowledgments
If you own multiple Apple Silicon Macs and want to:
- Serve large models locally that don't fit in a single Mac's unified memory (DeepSeek V4, Qwen3-Coder-Next, GLM-4.7, etc.)
- Keep inference private — no API keys, no rate limits, no data leaving your network
- Reuse hardware you already have — three M1 Maxes + one M3 Ultra still serve 100B+ parameter models
- Get fast time-to-first-token at long contexts thanks to disk-backed L2 KV caching
- Integrate with your existing tools — Claude Code, Cline, opencode, OpenWebUI all work out of the box via the OpenAI-compatible API
…then GenMLX is the simplest way to do that today.
It assumes a fixed, owned topology (1-6 Macs on the same private network) — that's the niche. If you need elastic, dynamic, heterogeneous device discovery across phones/laptops/desktops, look at EXO Labs instead.
- 🖥️ Web UI dashboard — manage models, serve, monitor, set up the cluster, all from
http://master:6789 - 🔌 OpenAI-compatible API —
/v1/chat/completions,/v1/completions,/v1/models, drop-in for any OpenAI client - 🤖 Tool/function calling — Hermes-style + DeepSeek-style + GLM-style tool parsing with streaming
- 🌐 Native Anthropic API adapter — Claude Code points at the cluster directly
- 🧠 Thinking-token routing —
<think>blocks correctly routed toreasoning_contentfor compatible clients
- 🧩 Master-agent over HTTP — no SSH for management plane, just bearer-token API calls
- 🚀 Auto-registration — agents announce themselves on boot; UI sees them within 30 seconds
- 🎯 Mesh setup wizard — UI generates per-node TB5 IP plans for N=1-6 nodes + verifies link-by-link; supports both full-mesh and ring topologies
- ⚡ Flexible networking — TB5 RDMA (best), TB4/TB3 RDMA, 10 GbE, or 1 GbE; mesh wizard detects + recommends per cluster.
jacclover TB, TCP backend over Ethernet. - 🧮 Mix any M-series Macs — heterogeneous RAM supported. Cluster auto-selects Tensor Parallel for homogeneous fleets, Pipeline Parallel for mixed (so a 192 GB Mac Studio + 32 GB Mac mini + 96 GB MacBook Pro all serve as one cluster, no manual sharding)
- 🛠️ Per-node multi-path storage — each node has its own configurable model storage paths (home dir, external SSDs, NAS mounts) managed from the UI
- ✅ Cross-node model presence check — UI badges show which model is on which node; blocks
Serveuntil all selected nodes have the model
- 💾 L2 disk-backed prompt cache — 200+ GB SSD cache for KV state; turns 88-minute cold prefill into 37-second L2 hit
- 🎯 L2 boundary snapshots — saves cache at the system+tools boundary so different conversations sharing a system prompt reuse it
- 🔀 Continuous batching —
BATCHED=1dispatcher serves multiple concurrent requests with prefix sharing - 📊 Smart KV quantization — optional int4/int8 KV for memory-bound workloads (default off — net loss on Apple Silicon)
- 🧩 Hybrid attention support — linear (GDN) + full attention models like Qwen3-Coder-Next handled correctly
- 📊 Live telemetry — per-node CPU/GPU/RAM/SSD via
macmonintegration; in-flight job tracking - 🔄 Browser-refresh-safe — long-running jobs (download/sync/serve) reattach when you reload the UI
- 🔑 Token-based auth — bearer tokens for all master/agent traffic; no SSH keys to manage
- 📦 Model lifecycle — download from Hugging Face, sync to all nodes (delta + resume), validate, serve, swap, delete from UI
- 🎛️ Integrations panel — copy-paste configs for Claude Code, Cline, opencode, OpenWebUI
v1.0.0-dev — public installer (
curl | bash) lands in Phase 6 of the implementation plan. For now, build from source:
git clone https://github.com/crystech/GenMLX.git
cd GenMLX
bash scripts/dev/setup.sh
source .venv/bin/activate
genmlx versionMaster node (the Mac you want to drive the UI from):
curl -fsSL https://raw.githubusercontent.com/crystech/GenMLX/main/install.sh | bash -s -- --masterThe installer:
- Installs Python 3.11 +
uv+macmonvia Homebrew - Sets up a virtual environment at
~/.genmlx/venv - Generates a 32-byte bearer token
- Registers a launchd service that starts GenMLX on boot
- Opens
http://localhost:6789in your browser
Worker nodes (additional Macs):
curl -fsSL https://genmlx.dev/install.sh | bash -s -- \
--agent \
--master-url http://<master-mac>.local:6789 \
--token gmx_<token-from-master-install-output>The agent auto-registers with the master within 30 seconds. Use the UI's Mesh Setup tab to wire up TB5 and generate the per-node IP plan.
┌──────────────────────────────────┐
│ Master Mac │
│ │
│ [Web UI on :6789 — dashboard] │
│ [REST + WebSocket API] │
│ [Mesh planner] │
│ [SQLite registry + jobs] │
│ │
│ [Dispatcher rank 0] │
└────┬────┬────┬────┬──────────────┘
│ HTTP + bearer-token auth
▼ ▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│ Agent │ │ Agent │ │ Agent │ ...up to 6 nodes
│ rank1 │ │ rank2 │ │ rank3 │
│ │ │ │ │ │
│ disp │ │ disp │ │ disp │
└───┬───┘ └───┬───┘ └───┬───┘
│ │ │
└─────────┴─────────┘
TB5 mesh — jaccl RDMA all-reduce
for the inference data plane
-
Master — The orchestrator. Hosts the UI, the REST API, the SQLite agent registry, the mesh planner, the job tracker. Always also runs
rank 0of the dispatcher. -
Agents — One per worker Mac. Lightweight HTTP daemons that respond to master commands (file sync, command exec, rank spawn, mesh configure). Stateless except for the local node's config + the dispatcher rank they're hosting.
-
Dispatcher — The serving brain. A 3000+ LOC FastAPI/
http.serverapp that wrapsmlx-lm, handles continuous batching, runs the L2 cache, parses thinking tokens + tool calls, exposes OpenAI/Anthropic APIs. The same dispatcher binary runs on every node; ranks communicate viamx.distributedover the TB5 mesh.
The split between master and agent (control plane) vs the dispatcher (data plane) is the central design choice: HTTP for low-frequency control + token-secured remote ops; jaccl/TB5 for the high-bandwidth low-latency inference traffic.
See ARCHITECTURE.md for the full design rationale.
| Component | Minimum | Recommended |
|---|---|---|
| Macs | 1 M-series Mac | 2-6 M-series Macs |
| RAM per Mac | 32 GB | 96 GB / 192 GB / 512 GB (mix is fine — see below) |
| Storage per Mac | 50 GB free | 500 GB+ (for models + L2 cache) |
| macOS | 14 Sonoma | 15 Sequoia |
| Network (single node) | — | — |
| Network (2+ nodes) | 1 GbE / Wi-Fi (degraded perf) | Thunderbolt 5 RDMA |
| Mac Studio M3 Ultra | great fit (6 TB5 ports, scales to 6-node mesh) | ⭐ ideal |
GenMLX works on a single Mac (no cluster) and scales to 6 Macs in a full TB5 mesh. Beyond 6 nodes, Mac Studio's 6 TB5 ports run out and you'd need a different topology (not supported in v1).
You don't need matching Macs. A 192 GB Mac Studio + 32 GB Mac mini + 96 GB MacBook Pro can serve as one cluster — GenMLX detects the RAM mix and picks the right strategy:
| Cluster type | Parallelism | Per-node sizing |
|---|---|---|
| Homogeneous (all RAM within ±10%) | Tensor Parallel (default) | Equal share per node |
| Heterogeneous (mixed RAM) | Pipeline Parallel (auto) | Layers weighted by per-node RAM |
The UI's cluster capacity card shows both options live (TP capacity: N × min(RAM) vs PP capacity: sum(RAM)). PP decode is ~10-30% slower than TP at the same total RAM (pipeline bubble), so for budget-conscious mixed clusters this is the right tradeoff. Override the auto-selector from the Serve modal if you want to force TP or PP.
GenMLX supports the full range of Mac networking, from Thunderbolt 5 RDMA down to 1 Gigabit Ethernet. Pipeline Parallel needs much less bandwidth than Tensor Parallel (one rank-to-next-rank send per layer, vs full all-reduce), so slower networks are workable — at honest performance caveats.
| Transport | Latency | Useable bandwidth | What works well |
|---|---|---|---|
| TB5 RDMA | <1 µs | ~80 Gbps | Everything. Best decode latency under TP. |
| TB4 / TB3 RDMA | ~5 µs | ~25-40 Gbps | TP for most models. Slightly slower decode than TB5. |
| 10 Gigabit Ethernet | ~10 µs | ~9.5 Gbps | PP works great. TP usable but ~3× slower decode than TB. |
| 1 Gigabit Ethernet | ~100 µs | ~940 Mbps | PP for small-to-mid models. TP all-reduce dominates decode — not recommended. |
The mesh wizard detects each node's fastest interface, identifies the highest-quality transport common to all nodes, and picks a strategy + topology accordingly. You'll see a "Reason" line in the UI explaining the choice — and you can always override.
Per-tier performance numbers in the docs are currently approximate (~30-40% slower vs TB5, etc.) — measured numbers land in Phase 7 after the maintainer benchmarks across actual infrastructure. Until then, the UI tooltip says "approximate, pending measurement" so you know what you're looking at.
Networking note: GenMLX does not require an external switch. TB5 is point-to-point; the mesh wizard generates a static-IP plan where every pair of directly-connected nodes lives on its own
/30subnet. Your existing LAN handles management traffic only.
Verified working as of v1.0.0-dev:
| Family | Variants | Notes |
|---|---|---|
| DeepSeek V4-Flash | mxfp4, 6bit, 8bit | Hybrid attention; works with spicyneuron/mlx-lm@fix-ds4-cache-reuse pin |
| Qwen3-Coder-Instruct | 30B-A3B, 4bit/8bit (MoE) | Continuous batching supported |
| Qwen3-Coder-Next | bf16 | Linear+full hybrid attention (qwen3_next) |
| Qwen3.6-27B | 4bit, 8bit | Dense; ideal for single-node |
| Qwen3 80B / 235B | various quants | TP=4 recommended |
| GLM-4.7 | 4bit, 8bit | 92 layers; launch-overhead bound |
| Llama 3.x | 4bit, 8bit | Standard transformer |
| Mistral / Mixtral | various | Standard architecture |
| Any MLX-compatible model | — | Use genmlx model get <hf_repo_id> |
Tip: For models with custom architectures (Qwen3-Coder-Next, V4-Flash, etc.), GenMLX ships TP sharding patches so they load correctly under tensor parallel. See
docs/models/for per-model setup notes.
Measured on a 4-Mac M3 Ultra cluster (TP=4), TB5 full-mesh, jaccl backend:
| Model | Decode (tps) | Prefill (pp-tps) | First-token | Notes |
|---|---|---|---|---|
| Qwen3.6-27B-8bit | ~50 | ~840 | <1s @ 20k | Dense, ideal for single user |
| Qwen3-Coder-30B-A3B-4bit | 50-65 | 800 | <1s @ 20k | MoE, continuous batching |
| DeepSeek-V4-Flash-mxfp4 | 38 short / 25 @ 75k | ~700 | <2s @ 20k | Hybrid attention; L2 cache shines |
| GLM-4.7-4bit | 22 | ~500 | <2s @ 20k | 92 layers, launch-bound |
| Qwen3-Coder-480B-A35B-8bit | ~33 (was supported, now retired) | ~600 | 3s @ 20k | Removed — storage cost too high |
L2 cache impact (V4-Flash, 508k-token prompt):
| Wall time | |
|---|---|
| Cold prefill | 88 minutes |
| L2 hit | 37 seconds |
| Speedup | 143× |
Numbers are not contractual — your mileage varies with model, quant, prompt length, and concurrency.
| Doc | Status |
|---|---|
| README.md — you are here | ✅ |
| CHANGELOG.md | ✅ |
| LICENSE | ✅ |
| CONTRIBUTING.md | 🚧 Phase 7 |
| ARCHITECTURE.md | 🚧 Phase 7 |
| INSTALL.md | 🚧 Phase 7 |
| MESH_SETUP.md | 🚧 Phase 7 |
docs/api/master.md (OpenAPI) |
🚧 Phase 7 |
docs/api/agent.md (OpenAPI) |
🚧 Phase 7 |
docs/models/<model>.md |
🚧 Phase 7 |
docs/performance-tuning.md |
🚧 Phase 7 |
docs/troubleshooting.md |
🚧 Phase 7 |
All docs are versioned with the code and updated alongside features. See CHANGELOG.md for the running log.
| Version | Milestone | Status |
|---|---|---|
0.1.0.dev0 |
Phase 0 — Repo scaffold, license, README, CHANGELOG, package skeleton | ✅ (current) |
0.1.0 |
Phase 1 — Clone dispatcher 1:1 from production cluster, parity test, single-node serve works | 🚧 |
0.2.0 |
Phase 2 — Common foundations (config, auth, paths) + agent read-only endpoints | 🚧 |
0.3.0 |
Phase 3 — Agent mutating endpoints (exec, file upload, sync) | 🚧 |
0.4.0 |
Phase 4 — Cluster serve orchestration (master tells agents to spawn ranks; no SSH) | 🚧 |
0.5.0 |
Phase 5 — Agent auto-registration + mesh wizard + multi-path storage management | 🚧 |
0.6.0 |
Phase 6 — Installer + launchd integration + Homebrew formula | 🚧 |
1.0.0-rc1 |
Phase 7 — Documentation, polish, lint, smoke-test on a real 4-Mac cluster | 🚧 |
1.0.0 |
Stable release after RC iteration | 🚧 |
Target for v1.0.0-rc1: 8 focused weeks from Phase 0.
Versioning follows Semantic Versioning with PEP 440 dev/rc tags. Anything 0.x may have breaking changes between minor versions; 1.x will preserve API compatibility.
- Token rotation
- Tailscale / WireGuard bundled networking
- Multi-instance serving (multiple models per cluster, route per request)
- Heterogeneous nodes (different RAM/SSD sizes properly recognized)
- Apple notarization for the installer
- Homebrew cask
- HA master (leader election, registry replication)
- Cross-platform agent (Linux GPU nodes alongside Macs)
- Auto-mesh discovery (sniff TB5 connectivity, skip manual wiring)
- Disaggregated prefill (GPU box + Mac decode)
See CHANGELOG.md for shipped features.
| GenMLX | EXO Labs | mlx_lm.distributed | llama.cpp + RPC | |
|---|---|---|---|---|
| Topology | Fixed, owned (1-6 Macs) | Dynamic discovery | Manual hostfile | Manual hostfile |
| Network | TB5 mesh (jaccl) | TB / WiFi / LAN | TB / LAN | LAN / RDMA (Linux) |
| Web UI | ✅ Full dashboard | Partial | ❌ | ❌ |
| OpenAI API | ✅ | ✅ | ❌ (needs wrapper) | ✅ (with patch) |
| Tool calling | ✅ Multiple parsers | Partial | ❌ | Partial |
| Multi-path storage | ✅ UI-managed | ❌ | ❌ | ❌ |
| L2 disk cache | ✅ Boundary snapshots | ❌ | ❌ | KV save on exit only |
| Continuous batching | ✅ | ❌ | ❌ | ✅ |
| Mixed-RAM clusters | ✅ TP + PP auto | ✅ today | ❌ | ❌ |
| Heterogeneous architectures (Mac + GPU box) | v2 | ✅ today | ❌ | ❌ |
| One-line install | ✅ | ✅ | ❌ | ❌ |
| Cross-platform | macOS (v1) | macOS + Linux | macOS + Linux | macOS + Linux + Win |
| Designed for | Mac fleet owners | Spare-device clusters | Researchers | Inference power users |
Short version: if you have several Macs and want them to act as one inference machine with the lowest-friction UX, GenMLX. If you have a mixed bag of devices and want flexible discovery, EXO. If you want the raw MLX building blocks for research, mlx_lm.distributed.
GenMLX is in active development. Contributions welcome once the v1.0.0-rc1 cuts. See CONTRIBUTING.md for setup, lint, test, and PR guidelines.
Open issues for:
- Bugs (with macOS version, Mac model, log snippet)
- Feature requests aligned with the roadmap
- Model compatibility reports
MIT. Copyright (c) 2026 GenMLX contributors.
The MIT license places no restriction on commercial use; if you build a product on GenMLX, we'd love to hear about it.
GenMLX stands on the shoulders of:
- mlx + mlx-lm — Apple's machine learning framework. The entire serving brain wraps mlx-lm.
- jaccl — Distributed all-reduce primitives over Thunderbolt + RDMA, originally from EXO Labs. We use the rltakashige/mlx-jaccl-fix-small-recv fork which adds critical Ampere-era stability fixes.
- DeepSeek-AI, Qwen Team, ZhipuAI — for open-weight models that make this whole thing worth building.
- Hugging Face — model hosting + the
huggingface_hubPython SDK.
This project also draws design inspiration from the broader MLX community — particularly EXO Labs for proving the fundamental architecture of distributed Apple Silicon inference is real, and the mlx-community Hugging Face org for keeping the model zoo populated.