GenMLX

Mesh your Apple Silicon Macs into a high-performance LLM serving cluster. Web UI, OpenAI-compatible API, one-line install.

GenMLX turns N Apple Silicon Macs (M-series) into a tensor-parallel inference cluster for large language models. Built on Apple's MLX framework, with Thunderbolt 5 mesh networking for low-latency cross-node communication. Designed to be productive in 15 minutes from curl | bash to first token.

⚠️ Pre-alpha — currently at v0.1.0.dev0 (Phase 0 of the 7-phase build to v1.0.0). The architecture and roadmap below describe the target state. See the roadmap for what works today vs what's still being built.

Why GenMLX

If you own multiple Apple Silicon Macs and want to:

Serve large models locally that don't fit in a single Mac's unified memory (DeepSeek V4, Qwen3-Coder-Next, GLM-4.7, etc.)
Keep inference private — no API keys, no rate limits, no data leaving your network
Reuse hardware you already have — three M1 Maxes + one M3 Ultra still serve 100B+ parameter models
Get fast time-to-first-token at long contexts thanks to disk-backed L2 KV caching
Integrate with your existing tools — Claude Code, Cline, opencode, OpenWebUI all work out of the box via the OpenAI-compatible API

…then GenMLX is the simplest way to do that today.

It assumes a fixed, owned topology (1-6 Macs on the same private network) — that's the niche. If you need elastic, dynamic, heterogeneous device discovery across phones/laptops/desktops, look at EXO Labs instead.

Features

Core

🖥️ Web UI dashboard — manage models, serve, monitor, set up the cluster, all from http://master:6789
🔌 OpenAI-compatible API — /v1/chat/completions, /v1/completions, /v1/models, drop-in for any OpenAI client
🤖 Tool/function calling — Hermes-style + DeepSeek-style + GLM-style tool parsing with streaming
🌐 Native Anthropic API adapter — Claude Code points at the cluster directly
🧠 Thinking-token routing — <think> blocks correctly routed to reasoning_content for compatible clients

Cluster orchestration

🧩 Master-agent over HTTP — no SSH for management plane, just bearer-token API calls
🚀 Auto-registration — agents announce themselves on boot; UI sees them within 30 seconds
🎯 Mesh setup wizard — UI generates per-node TB5 IP plans for N=1-6 nodes + verifies link-by-link; supports both full-mesh and ring topologies
⚡ Flexible networking — TB5 RDMA (best), TB4/TB3 RDMA, 10 GbE, or 1 GbE; mesh wizard detects + recommends per cluster. jaccl over TB, TCP backend over Ethernet.
🧮 Mix any M-series Macs — heterogeneous RAM supported. Cluster auto-selects Tensor Parallel for homogeneous fleets, Pipeline Parallel for mixed (so a 192 GB Mac Studio + 32 GB Mac mini + 96 GB MacBook Pro all serve as one cluster, no manual sharding)
🛠️ Per-node multi-path storage — each node has its own configurable model storage paths (home dir, external SSDs, NAS mounts) managed from the UI
✅ Cross-node model presence check — UI badges show which model is on which node; blocks Serve until all selected nodes have the model

Performance

💾 L2 disk-backed prompt cache — 200+ GB SSD cache for KV state; turns 88-minute cold prefill into 37-second L2 hit
🎯 L2 boundary snapshots — saves cache at the system+tools boundary so different conversations sharing a system prompt reuse it
🔀 Continuous batching — BATCHED=1 dispatcher serves multiple concurrent requests with prefix sharing
📊 Smart KV quantization — optional int4/int8 KV for memory-bound workloads (default off — net loss on Apple Silicon)
🧩 Hybrid attention support — linear (GDN) + full attention models like Qwen3-Coder-Next handled correctly

Operations

📊 Live telemetry — per-node CPU/GPU/RAM/SSD via macmon integration; in-flight job tracking
🔄 Browser-refresh-safe — long-running jobs (download/sync/serve) reattach when you reload the UI
🔑 Token-based auth — bearer tokens for all master/agent traffic; no SSH keys to manage
📦 Model lifecycle — download from Hugging Face, sync to all nodes (delta + resume), validate, serve, swap, delete from UI
🎛️ Integrations panel — copy-paste configs for Claude Code, Cline, opencode, OpenWebUI

Quick Start

v1.0.0-dev — public installer (curl | bash) lands in Phase 6 of the implementation plan. For now, build from source:

From source (developers)

git clone https://github.com/crystech/GenMLX.git
cd GenMLX
bash scripts/dev/setup.sh
source .venv/bin/activate
genmlx version

Once v1.0.0 ships

Master node (the Mac you want to drive the UI from):

curl -fsSL https://raw.githubusercontent.com/crystech/GenMLX/main/install.sh | bash -s -- --master

The installer:

Installs Python 3.11 + uv + macmon via Homebrew
Sets up a virtual environment at ~/.genmlx/venv
Generates a 32-byte bearer token
Registers a launchd service that starts GenMLX on boot
Opens http://localhost:6789 in your browser

Worker nodes (additional Macs):

curl -fsSL https://genmlx.dev/install.sh | bash -s -- \
  --agent \
  --master-url http://<master-mac>.local:6789 \
  --token gmx_<token-from-master-install-output>

The agent auto-registers with the master within 30 seconds. Use the UI's Mesh Setup tab to wire up TB5 and generate the per-node IP plan.

Architecture

                          ┌──────────────────────────────────┐
                          │  Master Mac                       │
                          │                                  │
                          │  [Web UI on :6789 — dashboard]   │
                          │  [REST + WebSocket API]          │
                          │  [Mesh planner]                  │
                          │  [SQLite registry + jobs]        │
                          │                                  │
                          │  [Dispatcher rank 0]             │
                          └────┬────┬────┬────┬──────────────┘
                               │ HTTP + bearer-token auth
                               ▼ ▼ ▼ ▼
                      ┌───────┐ ┌───────┐ ┌───────┐
                      │ Agent │ │ Agent │ │ Agent │  ...up to 6 nodes
                      │ rank1 │ │ rank2 │ │ rank3 │
                      │       │ │       │ │       │
                      │  disp │ │  disp │ │  disp │
                      └───┬───┘ └───┬───┘ └───┬───┘
                          │         │         │
                          └─────────┴─────────┘
                          TB5 mesh — jaccl RDMA all-reduce
                          for the inference data plane

Three layers

Master — The orchestrator. Hosts the UI, the REST API, the SQLite agent registry, the mesh planner, the job tracker. Always also runs rank 0 of the dispatcher.
Agents — One per worker Mac. Lightweight HTTP daemons that respond to master commands (file sync, command exec, rank spawn, mesh configure). Stateless except for the local node's config + the dispatcher rank they're hosting.
Dispatcher — The serving brain. A 3000+ LOC FastAPI/http.server app that wraps mlx-lm, handles continuous batching, runs the L2 cache, parses thinking tokens + tool calls, exposes OpenAI/Anthropic APIs. The same dispatcher binary runs on every node; ranks communicate via mx.distributed over the TB5 mesh.

The split between master and agent (control plane) vs the dispatcher (data plane) is the central design choice: HTTP for low-frequency control + token-secured remote ops; jaccl/TB5 for the high-bandwidth low-latency inference traffic.

See ARCHITECTURE.md for the full design rationale.

Hardware Requirements

Component	Minimum	Recommended
Macs	1 M-series Mac	2-6 M-series Macs
RAM per Mac	32 GB	96 GB / 192 GB / 512 GB (mix is fine — see below)
Storage per Mac	50 GB free	500 GB+ (for models + L2 cache)
macOS	14 Sonoma	15 Sequoia
Network (single node)	—	—
Network (2+ nodes)	1 GbE / Wi-Fi (degraded perf)	Thunderbolt 5 RDMA
Mac Studio M3 Ultra	great fit (6 TB5 ports, scales to 6-node mesh)	⭐ ideal

GenMLX works on a single Mac (no cluster) and scales to 6 Macs in a full TB5 mesh. Beyond 6 nodes, Mac Studio's 6 TB5 ports run out and you'd need a different topology (not supported in v1).

Mixing Macs of different RAM tiers ✅

You don't need matching Macs. A 192 GB Mac Studio + 32 GB Mac mini + 96 GB MacBook Pro can serve as one cluster — GenMLX detects the RAM mix and picks the right strategy:

Cluster type	Parallelism	Per-node sizing
Homogeneous (all RAM within ±10%)	Tensor Parallel (default)	Equal share per node
Heterogeneous (mixed RAM)	Pipeline Parallel (auto)	Layers weighted by per-node RAM

The UI's cluster capacity card shows both options live (TP capacity: N × min(RAM) vs PP capacity: sum(RAM)). PP decode is ~10-30% slower than TP at the same total RAM (pipeline bubble), so for budget-conscious mixed clusters this is the right tradeoff. Override the auto-selector from the Serve modal if you want to force TP or PP.

Networking — TB5 is best, but TB4/TB3/Ethernet all work ✅

GenMLX supports the full range of Mac networking, from Thunderbolt 5 RDMA down to 1 Gigabit Ethernet. Pipeline Parallel needs much less bandwidth than Tensor Parallel (one rank-to-next-rank send per layer, vs full all-reduce), so slower networks are workable — at honest performance caveats.

Transport	Latency	Useable bandwidth	What works well
TB5 RDMA	<1 µs	~80 Gbps	Everything. Best decode latency under TP.
TB4 / TB3 RDMA	~5 µs	~25-40 Gbps	TP for most models. Slightly slower decode than TB5.
10 Gigabit Ethernet	~10 µs	~9.5 Gbps	PP works great. TP usable but ~3× slower decode than TB.
1 Gigabit Ethernet	~100 µs	~940 Mbps	PP for small-to-mid models. TP all-reduce dominates decode — not recommended.

The mesh wizard detects each node's fastest interface, identifies the highest-quality transport common to all nodes, and picks a strategy + topology accordingly. You'll see a "Reason" line in the UI explaining the choice — and you can always override.

Per-tier performance numbers in the docs are currently approximate (~30-40% slower vs TB5, etc.) — measured numbers land in Phase 7 after the maintainer benchmarks across actual infrastructure. Until then, the UI tooltip says "approximate, pending measurement" so you know what you're looking at.

Networking note: GenMLX does not require an external switch. TB5 is point-to-point; the mesh wizard generates a static-IP plan where every pair of directly-connected nodes lives on its own /30 subnet. Your existing LAN handles management traffic only.

Supported Models

Verified working as of v1.0.0-dev:

Family	Variants	Notes
DeepSeek V4-Flash	mxfp4, 6bit, 8bit	Hybrid attention; works with spicyneuron/mlx-lm@fix-ds4-cache-reuse pin
Qwen3-Coder-Instruct	30B-A3B, 4bit/8bit (MoE)	Continuous batching supported
Qwen3-Coder-Next	bf16	Linear+full hybrid attention (`qwen3_next`)
Qwen3.6-27B	4bit, 8bit	Dense; ideal for single-node
Qwen3 80B / 235B	various quants	TP=4 recommended
GLM-4.7	4bit, 8bit	92 layers; launch-overhead bound
Llama 3.x	4bit, 8bit	Standard transformer
Mistral / Mixtral	various	Standard architecture
Any MLX-compatible model	—	Use `genmlx model get <hf_repo_id>`

Tip: For models with custom architectures (Qwen3-Coder-Next, V4-Flash, etc.), GenMLX ships TP sharding patches so they load correctly under tensor parallel. See docs/models/ for per-model setup notes.

Performance

Measured on a 4-Mac M3 Ultra cluster (TP=4), TB5 full-mesh, jaccl backend:

Model	Decode (tps)	Prefill (pp-tps)	First-token	Notes
Qwen3.6-27B-8bit	~50	~840	<1s @ 20k	Dense, ideal for single user
Qwen3-Coder-30B-A3B-4bit	50-65	800	<1s @ 20k	MoE, continuous batching
DeepSeek-V4-Flash-mxfp4	38 short / 25 @ 75k	~700	<2s @ 20k	Hybrid attention; L2 cache shines
GLM-4.7-4bit	22	~500	<2s @ 20k	92 layers, launch-bound
Qwen3-Coder-480B-A35B-8bit	~33 (was supported, now retired)	~600	3s @ 20k	Removed — storage cost too high

L2 cache impact (V4-Flash, 508k-token prompt):

	Wall time
Cold prefill	88 minutes
L2 hit	37 seconds
Speedup	143×

Numbers are not contractual — your mileage varies with model, quant, prompt length, and concurrency.

Documentation

Doc	Status
README.md — you are here	✅
CHANGELOG.md	✅
LICENSE	✅
CONTRIBUTING.md	🚧 Phase 7
ARCHITECTURE.md	🚧 Phase 7
INSTALL.md	🚧 Phase 7
MESH_SETUP.md	🚧 Phase 7
`docs/api/master.md` (OpenAPI)	🚧 Phase 7
`docs/api/agent.md` (OpenAPI)	🚧 Phase 7
`docs/models/<model>.md`	🚧 Phase 7
`docs/performance-tuning.md`	🚧 Phase 7
`docs/troubleshooting.md`	🚧 Phase 7

All docs are versioned with the code and updated alongside features. See CHANGELOG.md for the running log.

Roadmap

Path to v1.0.0

Version	Milestone	Status
`0.1.0.dev0`	Phase 0 — Repo scaffold, license, README, CHANGELOG, package skeleton	✅ (current)
`0.1.0`	Phase 1 — Clone dispatcher 1:1 from production cluster, parity test, single-node serve works	🚧
`0.2.0`	Phase 2 — Common foundations (config, auth, paths) + agent read-only endpoints	🚧
`0.3.0`	Phase 3 — Agent mutating endpoints (exec, file upload, sync)	🚧
`0.4.0`	Phase 4 — Cluster serve orchestration (master tells agents to spawn ranks; no SSH)	🚧
`0.5.0`	Phase 5 — Agent auto-registration + mesh wizard + multi-path storage management	🚧
`0.6.0`	Phase 6 — Installer + launchd integration + Homebrew formula	🚧
`1.0.0-rc1`	Phase 7 — Documentation, polish, lint, smoke-test on a real 4-Mac cluster	🚧
`1.0.0`	Stable release after RC iteration	🚧

Target for v1.0.0-rc1: 8 focused weeks from Phase 0.

Versioning follows Semantic Versioning with PEP 440 dev/rc tags. Anything 0.x may have breaking changes between minor versions; 1.x will preserve API compatibility.

v1.1 (planned)

Token rotation
Tailscale / WireGuard bundled networking
Multi-instance serving (multiple models per cluster, route per request)
Heterogeneous nodes (different RAM/SSD sizes properly recognized)
Apple notarization for the installer
Homebrew cask

v2.0 (long-term)

HA master (leader election, registry replication)
Cross-platform agent (Linux GPU nodes alongside Macs)
Auto-mesh discovery (sniff TB5 connectivity, skip manual wiring)
Disaggregated prefill (GPU box + Mac decode)

See CHANGELOG.md for shipped features.

How is this different from...

	GenMLX	EXO Labs	mlx_lm.distributed	llama.cpp + RPC
Topology	Fixed, owned (1-6 Macs)	Dynamic discovery	Manual hostfile	Manual hostfile
Network	TB5 mesh (jaccl)	TB / WiFi / LAN	TB / LAN	LAN / RDMA (Linux)
Web UI	✅ Full dashboard	Partial	❌	❌
OpenAI API	✅	✅	❌ (needs wrapper)	✅ (with patch)
Tool calling	✅ Multiple parsers	Partial	❌	Partial
Multi-path storage	✅ UI-managed	❌	❌	❌
L2 disk cache	✅ Boundary snapshots	❌	❌	KV save on exit only
Continuous batching	✅	❌	❌	✅
Mixed-RAM clusters	✅ TP + PP auto	✅ today	❌	❌
Heterogeneous architectures (Mac + GPU box)	v2	✅ today	❌	❌
One-line install	✅	✅	❌	❌
Cross-platform	macOS (v1)	macOS + Linux	macOS + Linux	macOS + Linux + Win
Designed for	Mac fleet owners	Spare-device clusters	Researchers	Inference power users

Short version: if you have several Macs and want them to act as one inference machine with the lowest-friction UX, GenMLX. If you have a mixed bag of devices and want flexible discovery, EXO. If you want the raw MLX building blocks for research, mlx_lm.distributed.

Contributing

GenMLX is in active development. Contributions welcome once the v1.0.0-rc1 cuts. See CONTRIBUTING.md for setup, lint, test, and PR guidelines.

Open issues for:

Bugs (with macOS version, Mac model, log snippet)
Feature requests aligned with the roadmap
Model compatibility reports

License

The MIT license places no restriction on commercial use; if you build a product on GenMLX, we'd love to hear about it.

Acknowledgments

GenMLX stands on the shoulders of:

mlx + mlx-lm — Apple's machine learning framework. The entire serving brain wraps mlx-lm.
jaccl — Distributed all-reduce primitives over Thunderbolt + RDMA, originally from EXO Labs. We use the rltakashige/mlx-jaccl-fix-small-recv fork which adds critical Ampere-era stability fixes.
DeepSeek-AI, Qwen Team, ZhipuAI — for open-weight models that make this whole thing worth building.
Hugging Face — model hosting + the huggingface_hub Python SDK.

This project also draws design inspiration from the broader MLX community — particularly EXO Labs for proving the fundamental architecture of distributed Apple Silicon inference is real, and the mlx-community Hugging Face org for keeping the model zoo populated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenMLX

Table of Contents

Why GenMLX

Features

Core

Cluster orchestration

Performance

Operations

Quick Start

From source (developers)

Once v1.0.0 ships

Architecture

Three layers

Hardware Requirements

Mixing Macs of different RAM tiers ✅

Networking — TB5 is best, but TB4/TB3/Ethernet all work ✅

Supported Models

Performance

Documentation

Roadmap

Path to v1.0.0

v1.1 (planned)

v2.0 (long-term)

How is this different from...

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github		.github
genmlx		genmlx
packaging/launchd		packaging/launchd
scripts/dev		scripts/dev
tests		tests
vendor		vendor
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
MESH_SETUP.md		MESH_SETUP.md
README.md		README.md
SECURITY.md		SECURITY.md
install.sh		install.sh
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

GenMLX

Table of Contents

Why GenMLX

Features

Core

Cluster orchestration

Performance

Operations

Quick Start

From source (developers)

Once v1.0.0 ships

Architecture

Three layers

Hardware Requirements

Mixing Macs of different RAM tiers ✅

Networking — TB5 is best, but TB4/TB3/Ethernet all work ✅

Supported Models

Performance

Documentation

Roadmap

Path to v1.0.0

v1.1 (planned)

v2.0 (long-term)

How is this different from...

Contributing

License

Acknowledgments

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages