Skip to content

crystech/GenMLX

Repository files navigation

GenMLX

Mesh your Apple Silicon Macs into a high-performance LLM serving cluster. Web UI, OpenAI-compatible API, one-line install.

License: MIT Python 3.11+ macOS Status

GenMLX turns N Apple Silicon Macs (M-series) into a tensor-parallel inference cluster for large language models. Built on Apple's MLX framework, with Thunderbolt 5 mesh networking for low-latency cross-node communication. Designed to be productive in 15 minutes from curl | bash to first token.

⚠️ Pre-alpha — currently at v0.1.0.dev0 (Phase 0 of the 7-phase build to v1.0.0). The architecture and roadmap below describe the target state. See the roadmap for what works today vs what's still being built.


Table of Contents


Why GenMLX

If you own multiple Apple Silicon Macs and want to:

  • Serve large models locally that don't fit in a single Mac's unified memory (DeepSeek V4, Qwen3-Coder-Next, GLM-4.7, etc.)
  • Keep inference private — no API keys, no rate limits, no data leaving your network
  • Reuse hardware you already have — three M1 Maxes + one M3 Ultra still serve 100B+ parameter models
  • Get fast time-to-first-token at long contexts thanks to disk-backed L2 KV caching
  • Integrate with your existing tools — Claude Code, Cline, opencode, OpenWebUI all work out of the box via the OpenAI-compatible API

…then GenMLX is the simplest way to do that today.

It assumes a fixed, owned topology (1-6 Macs on the same private network) — that's the niche. If you need elastic, dynamic, heterogeneous device discovery across phones/laptops/desktops, look at EXO Labs instead.

Features

Core

  • 🖥️ Web UI dashboard — manage models, serve, monitor, set up the cluster, all from http://master:6789
  • 🔌 OpenAI-compatible API/v1/chat/completions, /v1/completions, /v1/models, drop-in for any OpenAI client
  • 🤖 Tool/function calling — Hermes-style + DeepSeek-style + GLM-style tool parsing with streaming
  • 🌐 Native Anthropic API adapter — Claude Code points at the cluster directly
  • 🧠 Thinking-token routing<think> blocks correctly routed to reasoning_content for compatible clients

Cluster orchestration

  • 🧩 Master-agent over HTTP — no SSH for management plane, just bearer-token API calls
  • 🚀 Auto-registration — agents announce themselves on boot; UI sees them within 30 seconds
  • 🎯 Mesh setup wizard — UI generates per-node TB5 IP plans for N=1-6 nodes + verifies link-by-link; supports both full-mesh and ring topologies
  • Flexible networking — TB5 RDMA (best), TB4/TB3 RDMA, 10 GbE, or 1 GbE; mesh wizard detects + recommends per cluster. jaccl over TB, TCP backend over Ethernet.
  • 🧮 Mix any M-series Macs — heterogeneous RAM supported. Cluster auto-selects Tensor Parallel for homogeneous fleets, Pipeline Parallel for mixed (so a 192 GB Mac Studio + 32 GB Mac mini + 96 GB MacBook Pro all serve as one cluster, no manual sharding)
  • 🛠️ Per-node multi-path storage — each node has its own configurable model storage paths (home dir, external SSDs, NAS mounts) managed from the UI
  • Cross-node model presence check — UI badges show which model is on which node; blocks Serve until all selected nodes have the model

Performance

  • 💾 L2 disk-backed prompt cache — 200+ GB SSD cache for KV state; turns 88-minute cold prefill into 37-second L2 hit
  • 🎯 L2 boundary snapshots — saves cache at the system+tools boundary so different conversations sharing a system prompt reuse it
  • 🔀 Continuous batchingBATCHED=1 dispatcher serves multiple concurrent requests with prefix sharing
  • 📊 Smart KV quantization — optional int4/int8 KV for memory-bound workloads (default off — net loss on Apple Silicon)
  • 🧩 Hybrid attention support — linear (GDN) + full attention models like Qwen3-Coder-Next handled correctly

Operations

  • 📊 Live telemetry — per-node CPU/GPU/RAM/SSD via macmon integration; in-flight job tracking
  • 🔄 Browser-refresh-safe — long-running jobs (download/sync/serve) reattach when you reload the UI
  • 🔑 Token-based auth — bearer tokens for all master/agent traffic; no SSH keys to manage
  • 📦 Model lifecycle — download from Hugging Face, sync to all nodes (delta + resume), validate, serve, swap, delete from UI
  • 🎛️ Integrations panel — copy-paste configs for Claude Code, Cline, opencode, OpenWebUI

Quick Start

v1.0.0-dev — public installer (curl | bash) lands in Phase 6 of the implementation plan. For now, build from source:

From source (developers)

git clone https://github.com/crystech/GenMLX.git
cd GenMLX
bash scripts/dev/setup.sh
source .venv/bin/activate
genmlx version

Once v1.0.0 ships

Master node (the Mac you want to drive the UI from):

curl -fsSL https://raw.githubusercontent.com/crystech/GenMLX/main/install.sh | bash -s -- --master

The installer:

  1. Installs Python 3.11 + uv + macmon via Homebrew
  2. Sets up a virtual environment at ~/.genmlx/venv
  3. Generates a 32-byte bearer token
  4. Registers a launchd service that starts GenMLX on boot
  5. Opens http://localhost:6789 in your browser

Worker nodes (additional Macs):

curl -fsSL https://genmlx.dev/install.sh | bash -s -- \
  --agent \
  --master-url http://<master-mac>.local:6789 \
  --token gmx_<token-from-master-install-output>

The agent auto-registers with the master within 30 seconds. Use the UI's Mesh Setup tab to wire up TB5 and generate the per-node IP plan.


Architecture

                          ┌──────────────────────────────────┐
                          │  Master Mac                       │
                          │                                  │
                          │  [Web UI on :6789 — dashboard]   │
                          │  [REST + WebSocket API]          │
                          │  [Mesh planner]                  │
                          │  [SQLite registry + jobs]        │
                          │                                  │
                          │  [Dispatcher rank 0]             │
                          └────┬────┬────┬────┬──────────────┘
                               │ HTTP + bearer-token auth
                               ▼ ▼ ▼ ▼
                      ┌───────┐ ┌───────┐ ┌───────┐
                      │ Agent │ │ Agent │ │ Agent │  ...up to 6 nodes
                      │ rank1 │ │ rank2 │ │ rank3 │
                      │       │ │       │ │       │
                      │  disp │ │  disp │ │  disp │
                      └───┬───┘ └───┬───┘ └───┬───┘
                          │         │         │
                          └─────────┴─────────┘
                          TB5 mesh — jaccl RDMA all-reduce
                          for the inference data plane

Three layers

  1. Master — The orchestrator. Hosts the UI, the REST API, the SQLite agent registry, the mesh planner, the job tracker. Always also runs rank 0 of the dispatcher.

  2. Agents — One per worker Mac. Lightweight HTTP daemons that respond to master commands (file sync, command exec, rank spawn, mesh configure). Stateless except for the local node's config + the dispatcher rank they're hosting.

  3. Dispatcher — The serving brain. A 3000+ LOC FastAPI/http.server app that wraps mlx-lm, handles continuous batching, runs the L2 cache, parses thinking tokens + tool calls, exposes OpenAI/Anthropic APIs. The same dispatcher binary runs on every node; ranks communicate via mx.distributed over the TB5 mesh.

The split between master and agent (control plane) vs the dispatcher (data plane) is the central design choice: HTTP for low-frequency control + token-secured remote ops; jaccl/TB5 for the high-bandwidth low-latency inference traffic.

See ARCHITECTURE.md for the full design rationale.


Hardware Requirements

Component Minimum Recommended
Macs 1 M-series Mac 2-6 M-series Macs
RAM per Mac 32 GB 96 GB / 192 GB / 512 GB (mix is fine — see below)
Storage per Mac 50 GB free 500 GB+ (for models + L2 cache)
macOS 14 Sonoma 15 Sequoia
Network (single node)
Network (2+ nodes) 1 GbE / Wi-Fi (degraded perf) Thunderbolt 5 RDMA
Mac Studio M3 Ultra great fit (6 TB5 ports, scales to 6-node mesh) ⭐ ideal

GenMLX works on a single Mac (no cluster) and scales to 6 Macs in a full TB5 mesh. Beyond 6 nodes, Mac Studio's 6 TB5 ports run out and you'd need a different topology (not supported in v1).

Mixing Macs of different RAM tiers ✅

You don't need matching Macs. A 192 GB Mac Studio + 32 GB Mac mini + 96 GB MacBook Pro can serve as one cluster — GenMLX detects the RAM mix and picks the right strategy:

Cluster type Parallelism Per-node sizing
Homogeneous (all RAM within ±10%) Tensor Parallel (default) Equal share per node
Heterogeneous (mixed RAM) Pipeline Parallel (auto) Layers weighted by per-node RAM

The UI's cluster capacity card shows both options live (TP capacity: N × min(RAM) vs PP capacity: sum(RAM)). PP decode is ~10-30% slower than TP at the same total RAM (pipeline bubble), so for budget-conscious mixed clusters this is the right tradeoff. Override the auto-selector from the Serve modal if you want to force TP or PP.

Networking — TB5 is best, but TB4/TB3/Ethernet all work ✅

GenMLX supports the full range of Mac networking, from Thunderbolt 5 RDMA down to 1 Gigabit Ethernet. Pipeline Parallel needs much less bandwidth than Tensor Parallel (one rank-to-next-rank send per layer, vs full all-reduce), so slower networks are workable — at honest performance caveats.

Transport Latency Useable bandwidth What works well
TB5 RDMA <1 µs ~80 Gbps Everything. Best decode latency under TP.
TB4 / TB3 RDMA ~5 µs ~25-40 Gbps TP for most models. Slightly slower decode than TB5.
10 Gigabit Ethernet ~10 µs ~9.5 Gbps PP works great. TP usable but ~3× slower decode than TB.
1 Gigabit Ethernet ~100 µs ~940 Mbps PP for small-to-mid models. TP all-reduce dominates decode — not recommended.

The mesh wizard detects each node's fastest interface, identifies the highest-quality transport common to all nodes, and picks a strategy + topology accordingly. You'll see a "Reason" line in the UI explaining the choice — and you can always override.

Per-tier performance numbers in the docs are currently approximate (~30-40% slower vs TB5, etc.) — measured numbers land in Phase 7 after the maintainer benchmarks across actual infrastructure. Until then, the UI tooltip says "approximate, pending measurement" so you know what you're looking at.

Networking note: GenMLX does not require an external switch. TB5 is point-to-point; the mesh wizard generates a static-IP plan where every pair of directly-connected nodes lives on its own /30 subnet. Your existing LAN handles management traffic only.


Supported Models

Verified working as of v1.0.0-dev:

Family Variants Notes
DeepSeek V4-Flash mxfp4, 6bit, 8bit Hybrid attention; works with spicyneuron/mlx-lm@fix-ds4-cache-reuse pin
Qwen3-Coder-Instruct 30B-A3B, 4bit/8bit (MoE) Continuous batching supported
Qwen3-Coder-Next bf16 Linear+full hybrid attention (qwen3_next)
Qwen3.6-27B 4bit, 8bit Dense; ideal for single-node
Qwen3 80B / 235B various quants TP=4 recommended
GLM-4.7 4bit, 8bit 92 layers; launch-overhead bound
Llama 3.x 4bit, 8bit Standard transformer
Mistral / Mixtral various Standard architecture
Any MLX-compatible model Use genmlx model get <hf_repo_id>

Tip: For models with custom architectures (Qwen3-Coder-Next, V4-Flash, etc.), GenMLX ships TP sharding patches so they load correctly under tensor parallel. See docs/models/ for per-model setup notes.


Performance

Measured on a 4-Mac M3 Ultra cluster (TP=4), TB5 full-mesh, jaccl backend:

Model Decode (tps) Prefill (pp-tps) First-token Notes
Qwen3.6-27B-8bit ~50 ~840 <1s @ 20k Dense, ideal for single user
Qwen3-Coder-30B-A3B-4bit 50-65 800 <1s @ 20k MoE, continuous batching
DeepSeek-V4-Flash-mxfp4 38 short / 25 @ 75k ~700 <2s @ 20k Hybrid attention; L2 cache shines
GLM-4.7-4bit 22 ~500 <2s @ 20k 92 layers, launch-bound
Qwen3-Coder-480B-A35B-8bit ~33 (was supported, now retired) ~600 3s @ 20k Removed — storage cost too high

L2 cache impact (V4-Flash, 508k-token prompt):

Wall time
Cold prefill 88 minutes
L2 hit 37 seconds
Speedup 143×

Numbers are not contractual — your mileage varies with model, quant, prompt length, and concurrency.


Documentation

Doc Status
README.md — you are here
CHANGELOG.md
LICENSE
CONTRIBUTING.md 🚧 Phase 7
ARCHITECTURE.md 🚧 Phase 7
INSTALL.md 🚧 Phase 7
MESH_SETUP.md 🚧 Phase 7
docs/api/master.md (OpenAPI) 🚧 Phase 7
docs/api/agent.md (OpenAPI) 🚧 Phase 7
docs/models/<model>.md 🚧 Phase 7
docs/performance-tuning.md 🚧 Phase 7
docs/troubleshooting.md 🚧 Phase 7

All docs are versioned with the code and updated alongside features. See CHANGELOG.md for the running log.


Roadmap

Path to v1.0.0

Version Milestone Status
0.1.0.dev0 Phase 0 — Repo scaffold, license, README, CHANGELOG, package skeleton ✅ (current)
0.1.0 Phase 1 — Clone dispatcher 1:1 from production cluster, parity test, single-node serve works 🚧
0.2.0 Phase 2 — Common foundations (config, auth, paths) + agent read-only endpoints 🚧
0.3.0 Phase 3 — Agent mutating endpoints (exec, file upload, sync) 🚧
0.4.0 Phase 4 — Cluster serve orchestration (master tells agents to spawn ranks; no SSH) 🚧
0.5.0 Phase 5 — Agent auto-registration + mesh wizard + multi-path storage management 🚧
0.6.0 Phase 6 — Installer + launchd integration + Homebrew formula 🚧
1.0.0-rc1 Phase 7 — Documentation, polish, lint, smoke-test on a real 4-Mac cluster 🚧
1.0.0 Stable release after RC iteration 🚧

Target for v1.0.0-rc1: 8 focused weeks from Phase 0.

Versioning follows Semantic Versioning with PEP 440 dev/rc tags. Anything 0.x may have breaking changes between minor versions; 1.x will preserve API compatibility.

v1.1 (planned)

  • Token rotation
  • Tailscale / WireGuard bundled networking
  • Multi-instance serving (multiple models per cluster, route per request)
  • Heterogeneous nodes (different RAM/SSD sizes properly recognized)
  • Apple notarization for the installer
  • Homebrew cask

v2.0 (long-term)

  • HA master (leader election, registry replication)
  • Cross-platform agent (Linux GPU nodes alongside Macs)
  • Auto-mesh discovery (sniff TB5 connectivity, skip manual wiring)
  • Disaggregated prefill (GPU box + Mac decode)

See CHANGELOG.md for shipped features.


How is this different from...

GenMLX EXO Labs mlx_lm.distributed llama.cpp + RPC
Topology Fixed, owned (1-6 Macs) Dynamic discovery Manual hostfile Manual hostfile
Network TB5 mesh (jaccl) TB / WiFi / LAN TB / LAN LAN / RDMA (Linux)
Web UI ✅ Full dashboard Partial
OpenAI API ❌ (needs wrapper) ✅ (with patch)
Tool calling ✅ Multiple parsers Partial Partial
Multi-path storage ✅ UI-managed
L2 disk cache ✅ Boundary snapshots KV save on exit only
Continuous batching
Mixed-RAM clusters ✅ TP + PP auto ✅ today
Heterogeneous architectures (Mac + GPU box) v2 ✅ today
One-line install
Cross-platform macOS (v1) macOS + Linux macOS + Linux macOS + Linux + Win
Designed for Mac fleet owners Spare-device clusters Researchers Inference power users

Short version: if you have several Macs and want them to act as one inference machine with the lowest-friction UX, GenMLX. If you have a mixed bag of devices and want flexible discovery, EXO. If you want the raw MLX building blocks for research, mlx_lm.distributed.


Contributing

GenMLX is in active development. Contributions welcome once the v1.0.0-rc1 cuts. See CONTRIBUTING.md for setup, lint, test, and PR guidelines.

Open issues for:

  • Bugs (with macOS version, Mac model, log snippet)
  • Feature requests aligned with the roadmap
  • Model compatibility reports

License

MIT. Copyright (c) 2026 GenMLX contributors.

The MIT license places no restriction on commercial use; if you build a product on GenMLX, we'd love to hear about it.

Acknowledgments

GenMLX stands on the shoulders of:

  • mlx + mlx-lm — Apple's machine learning framework. The entire serving brain wraps mlx-lm.
  • jaccl — Distributed all-reduce primitives over Thunderbolt + RDMA, originally from EXO Labs. We use the rltakashige/mlx-jaccl-fix-small-recv fork which adds critical Ampere-era stability fixes.
  • DeepSeek-AI, Qwen Team, ZhipuAI — for open-weight models that make this whole thing worth building.
  • Hugging Face — model hosting + the huggingface_hub Python SDK.

This project also draws design inspiration from the broader MLX community — particularly EXO Labs for proving the fundamental architecture of distributed Apple Silicon inference is real, and the mlx-community Hugging Face org for keeping the model zoo populated.

About

**GenMLX** turns N Apple Silicon Macs (M-series) into a tensor-parallel inference cluster for large language models. Built on Apple's MLX framework, with Thunderbolt 5 mesh networking for low-latency cross-node communication. Designed to be productive in 15 minutes from `curl | bash` to first token.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors