Inference daemon for Apple Silicon. macOS-native MLX serving with an OpenAI-compatible HTTP API and Tailscale-aware source filtering.
inferenced is a small Rust daemon that runs on a macOS host. It supervises
mlx_lm.server (Apple's reference
LLM server, which uses Metal under the hood for GPU-accelerated inference),
adds proper process supervision, source-CIDR filtering, Prometheus metrics,
and a clean OpenAI-compatible HTTP surface that's safe to expose on a
Tailscale-only port.
It exists because LLM inference on Apple Silicon must run as native macOS code — Apple's Virtualization framework doesn't expose Metal/MPS/ANE to Linux guests. So if your workloads (Kubernetes pods, scripts, agents) live in Linux land but you want to use the GPU your Mac mini already has, you need a daemon on the host that serves inference and a clean way for clients to call it. This is that daemon.
It pairs with inferenced-operator,
a Kubernetes operator that orchestrates fleets of inferenced instances
across a cluster of Apple Silicon hosts. You can also run inferenced
standalone — curl localhost:11434/v1/chat/completions and you're done.
┌──────────────────┐ ┌────────────────────┐
│ any client │ HTTP │ inferenced │
│ (curl, kubectl, ├────────►│ (axum, supervisor,│
│ pod, script) │ │ metrics, auth) │
└──────────────────┘ └─────────┬──────────┘
│ proxy /v1/*
▼
┌────────────────────┐
│ mlx_lm.server │
│ (Python, MLX) │
└─────────┬──────────┘
▼
┌────────────────────┐
│ Apple Silicon GPU │
│ via Metal │
└────────────────────┘
# Prereqs: Rust 1.75+, Python 3.10+, Apple Silicon Mac
brew install python@3.12
python3.12 -m pip install --user --break-system-packages mlx-lm
# Build
cargo build --release --target aarch64-apple-darwin
# Run (defaults to Qwen2.5-3B-Instruct-4bit and binds 0.0.0.0:11434)
./target/aarch64-apple-darwin/release/inferenced
# In another terminal
curl http://localhost:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Qwen2.5-3B-Instruct-4bit",
"messages": [{"role": "user", "content": "hello"}],
"stream": false
}'| Architecture | How inferenced fits between clients, MLX, and the rest of your infrastructure. |
| Installation | Install on a single Mac — Homebrew, Rust toolchain, mlx-lm, and a launchd LaunchDaemon for boot persistence. |
| Configuration | Every CLI flag and env var. |
| HTTP API | OpenAI-compatible /v1/*, plus /healthz, /metrics, /. |
| Metrics | Prometheus metric reference. |
| Development | Building from source, running tests, contributing. |
| Troubleshooting | "It's not starting", "I get source not allowed", "tokens/sec is bad". |
examples/launchd/dev.dormlab.inferenced.plist— system-level LaunchDaemon (runs as root for Metal access).examples/kubernetes/— Service + EndpointSlice manifests so cluster pods can call your fleet of macOS hosts as a single in-cluster Service.
- ✅ Single static binary, ~3 MB (
cargo build --release). - ✅ OpenAI-compatible — every
/v1/*route is transparently proxied; SSE streaming preserved end-to-end. - ✅ Source-CIDR filtering — defaults to Tailscale + loopback, configurable.
- ✅ Process supervision — restarts
mlx_lm.serverwith capped exponential backoff. - ✅ Prometheus
/metrics— request counters by route + status class. - ✅ Healthchecks —
/healthzvalidates the upstream Python process. - ✅ Graceful shutdown — SIGTERM propagates to
mlx_lm.server. - ✅
launchdLaunchDaemon template for boot persistence.
v0.1 — single-model per daemon, fixed at startup via --model. Multi-model
hot-loading is the v0.2 goal (admin API for POST /admin/models/{load,unload})
which the operator can drive, see the architecture doc.
MIT. See LICENSE.