inferenced

Inference daemon for Apple Silicon. macOS-native MLX serving with an OpenAI-compatible HTTP API and Tailscale-aware source filtering.

inferenced is a small Rust daemon that runs on a macOS host. It supervises mlx_lm.server (Apple's reference LLM server, which uses Metal under the hood for GPU-accelerated inference), adds proper process supervision, source-CIDR filtering, Prometheus metrics, and a clean OpenAI-compatible HTTP surface that's safe to expose on a Tailscale-only port.

It exists because LLM inference on Apple Silicon must run as native macOS code — Apple's Virtualization framework doesn't expose Metal/MPS/ANE to Linux guests. So if your workloads (Kubernetes pods, scripts, agents) live in Linux land but you want to use the GPU your Mac mini already has, you need a daemon on the host that serves inference and a clean way for clients to call it. This is that daemon.

It pairs with inferenced-operator, a Kubernetes operator that orchestrates fleets of inferenced instances across a cluster of Apple Silicon hosts. You can also run inferenced standalone — curl localhost:11434/v1/chat/completions and you're done.

   ┌──────────────────┐         ┌────────────────────┐
   │   any client     │  HTTP   │    inferenced      │
   │ (curl, kubectl,  ├────────►│  (axum, supervisor,│
   │  pod, script)    │         │   metrics, auth)   │
   └──────────────────┘         └─────────┬──────────┘
                                          │ proxy /v1/*
                                          ▼
                                ┌────────────────────┐
                                │   mlx_lm.server    │
                                │  (Python, MLX)     │
                                └─────────┬──────────┘
                                          ▼
                                ┌────────────────────┐
                                │ Apple Silicon GPU  │
                                │ via Metal          │
                                └────────────────────┘

Quick start

# Prereqs: Rust 1.75+, Python 3.10+, Apple Silicon Mac
brew install python@3.12
python3.12 -m pip install --user --break-system-packages mlx-lm

# Build
cargo build --release --target aarch64-apple-darwin

# Run (defaults to Qwen2.5-3B-Instruct-4bit and binds 0.0.0.0:11434)
./target/aarch64-apple-darwin/release/inferenced

# In another terminal
curl http://localhost:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mlx-community/Qwen2.5-3B-Instruct-4bit",
    "messages": [{"role": "user", "content": "hello"}],
    "stream": false
  }'

Documentation


Architecture	How `inferenced` fits between clients, MLX, and the rest of your infrastructure.
Installation	Install on a single Mac — Homebrew, Rust toolchain, `mlx-lm`, and a `launchd` LaunchDaemon for boot persistence.
Configuration	Every CLI flag and env var.
HTTP API	OpenAI-compatible `/v1/*`, plus `/healthz`, `/metrics`, `/`.
Metrics	Prometheus metric reference.
Development	Building from source, running tests, contributing.
Troubleshooting	"It's not starting", "I get `source not allowed`", "tokens/sec is bad".

Examples

examples/launchd/dev.dormlab.inferenced.plist — system-level LaunchDaemon (runs as root for Metal access).
examples/kubernetes/ — Service + EndpointSlice manifests so cluster pods can call your fleet of macOS hosts as a single in-cluster Service.

Features

✅ Single static binary, ~3 MB (cargo build --release).
✅ OpenAI-compatible — every /v1/* route is transparently proxied; SSE streaming preserved end-to-end.
✅ Source-CIDR filtering — defaults to Tailscale + loopback, configurable.
✅ Process supervision — restarts mlx_lm.server with capped exponential backoff.
✅ Prometheus /metrics — request counters by route + status class.
✅ Healthchecks — /healthz validates the upstream Python process.
✅ Graceful shutdown — SIGTERM propagates to mlx_lm.server.
✅ launchd LaunchDaemon template for boot persistence.

Status

v0.1 — single-model per daemon, fixed at startup via --model. Multi-model hot-loading is the v0.2 goal (admin API for POST /admin/models/{load,unload}) which the operator can drive, see the architecture doc.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
docs		docs
examples		examples
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

inferenced

Quick start

Documentation

Examples

Features

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

inferenced

Quick start

Documentation

Examples

Features

Status

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages