Skip to content

dormlab/inferenced

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

inferenced

Inference daemon for Apple Silicon. macOS-native MLX serving with an OpenAI-compatible HTTP API and Tailscale-aware source filtering.

inferenced is a small Rust daemon that runs on a macOS host. It supervises mlx_lm.server (Apple's reference LLM server, which uses Metal under the hood for GPU-accelerated inference), adds proper process supervision, source-CIDR filtering, Prometheus metrics, and a clean OpenAI-compatible HTTP surface that's safe to expose on a Tailscale-only port.

It exists because LLM inference on Apple Silicon must run as native macOS code — Apple's Virtualization framework doesn't expose Metal/MPS/ANE to Linux guests. So if your workloads (Kubernetes pods, scripts, agents) live in Linux land but you want to use the GPU your Mac mini already has, you need a daemon on the host that serves inference and a clean way for clients to call it. This is that daemon.

It pairs with inferenced-operator, a Kubernetes operator that orchestrates fleets of inferenced instances across a cluster of Apple Silicon hosts. You can also run inferenced standalone — curl localhost:11434/v1/chat/completions and you're done.

   ┌──────────────────┐         ┌────────────────────┐
   │   any client     │  HTTP   │    inferenced      │
   │ (curl, kubectl,  ├────────►│  (axum, supervisor,│
   │  pod, script)    │         │   metrics, auth)   │
   └──────────────────┘         └─────────┬──────────┘
                                          │ proxy /v1/*
                                          ▼
                                ┌────────────────────┐
                                │   mlx_lm.server    │
                                │  (Python, MLX)     │
                                └─────────┬──────────┘
                                          ▼
                                ┌────────────────────┐
                                │ Apple Silicon GPU  │
                                │ via Metal          │
                                └────────────────────┘

Quick start

# Prereqs: Rust 1.75+, Python 3.10+, Apple Silicon Mac
brew install python@3.12
python3.12 -m pip install --user --break-system-packages mlx-lm

# Build
cargo build --release --target aarch64-apple-darwin

# Run (defaults to Qwen2.5-3B-Instruct-4bit and binds 0.0.0.0:11434)
./target/aarch64-apple-darwin/release/inferenced

# In another terminal
curl http://localhost:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mlx-community/Qwen2.5-3B-Instruct-4bit",
    "messages": [{"role": "user", "content": "hello"}],
    "stream": false
  }'

Documentation

Architecture How inferenced fits between clients, MLX, and the rest of your infrastructure.
Installation Install on a single Mac — Homebrew, Rust toolchain, mlx-lm, and a launchd LaunchDaemon for boot persistence.
Configuration Every CLI flag and env var.
HTTP API OpenAI-compatible /v1/*, plus /healthz, /metrics, /.
Metrics Prometheus metric reference.
Development Building from source, running tests, contributing.
Troubleshooting "It's not starting", "I get source not allowed", "tokens/sec is bad".

Examples

Features

  • Single static binary, ~3 MB (cargo build --release).
  • OpenAI-compatible — every /v1/* route is transparently proxied; SSE streaming preserved end-to-end.
  • Source-CIDR filtering — defaults to Tailscale + loopback, configurable.
  • Process supervision — restarts mlx_lm.server with capped exponential backoff.
  • Prometheus /metrics — request counters by route + status class.
  • Healthchecks/healthz validates the upstream Python process.
  • Graceful shutdown — SIGTERM propagates to mlx_lm.server.
  • launchd LaunchDaemon template for boot persistence.

Status

v0.1 — single-model per daemon, fixed at startup via --model. Multi-model hot-loading is the v0.2 goal (admin API for POST /admin/models/{load,unload}) which the operator can drive, see the architecture doc.

License

MIT. See LICENSE.

About

macOS-native MLX inference daemon with OpenAI-compatible API

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages