Kubernetes operator that orchestrates fleets of
inferenceddaemons across Apple Silicon hosts.
inferenced-operator watches two custom resources:
InferenceHost— one perinferenceddaemon in your fleet. Declares the daemon's HTTP endpoint and labels.InferenceModel— declares a model that should be served by the cluster. Has a host selector + replica count.
For each InferenceModel, the operator picks compatible hosts, calls each chosen daemon's POST /admin/models to load the model, and creates a Kubernetes Service + EndpointSlice so cluster pods can reach the fleet under a single in-cluster DNS name like qwen-3b.inferenced.svc.cluster.local:11434.
┌──────────────────────────────────────────┐
InferenceModel ─────► │ inferenced-operator │
InferenceHost ─────► │ (kube-rs Controller, Rust) │
└────────────────┬─────────────────────────┘
│ HTTP /admin/models
▼
┌──────────────────────────────────────────────────┐
│ inferenced ║ inferenced ║ inferenced │
│ (mac-1) ║ (mac-2) ║ (mac-3) │
└─────┬────────╨──────┬───────╨──────┬─────────────┘
│ Apple GPU │ │
▼ ▼ ▼
Metal/MLX Metal/MLX Metal/MLX
The operator publishes a Service named after each InferenceModel:
kubectl apply -f examples/host-amelia.yaml
kubectl apply -f examples/model-qwen-3b.yaml
kubectl get inferencemodel
# NAME MODEL REPLICAS READY AGE
# qwen-3b mlx-community/Qwen2.5-3B-Instruct-4bit 2 2 30s
kubectl run -it --rm chat --image=alpine -- sh
> apk add curl
> curl -s qwen-3b.inferenced.svc.cluster.local:11434/v1/chat/completions \
-H 'content-type: application/json' \
-d '{"model":"mlx-community/Qwen2.5-3B-Instruct-4bit","messages":[{"role":"user","content":"hi"}]}'| Architecture | The reconcile loops, what the operator owns, what the daemon owns. |
| Installation | Helm chart install + RBAC + CRD installation. |
| CRD Reference | Full schema for InferenceHost and InferenceModel, with examples. |
| Examples | Working YAML for hosts + models. |
| Development | Building, running locally against a kubeconfig, contributing. |
| Troubleshooting | Common reconcile failures and how to diagnose. |
# 1. Install the operator (CRDs included).
helm install inferenced-operator \
oci://ghcr.io/dormlab/charts/inferenced-operator \
--namespace inferenced \
--create-namespace
# 2. Tell it about your hosts.
kubectl apply -f examples/host-amelia.yaml
# 3. Declare a model.
kubectl apply -f examples/model-qwen-3b.yaml
# 4. Wait for ready.
kubectl get inferencemodel -w# CRDs.
kubectl apply -f crds/
# Operator (use your own image or build from source).
kubectl create namespace inferenced
kubectl apply -n inferenced -f deploy/ # see docs/installation.mdv0.1 — InferenceHost and InferenceModel CRDs, two reconcile loops, owner-references on Service and EndpointSlice for proper GC. Tests cover CRD parsing and reconcile error paths.
Roadmap in docs/architecture.md#roadmap.
MIT. See LICENSE.