Skip to content

dormlab/inferenced-operator

Repository files navigation

inferenced-operator

Kubernetes operator that orchestrates fleets of inferenced daemons across Apple Silicon hosts.

inferenced-operator watches two custom resources:

  • InferenceHost — one per inferenced daemon in your fleet. Declares the daemon's HTTP endpoint and labels.
  • InferenceModel — declares a model that should be served by the cluster. Has a host selector + replica count.

For each InferenceModel, the operator picks compatible hosts, calls each chosen daemon's POST /admin/models to load the model, and creates a Kubernetes Service + EndpointSlice so cluster pods can reach the fleet under a single in-cluster DNS name like qwen-3b.inferenced.svc.cluster.local:11434.

                          ┌──────────────────────────────────────────┐
   InferenceModel ─────►  │        inferenced-operator               │
   InferenceHost  ─────►  │   (kube-rs Controller, Rust)             │
                          └────────────────┬─────────────────────────┘
                                           │ HTTP /admin/models
                                           ▼
                ┌──────────────────────────────────────────────────┐
                │  inferenced  ║  inferenced  ║  inferenced        │
                │  (mac-1)     ║  (mac-2)     ║  (mac-3)           │
                └─────┬────────╨──────┬───────╨──────┬─────────────┘
                      │ Apple GPU     │              │
                      ▼               ▼              ▼
                  Metal/MLX       Metal/MLX      Metal/MLX

The operator publishes a Service named after each InferenceModel:

kubectl apply -f examples/host-amelia.yaml
kubectl apply -f examples/model-qwen-3b.yaml

kubectl get inferencemodel
# NAME      MODEL                                       REPLICAS  READY  AGE
# qwen-3b   mlx-community/Qwen2.5-3B-Instruct-4bit      2         2      30s

kubectl run -it --rm chat --image=alpine -- sh
> apk add curl
> curl -s qwen-3b.inferenced.svc.cluster.local:11434/v1/chat/completions \
    -H 'content-type: application/json' \
    -d '{"model":"mlx-community/Qwen2.5-3B-Instruct-4bit","messages":[{"role":"user","content":"hi"}]}'

Documentation

Architecture The reconcile loops, what the operator owns, what the daemon owns.
Installation Helm chart install + RBAC + CRD installation.
CRD Reference Full schema for InferenceHost and InferenceModel, with examples.
Examples Working YAML for hosts + models.
Development Building, running locally against a kubeconfig, contributing.
Troubleshooting Common reconcile failures and how to diagnose.

Quick install (Helm)

# 1. Install the operator (CRDs included).
helm install inferenced-operator \
  oci://ghcr.io/dormlab/charts/inferenced-operator \
  --namespace inferenced \
  --create-namespace

# 2. Tell it about your hosts.
kubectl apply -f examples/host-amelia.yaml

# 3. Declare a model.
kubectl apply -f examples/model-qwen-3b.yaml

# 4. Wait for ready.
kubectl get inferencemodel -w

Quick install (raw manifests, without Helm)

# CRDs.
kubectl apply -f crds/

# Operator (use your own image or build from source).
kubectl create namespace inferenced
kubectl apply -n inferenced -f deploy/  # see docs/installation.md

Status

v0.1InferenceHost and InferenceModel CRDs, two reconcile loops, owner-references on Service and EndpointSlice for proper GC. Tests cover CRD parsing and reconcile error paths.

Roadmap in docs/architecture.md#roadmap.

License

MIT. See LICENSE.

About

Kubernetes operator that orchestrates fleets of inferenced daemons across Apple Silicon hosts

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors