Skip to content

v0.5.0 — llama-cpp router mode, float-tags policy

Latest

Choose a tag to compare

@a1exus a1exus released this 23 May 18:41
· 3 commits to main since this release
4dd2e8b

Added

  • tailscale/ stack: Tailscale sidecar — third ingress path alongside the LAN (mDNS) and public (Cloudflare Tunnel) ones. Registers this host as a node on your tailnet (spark-1822.<tailnet>.ts.net) and runs Tailscale Serve (config in serve.json) to terminate TLS on tailnet :443 with a real publicly-trusted MagicDNS cert and reverse-proxy to http://traefik:80. Joins the external traefik Docker network so backend container names resolve. Userspace networking mode (no /dev/net/tun, no privileged caps); node key + machine state in a named docker volume so secrets stay out of /opt. tailscale/README.md documents the Host-header routing caveat (Traefik routers match *.spark-1822.local, so tailnet hostnames need to be added to existing rule= labels) plus a "Hardening" section calling out four deferred follow-ups vs. Tailscale's production guidance: file-mounted auth secret over plain env, OAuth client credentials over static auth keys, --advertise-tags=tag:server for ACL identity, and kernel networking mode for throughput.
  • llama-cpp/ router mode (now the default): when no MODEL_* env var is set, llama-server starts with --models-dir /models and serves every GGUF in the symlink farm built by make hf-sync, loading on demand. Up to MODELS_MAX models stay resident in VRAM; LRU evicts the rest. Per-model overrides in an auto-generated config.ini with managed-fields semantics — hf-sync owns the model = line; user-edits to other keys (ctx-size, n-gpu-layers, hand-added alias, etc.) are preserved across regenerations; removed GGUFs are moved to config.ini.orphans (restored verbatim if they come back). Classic single-model mode (MODEL_PATH / MODEL_OLLAMA / MODEL_URL) still works for one-off pinning. Each GGUF is reachable under three IDs in /v1/models: the short alias from the config section name, the bare filename, and an HF-style <org>/<repo>:<quant> the router auto-derives from the symlink target path. New LLAMA_API_KEY bearer-token auth — required when the endpoint is reachable from anywhere other than 127.0.0.1 (Cloudflare Tunnel path is genuine internet exposure); generated by openssl rand -hex 32. Two new helper scripts: llama-cpp/scripts/sync-router.sh (builds the symlink farm; enumerates via hf cache scan with a find fallback) and llama-cpp/scripts/regen-config-ini.py (managed-fields INI regenerator with atomic writes). New Make target make models pretty-prints /v1/models from the running container. make up now accepts an empty ENV= to start in router mode.
  • Spec + plan docs under docs/superpowers/specs/ and docs/superpowers/plans/ recording the router-mode rollout (locked-in decisions, design, verification plan, and post-deployment findings).

Changed

  • llama-cpp/.env.example + entrypoint.sh + the hf-sync per-variant template: default CTX_SIZE 8192 → 32768. Matches what most contemporary 27B–120B GGUFs handle comfortably without YaRN tricks; per-variant overrides still win (uncomment CTX_SIZE=… in envs/<name>.env).
  • vllm/.env.example + entrypoint.sh + the hf-sync per-variant template: default VLLM_MAX_LEN 8192 → 32768 (same rationale as llama-cpp).
  • Top-level README.md: tailscale/ added to the intro bullets, Topology diagram (three ingress paths now), Layout tree, Components table, and First-time setup (new optional step 6, mirroring the Cloudflare Tunnel step).
  • Trivy: tailscale/tailscale added to the image-scan matrix. extract-tags now reads TAILSCALE_TAG from tailscale/.env.example (same strict regex as the other tags). trivy.md jobs table updated to list the new image.
  • tailscale/serve.json: now exposes both schemes on the tailnet node. :80http://traefik:80 (Traefik's web entrypoint 308-redirects to HTTPS, sending the client back to :443). :443https-insecure://traefik:443 (Tailscale terminates with the MagicDNS cert, then proxies to Traefik's websecure entrypoint; -insecure skips Traefik's wildcard-cert verification since the inner hop is container-to-container over the traefik Docker network). The previous single-listener config (:443http://traefik:80) would have caused a 308 redirect loop now that the dashboard router lives behind Host-based routing only. AllowFunnel: false for both ports — tailnet members only.
  • tailscale/services.json: new file. Tailscale VIP Service endpoint config — a separate tailnet entity from the node, with its own MagicDNS name and cert. Endpoints proxy to Traefik (tcp:80http://traefik:80, tcp:443https+insecure://traefik:443). Single shared file applied per-service with --service=svc:<name>; distinct schema from serve.json (per Tailscale Services configuration file — single-service ServiceDetailsFile form: flat version + endpoints, no outer wrapper).
  • tailscale/Makefile: new. make services-apply / services-status / services-clear loops over six per-backend VIP services (svc:traefik, svc:vllm, svc:llama, svc:ollama, svc:open-webui, svc:netdata) — pushes services.json into the container, runs tailscale serve set-config --service=<svc> + serve advertise <svc> for each. Daemon state persists in the tailscale-state volume. Service creation in the admin console (https://login.tailscale.com/admin/services) is one-time and manual — Tailscale doesn't expose a CLI for that. Tailnet clients hit https://<svc>.<tailnet>.ts.net and land on the matching Traefik backend.
  • Every Traefik router relaxed from <svc>.spark{x:.+} to <svc>.{x:.+}. The new form accepts both the LAN names (vllm.spark-1822.local) and the per-service tailnet names (vllm.<tailnet>.ts.net) without hardcoding either domain. Slightly more permissive than before — any <svc>.<anything> hitting Traefik will match its router — but the reachability surface (LAN + tailnet + Cloudflare Tunnel) is unchanged, so it's a non-issue in practice.
  • Legacy svc:spark (single-service-for-everything) kept working for backward compatibility: the Traefik dashboard router carries an || HostRegexp(spark{x:.+}) fallback so https://spark.<tailnet>.ts.net/ still lands on the dashboard. Retire it later — see "Legacy svc:spark" in tailscale/README.md.
  • Every Traefik router's rule= switched from a hardcoded Host() literal to a single HostRegexp(.spark{x:.+}) matcher. Matches the per-service subdomain form across any LAN/tailnet/Cloudflare domain (vllm.spark-1822.local, vllm.sparky.example.com, …). Six routers updated: ollama, open-webui, vllm, llama (label-based, in each app's compose) and netdata, traefik (file-based, in traefik/dynamic/services.yml). Renames of the hostname (spark-1822 → anything starting with spark) or the LAN/tailnet/public domain no longer require touching any rule. The Traefik dashboard router (traefik in dynamic/services.yml) additionally carries a second matcher HostRegexp(spark{x:.+}), so the bare tailnet hostname Tailscale Serve forwards (spark-1822.<tailnet>.ts.net) lands on the dashboard — Tailscale Serve only knows the node's MagicDNS hostname, and the dashboard is the most useful default landing. tailscale/README.md "Host-header routing" and traefik/README.md "Add an app" sections track the final state.
  • Image-pin policy flipped: float by default, pin in production. Every committed .env.example now uses a floating tag (latest for ollama / open-webui / netdata / tailscale / vllm / cloudflared; v2 for traefik because latest would resolve to v3 and silently break the named-capture HostRegexp rules; server-cuda for ggml-org/llama.cpp since no latest tag exists for that image). Operators override in their host-local .env with an immutable tag or content-digest pin for reproducibility — each .env.example carries the previous explicit pin in a comment + a release-page link. Root README's "Conventions" section rewritten to reflect the new rule (was: "never :latest"). Trivy's extract-tags regex (^[A-Za-z0-9._@:+-]+$) already accepted floating tags — no CI logic change; only stale "pinned" wording was cleaned up in .github/workflows/trivy.md and the cron comment.
  • llama-cpp/: GPU exclusivity reclassified. Router mode and Ollama are both lazy — neither claims VRAM until a model is loaded — so they coexist on the GB10's 124 GiB. The actually-exclusive engines in this stack are vLLM (--gpu-memory-utilization 0.9 reserves ~90% of VRAM eagerly) and llama-cpp's classic single-model mode (-ngl 999 pins every layer at startup). README's old "GPU exclusivity" subsection rewritten as "GPU sharing" with a posture-by-engine table. Documents that Ollama's restart: unless-stopped brings it back on Docker daemon restarts even after an explicit stop — so hard exclusivity needs ongoing intent.
  • llama-cpp/README.md clarifications: made explicit that the OpenAI-compatible API and the built-in web UI share one port (8080) on the same Traefik router, discriminated by path — there is no separate UI port. New "Router quirks worth knowing" subsection covering: /v1/models is unauthenticated by OpenAI-compat convention (only /v1/chat/completions enforces the bearer), each part of a multi-part split GGUF appears as its own model ID (only part 1 is loadable), loading the same physical file under two IDs counts twice toward MODELS_MAX, and gpt-oss-* models need the harmony chat template (default ChatML produces a 500 on parse). README "Pinning the image" snippet updated for ghcr.io's switch from Docker manifest list to OCI image index (the old Accept: application/vnd.docker.distribution.manifest.list.v2+json now returns 404; the new snippet sends OCI first), and adds a --help sanity check to catch upstream builds with broken RPATH/RUNPATH before deployment.

Removed

  • caddy/ stack: dropped. Traefik covers the same routing surface and is a better fit for a multi-container setup — discovery via Docker labels means adding a stack is a label drop, not a Caddyfile.d/*.caddyfile edit + reload. Eliminates the "two reverse proxies, pick one" footgun. Every sibling stack (open-webui/, vllm/, llama-cpp/, netdata/) had Caddy-as-backup phrasing scrubbed from its README + compose comments; ${CADDY_DOMAIN}-style URL templates replaced with the canonical literal https://<svc>.spark-1822.local (the HostRegexp rule accepts any <svc>.spark*.<domain>, so tailnet and Cloudflare Tunnel hostnames also resolve through the same routers). Trivy's image-scan matrix lost the caddy row. Host migration: docker compose -f /opt/caddy/docker-compose.yml down -v (also drops the caddy-data volume holding Caddy's internal CA), then sudo rm -rf /opt/caddy after the next git pull. Anything that was trusting caddy-root.crt should be re-pointed at traefik-root.crt (per-OS install steps in traefik/README.md).

Security

  • New LLAMA_API_KEY bearer-token auth on llama-cpp/ (see Added). Required for non-127.0.0.1 exposure paths. /v1/chat/completions returns 401 without it; /v1/models stays open by OpenAI-compat convention.
  • Known finding: Trivy flagged CVE-2026-33186 (gRPC-Go authorization bypass via HTTP/2 path validation, fixed in google.golang.org/grpc 1.79.3) inside cloudflare/cloudflared:2026.5.0's embedded Go binary. Cloudflare hadn't shipped a rebuild as of this release; with the new floating-tag convention, the next docker compose pull picks up the rebuild automatically once it ships. Practical risk for this host is low — cloudflared is a gRPC client to Cloudflare's edge (the vector requires an attacker-controlled server). The entry in .github/workflows/trivy.md → "Known findings" tracks it.