Skip to content

v0.3.0 — vLLM stack, tool-calling, env-per-variant workflow

Choose a tag to compare

@a1exus a1exus released this 14 May 03:06
· 85 commits to main since this release

[0.3.0] - 2026-05-13

Added

  • vllm/ stack: vLLM inference server (image vllm/vllm-openai:v0.20.2, multi-arch arm64+amd64). OpenAI-compatible API fronted by Caddy at https://vllm.${CADDY_DOMAIN}. Shares the host's HuggingFace cache (/opt/hf/.cache/huggingface) with llama-cpp/. Complements llama-cpp/ (vLLM for HF safetensors + high-throughput serving; llama.cpp for GGUF). restart: "no" for GPU exclusivity with Ollama / llama-cpp. Smoke-tested on GB10 with Qwen/Qwen3.6-27B (compute capability 12.1 just works — no source build needed); gpt-oss-* variants still fail at startup on v0.20.2 because the bundled openai-harmony fetches a vocab file from a URL that 404s upstream (unrelated to GB10).
  • vllm/entrypoint.sh: builds the vllm serve argv from env vars, mirroring the llama-cpp/ pattern. Replaces the long YAML command: list. Enables OpenAI tool-calling on the API (--enable-auto-tool-choice --tool-call-parser qwen3_xml) so agentic clients (Opencode, etc.) can issue tool-use requests. Qwen3.6 emits the XML tool-call format (<tool_call><function=NAME><parameter=PARAM>VAL</parameter></function></tool_call>), not the Hermes JSON variant. For other model families that emit a different format, the parser is a no-op (chat completions still work).
  • One-env-per-model-variant layout for llama-cpp/ and vllm/. Each stack has an envs/<name>.env directory of self-contained variant files (image pin + HF cache + HF token + model knobs) plus a top-level Makefile (make list, make up ENV=<name>, make hf-cache, make hf-sync). make up invokes docker compose --env-file envs/<name>.env up -d directly — no rolling .env is written. Management via plain docker compose (with the same --env-file) or docker against the container name.
  • make hf-cache / make hf-sync (vllm + llama-cpp): list cached HF repos / GGUFs on this host, and reconcile envs/*.env against them — create envs for newly cached models, restore <name>.env from <name>.env.bak when a model returns (preserving hand edits), move <name>.env → <name>.env.bak when a model leaves. The .bak orphan path is non-destructive.
  • llama-cpp/ and vllm/: bind the engine's OpenAI-compatible API to 127.0.0.1 on the host (127.0.0.1:8080:8080 and 127.0.0.1:8000:8000). External traffic continues to flow through Caddy on the shared caddy network — the loopback bind is for direct host-side curl/benchmarks. HOST_PORT overrideable per-variant.
  • caddy/Makefile: new — make ca-cert extracts Caddy's internal root CA into ./caddy-root.crt.
  • caddy/README.md: expanded the local-CA install matrix — added macOS (CLI) (security add-trusted-cert), Fedora/RHEL, Arch (trust anchor --store), Windows (PowerShell + cmd + GUI), and a Node.js apps row (NODE_EXTRA_CA_CERTS / Node ≥22 NODE_USE_SYSTEM_CA=1) so Opencode and other Node-bundled-CA clients can trust Caddy's leaf certs.
  • "Supported model formats" sections in llama-cpp/README.md and vllm/README.md: spell out what each engine loads (GGUF vs HF safetensors) and what it doesn't, with upstream links for architecture and quantization compatibility.
  • Trivy: vllm/vllm-openai added to the image-scan matrix; new vllm_tag output from extract-tags.

Changed

  • make up VARIANT=<name>make up ENV=<name> (and matching renames in docs). The Make variable name now lines up with what the files are: ".env" files.
  • llama-cpp/ + vllm/ variant workflow: each envs/<name>.env is self-contained (image pin + HF cache + HF token + model knobs in one file). make up ENV=<name> uses docker compose --env-file envs/<name>.env up -d directly — no rolling .env is written. The make down/logs/ps targets are dropped — docker compose --env-file ... / docker are the source of truth for those.
  • vllm/, llama-cpp/: one Makefile per stack. The previous envs/Makefile for HF-cache maintenance was collapsed into the top-level Makefile; recipes cd envs/ to operate on *.env.
  • caddy/: the stack now defines the shared caddy Docker network (attachable: true) instead of referencing it as external: true. Dropped the docker network create caddy one-time setup step. cd /opt/caddy && docker compose up -d creates the network on first boot. Other stacks still reference it as external: true to join.
  • open-webui/docker-compose.yml: the two persistent volumes (open-webui, open-webui-ollama) are now declared external: true to match how they exist on the host and to make sure docker compose down -v never destroys them. Silences the "already exists but was not created by Docker Compose" warnings. First-time deploys need docker volume create open-webui open-webui-ollama once — documented in open-webui/README.md.
  • open-webui Caddy vhost moved from the bare {$CADDY_DOMAIN} to open-webui.{$CADDY_DOMAIN} — matches the per-service subdomain convention used by every other stack (llama., vllm., ollama., netdata.). Requires a matching mDNS alias. Side effect: the bare spark-1822.local no longer routes to anything, so Caddy returns a clean 404 instead of the misleading 502 it served while open-webui was down.
  • llama-cpp/ and vllm/ Makefiles: small best-practice hardening — .SUFFIXES: (disable built-in implicit rules), .DELETE_ON_ERROR:, $(strip $(ENV)) to tolerate trailing whitespace, quoted env-file paths, and a SERVICE variable for the compose service name. No behavior change for existing inputs.
  • vllm/entrypoint.sh: unset the four VLLM_* helper vars (VLLM_MODEL / VLLM_SERVED_NAME / VLLM_GPU_MEM / VLLM_MAX_LEN) before exec vllm serve. They're only used to build the argv; leaving them in env triggers cosmetic "Unknown vLLM environment variable" warnings at startup.
  • Top-level README.md: rewrote the Deploy workflow. The old "scp + sudo install" pattern is gone — /opt on the host is itself a checkout of this repo, so deploy is ssh spark-1822.local 'sudo git -C /opt pull --ff-only' followed by the stack-specific apply step.
  • Renamed the shared external Docker network from web to caddy — the name reflects what the network actually is (the path Caddy proxies over). Every stack's compose updated. Migration on the host: docker network create caddy, docker compose up -d each stack, docker network rm web.
  • llama-cpp/ switched to restart: "no" (was unless-stopped). The engine eagerly grabs ~65 GiB of VRAM and conflicts with Ollama; manual-start avoids racing each other on boot. The stack's README documents the switch-engine snippets. Same change applied to vllm/.
  • HF_CACHE_HOST default moved from /home/alexus/.cache/huggingface to /opt/hf/.cache/huggingface — the host's existing system-wide HF cache (~77 GiB of models already there, including openai/gpt-oss-120b).
  • .gitignore: added **/envs/*.env (variant files are host-local artifacts), *.bak (host-local backups including hf-sync's orphaning path), and hf (the host's HuggingFace cache lives at /opt/hf/).

Fixed

  • llama-cpp/entrypoint.sh: marked executable (100755). The script is bind-mounted at the container's entrypoint; without the exec bit on the host file, runc failed with "permission denied" on docker compose up.

Removed

  • llama-cpp/envs/: dropped the 8 Ollama-blob variant files (gpt-oss-safeguard-20b, qwen3.6-35b, phi4-14b, gemma4-e4b, llama3.1-8b, deepseek-r1-8b, granite4.1-3b, tinyllama). The MODEL_OLLAMA resolution in entrypoint.sh, the /ollama:ro mount, and the env pass-through stay so an Ollama-backed variant can be added back any time without code changes.