You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
vllm/ stack: vLLM inference server (image vllm/vllm-openai:v0.20.2, multi-arch arm64+amd64). OpenAI-compatible API fronted by Caddy at https://vllm.${CADDY_DOMAIN}. Shares the host's HuggingFace cache (/opt/hf/.cache/huggingface) with llama-cpp/. Complements llama-cpp/ (vLLM for HF safetensors + high-throughput serving; llama.cpp for GGUF). restart: "no" for GPU exclusivity with Ollama / llama-cpp. Smoke-tested on GB10 with Qwen/Qwen3.6-27B (compute capability 12.1 just works — no source build needed); gpt-oss-* variants still fail at startup on v0.20.2 because the bundled openai-harmony fetches a vocab file from a URL that 404s upstream (unrelated to GB10).
vllm/entrypoint.sh: builds the vllm serve argv from env vars, mirroring the llama-cpp/ pattern. Replaces the long YAML command: list. Enables OpenAI tool-calling on the API (--enable-auto-tool-choice --tool-call-parser qwen3_xml) so agentic clients (Opencode, etc.) can issue tool-use requests. Qwen3.6 emits the XML tool-call format (<tool_call><function=NAME><parameter=PARAM>VAL</parameter></function></tool_call>), not the Hermes JSON variant. For other model families that emit a different format, the parser is a no-op (chat completions still work).
One-env-per-model-variant layout for llama-cpp/ and vllm/. Each stack has an envs/<name>.env directory of self-contained variant files (image pin + HF cache + HF token + model knobs) plus a top-level Makefile (make list, make up ENV=<name>, make hf-cache, make hf-sync). make up invokes docker compose --env-file envs/<name>.env up -d directly — no rolling .env is written. Management via plain docker compose (with the same --env-file) or docker against the container name.
make hf-cache / make hf-sync (vllm + llama-cpp): list cached HF repos / GGUFs on this host, and reconcile envs/*.env against them — create envs for newly cached models, restore <name>.env from <name>.env.bak when a model returns (preserving hand edits), move <name>.env → <name>.env.bak when a model leaves. The .bak orphan path is non-destructive.
llama-cpp/ and vllm/: bind the engine's OpenAI-compatible API to 127.0.0.1 on the host (127.0.0.1:8080:8080 and 127.0.0.1:8000:8000). External traffic continues to flow through Caddy on the shared caddy network — the loopback bind is for direct host-side curl/benchmarks. HOST_PORT overrideable per-variant.
caddy/Makefile: new — make ca-cert extracts Caddy's internal root CA into ./caddy-root.crt.
caddy/README.md: expanded the local-CA install matrix — added macOS (CLI) (security add-trusted-cert), Fedora/RHEL, Arch (trust anchor --store), Windows (PowerShell + cmd + GUI), and a Node.js apps row (NODE_EXTRA_CA_CERTS / Node ≥22 NODE_USE_SYSTEM_CA=1) so Opencode and other Node-bundled-CA clients can trust Caddy's leaf certs.
"Supported model formats" sections in llama-cpp/README.md and vllm/README.md: spell out what each engine loads (GGUF vs HF safetensors) and what it doesn't, with upstream links for architecture and quantization compatibility.
Trivy: vllm/vllm-openai added to the image-scan matrix; new vllm_tag output from extract-tags.
Changed
make up VARIANT=<name> → make up ENV=<name> (and matching renames in docs). The Make variable name now lines up with what the files are: ".env" files.
llama-cpp/ + vllm/ variant workflow: each envs/<name>.env is self-contained (image pin + HF cache + HF token + model knobs in one file). make up ENV=<name> uses docker compose --env-file envs/<name>.env up -d directly — no rolling .env is written. The make down/logs/ps targets are dropped — docker compose --env-file ... / docker are the source of truth for those.
vllm/, llama-cpp/: one Makefile per stack. The previous envs/Makefile for HF-cache maintenance was collapsed into the top-level Makefile; recipes cd envs/ to operate on *.env.
caddy/: the stack now defines the shared caddy Docker network (attachable: true) instead of referencing it as external: true. Dropped the docker network create caddy one-time setup step. cd /opt/caddy && docker compose up -d creates the network on first boot. Other stacks still reference it as external: true to join.
open-webui/docker-compose.yml: the two persistent volumes (open-webui, open-webui-ollama) are now declared external: true to match how they exist on the host and to make sure docker compose down -v never destroys them. Silences the "already exists but was not created by Docker Compose" warnings. First-time deploys need docker volume create open-webui open-webui-ollama once — documented in open-webui/README.md.
open-webui Caddy vhost moved from the bare {$CADDY_DOMAIN} to open-webui.{$CADDY_DOMAIN} — matches the per-service subdomain convention used by every other stack (llama., vllm., ollama., netdata.). Requires a matching mDNS alias. Side effect: the bare spark-1822.local no longer routes to anything, so Caddy returns a clean 404 instead of the misleading 502 it served while open-webui was down.
llama-cpp/ and vllm/ Makefiles: small best-practice hardening — .SUFFIXES: (disable built-in implicit rules), .DELETE_ON_ERROR:, $(strip $(ENV)) to tolerate trailing whitespace, quoted env-file paths, and a SERVICE variable for the compose service name. No behavior change for existing inputs.
vllm/entrypoint.sh: unset the four VLLM_* helper vars (VLLM_MODEL / VLLM_SERVED_NAME / VLLM_GPU_MEM / VLLM_MAX_LEN) before exec vllm serve. They're only used to build the argv; leaving them in env triggers cosmetic "Unknown vLLM environment variable" warnings at startup.
Top-level README.md: rewrote the Deploy workflow. The old "scp + sudo install" pattern is gone — /opt on the host is itself a checkout of this repo, so deploy is ssh spark-1822.local 'sudo git -C /opt pull --ff-only' followed by the stack-specific apply step.
Renamed the shared external Docker network from web to caddy — the name reflects what the network actually is (the path Caddy proxies over). Every stack's compose updated. Migration on the host: docker network create caddy, docker compose up -d each stack, docker network rm web.
llama-cpp/ switched to restart: "no" (was unless-stopped). The engine eagerly grabs ~65 GiB of VRAM and conflicts with Ollama; manual-start avoids racing each other on boot. The stack's README documents the switch-engine snippets. Same change applied to vllm/.
HF_CACHE_HOST default moved from /home/alexus/.cache/huggingface to /opt/hf/.cache/huggingface — the host's existing system-wide HF cache (~77 GiB of models already there, including openai/gpt-oss-120b).
.gitignore: added **/envs/*.env (variant files are host-local artifacts), *.bak (host-local backups including hf-sync's orphaning path), and hf (the host's HuggingFace cache lives at /opt/hf/).
Fixed
llama-cpp/entrypoint.sh: marked executable (100755). The script is bind-mounted at the container's entrypoint; without the exec bit on the host file, runc failed with "permission denied" on docker compose up.
Removed
llama-cpp/envs/: dropped the 8 Ollama-blob variant files (gpt-oss-safeguard-20b, qwen3.6-35b, phi4-14b, gemma4-e4b, llama3.1-8b, deepseek-r1-8b, granite4.1-3b, tinyllama). The MODEL_OLLAMA resolution in entrypoint.sh, the /ollama:ro mount, and the env pass-through stay so an Ollama-backed variant can be added back any time without code changes.