feat(dns): vanity hostname claims (first-come-first-served) by posix4e · Pull Request #145 · devopsdefender/dd

posix4e · 2026-04-19T20:14:39Z

Summary

Workloads can now declare expose.claim_hostname: "nvidia-smi" to grab a stable short URL (nvidia-smi.<domain>) directly under the zone apex, instead of the auto-labeled per-agent URL (<agent>-<label>.<domain>). DNS uniqueness is the lock — the first agent to POST the CNAME wins, subsequent callers get a deterministic conflict.

This is Phase 1 of the "DNS-based deployments" plan. Phase 2 (automatic relaunch on agent death) is not in this PR — for now, when the owning agent dies, the URL 404s until a fresh deploy lands.

What changes

Schema (apps/README.md, apps/web-nvidia-smi/workload.json) — expose: gains a mutually-exclusive claim_hostname field. nvidia-smi switches from hostname_label: \"gpu\" to claim_hostname: \"nvidia-smi\".
Wire — DD_EXTRA_INGRESS env extends to @name:port for claim entries. Agent parses into two Vecs (extra_ingress + claims); register/ingress_replace payload entries are either {hostname_label, port} or {claim_hostname, port}.
CF — new try_claim_cname POSTs without upsert, bubbles up conflict if the record already exists. apply_ingress accepts a claims slice, POSTs fresh CNAMEs or confirms ownership on re-apply.
CP — RegisterReq/IngressReplaceReq parse the new variant field; provision_agent_access creates a public-bypass CF Access app per claim at the zone apex. collector::Agent stores claims alongside extras; the collector recovers them via /health scrape after a CP restart.
Release — the collector's orphan-GC path now iterates a dead agent's claims, calls cf::release_claim (checks ownership before deleting — doesn't stomp on a legitimate takeover), then sweeps the per-claim Access apps. So when the nvidia-smi agent STONITHs, the name frees up for the next deploy.

Test plan

cargo build --release / cargo clippy -D warnings / cargo test — clean (23 tests; 4 new for claim parsing and CF helpers)
Preview deploy: nvidia-smi spec baked into prod config.iso; confirm nvidia-smi.devopsdefender.com resolves and serves after prod agent registers
Deploy a second workload spec with the same claim_hostname — expect CP 500 upstream error mentioning "already owned"
STONITH the prod agent (virsh destroy dd-local-prod); confirm the collector's next tick frees the claim (CNAME gone, Access app gone). Next deploy can re-claim.

Follow-ups (plan file)

Phase 2: automatic relaunch on agent death (CP picks eligible agent, forwards spec).
Promote cloudflared to a real EE workload so /logs/cloudflared-tunnel returns tunnel connection output.

🤖 Generated with Claude Code

Workloads can now declare `expose.claim_hostname: "nvidia-smi"` to grab a stable short URL (`nvidia-smi.<domain>`) directly under the zone apex, instead of the auto-labeled per-agent URL shape. DNS uniqueness is the lock — the first agent to POST the CNAME wins, subsequent callers get a deterministic conflict error from the CP. - Schema: `expose:` gains a mutually-exclusive `claim_hostname` field alongside `hostname_label`. `apps/README.md` documents both. `web-nvidia-smi` switches to `claim_hostname: "nvidia-smi"`. - Wire: `DD_EXTRA_INGRESS` env var extends to `@name:port` for claim entries; `label:port` stays for auto-labeled. The agent parses into two Vecs and forwards each `/register` + `/ingress/replace` payload entry as either `{hostname_label, port}` or `{claim_hostname, port}`. - CF: new `try_claim_cname` POSTs without upsert, bubbles up a conflict if the record already exists. `apply_ingress` accepts a claims slice, adds the ingress rules, and either POSTs a fresh CNAME (first caller) or confirms we already own it (idempotent re-apply) — anything else fails. - CP: `RegisterReq`/`IngressReplaceReq` gain the variant field; `provision_agent_access` creates a public-bypass CF Access app per claim at the zone apex. `collector::Agent` stores claims alongside extras for recovery via /health scrape. - Release: the collector's orphan-GC path now iterates a dead agent's claims and calls `cf::release_claim` (checks ownership before deleting — avoids stomping on a legitimate takeover), then `delete_access_apps_for` on each vanity domain. So when the nvidia-smi agent STONITHs, the name frees up for the next deploy. No automatic failover / relaunch yet — that's Phase 2 in the plan. For now, when the owning agent dies, the URL 404s until a fresh deploy lands somewhere eligible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-19T20:17:27Z

DD preview ready

URL: https://pr-145.devopsdefender.com

Browser login: visit https://pr-145.devopsdefender.com — Cloudflare Access routes you
through GitHub OAuth. Membership (public) in the DD GitHub
org grants access; the DD_ACCESS_ADMIN_EMAIL is the
break-glass fallback.

Machine-to-machine: GitHub Actions workflows in the
DD_OWNER org pass their per-job OIDC JWT as
Authorization: Bearer … (audience dd-agent).

Register endpoint for a local agent: https://pr-145.devopsdefender.com/register
(CF-Access-bypassed; authenticated by ITA attestation).

posix4e · 2026-04-19T22:25:19Z

Closing — parked in #148 for later. No longer actively working on this.

posix4e had a problem deploying to staging April 19, 2026 20:15 — with GitHub Actions Failure

This was referenced Apr 19, 2026

chore(cf): reap orphan Access apps on CP startup #146

Closed

DNS-based deployments: vanity hostname claims (first-come-first-served) #148

Open

posix4e closed this Apr 19, 2026

posix4e deleted the feat/dns-claims branch April 19, 2026 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dns): vanity hostname claims (first-come-first-served)#145

feat(dns): vanity hostname claims (first-come-first-served)#145
posix4e wants to merge 1 commit into
mainfrom
feat/dns-claims

posix4e commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

posix4e commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

posix4e commented Apr 19, 2026

Summary

What changes

Test plan

Follow-ups (plan file)

Uh oh!

github-actions Bot commented Apr 19, 2026

DD preview ready

Uh oh!

posix4e commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant