Skip to content

feat(dns): vanity hostname claims (first-come-first-served)#145

Closed
posix4e wants to merge 1 commit into
mainfrom
feat/dns-claims
Closed

feat(dns): vanity hostname claims (first-come-first-served)#145
posix4e wants to merge 1 commit into
mainfrom
feat/dns-claims

Conversation

@posix4e
Copy link
Copy Markdown
Member

@posix4e posix4e commented Apr 19, 2026

Summary

Workloads can now declare expose.claim_hostname: "nvidia-smi" to grab a stable short URL (nvidia-smi.<domain>) directly under the zone apex, instead of the auto-labeled per-agent URL (<agent>-<label>.<domain>). DNS uniqueness is the lock — the first agent to POST the CNAME wins, subsequent callers get a deterministic conflict.

This is Phase 1 of the "DNS-based deployments" plan. Phase 2 (automatic relaunch on agent death) is not in this PR — for now, when the owning agent dies, the URL 404s until a fresh deploy lands.

What changes

  • Schema (apps/README.md, apps/web-nvidia-smi/workload.json) — expose: gains a mutually-exclusive claim_hostname field. nvidia-smi switches from hostname_label: \"gpu\" to claim_hostname: \"nvidia-smi\".
  • WireDD_EXTRA_INGRESS env extends to @name:port for claim entries. Agent parses into two Vecs (extra_ingress + claims); register/ingress_replace payload entries are either {hostname_label, port} or {claim_hostname, port}.
  • CF — new try_claim_cname POSTs without upsert, bubbles up conflict if the record already exists. apply_ingress accepts a claims slice, POSTs fresh CNAMEs or confirms ownership on re-apply.
  • CPRegisterReq/IngressReplaceReq parse the new variant field; provision_agent_access creates a public-bypass CF Access app per claim at the zone apex. collector::Agent stores claims alongside extras; the collector recovers them via /health scrape after a CP restart.
  • Release — the collector's orphan-GC path now iterates a dead agent's claims, calls cf::release_claim (checks ownership before deleting — doesn't stomp on a legitimate takeover), then sweeps the per-claim Access apps. So when the nvidia-smi agent STONITHs, the name frees up for the next deploy.

Test plan

  • cargo build --release / cargo clippy -D warnings / cargo test — clean (23 tests; 4 new for claim parsing and CF helpers)
  • Preview deploy: nvidia-smi spec baked into prod config.iso; confirm nvidia-smi.devopsdefender.com resolves and serves after prod agent registers
  • Deploy a second workload spec with the same claim_hostname — expect CP 500 upstream error mentioning "already owned"
  • STONITH the prod agent (virsh destroy dd-local-prod); confirm the collector's next tick frees the claim (CNAME gone, Access app gone). Next deploy can re-claim.

Follow-ups (plan file)

  • Phase 2: automatic relaunch on agent death (CP picks eligible agent, forwards spec).
  • Promote cloudflared to a real EE workload so /logs/cloudflared-tunnel returns tunnel connection output.

🤖 Generated with Claude Code

Workloads can now declare `expose.claim_hostname: "nvidia-smi"` to
grab a stable short URL (`nvidia-smi.<domain>`) directly under the
zone apex, instead of the auto-labeled per-agent URL shape. DNS
uniqueness is the lock — the first agent to POST the CNAME wins,
subsequent callers get a deterministic conflict error from the CP.

  - Schema: `expose:` gains a mutually-exclusive `claim_hostname`
    field alongside `hostname_label`. `apps/README.md` documents both.
    `web-nvidia-smi` switches to `claim_hostname: "nvidia-smi"`.
  - Wire: `DD_EXTRA_INGRESS` env var extends to `@name:port` for
    claim entries; `label:port` stays for auto-labeled. The agent
    parses into two Vecs and forwards each `/register` +
    `/ingress/replace` payload entry as either `{hostname_label, port}`
    or `{claim_hostname, port}`.
  - CF: new `try_claim_cname` POSTs without upsert, bubbles up a
    conflict if the record already exists. `apply_ingress` accepts
    a claims slice, adds the ingress rules, and either POSTs a
    fresh CNAME (first caller) or confirms we already own it
    (idempotent re-apply) — anything else fails.
  - CP: `RegisterReq`/`IngressReplaceReq` gain the variant field;
    `provision_agent_access` creates a public-bypass CF Access app
    per claim at the zone apex. `collector::Agent` stores claims
    alongside extras for recovery via /health scrape.
  - Release: the collector's orphan-GC path now iterates a dead
    agent's claims and calls `cf::release_claim` (checks ownership
    before deleting — avoids stomping on a legitimate takeover),
    then `delete_access_apps_for` on each vanity domain. So when
    the nvidia-smi agent STONITHs, the name frees up for the next
    deploy.

No automatic failover / relaunch yet — that's Phase 2 in the plan.
For now, when the owning agent dies, the URL 404s until a fresh
deploy lands somewhere eligible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

DD preview ready

URL: https://pr-145.devopsdefender.com

Browser login: visit https://pr-145.devopsdefender.com — Cloudflare Access routes you
through GitHub OAuth. Membership (public) in the DD GitHub
org grants access; the DD_ACCESS_ADMIN_EMAIL is the
break-glass fallback.

Machine-to-machine: GitHub Actions workflows in the
DD_OWNER org pass their per-job OIDC JWT as
Authorization: Bearer … (audience dd-agent).

Register endpoint for a local agent: https://pr-145.devopsdefender.com/register
(CF-Access-bypassed; authenticated by ITA attestation).

@posix4e
Copy link
Copy Markdown
Member Author

posix4e commented Apr 19, 2026

Closing — parked in #148 for later. No longer actively working on this.

@posix4e posix4e closed this Apr 19, 2026
@posix4e posix4e deleted the feat/dns-claims branch April 19, 2026 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant