Skip to content

feat(talos): WireGuard server listener on control planes (WS3 phase 1)#2462

Merged
botantler-1[bot] merged 4 commits into
mainfrom
claude/wireguard-server
Jul 4, 2026
Merged

feat(talos): WireGuard server listener on control planes (WS3 phase 1)#2462
botantler-1[bot] merged 4 commits into
mainfrom
claude/wireguard-server

Conversation

@devantler

@devantler devantler commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

WS3, phase 1 of the "VPN in front of critical services" rollout. Adds the dial-in WireGuard server on the Talos control planes — the encrypted endpoint the home UniFi gateway (the client side, unifi#9, now on Crossplane) connects to.

What this adds

  • talos/control-planes/wireguard.yamlwg0 at 10.200.0.1/24, listenPort 51820, no peers yet: the gateway's key pair only exists once the unifi tenant applies, and the tenant is sequenced behind this server landing — so the peer registration follows in WS3 phase 1b: register the UniFi gateway peer on the control-plane wg0 listener #2473. A peer-less listener is safe (WireGuard answers no unauthenticated packet, nothing can connect). The server private key is ${WG_SERVER_PRIVATE_KEY} env-expanded by ksail — never committed; the prod environment secret is set and threaded through the deploy-prod composite (ci+cd) and the DR rebuild.
  • talos/control-planes/ingress-firewall.yaml — one world-reachable 51820/udp rule (WireGuard never replies to an unauthenticated packet, so an open source is safe; 5 rules total, well under the ENOBUFS threshold).

Scope — this is the LISTENER only

No tunnel comes up from this PR alone: the gateway peer (#2473) and onward routing/enforcement are follow-on. Merging is safe and unblocks the tenant side, which waits for this server to exist.

Safety

  • Config-only, not auto-applied. ksail's cluster update doesn't push machine-config changes to already-running nodes, so merging this changes nothing live. It makes every node correct-by-construction on create/recreate; to apply to the running control planes, patch once out-of-band (a no-reboot field):
    talosctl --nodes <cp-ips> patch mc --patch @wireguard.yaml (with ${WG_SERVER_PRIVATE_KEY} filled).

Next (needs a design decision — will follow separately)

Making the admin UIs (Coroot/Hubble/OpenCost/Longhorn/OpenBao/Headlamp/KSail) reachable only through the tunnel is non-trivial here: the cluster fronts them on a public Hetzner LB (no Cilium LB-IPAM/L2-announcement), and iptables masquerade rewrites tunnel-sourced traffic to the node CIDR — so the clean "internal VIP" and "source-CIDR NetworkPolicy" approaches don't work as-is. The enforcement approach will be a separate PR once we've picked a direction and can validate the datapath against the live tunnel.

devantler and others added 2 commits June 27, 2026 19:40
Adds a wg0 WireGuard server on the control planes so a home UniFi gateway can
dial in (pairs with devantler-tech/unifi PR #9, which configures the gateway as
the client). Home initiates, so the dynamic residential IP never matters.

- talos/control-planes/wireguard.yaml: wg0 at 10.200.0.1/24, listen 51820/udp,
  privateKey + gateway peer pubkey via ksail ${VAR} expansion (never committed).
  Same identity on all 3 control planes (repoint the peer endpoint on failover).
- ingress-firewall.yaml: one world-reachable 51820/udp rule (5 total, well under
  the ENOBUFS threshold the header warns about).

SCOPE: the LISTENER only. Routing tunnel traffic onward to cluster services or an
internal admin gateway VIP is follow-on datapath design, not solved here.
Applying needs a one-time `talosctl patch mc` (ksail does not push machine-config
to existing nodes), the key exchange documented in the file header, and the WG_*
secrets wired into the validate/deploy env.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This pull request adds a new Talos machine-network configuration for a control-plane WireGuard interface on wg0, with the private key supplied from WG_SERVER_PRIVATE_KEY. It also threads that secret through the production deploy action, CI/CD workflows, cluster creation, and disaster-recovery runbook. In parallel, it updates the ingress firewall to allow world-reachable UDP port 51820 and revises the accompanying access-count comment.

Possibly related issues

Possibly related PRs

  • devantler-tech/platform#2330: Both PRs touch .github/workflows/ci.yaml and the heal-prod-on-failure deploy path that invokes the shared deploy composite.

Suggested labels: configuration, networking, deployment

Suggested reviewers: none

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly summarizes the WireGuard server listener added to Talos control planes.
Description check ✅ Passed The description matches the changeset and explains the WireGuard listener, firewall rule, and deployment plumbing.

Comment @coderabbitai help to get the list of available commands.

@devantler devantler marked this pull request as ready for review July 4, 2026 17:08
@devantler devantler enabled auto-merge July 4, 2026 18:22
@devantler devantler disabled auto-merge July 4, 2026 18:33
@devantler devantler enabled auto-merge July 4, 2026 18:33
@devantler devantler disabled auto-merge July 4, 2026 19:05
@devantler devantler enabled auto-merge July 4, 2026 19:05
@devantler devantler added this pull request to the merge queue Jul 4, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 4, 2026
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

merge_group root-cause (run 28716734943, evicted 19:12Z) — do NOT re-queue yet; the failure is deterministic.

Talos rejected the control-plane config apply on all 3 CPs with private key is invalid: wrong key "" length: 0 / public key invalid: wrong key "" length: 0 — i.e. ${WG_SERVER_PRIVATE_KEY} and ${WG_GATEWAY_PUBLIC_KEY} expanded to empty strings. The reject was atomic (InvalidArgument at validation; nothing applied, prod unaffected) and the queue evicted the PR.

Two gaps have to close before this can merge:

  1. The secrets aren't seeded yet — this PR's own key-exchange order (generate server pair → tenant peer pubkey → tenant apply mints the gateway key → set WG_GATEWAY_PUBLIC_KEY) hasn't happened; both are maintainer-side secret writes.
  2. The deploy path can't deliver them anyway: the 🚀 Deploy to Production step passes a finite input set (sops-age-key / kube-config / talos-config / ghcr-token / hcloud-token) — there is no plumbing for the two WG env vars, so even seeded secrets would still expand empty at ksail load time. The deploy action + ci.yaml need the two inputs added (and the same for any future ${...} env-expanded Talos patch).

Suggested order: ship the plumbing commit on this branch first, then seed the secrets, then re-queue.

@devantler devantler added this pull request to the merge queue Jul 4, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 4, 2026
…ATE_KEY through deploys

The merge_group deploy rejected the machine config on every control plane
('private key is invalid: wrong key "" length: 0'): neither WG env var
existed in CI. The gateway's public key cannot exist yet (the tenant that
mints it is paused until this server lands), so requiring the peer here
deadlocks both sides — the listener now ships without peers (safe: WireGuard
answers no unauthenticated packet) and the peer registration follows in #2473.
WG_SERVER_PRIVATE_KEY is threaded through the deploy-prod composite (ci+cd)
and dr-rebuild's cluster create; the prod environment secret is set.
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Merge-queue eviction root-caused and fixed. The 19:10Z merge_group deploy failed in 🚀 Deploy to Prod: Talos rejected the rendered machine config on all control planes with private key is invalid: wrong key "" length: 0 — neither WG_* env var existed in CI, so ksail expanded them to empty strings.

Fixed by 028ce953:

  1. Peer split (deadlock break): the gateway's public key cannot exist until the unifi tenant's cluster_wireguard applies, and that tenant work is paused until this server lands (unifi#14) — requiring ${WG_GATEWAY_PUBLIC_KEY} here deadlocked both sides. The listener now ships peer-less (safe: WireGuard answers no unauthenticated packet); peer registration is WS3 phase 1b: register the UniFi gateway peer on the control-plane wg0 listener #2473.
  2. Secret + wiring: generated the server key pair, set WG_SERVER_PRIVATE_KEY in the prod environment, and threaded it through the deploy-prod composite (ci + cd) and dr-rebuild's cluster create. The private key was piped straight into the secret and never logged.

Server public key (for the tenant's cluster_wg_peer_public_key, #2473): zEJdr4QdIZBpGUPKzN7NF9Oidaw1lSstPqoQpIdNTUA=

Will re-queue once CI is green.

@botantler-1 botantler-1 Bot added this pull request to the merge queue Jul 4, 2026
Merged via the queue into main with commit 7b6c79d Jul 4, 2026
14 of 15 checks passed
@botantler-1 botantler-1 Bot deleted the claude/wireguard-server branch July 4, 2026 19:32
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jul 4, 2026
@botantler-1

botantler-1 Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 1.98.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@botantler-1 botantler-1 Bot added the released label Jul 4, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
talos/control-planes/wireguard.yaml (1)

45-58: 📐 Maintainability & Code Quality | 🔵 Trivial

Consider a pre-merge render/validate step for Talos patches.

The empty-key rejection that already occurred here would have been caught before hitting the merge-group deploy with a local dry-run (e.g., rendering the patch with envsubst/ksail and validating with talosctl validate, or ksail workload validate if applicable). Worth adding to CI for this and future talos/control-planes/*.yaml patches to fail fast instead of only failing atomically in the merge-group.

As per coding guidelines, "Keep manifest changes small and use YAML/schema validation before submitting a manifest PR; for files with cluster context, prefer ksail workload validate / kubectl kustomize / kubectl apply --dry-run=client as appropriate."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@talos/control-planes/wireguard.yaml` around lines 45 - 58, The Talos manifest
in wireguard.yaml should be validated before merge so empty or invalid keys are
caught earlier. Add a pre-merge render/validate step for the Talos patch
workflow that renders the config (for example via envsubst or ksail) and runs
the appropriate schema check (such as talosctl validate or ksail workload
validate) against the machine.network.interfaces/wireguard configuration. Wire
this into CI for talos/control-planes patches so failures surface during PR
checks instead of only at merge-group deploy time.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@talos/control-planes/wireguard.yaml`:
- Around line 45-58: The Talos manifest in wireguard.yaml should be validated
before merge so empty or invalid keys are caught earlier. Add a pre-merge
render/validate step for the Talos patch workflow that renders the config (for
example via envsubst or ksail) and runs the appropriate schema check (such as
talosctl validate or ksail workload validate) against the
machine.network.interfaces/wireguard configuration. Wire this into CI for
talos/control-planes patches so failures surface during PR checks instead of
only at merge-group deploy time.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 44b363f9-01e7-4260-ab84-637f23b8a2dd

📥 Commits

Reviewing files that changed from the base of the PR and between 5a41d98 and 028ce95.

📒 Files selected for processing (6)
  • .github/actions/deploy-prod/action.yml
  • .github/workflows/cd.yaml
  • .github/workflows/ci.yaml
  • .github/workflows/dr-rebuild.yaml
  • docs/dr/runbook.md
  • talos/control-planes/wireguard.yaml
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • devantler-tech/actions (auto-detected)
  • devantler-tech/aws (auto-detected)
  • devantler-tech/reusable-workflows (auto-detected)
  • devantler-tech/ksail (auto-detected)
  • devantler-tech/ascoachingogvaner (auto-detected)
  • devantler-tech/wedding-app (auto-detected)
  • devantler-tech/agent-skills (auto-detected)
📜 Review details
⏰ Context from checks skipped due to timeout. (1)
  • GitHub Check: Analyze (python)
⚠️ CI failures not shown inline (2)

GitHub Actions: 🔀 Enable Auto-Merge / auto-merge: feat(talos): WireGuard server listener on control planes (WS3 phase 1)

Conclusion: failure

View job details

##[group]Run set +e
 �[36;1mset +e�[0m
 �[36;1mREVIEW_OUTPUT=$(gh pr review "$PR_NUMBER" --approve --repo "$REPOSITORY" 2>&1)�[0m
 �[36;1mREVIEW_EXIT_CODE=$?�[0m
 �[36;1mset -e�[0m
 �[36;1m�[0m
 �[36;1mif [[ $REVIEW_EXIT_CODE -eq 0 ]]; then�[0m
 �[36;1m  echo "✅ PR #${PR_NUMBER} approved"�[0m
 �[36;1melif [[ "$REVIEW_OUTPUT" == *"Can not approve your own pull request"* ]]; then�[0m
 �[36;1m  echo "::warning::Could not approve PR #${PR_NUMBER} because GitHub does not allow self-approval. Skipping approval."�[0m
 �[36;1melse�[0m
 �[36;1m  echo "::error::Failed to approve PR #${PR_NUMBER}."�[0m

GitHub Actions: 🔀 Enable Auto-Merge / 0_auto-merge.txt: feat(talos): WireGuard server listener on control planes (WS3 phase 1)

Conclusion: failure

View job details

##[group]Run set +e
 �[36;1mset +e�[0m
 �[36;1mREVIEW_OUTPUT=$(gh pr review "$PR_NUMBER" --approve --repo "$REPOSITORY" 2>&1)�[0m
 �[36;1mREVIEW_EXIT_CODE=$?�[0m
 �[36;1mset -e�[0m
 �[36;1m�[0m
 �[36;1mif [[ $REVIEW_EXIT_CODE -eq 0 ]]; then�[0m
 �[36;1m  echo "✅ PR #${PR_NUMBER} approved"�[0m
 �[36;1melif [[ "$REVIEW_OUTPUT" == *"Can not approve your own pull request"* ]]; then�[0m
 �[36;1m  echo "::warning::Could not approve PR #${PR_NUMBER} because GitHub does not allow self-approval. Skipping approval."�[0m
 �[36;1melse�[0m
 �[36;1m  echo "::error::Failed to approve PR #${PR_NUMBER}."�[0m
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.{yaml,yml}: Use Kustomize overlays rather than editing base resources directly; k8s/bases/ is immutable from overlays and changes should be made with patches: in provider or cluster overlays.
Keep manifest changes small and use YAML/schema validation before submitting a manifest PR; for files with cluster context, prefer ksail workload validate / kubectl kustomize / kubectl apply --dry-run=client as appropriate.

Files:

  • talos/control-planes/wireguard.yaml
🔀 Multi-repo context devantler-tech/ksail, devantler-tech/actions, devantler-tech/reusable-workflows

Linked repositories findings

devantler-tech/ksail

  • docs/src/content/docs/configuration/declarative-configuration.mdx:57 — documents that environment variables are expanded in Talos patch files under talos/control-planes/, so ${WG_SERVER_PRIVATE_KEY} / ${WG_GATEWAY_PUBLIC_KEY} are supported by the existing patch-loading path. [::devantler-tech/ksail::]
  • docs/src/content/docs/distributions/talos.mdx:199,212 — says ksail cluster create / ksail cluster update load every patch from talos/control-planes/, and the example tree already includes ingress-firewall-rules.yaml for control-plane firewall rules. [::devantler-tech/ksail::]
  • docs/src/content/docs/guides/talos-native-patches.mdx:64-97 — lists the Talos patch directories and states ksail cluster update processes them, confirming the new wireguard.yaml patch will participate in normal update/apply behavior. [::devantler-tech/ksail::]
  • pkg/fsutil/generator/talos/generator.go:399-479 and pkg/fsutil/generator/talos/patchspec.go:151-181 — Talos ingress-firewall generation already emits control-plane NetworkRuleConfig documents, so the new 51820/udp rule is an extension of an existing generated patch set rather than a new mechanism. [::devantler-tech/ksail::]
  • pkg/fsutil/configmanager/ksail/distribution.go:885-928 — Talos ingress firewall patches are generated from the config manager when patch files are absent, and the control-plane patch path is part of that pipeline. [::devantler-tech/ksail::]
  • pkg/apis/cluster/v1alpha1/options.go:199-204 and web/ui/src/generated/ksail-config.ts:536-542ingressFirewall is an existing cluster option that already maps to Talos NetworkRuleConfig generation. [::devantler-tech/ksail::]

devantler-tech/actions

  • No direct references found to deploy-prod, WG_SERVER_PRIVATE_KEY, or the new WireGuard/Talos inputs in this repo. [::devantler-tech/actions::]

devantler-tech/reusable-workflows

  • No direct references found to deploy-prod, WG_SERVER_PRIVATE_KEY, or the WireGuard/Talos deploy plumbing in this repo. [::devantler-tech/reusable-workflows::]
🔇 Additional comments (6)
talos/control-planes/wireguard.yaml (1)

9-17: 🩺 Stability & Availability

Confirm WG_SERVER_PRIVATE_KEY is seeded before re-queuing.

This mirrors the already-reported merge-group failure (empty key rejected by Talos). The plumbing added in this cohort (deploy-prod action, cd/ci workflows) fixes the "deploy path can't pass the var" blocker, but the secret itself still needs to exist in the prod GitHub environment before this can land successfully.

.github/actions/deploy-prod/action.yml (1)

27-32: LGTM!

Also applies to: 319-322

.github/workflows/cd.yaml (1)

47-47: LGTM!

.github/workflows/ci.yaml (1)

175-180: LGTM!

Also applies to: 231-237

.github/workflows/dr-rebuild.yaml (1)

107-110: LGTM!

docs/dr/runbook.md (1)

42-50: LGTM!

@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

CodeRabbit outside-diff finding (pre-merge Talos render/validate step) — valid, captured as #2477. Adding that CI capability is out of scope for this PR (it's a new PR-event validation lane for all talos/** patches, needing placeholder key-material design), so it joins the backlog rather than growing this diff. The concrete failure it would have caught is fixed here directly (secret set + threaded; peer-less listener).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

1 participant