feat(talos): WireGuard server listener on control planes (WS3 phase 1)#2462
Conversation
Adds a wg0 WireGuard server on the control planes so a home UniFi gateway can dial in (pairs with devantler-tech/unifi PR #9, which configures the gateway as the client). Home initiates, so the dynamic residential IP never matters. - talos/control-planes/wireguard.yaml: wg0 at 10.200.0.1/24, listen 51820/udp, privateKey + gateway peer pubkey via ksail ${VAR} expansion (never committed). Same identity on all 3 control planes (repoint the peer endpoint on failover). - ingress-firewall.yaml: one world-reachable 51820/udp rule (5 total, well under the ENOBUFS threshold the header warns about). SCOPE: the LISTENER only. Routing tunnel traffic onward to cluster services or an internal admin gateway VIP is follow-on datapath design, not solved here. Applying needs a one-time `talosctl patch mc` (ksail does not push machine-config to existing nodes), the key exchange documented in the file header, and the WG_* secrets wired into the validate/deploy env. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis pull request adds a new Talos machine-network configuration for a control-plane WireGuard interface on Possibly related issues
Possibly related PRs
Suggested labels: configuration, networking, deployment Suggested reviewers: none 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Comment |
merge_group root-cause (run 28716734943, evicted 19:12Z) — do NOT re-queue yet; the failure is deterministic. Talos rejected the control-plane config apply on all 3 CPs with Two gaps have to close before this can merge:
Suggested order: ship the plumbing commit on this branch first, then seed the secrets, then re-queue. |
…ATE_KEY through deploys
The merge_group deploy rejected the machine config on every control plane
('private key is invalid: wrong key "" length: 0'): neither WG env var
existed in CI. The gateway's public key cannot exist yet (the tenant that
mints it is paused until this server lands), so requiring the peer here
deadlocks both sides — the listener now ships without peers (safe: WireGuard
answers no unauthenticated packet) and the peer registration follows in #2473.
WG_SERVER_PRIVATE_KEY is threaded through the deploy-prod composite (ci+cd)
and dr-rebuild's cluster create; the prod environment secret is set.
Merge-queue eviction root-caused and fixed. The 19:10Z Fixed by
Server public key (for the tenant's Will re-queue once CI is green. |
|
🎉 This PR is included in version 1.98.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
talos/control-planes/wireguard.yaml (1)
45-58: 📐 Maintainability & Code Quality | 🔵 TrivialConsider a pre-merge render/validate step for Talos patches.
The empty-key rejection that already occurred here would have been caught before hitting the merge-group deploy with a local dry-run (e.g., rendering the patch with
envsubst/ksail and validating withtalosctl validate, orksail workload validateif applicable). Worth adding to CI for this and futuretalos/control-planes/*.yamlpatches to fail fast instead of only failing atomically in the merge-group.As per coding guidelines, "Keep manifest changes small and use YAML/schema validation before submitting a manifest PR; for files with cluster context, prefer
ksail workload validate/kubectl kustomize/kubectl apply --dry-run=clientas appropriate."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@talos/control-planes/wireguard.yaml` around lines 45 - 58, The Talos manifest in wireguard.yaml should be validated before merge so empty or invalid keys are caught earlier. Add a pre-merge render/validate step for the Talos patch workflow that renders the config (for example via envsubst or ksail) and runs the appropriate schema check (such as talosctl validate or ksail workload validate) against the machine.network.interfaces/wireguard configuration. Wire this into CI for talos/control-planes patches so failures surface during PR checks instead of only at merge-group deploy time.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@talos/control-planes/wireguard.yaml`:
- Around line 45-58: The Talos manifest in wireguard.yaml should be validated
before merge so empty or invalid keys are caught earlier. Add a pre-merge
render/validate step for the Talos patch workflow that renders the config (for
example via envsubst or ksail) and runs the appropriate schema check (such as
talosctl validate or ksail workload validate) against the
machine.network.interfaces/wireguard configuration. Wire this into CI for
talos/control-planes patches so failures surface during PR checks instead of
only at merge-group deploy time.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 44b363f9-01e7-4260-ab84-637f23b8a2dd
📒 Files selected for processing (6)
.github/actions/deploy-prod/action.yml.github/workflows/cd.yaml.github/workflows/ci.yaml.github/workflows/dr-rebuild.yamldocs/dr/runbook.mdtalos/control-planes/wireguard.yaml
🔗 Linked repositories identified
CodeRabbit considers these linked repositories for cross-repo context during reviews:
devantler-tech/actions(auto-detected)devantler-tech/aws(auto-detected)devantler-tech/reusable-workflows(auto-detected)devantler-tech/ksail(auto-detected)devantler-tech/ascoachingogvaner(auto-detected)devantler-tech/wedding-app(auto-detected)devantler-tech/agent-skills(auto-detected)
📜 Review details
⏰ Context from checks skipped due to timeout. (1)
- GitHub Check: Analyze (python)
⚠️ CI failures not shown inline (2)
GitHub Actions: 🔀 Enable Auto-Merge / auto-merge: feat(talos): WireGuard server listener on control planes (WS3 phase 1)
Conclusion: failure
##[group]Run set +e
�[36;1mset +e�[0m
�[36;1mREVIEW_OUTPUT=$(gh pr review "$PR_NUMBER" --approve --repo "$REPOSITORY" 2>&1)�[0m
�[36;1mREVIEW_EXIT_CODE=$?�[0m
�[36;1mset -e�[0m
�[36;1m�[0m
�[36;1mif [[ $REVIEW_EXIT_CODE -eq 0 ]]; then�[0m
�[36;1m echo "✅ PR #${PR_NUMBER} approved"�[0m
�[36;1melif [[ "$REVIEW_OUTPUT" == *"Can not approve your own pull request"* ]]; then�[0m
�[36;1m echo "::warning::Could not approve PR #${PR_NUMBER} because GitHub does not allow self-approval. Skipping approval."�[0m
�[36;1melse�[0m
�[36;1m echo "::error::Failed to approve PR #${PR_NUMBER}."�[0m
GitHub Actions: 🔀 Enable Auto-Merge / 0_auto-merge.txt: feat(talos): WireGuard server listener on control planes (WS3 phase 1)
Conclusion: failure
##[group]Run set +e
�[36;1mset +e�[0m
�[36;1mREVIEW_OUTPUT=$(gh pr review "$PR_NUMBER" --approve --repo "$REPOSITORY" 2>&1)�[0m
�[36;1mREVIEW_EXIT_CODE=$?�[0m
�[36;1mset -e�[0m
�[36;1m�[0m
�[36;1mif [[ $REVIEW_EXIT_CODE -eq 0 ]]; then�[0m
�[36;1m echo "✅ PR #${PR_NUMBER} approved"�[0m
�[36;1melif [[ "$REVIEW_OUTPUT" == *"Can not approve your own pull request"* ]]; then�[0m
�[36;1m echo "::warning::Could not approve PR #${PR_NUMBER} because GitHub does not allow self-approval. Skipping approval."�[0m
�[36;1melse�[0m
�[36;1m echo "::error::Failed to approve PR #${PR_NUMBER}."�[0m
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{yaml,yml}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.{yaml,yml}: Use Kustomize overlays rather than editing base resources directly;k8s/bases/is immutable from overlays and changes should be made withpatches:in provider or cluster overlays.
Keep manifest changes small and use YAML/schema validation before submitting a manifest PR; for files with cluster context, preferksail workload validate/kubectl kustomize/kubectl apply --dry-run=clientas appropriate.
Files:
talos/control-planes/wireguard.yaml
🔀 Multi-repo context devantler-tech/ksail, devantler-tech/actions, devantler-tech/reusable-workflows
Linked repositories findings
devantler-tech/ksail
docs/src/content/docs/configuration/declarative-configuration.mdx:57— documents that environment variables are expanded in Talos patch files undertalos/control-planes/, so${WG_SERVER_PRIVATE_KEY}/${WG_GATEWAY_PUBLIC_KEY}are supported by the existing patch-loading path. [::devantler-tech/ksail::]docs/src/content/docs/distributions/talos.mdx:199,212— saysksail cluster create/ksail cluster updateload every patch fromtalos/control-planes/, and the example tree already includesingress-firewall-rules.yamlfor control-plane firewall rules. [::devantler-tech/ksail::]docs/src/content/docs/guides/talos-native-patches.mdx:64-97— lists the Talos patch directories and statesksail cluster updateprocesses them, confirming the newwireguard.yamlpatch will participate in normal update/apply behavior. [::devantler-tech/ksail::]pkg/fsutil/generator/talos/generator.go:399-479andpkg/fsutil/generator/talos/patchspec.go:151-181— Talos ingress-firewall generation already emits control-planeNetworkRuleConfigdocuments, so the new51820/udprule is an extension of an existing generated patch set rather than a new mechanism. [::devantler-tech/ksail::]pkg/fsutil/configmanager/ksail/distribution.go:885-928— Talos ingress firewall patches are generated from the config manager when patch files are absent, and the control-plane patch path is part of that pipeline. [::devantler-tech/ksail::]pkg/apis/cluster/v1alpha1/options.go:199-204andweb/ui/src/generated/ksail-config.ts:536-542—ingressFirewallis an existing cluster option that already maps to TalosNetworkRuleConfiggeneration. [::devantler-tech/ksail::]
devantler-tech/actions
- No direct references found to
deploy-prod,WG_SERVER_PRIVATE_KEY, or the new WireGuard/Talos inputs in this repo. [::devantler-tech/actions::]
devantler-tech/reusable-workflows
- No direct references found to
deploy-prod,WG_SERVER_PRIVATE_KEY, or the WireGuard/Talos deploy plumbing in this repo. [::devantler-tech/reusable-workflows::]
🔇 Additional comments (6)
talos/control-planes/wireguard.yaml (1)
9-17: 🩺 Stability & AvailabilityConfirm
WG_SERVER_PRIVATE_KEYis seeded before re-queuing.This mirrors the already-reported merge-group failure (empty key rejected by Talos). The plumbing added in this cohort (deploy-prod action, cd/ci workflows) fixes the "deploy path can't pass the var" blocker, but the secret itself still needs to exist in the
prodGitHub environment before this can land successfully..github/actions/deploy-prod/action.yml (1)
27-32: LGTM!Also applies to: 319-322
.github/workflows/cd.yaml (1)
47-47: LGTM!.github/workflows/ci.yaml (1)
175-180: LGTM!Also applies to: 231-237
.github/workflows/dr-rebuild.yaml (1)
107-110: LGTM!docs/dr/runbook.md (1)
42-50: LGTM!
CodeRabbit outside-diff finding (pre-merge Talos render/validate step) — valid, captured as #2477. Adding that CI capability is out of scope for this PR (it's a new PR-event validation lane for all |
WS3, phase 1 of the "VPN in front of critical services" rollout. Adds the dial-in WireGuard server on the Talos control planes — the encrypted endpoint the home UniFi gateway (the client side, unifi#9, now on Crossplane) connects to.
What this adds
talos/control-planes/wireguard.yaml—wg0at10.200.0.1/24,listenPort 51820, no peers yet: the gateway's key pair only exists once the unifi tenant applies, and the tenant is sequenced behind this server landing — so the peer registration follows in WS3 phase 1b: register the UniFi gateway peer on the control-plane wg0 listener #2473. A peer-less listener is safe (WireGuard answers no unauthenticated packet, nothing can connect). The server private key is${WG_SERVER_PRIVATE_KEY}env-expanded by ksail — never committed; theprodenvironment secret is set and threaded through the deploy-prod composite (ci+cd) and the DR rebuild.talos/control-planes/ingress-firewall.yaml— one world-reachable51820/udprule (WireGuard never replies to an unauthenticated packet, so an open source is safe; 5 rules total, well under the ENOBUFS threshold).Scope — this is the LISTENER only
No tunnel comes up from this PR alone: the gateway peer (#2473) and onward routing/enforcement are follow-on. Merging is safe and unblocks the tenant side, which waits for this server to exist.
Safety
cluster updatedoesn't push machine-config changes to already-running nodes, so merging this changes nothing live. It makes every node correct-by-construction on create/recreate; to apply to the running control planes, patch once out-of-band (a no-reboot field):talosctl --nodes <cp-ips> patch mc --patch @wireguard.yaml(with${WG_SERVER_PRIVATE_KEY}filled).Next (needs a design decision — will follow separately)
Making the admin UIs (Coroot/Hubble/OpenCost/Longhorn/OpenBao/Headlamp/KSail) reachable only through the tunnel is non-trivial here: the cluster fronts them on a public Hetzner LB (no Cilium LB-IPAM/L2-announcement), and iptables masquerade rewrites tunnel-sourced traffic to the node CIDR — so the clean "internal VIP" and "source-CIDR NetworkPolicy" approaches don't work as-is. The enforcement approach will be a separate PR once we've picked a direction and can validate the datapath against the live tunnel.