Possible connectivity disruption on agent restart with WireGuard + native routing #31979
Open
2 tasks done
Labels
area/encryption
Impacts encryption support such as IPSec, WireGuard, or kTLS.
feature/wireguard
Relates to Cilium's Wireguard feature
kind/bug
This is a bug in the Cilium logic.
sig/agent
Cilium agent related.
Is there an existing issue for this?
What happened?
Temporary connectivity disruption can occur on agent restart when Cilium is configured in native routing mode, and WireGuard encryption is enabled, because the list of Allowed IPs gets recreated from scratch upon reception of the node event for each given remote node, possibly removing entries for valid endpoints that have not yet been discovered at that point through the CiliumEndpoint CRD or the corresponding kvstore representation. This issue, instead, does not affect the current implementation in tunnel mode, as in that case we encrypt encapsulated traffic, which always has source and destination addresses corresponding to Node Internal IPs, which are immediately added as Allowed IPs.
A possible solution would be to restore the list of Allowed IPs for each peer from the WireGuard state after agent restart, and then do a GC pass to remove the stale entries after that ipcache synchronization has completed. IPCache synchronization should account for CiliumEndpoint synchronization (if the CiliumEndpoint CRD is enabled), kvstore synchronization (if kvstore mode is enabled), and clustermesh synchronization (when clustermesh is enabled).
Cilium Version
Tested on tip of main, but likely all versions are affected
Code of Conduct
The text was updated successfully, but these errors were encountered: