Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wireguard: Add fallback to userspace implementation #17451

Merged
merged 9 commits into from Oct 25, 2021

Conversation

gandro
Copy link
Member

@gandro gandro commented Sep 22, 2021

WireGuard on Linux usually uses the kernel-mode implementation, as it is
faster and does not rely on a userspace process running to process
packets.

However, some kernels do not ship with WireGuard support, either because
they are too old, or because WireGuard support has explicitly compiled
out. For such cases, we can fall back to the WireGuard user-space
implementation based on a TUN device. While this is not recommended in
production, it can be useful for testing or in cases where some machines
cannot upgrade to a more recent kernel.

The userspace implementation requires a daemon which encrypts and
decrypts the packets sent over the TUN device. In Cilium, this task is
performed by the cilium-agent process which calls into the wireguard-go
library. This has the consequence that if cilium-agent is down,
connectivity between WireGuard peers, and therefore connectivity between
Cilium-managed pods will also be unavailable. Thus, running WireGuard in
userspace mode is not recommeneded for production workloads.

The fallback to userspace mode must be explicitly enabled via the
enable-wireguard-userspace-fallback option.

Review per commit. The first few commits contain drive-by cleanups.

@gandro gandro added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. labels Sep 22, 2021
@gandro gandro requested a review from brb September 22, 2021 11:44
@gandro gandro marked this pull request as ready for review September 22, 2021 11:44
@gandro gandro requested a review from a team September 22, 2021 11:44
@gandro gandro requested review from a team as code owners September 22, 2021 11:44
@gandro gandro requested a review from qmonnet September 22, 2021 11:44
@gandro gandro requested a review from rolinh September 22, 2021 11:44
@gandro gandro force-pushed the pr/gandro/wireguard-user-mode branch 2 times, most recently from e88a950 to b7370f9 Compare September 22, 2021 11:54
Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@errordeveloper
Copy link
Contributor

I wonder if it might be possible to unit test the functions that setup the server, you could run two servers on one host and simulate traffic between them, what do you think? That would only cover user-space server functionality, and won't cover more desirable setup where one node uses the kernel implementation and another relies on user-space fallback, but at least we could have some tests for this new functionality...

@gandro
Copy link
Member Author

gandro commented Sep 22, 2021

I wonder if it might be possible to unit test the functions that setup the server, you could run two servers on one host and simulate traffic between them, what do you think? That would only cover user-space server functionality, and won't cover more desirable setup where one node uses the kernel implementation and another relies on user-space fallback, but at least we could have some tests for this new functionality...

The reason why I actually started this work is to add WireGuard to the Clustermesh CI 3.0, which runs on VMs which don't have WireGuard support. That CI test will cover the functionality (and more) added by this PR much better than unit tests. Do you think that's a suitable alternative?

With regards to compatibility between userspace-mode and kernel-mode: This is something guaranteed by upstream (wireguard-go is developed by same people who are responsible for the kernel-mode). Do you have a particular concern in mind here? After all, there are no parameters for us to configure which could break this compatibility.

pkg/wireguard/agent/agent.go Outdated Show resolved Hide resolved
pkg/option/config.go Outdated Show resolved Hide resolved
pkg/wireguard/agent/agent.go Show resolved Hide resolved
pkg/wireguard/agent/agent.go Outdated Show resolved Hide resolved
Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to this PR but something that came to my mind while reviewing:

In case the wireguard kernel implementation is built as a module (which seems to be the default configuration and is the case on the few systems I checked): do we check whether the module is loaded before attempting to set up wireguard or would it make sense to add such a check (similar to the way we check for iptables modules being loaded)? Or is it guaranteed to be autoloaded once we create the wgX interface?

@gandro gandro requested a review from a team as a code owner September 29, 2021 11:58
@gandro gandro requested a review from a team September 29, 2021 11:58
@gandro gandro force-pushed the pr/gandro/wireguard-user-mode branch 5 times, most recently from 5807ece to 5f8c332 Compare September 29, 2021 15:16
Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! :shipit:

@gandro
Copy link
Member Author

gandro commented Oct 5, 2021

/test

Regarding my last push: Rebased on master before running CI.

Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sChaosTest Connectivity demo application Endpoint can still connect while Cilium is not running

Failure Output

FAIL: DNS entry is not ready after timeout

Edit:

Edit 2: The net-next has been updated and some other workflows disabled. Rerunning tests.

@gandro
Copy link
Member Author

gandro commented Oct 18, 2021

/test

Init used to be called after the WireGuard tunnel IP was allocatd,
however this is not the case anymore since commit 0ecab37. Init is
still called after the watcher initialization, but it does not require
an IP address anymore.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
It should be propagated to the caller, as we exit the function when
rp_filter cannot be disabled.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
To aid in troubleshooting when Init() fails.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
The operator does not have any functionality that conditionally depends
on WireGuard anymore.

Fixes: 0ecab37 ("wireguard: Remove operator")

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Previously, the Init() function did not lock the WireGuard agent under
the assumption that there can be no concurrent calls into the agent.
That assumption however is false, both Close() (via signal handler
cleanup) and UpdatePeer() (via node manager) can be called concurrently
while Init() is being invoked.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/wireguard-user-mode branch from b6a1fea to 290b847 Compare October 20, 2021 13:35
This commit adds the packages used for the WireGuard userspace
implementation added in a subsequent commit.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
WireGuard on Linux usually uses the kernel-mode implementation, as it is
faster and does not rely on a userspace process running to process
packets.

However, some kernels do not ship with WireGuard support, either because
they are too old, or because WireGuard support has explicitly compiled
out. For such cases, we can fall back to the WireGuard user-space
implementation based on a TUN device. While this is not recommended in
production, it can be useful for testing or in cases where some machines
cannot upgrade to a more recent kernel.

The userspace implementation requires a daemon which encrypts and
decrypts the packets sent over the TUN device. In Cilium, this task is
performed by the cilium-agent process which calls into the wireguard-go
library. This has the consequence that if cilium-agent is down,
connectivity between WireGuard peers, and therefore connectivity between
Cilium-managed pods will also be unavailable. Thus, running WireGuard in
userspace mode is not recommeneded for production workloads.

The fallback to userspace mode must be explicitly enabled via the
`enable-wireguard-userspace-fallback` option.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Mentioning the caveats that apply.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
To enable the fallback to user-space mode implemented in the previous
commits.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/wireguard-user-mode branch from 290b847 to 718c4f7 Compare October 20, 2021 13:45
@gandro
Copy link
Member Author

gandro commented Oct 20, 2021

/test

Had to rebase due to conflicts.

Job 'Cilium-PR-K8s-1.21-kernel-4.9' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sConformance Portmap Chaining Check connectivity-check compliance with portmap chaining

Failure Output

FAIL: unable to retrieve all nodes with 'kubectl get nodes -o json | jq '.items | length'': Exitcode: -1 

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.21-kernel-4.9 so I can create a new GitHub issue to track it.

Edit: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.9/1659/testReport/junit/Suite-k8s-1/21/K8sConformance_Portmap_Chaining_Check_connectivity_check_compliance_with_portmap_chaining/

This looks like a failiure with K8s itself, not Cilium. kube-controller-manager failed with 2021-10-20T15:31:35.000867000Z F1020 15:31:35.000592 1 controllermanager.go:284] leaderelection lost

@gandro
Copy link
Member Author

gandro commented Oct 25, 2021

test-1.21-4.9

Edit: Given the above flake is unrelated to WireGuard and I would like to get this PR in for the 1.11 feature freeze, I will restart the failed test.

@gandro
Copy link
Member Author

gandro commented Oct 25, 2021

Marking this ready-to-merge. It has gotten a lot of reviews and is blocking further CI work (#17453). Plus I also would like to get this into Cilium 1.11 before the feature freeze.

@gandro gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 25, 2021
@kkourt kkourt merged commit 19cd3f3 into cilium:master Oct 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants