Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network performance with Wireguard extremely poor #28413

Open
2 tasks done
roarvroom opened this issue Oct 5, 2023 · 28 comments
Open
2 tasks done

Network performance with Wireguard extremely poor #28413

roarvroom opened this issue Oct 5, 2023 · 28 comments
Assignees
Labels
area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. feature/wireguard Relates to Cilium's Wireguard feature kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/performance There is a performance impact of this. need-more-info More information is required to further debug or fix the issue. needs/triage This issue requires triaging to establish severity and next steps. pinned These issues are not marked stale by our issue bot. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@roarvroom
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Environment:

  • EKS cluster in AWS (1.27)
  • EKS node interfaces come with MTU set to 9001 on all interfaces
  • Cilium version: 1.14.2, wireguard encryption enabled

Cilium helm values:

cni:
  chainingMode: aws-cni
  exclusive: false
enableIPv4Masquerade: false
encryption:
  enabled: true
  nodeEncryption: false
  type: wireguard
endpointRoutes:
  enabled: true
hubble:
  listenAddress: :4244
  metrics:
    dashboards:
      enabled: true
  relay:
    enabled: true
  ui:
    enabled: true
l7proxy: false
operator:
  dashboards:
    enabled: true
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true
prometheus:
  dashboards:
    enabled: true
  enabled: true
  serviceMonitor:
    enabled: true
routingMode: native
tunnel: disabled

Issue:

After installing Cilium with the above configuration, I observed very poor network performance (around 80Mb/s) when running an iperf test between two pods scheduled on different worker nodes, even though the network capacity is 4.6Gbit/s (AWS instances t3a.2xlarge).

Upon further investigation, I found that even after setting the MTU to 8000 (my arbitrary value below 9001) in the Cilium configmap and restarting Cilium, the MTU for the container interfaces remained at 9001. This is while the MTU for the cilium_wg0 interface was correctly set to 7920 and the cilium_net@cilium_host and cilium_host@cilium_net interfaces were set to 8000.

After manually adjusting the MTU inside the containers, I found that setting the MTU to anything below 8928 resulted in a significant improvement in network performance, achieving around 3.5Gbit/s.

root@ubuntu-pod-2:/# ip link set eth0 mtu 8928; iperf -c 100.64.250.163
------------------------------------------------------------
Client connecting to 100.64.250.163, TCP port 5001
TCP window size: 1.49 MByte (default)
------------------------------------------------------------
[  3] local 100.64.250.124 port 40584 connected with 100.64.250.163 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  4.22 GBytes  3.62 Gbits/sec
root@ubuntu-pod-2:/# ip link set eth0 mtu 8929; iperf -c 100.64.250.163
------------------------------------------------------------
Client connecting to 100.64.250.163, TCP port 5001
TCP window size: 1.65 MByte (default)
------------------------------------------------------------
[  3] local 100.64.250.124 port 56988 connected with 100.64.250.163 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.2 sec  97.0 MBytes  80.1 Mbits/sec
8: cilium_wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 7920 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/none
9: cilium_net@cilium_host: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 8000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 4a:fe:67:61:b8:28 brd ff:ff:ff:ff:ff:ff
10: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 8000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether ba:a9:25:fb:36:22 brd ff:ff:ff:ff:ff:ff
...
17: eni4b916ae9eaf@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether de:f2:60:83:3c:bb brd ff:ff:ff:ff:ff:ff link-netns cni-749f0b8d-684d-40b8-7ba0-97947f237abb
19: enia83678c0670@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether fe:83:83:bf:d2:30 brd ff:ff:ff:ff:ff:ff link-netns cni-8b3d612a-4bff-43bc-baea-772c40d3a614
20: enid820b7170dd@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 7a:74:45:3f:84:86 brd ff:ff:ff:ff:ff:ff link-netns cni-74e6c8e2-e69a-452e-b834-4019cfc0dcde
21: eni957c2224e78@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 92:05:4d:9f:57:86 brd ff:ff:ff:ff:ff:ff link-netns cni-5bad9d43-c60d-f297-111a-5bfbb0733b52
22: eni74deebfb075@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 92:74:f4:c1:f7:01 brd ff:ff:ff:ff:ff:ff link-netns cni-153f153e-375d-dacf-5243-4f9cf5ce98c8

Expected Behavior:

When adjusting the MTU in the Cilium configuration, it should be reflected on all relevant interfaces, including the container interfaces. Additionally, with the default MTU settings and WireGuard encryption enabled, the network performance should not be as poor as observed.

Steps to Reproduce:

  • Set up an EKS cluster in AWS with node interfaces having an MTU of 9001.
  • Install Cilium using the provided Helm chart values.
  • Run an iperf test between two pods on different nodes and observe the poor network performance.
  • Adjust the MTU in the Cilium configmap, restart Cilium, and observe that the MTU for container interfaces remains at 9001.
  • Manually adjust the MTU inside the containers and observe the improvement in network performance when setting the MTU to below 8928.

Cilium Version

cilium-cli: v0.15.7 compiled with go1.21.0 on darwin/arm64
cilium image (default): v1.14.1
cilium image (stable): v1.14.2
cilium image (running): 1.14.2

Kernel Version

5.10.184-175.749.amzn2.x86_64

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.2", GitCommit:"fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b", GitTreeState:"clean", BuildDate:"2023-02-22T13:32:21Z", GoVersion:"go1.20.1", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.4-eks-2d98532", GitCommit:"3d90c097c72493c2f1a9dd641e4a22d24d15be68", GitTreeState:"clean", BuildDate:"2023-07-28T16:51:44Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}

Sysdump

cilium-sysdump-20231005-125744.zip

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@roarvroom roarvroom added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Oct 5, 2023
@tklauser tklauser added kind/performance There is a performance impact of this. area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. labels Oct 5, 2023
@mmerickel
Copy link

Yeah this is a show stopper I think? Configured my cluster as in #28387 and things work amazing until you try to send larger amounts of data and it falls on its face. MTUs are identical to OP, 9001 on everything except 8921 on cilium_wg0.

Send packets with MSS 8852

root@something-2:/# iperf3 -c 2600:1f16:40:5306:969::9 -M 8852
Connecting to host 2600:1f16:40:5306:969::9, port 5201
[  5] local 2600:1f16:40:5306:4d9::18 port 53884 connected to 2600:1f16:40:5306:969::9 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   240 MBytes  2.01 Gbits/sec    6   1.46 MBytes
[  5]   1.00-2.00   sec   310 MBytes  2.59 Gbits/sec   11   1.48 MBytes

Send packets with MSS=8853

root@something-2:/# iperf3 -c 2600:1f16:40:5306:969::9 -M 8853
Connecting to host 2600:1f16:40:5306:969::9, port 5201
[  5] local 2600:1f16:40:5306:4d9::18 port 46234 connected to 2600:1f16:40:5306:969::9 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   320 KBytes  2.61 Mbits/sec    3   8.63 KBytes
[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   8.63 KBytes

@archoversight
Copy link

archoversight commented Oct 6, 2023

I am seeing the same issues as @mmerickel.

IPv6 does not fragment automatically, and without proper path MTU there is no way for the link to find the valid MTU automatically so connectivity just drops to the floor.

@mmerickel
Copy link

We did test configuring enable-pmtu-discovery: "true" and there is no icmp6 traffic over cilium_wg0 so as @archoversight said I'm lead to believe that this feature is not working for this configuration of either ipv6, or wireguard, or both.

Also it's worth noting that things work great if you turn off wireguard encryption! :-)

@PhilipSchmid
Copy link
Contributor

PhilipSchmid commented Oct 6, 2023

Hi folks,
I just tried to reproduce this with a cluster that I still had running on EC2 instances. "Unfortunately", I didn't see any noticeable performance impact. However, the config is also quite different from yours (VXLAN, no CNI chaining, no EKS).

  • AWS EC2 instances: m5.large
  • Cilium version: 1.14.2
  • OS Kernel: 5.15.0-1039-aws
  • OS: Ubuntu 20.04
  • K8s: v1.24.15

Cilium values:

encryption:
  enabled: true
  type: wireguard
  nodeEncryption: false
tunnel: "vxlan"

Without Wireguard:

root@iperf3-deployment-5d6946cbd9-r7k2r:~# iperf3 -c 10.2.2.98 -p 12345
Connecting to host 10.2.2.98, port 12345
[  4] local 10.2.3.7 port 39422 connected to 10.2.2.98 port 12345
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   591 MBytes  4.95 Gbits/sec    0   2.14 MBytes
[  4]   1.00-2.00   sec   589 MBytes  4.94 Gbits/sec    0   2.14 MBytes
[  4]   2.00-3.00   sec   589 MBytes  4.94 Gbits/sec   49   2.24 MBytes
[  4]   3.00-4.00   sec   588 MBytes  4.94 Gbits/sec    0   2.26 MBytes
[  4]   4.00-5.00   sec   589 MBytes  4.94 Gbits/sec    0   2.33 MBytes
[  4]   5.00-6.00   sec   589 MBytes  4.94 Gbits/sec    0   2.33 MBytes
[  4]   6.00-7.00   sec   589 MBytes  4.94 Gbits/sec    0   2.61 MBytes
[  4]   7.00-8.00   sec   589 MBytes  4.94 Gbits/sec    0   2.66 MBytes
[  4]   8.00-9.00   sec   589 MBytes  4.94 Gbits/sec    0   2.66 MBytes
[  4]   9.00-10.00  sec   589 MBytes  4.94 Gbits/sec    0   2.66 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  5.75 GBytes  4.94 Gbits/sec   49             sender
[  4]   0.00-10.00  sec  5.75 GBytes  4.94 Gbits/sec                  receiver

Enabled Wireguard, restarted all Cilium and iperf3 pods, ensured the iperf3 Pods were running on different nodes, rerun the tests:

With Wireguard:

root@iperf3-deployment-5c59548b57-76bcr:~# iperf3 -c 10.2.3.57 -p 12345
Connecting to host 10.2.3.57, port 12345
[  4] local 10.2.2.103 port 53710 connected to 10.2.3.57 port 12345
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   591 MBytes  4.95 Gbits/sec    7   1.91 MBytes
[  4]   1.00-2.00   sec   589 MBytes  4.94 Gbits/sec    0   1.91 MBytes
[  4]   2.00-3.00   sec   589 MBytes  4.94 Gbits/sec    0   1.95 MBytes
[  4]   3.00-4.00   sec   588 MBytes  4.93 Gbits/sec    0   1.97 MBytes
[  4]   4.00-5.00   sec   589 MBytes  4.94 Gbits/sec   56   2.24 MBytes
[  4]   5.00-6.00   sec   589 MBytes  4.94 Gbits/sec    0   2.24 MBytes
[  4]   6.00-7.00   sec   588 MBytes  4.93 Gbits/sec    0   2.24 MBytes
[  4]   7.00-8.00   sec   589 MBytes  4.94 Gbits/sec    0   2.24 MBytes
[  4]   8.00-9.00   sec   586 MBytes  4.92 Gbits/sec   15   2.25 MBytes
[  4]   9.00-10.00  sec   589 MBytes  4.94 Gbits/sec    0   2.26 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  5.75 GBytes  4.94 Gbits/sec   78             sender
[  4]   0.00-10.00  sec  5.74 GBytes  4.93 Gbits/sec                  receiver

Will try to reproduce it once again with a proper direct routing, EKS VPC-CNI chained setup anytime soon.

@roarvroom
Copy link
Author

@PhilipSchmid, it's hard to believe you're doing it right. If I create two t3a.xlarge with recent Ubuntu LTS on AWS, the performance drop is still significant (no Cilium, just two nodes talking wireguard to each other). I don't have the numbers here, but the drop was what I remember from 4.6Gbit/s to something like 3.5Gbit/s. That's still order of 3 magnitudes better :)

@PhilipSchmid
Copy link
Contributor

@roarvroom Maybe yes, but Wireguard is definetly activated

root@ip-10-1-2-208:/home/cilium# cilium encrypt status
Encryption: Wireguard
Interface: cilium_wg0
	Public key: ozRtQnV2nbHmn+FVKCC6fueHLMlaYB41/5qstTww1Vc=
	Number of peers: 5

... and the Pods are for sure not running on the same node. That would look like this:

root@iperf3-deployment-5c59548b57-4swrv:~# iperf3 -c 10.2.3.57 -p 12345
Connecting to host 10.2.3.57, port 12345
[  4] local 10.2.3.242 port 57094 connected to 10.2.3.57 port 12345
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  3.61 GBytes  31.0 Gbits/sec  148   2.44 MBytes
[  4]   1.00-2.00   sec  3.52 GBytes  30.3 Gbits/sec  648   1.83 MBytes
[  4]   2.00-3.00   sec  3.60 GBytes  31.0 Gbits/sec  247   1.83 MBytes
[  4]   3.00-4.00   sec  3.63 GBytes  31.2 Gbits/sec  111   1.83 MBytes
[  4]   4.00-5.00   sec  3.65 GBytes  31.4 Gbits/sec    0   1.83 MBytes
[  4]   5.00-6.00   sec  3.59 GBytes  30.8 Gbits/sec  227   1.85 MBytes
[  4]   6.00-7.00   sec  3.58 GBytes  30.8 Gbits/sec    0   1.85 MBytes
[  4]   7.00-8.00   sec  3.54 GBytes  30.4 Gbits/sec   68   1.85 MBytes
[  4]   8.00-9.00   sec  3.55 GBytes  30.5 Gbits/sec   25   1.85 MBytes
[  4]   9.00-10.00  sec  3.58 GBytes  30.7 Gbits/sec    0   1.85 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  35.9 GBytes  30.8 Gbits/sec  1474             sender
[  4]   0.00-10.00  sec  35.9 GBytes  30.8 Gbits/sec                  receiver

Hence, I'm still looking into it.

@PhilipSchmid
Copy link
Contributor

PhilipSchmid commented Oct 6, 2023

Found something: As soon as I enable the beta Node-to-Node encryption as well (rollout restart all Cilium agents), the performance drops significantly.

encryption:
  enabled: true
  type: wireguard
  nodeEncryption: true      # <- this
root@iperf3-deployment-5c59548b57-w7mmb:~# iperf3 -c 10.2.3.57 -p 12345
Connecting to host 10.2.3.57, port 12345
[  4] local 10.2.1.144 port 45620 connected to 10.2.3.57 port 12345
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  17.7 MBytes   148 Mbits/sec  263    113 KBytes
[  4]   1.00-2.00   sec  9.06 MBytes  76.0 Mbits/sec  187   43.3 KBytes
[  4]   2.00-3.00   sec  9.35 MBytes  78.5 Mbits/sec  131    104 KBytes
[  4]   3.00-4.00   sec  9.06 MBytes  76.0 Mbits/sec  179   26.0 KBytes
[  4]   4.00-5.00   sec  8.47 MBytes  71.0 Mbits/sec  112   60.6 KBytes
[  4]   5.00-6.00   sec  7.87 MBytes  66.1 Mbits/sec  101   60.6 KBytes
[  4]   6.00-7.00   sec  9.83 MBytes  82.4 Mbits/sec  163    208 KBytes
[  4]   7.00-8.00   sec  8.23 MBytes  69.0 Mbits/sec  162   52.0 KBytes
[  4]   8.00-9.00   sec  8.47 MBytes  71.0 Mbits/sec  122    173 KBytes
[  4]   9.00-10.00  sec  8.88 MBytes  74.5 Mbits/sec  151   86.6 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  96.9 MBytes  81.3 Mbits/sec  1571             sender
[  4]   0.00-10.00  sec  94.8 MBytes  79.6 Mbits/sec                  receiver

iperf Done.

Disabling encryption.nodeEncryption=false once again, restarting all nodes and Cilium Pods, allows the performance to come back to the healthy values:

root@iperf3-deployment-d8f65bb65-fkqdd:~# iperf3 -c 10.2.3.241 -p 12345
Connecting to host 10.2.3.241, port 12345
[  4] local 10.2.1.152 port 53434 connected to 10.2.3.241 port 12345
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   592 MBytes  4.97 Gbits/sec    0   1.92 MBytes
[  4]   1.00-2.00   sec   586 MBytes  4.92 Gbits/sec    0   2.24 MBytes
[  4]   2.00-3.00   sec   589 MBytes  4.94 Gbits/sec   54   2.40 MBytes
[  4]   3.00-4.00   sec   589 MBytes  4.94 Gbits/sec    0   2.40 MBytes
[  4]   4.00-5.00   sec   582 MBytes  4.89 Gbits/sec    0   2.44 MBytes
[  4]   5.00-6.00   sec   589 MBytes  4.94 Gbits/sec    0   2.44 MBytes
[  4]   6.00-7.00   sec   585 MBytes  4.91 Gbits/sec    0   2.80 MBytes
[  4]   7.00-8.00   sec   589 MBytes  4.94 Gbits/sec    0   2.93 MBytes
[  4]   8.00-9.00   sec   588 MBytes  4.93 Gbits/sec    0   2.93 MBytes
[  4]   9.00-10.00  sec   589 MBytes  4.94 Gbits/sec    0   2.93 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  5.74 GBytes  4.93 Gbits/sec   54             sender
[  4]   0.00-10.00  sec  5.74 GBytes  4.93 Gbits/sec                  receiver

I guess in my case it's only after enabling Node-to-Node encryption because I'm running with VXLAN overlay while you're running with direct routing. So don't get me wrong, I don't think it's directly related to Node-to-Node encryption but rather to the similarities between running "Wireguard & direct routing" and "Wireguard & VXLAN & Node-to-Node encryption".

@michaltorbinski
Copy link

michaltorbinski commented Oct 6, 2023

This issue(or similar) can be also seen even on a Kind cluster.

Environment:
Host: Ubuntu 22.04
Kind: kind version 0.20.0
Worker OS: Debian GNU/Linux 11 (bullseye)
Kernel: 6.2.0-34-generic
Kubernetes: v1.27.3

Cilium without wireguard:

helm upgrade --install  cilium cilium/cilium --version 1.14.2 \
   --namespace kube-system \
   --set image.pullPolicy=IfNotPresent \
   --set ipam.mode=kubernetes --set bpf.masquerade=true \
   --set hubble.relay.enabled=true  \
   --set hubble.ui.enabled=true --set kubeProxyReplacement=true \
--set k8sServiceHost=kind-control-plane \
--set k8sServicePort=6443

Iperf:

Client on 172.18.0.3:  Connecting to host iperf3-server, port 5201
Client on 172.18.0.3:  [  5] local 10.244.3.75 port 34776 connected to 10.96.235.239 port 5201
Client on 172.18.0.3:  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
Client on 172.18.0.3:  [  5]   0.00-1.00   sec  2.13 GBytes  18.3 Gbits/sec    0    653 KBytes       
Client on 172.18.0.3:  [  5]   1.00-2.00   sec  2.18 GBytes  18.7 Gbits/sec    0    653 KBytes       
Client on 172.18.0.3:  [  5]   2.00-3.00   sec  2.09 GBytes  17.9 Gbits/sec    0    694 KBytes       
Client on 172.18.0.3:  [  5]   3.00-4.00   sec  2.09 GBytes  18.0 Gbits/sec    0   1003 KBytes       
Client on 172.18.0.3:  [  5]   4.00-5.00   sec  2.08 GBytes  17.9 Gbits/sec    0   1003 KBytes       
Client on 172.18.0.3:  [  5]   5.00-6.00   sec  2.05 GBytes  17.6 Gbits/sec    0   1003 KBytes       
Client on 172.18.0.3:  [  5]   6.00-7.00   sec  2.13 GBytes  18.3 Gbits/sec    0   1003 KBytes       
Client on 172.18.0.3:  [  5]   7.00-8.00   sec  2.15 GBytes  18.5 Gbits/sec    0   1003 KBytes       
Client on 172.18.0.3:  [  5]   8.00-9.00   sec  2.13 GBytes  18.3 Gbits/sec    0   1003 KBytes       
Client on 172.18.0.3:  [  5]   9.00-10.00  sec  2.07 GBytes  17.8 Gbits/sec    0   1003 KBytes       
Client on 172.18.0.3:  - - - - - - - - - - - - - - - - - - - - - - - - -
Client on 172.18.0.3:  [ ID] Interval           Transfer     Bitrate         Retr
Client on 172.18.0.3:  [  5]   0.00-10.00  sec  21.1 GBytes  18.1 Gbits/sec    0             sender
Client on 172.18.0.3:  [  5]   0.00-10.04  sec  21.1 GBytes  18.0 Gbits/sec                  receiver
Client on 172.18.0.3:  
Client on 172.18.0.3:  iperf Done.

Wireguard enabled and cilium restarted:

Client on 172.18.0.3:  Connecting to host iperf3-server, port 5201
Client on 172.18.0.3:  [  5] local 10.244.3.27 port 56594 connected to 10.96.227.70 port 5201
Client on 172.18.0.3:  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
Client on 172.18.0.3:  [  5]   0.00-1.00   sec   242 MBytes  2.03 Gbits/sec  1639    765 KBytes       
Client on 172.18.0.3:  [  5]   1.00-2.00   sec   245 MBytes  2.05 Gbits/sec  213    915 KBytes       
Client on 172.18.0.3:  [  5]   2.00-3.00   sec   245 MBytes  2.05 Gbits/sec  175   1.00 MBytes       
Client on 172.18.0.3:  [  5]   3.00-4.00   sec   248 MBytes  2.08 Gbits/sec   64    802 KBytes       
Client on 172.18.0.3:  [  5]   4.00-5.00   sec   241 MBytes  2.02 Gbits/sec   96    973 KBytes       
Client on 172.18.0.3:  [  5]   5.00-6.00   sec   244 MBytes  2.04 Gbits/sec   74   1.08 MBytes       
Client on 172.18.0.3:  [  5]   6.00-7.00   sec   244 MBytes  2.04 Gbits/sec   49    928 KBytes       
Client on 172.18.0.3:  [  5]   7.00-8.00   sec   245 MBytes  2.05 Gbits/sec   27   1.05 MBytes       
Client on 172.18.0.3:  [  5]   8.00-9.00   sec   242 MBytes  2.03 Gbits/sec   37   1.15 MBytes       
Client on 172.18.0.3:  [  5]   9.00-10.00  sec   246 MBytes  2.07 Gbits/sec    2   1.26 MBytes       
Client on 172.18.0.3:  - - - - - - - - - - - - - - - - - - - - - - - - -
Client on 172.18.0.3:  [ ID] Interval           Transfer     Bitrate         Retr
Client on 172.18.0.3:  [  5]   0.00-10.00  sec  2.38 GBytes  2.05 Gbits/sec  2376             sender
Client on 172.18.0.3:  [  5]   0.00-10.05  sec  2.38 GBytes  2.04 Gbits/sec                  receiver
Client on 172.18.0.3:  
Client on 172.18.0.3:  iperf Done.

I also enabled node encryption at the end but there was no significant change.

On the worker nodes cilium interfaces:

2: cilium_net@cilium_host: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether b6:66:34:24:86:56 brd ff:ff:ff:ff:ff:ff
3: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 56:e3:1e:6d:d5:f2 brd ff:ff:ff:ff:ff:ff
    inet 10.244.3.97/32 scope global cilium_host
       valid_lft forever preferred_lft forever
4: cilium_vxlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 2a:be:f8:dd:01:b0 brd ff:ff:ff:ff:ff:ff
...
19: cilium_wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
23: eth0@if24: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.18.0.3/16 brd 172.18.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fc00:f853:ccd:e793::3/64 scope global nodad 
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe12:3/64 scope link 
       valid_lft forever preferred_lft forever
28: lxc_health@if27: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 7a:cb:67:36:97:04 brd ff:ff:ff:ff:ff:ff link-netnsid 1

@archoversight
Copy link

archoversight commented Oct 6, 2023

It's not an issue with AWS, you could replicate the same setup with two hosts on bare metal, it's the way that the WireGuard tunnel MTU is set lower while the pod's veth's are all set to the same value as eth0.

What we need:

pod eth0 (mtu: 8921) -> veth (mtu: 8921) -> wg0 (mtu: 8921) -> eth0 (mtu: 9001)

So that we don't have packets being fragmented. This is especially important in IPv6 where packet fragmentation is not allowed to be done by routers in the path unlike IPv4, and thus the packet just gets dropped once it hits the interface with the smaller MTU.

OR the better fix:

Path MTU really should be enabled, and I am not sure why it doesn't work. So that packets not destined for the wireguard tunnel can use the full MTU of the interface they are going to get routed out of (eth0/9001)

@mmerickel
Copy link

I haven't been able to track it down, but building on @archoversight's note about the ideal flow, things still do not work unless we drop the originating packet size even lower to 8852 if you look at the iperf3 output at #28413 (comment).

There is some un-understood slop (69 bytes) being introduced on top of the 80 bytes required for bare metal wireguard-over-ipv6 that it seems cilium is injecting into the system somewhere?

@archoversight
Copy link

@michaltorbinski when you are doing your iperf3 testing, try setting the -M flag to 40 lower than the MTU on the wireguard tunnel (and potentially even lower, until performance comes back up). That sets the MSS for iperf3 packet size.

On IPv4 the reason you are seeing the massive slow down is because the packets are being fragmented en-route.

@michaltorbinski
Copy link

Thank you for the suggestion @archoversight
Surprisingly -M is not improving performance at any value:

root@ubuntu-pod-2:/# iperf3 -c 10.244.2.75 -M 1380
Connecting to host 10.244.2.75, port 5201
[  5] local 10.244.3.126 port 49226 connected to 10.244.2.75 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   269 MBytes  2.26 Gbits/sec  1332    872 KBytes       
[  5]   1.00-2.00   sec   267 MBytes  2.24 Gbits/sec   60    826 KBytes       

root@ubuntu-pod-2:/# iperf3 -c 10.244.2.75 -M 1340
Connecting to host 10.244.2.75, port 5201
[  5] local 10.244.3.126 port 42848 connected to 10.244.2.75 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   231 MBytes  1.94 Gbits/sec  878    564 KBytes       
[  5]   1.00-2.00   sec   247 MBytes  2.08 Gbits/sec  306    712 KBytes       
[  5]   2.00-3.00   sec   241 MBytes  2.02 Gbits/sec  201    839 KBytes         

root@ubuntu-pod-2:/# iperf3 -c 10.244.2.75 -M 1300
Connecting to host 10.244.2.75, port 5201
[  5] local 10.244.3.126 port 35042 connected to 10.244.2.75 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   219 MBytes  1.84 Gbits/sec  485    464 KBytes       
[  5]   1.00-2.00   sec   246 MBytes  2.06 Gbits/sec  225    639 KBytes        

root@ubuntu-pod-2:/# iperf3 -c 10.244.2.75 -M 1100
Connecting to host 10.244.2.75, port 5201
[  5] local 10.244.3.126 port 49232 connected to 10.244.2.75 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   187 MBytes  1.57 Gbits/sec  863    508 KBytes       
[  5]   1.00-2.00   sec   182 MBytes  1.53 Gbits/sec  429    578 KBytes       
[  5]   2.00-3.00   sec   185 MBytes  1.55 Gbits/sec   90    540 KBytes    

@roarvroom
Copy link
Author

@PhilipSchmid regarding your "encryption: enabled" from "cilium status".

During my journey with Cilium, I remember seeing the situation when Cilium reported encryption enabled, but there were no _wg0 interfaces and "wg show" threw an empty output. I only saw this once, but can you verify that as well?

I would really be interested in seeing why your wireguard-protected connection is not dropping the performance by tens of percent.

@archoversight
Copy link

Thank you for the suggestion @archoversight Surprisingly -M is not improving performance at any value:

This is really interesting... since dropping the MSS should allow you to avoid hitting fragmentation of the packets thereby increasing throughput as less work has to be done to chop the packets into multiple packets.

@squeed
Copy link
Contributor

squeed commented Oct 11, 2023

Just dropping by with a quick observation: cilium generally sets a larger MTU on the interface, then restricts the MTU via routes. For example, on a test cluster I have lying around in an arbitrary pod:

$ ip link show eth0
20: eth0@if21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000

$ ip route         
default via 10.244.1.93 dev eth0 mtu 1450 

(I'm not saying that MTU isn't the issue, but it's worth noting).

Also, I observed issues with ipv6 fragmentation and filed an issue here: #25135

@julianwiedmann julianwiedmann added feature/wireguard Relates to Cilium's Wireguard feature sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. labels Nov 13, 2023
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 17, 2024
@mmerickel
Copy link

not stale

@github-actions github-actions bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 18, 2024
@3u13r
Copy link
Contributor

3u13r commented Jan 18, 2024

We're also experiencing this issue with Cilium v2.15.0-pre.3 (VXLAN, WireGuard, Node Encryption).
For the node-to-node iperf3 case there are multiple ways to get from ~78 Mbits/sec to ~655 Mbits/sec.
One way is to set the MSS in iperf3 to <= 1350 or the iperf3 server pod eth0 mtu to <= 1390.

There might be multiple issues at play:

  1. Pod-to-pod encryption: WG tunneling #29000 is faulty since when encapsulate and encrypt is enabled (which will be the default) the route MTU needs to reflect both the encryption and the tunneling overhead. When I enable the flag then I get 102 Mbits/sec from a standard iperf3 pod-to-pod. When I respect both overheads then the bandwidth increases to 656 Mbits/sec.
  2. Node-to-node traffic is still slow even with this patch. This is due to IP fragmentation of the UDP packets of WireGuard. There was a discussion about that in the WireGuard mailing list. As far as I understand, we cannot simply fix the MTUs on the interfaces (please let me know if you succeed).

For issue 2. I looked into the bpf code and from what I can tell, the redirect in wireguard.h does not have any MTU checks. Also, since the programs only see the eth0 interface, they assume a MSS of eth0_MTU - 40, which is not correct since the smallest MTU is on the WireGuard interface, hence setting the MSS in iperf3 results in a string performance improvement.

The easiest, but not fully complete solution is to set the MSS value on a route via

ip route add 192.168.1.0/24 dev eth0 advmss 1420

What I still need to wrap my heap around is the fact that the IP fragmentation of the WireGuard UDP packets costs 90% of the bandwidth.

Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Mar 19, 2024
@mmerickel
Copy link

/not-stale

@github-actions github-actions bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Mar 20, 2024
@learnitall learnitall self-assigned this Mar 25, 2024
@brb
Copy link
Member

brb commented Apr 9, 2024

@mmerickel What's your cluster configuration? AWS CNI chaining?

@mmerickel
Copy link

The entire config is in the issue description but yes it’s using CNI chaining with the aws-vpc-cni.

@brb brb added the pinned These issues are not marked stale by our issue bot. label Apr 12, 2024
@brb
Copy link
Member

brb commented Apr 12, 2024

Just relaying relevant bits @learnitall said in a private thread:

As for why this happens, I believe the AWS VPC CNI isn't playing nice and is somehow forcing pods to use an MTU of 9001. The AWS VPC CNI doesn't know that WireGuard is enabled in the cluster, so it doesn't understand that it needs to lower the MTU of each Pod's interface. The MTU that the AWS VPC CNI assigns to each pod interface can be found in the CNI config on the node: <...>

In release v1.16.4, support was added for configuring the MTU that the AWS VPC CNI would use for pod interfaces (see aws/amazon-vpc-cni-k8s#2791). I patched the DaemonSet to tell the AWS VPC CNI to use an MTU of 8921, restarted the netperf client and server pods to recreate their interfaces, and throughput consistently increased to ~4.5 Gb/s.

  • Apply the following patch to the aws-node DaemonSet:
spec:
  template:
    spec:
      containers:
        - name: aws-node
          env:
          - name: POD_MTU
            value: "8921"
  • Restart workload pods and ensure they have an MTU on their interface of 8921 via ip a.

@learnitall
Copy link
Contributor

Thanks for relaying @brb. After this comment, I did some more investigating and found more context as to why this happens. It's not that the AWS VPC CNI is forcing pods to use a pod MTU of 9001, it's that Cilium doesn't configure route MTUs in chaining mode.

The WireGuard fragmentation problem seems to be specific to a chained configuration. An EKS installation without a chained CNI config seems to perform just fine.

The cilium-cni plugin is responsible for setting up the route on a Pod interface with a WireGuard-compatible MTU, however it skips adding routes when configured with chaining. In a non-chained configuration, the cilium-cni calls into plugins/cilium-cni/cmd/interface.go#interfaceAdd, which sets the route MTU here. For a chained configuration with AWS, the cilium-cni calls into plugins/cilium-cni/chaining/generic-veth/generic-veth.go#GenericVethChainer.Add, which only sets the route MTU if the cilium-cni is configured with the EnableRouteMTU option. The enable-route-mtu option is not touched by Cilium during installation and rather is a manual configuration option that can be set by users.

As a temporary workaround, using a manual cilium-cni configuration which enables the enable-route-mtu option should fix the MTU issue.

@brb
Copy link
Member

brb commented Apr 18, 2024

Could you please try the following images https://github.com/cilium/cilium/actions/runs/8737485006/job/23975228008?pr=32047#step:4:19 (based on v1.15), and report back whether it has resolved the performance issues?

@davidspek
Copy link

davidspek commented May 22, 2024

Just referencing #32244 since it might be relevant for some people. That PR is included in Cilium v1.15.5, v1.14.11 and v1.13.16.

@julianwiedmann julianwiedmann added the need-more-info More information is required to further debug or fix the issue. label May 22, 2024
@mvishnevsky
Copy link

Just referencing #32244 since it might be relevant for some people. That PR is included in Cilium v1.15.5, v1.14.11 and v1.13.16.

It seems it is not solving the issue for AWS CNI chaining case. The routes in Pod are still without MTU set.

@learnitall
Copy link
Contributor

Yes, #32244 is specific to Azure and Alibaba Cloud IPAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. feature/wireguard Relates to Cilium's Wireguard feature kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/performance There is a performance impact of this. need-more-info More information is required to further debug or fix the issue. needs/triage This issue requires triaging to establish severity and next steps. pinned These issues are not marked stale by our issue bot. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests