v1.9 backports 2021-05-07 #16048

brb · 2021-05-07T14:03:40Z

v1.9 backports 2021-05-07

docs: Update our community docs page #14968 -- docs: Update our community docs page (@pchaigno)
node-neigh: add metric to count arping requests #14816 -- node-neigh: add metric to count arping requests (@jaffcheng)
test: Format test-only's kernel_version to avoid mistakes #15743 -- test: Format test-only's kernel_version to avoid mistakes (@pchaigno)
Drop a @ in clustermesh-apiserver helm chart #15934 -- Drop a @ in clustermesh-apiserver helm chart (@anthr76)
bwm: queue mapping & cong fixes #15964 -- bwm: queue mapping & cong fixes (@borkmann)
test: Fix fragment tracking test on GKE #15959 -- test: Fix fragment tracking test on GKE (@pchaigno)
connectivity-check: Reduce chances of port conflict with proxy #15988 -- connectivity-check: Reduce chances of port conflict with proxy (@pchaigno)
Update weekly community meeting timeslot #15985 -- Update weekly community meeting timeslot (@joestringer)
plugins/cilium-cni: fix aws-cni cni chaining #15915 -- plugins/cilium-cni: fix aws-cni cni chaining (@aanm)
node-neigh: Locking, logging, misc improvements #15783 -- node-neigh: Locking, logging, misc improvements (@brb)
cilium: Encryption EKS 4.14 kernel (default) fixes #15867 -- cilium: Encryption EKS 4.14 kernel (default) fixes (@jrfastab)
node-neigh: Avoid flooding the same next hop #15882 -- node-neigh: Avoid flooding the same next hop (@brb)

NOT BACKPORTED:

test: Fix the search for VIPs in cilium service list #15968 -- test: Fix the search for VIPs in cilium service list (@pchaigno) <-- many merge conflicts

Once this PR is merged, you can update the PR labels via:

$ for pr in 14968 14816 15743 15934 15964 15959 15988 15985 15915 15783 15867 15882; do contrib/backporting/set-labels.py $pr done 1.9; done

[ upstream commit 63996d6 ] - Update the information on SIGs. - Split the Slack channels in three categories and add a few new suggestions. - Add a section on the Weekly Community Meeting. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit 42eb1ed ] Could be used to decide if the arping refresh interval needs to be adjusted in order to limit spams. Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com> Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit fb08c6c ] I often try to start test-only builds with e.g.: test-only --kernel_version=4.19 --focus="..." That fails because our tests expect "419". We can extend the Python script used to parse argument to recognize that and update kernel_version to the expected format. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit d3dc63d ] A Pod could potentially emit non-IPv4/v6 traffic and in that case we must not set the aggregate since we also don't clear it in edt_sched_departure() again. Hence, move the edt_set_aggregate() to the two supported protos. Fixes: #15960 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit c233140 ] Consider a socket which has SO_MAX_PACING_RATE of 4Gbit/s and the socket being part of a Pod. This is currently broken given skb->tstamps are cleared on BPF redirect as well as netns traversal even though fq in hostns manages the socket's pacing. Rates would result being unpredictable: root@apoc:~/go/src/github.com/cilium/cilium# netperf -H 10.217.1.19 -t TCP_STREAM -l40 -s2 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.217.1.19 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 40.04 655.52 root@apoc:~/go/src/github.com/cilium/cilium# netperf -H 10.217.1.19 -t TCP_STREAM -l40 -s2 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.217.1.19 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 40.07 1274.70 root@apoc:~/go/src/github.com/cilium/cilium# netperf -H 10.217.1.19 -t TCP_STREAM -l40 -s2 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.217.1.19 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 40.07 1519.32 root@apoc:~/go/src/github.com/cilium/cilium# netperf -H 10.217.1.19 -t TCP_STREAM -l40 -s2 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.217.1.19 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 40.06 849.96 We are working on a kernel side solution to retain skb->tstamps which would fix this issue and result in a stable 4Gbit/s rate for this example. Once that is merged we can reenable BBR from BWM side for those kernels (and fallback to cubic for those that do not have it). Related: #15324 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit 9fc9583 ] The fragment tracking test was disabled on GKE because it is incompatible with endpoint routes [1]. Until that incompatiblity is fixed, we can disable endpoint routes when running on GKE. 1 - #15958 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit e97c7c3 ] Cilium's DNS proxy listens on a port allocated by the kernel. That port can however conflict with the port used by echo-c-host-container from the connectivity checks, leading to CI flakes when run regularly. A kernel comment on the function allocating the port [1] gives us a way out: * if snum is zero it means select any available local port. * We try to allocate an odd port (and leave even ports for connect()) So using an even-numbered port for echo-c-host-container in the connectivity checks should strongly reduce the likelihood of this flake happening again. Other hostns pods already use even-numbered ports. 1 - https://elixir.bootlin.com/linux/v5.12.1/source/net/ipv4/inet_connection_sock.c#L354 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit 785c2e7 ] As discussed during community meeting 2021-05-03. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit ea381e7 ] AWS-CNI seems to require more fields than the ones hard coded in Cilium image. This patch adds the missing fields. Error messages that might show up in pod describe are similar as: ``` network: invalid character '{' after top-level value ``` or ``` \n{\n \"code\": 100,\n \"msg\": \"add cmd: failed to assign an IP address to container\"\n}": invalid character '{' after top-level value ``` Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit a994147 ] This commit introduces a dedicated lock for managing node neighbor entries. Previously, the neighbor subsystem was sharing the same lock with the node manager. This resulted in the lock starvation when the periodic neighbor refresh was taking longer than expected and it was queuing up the refresh requests. To avoid the starvation which can slow down node updates, we introduce the dedicated lock. Also, we make sure that insertNeighbor() won't block NodeUpdate() in the case of the scenario above by running it in a separate goroutine (same applies to deleteNeighbor() and NodeDelete()). Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit 7c45d17 ] There is no default case in the select, so the select will block until one of its cases is fulfilled. Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit 0d72941 ] - Lock insertNeighbor() before accessing shared structures. - Defer neighbor BPF map retirement so that it's moved from the contend path. - Make getSrcAndNextHopIPv4() a function to indicate that it's not using any shared structures. Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit a82137b ] This commit changes how insertNeighbor() method takes the lock. Instead of holding the lock for the entire execution duration, we release the lock while arpinging. This allows parallel arpings which should make the periodic ARP refresher faster. Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit 38d0049 ] It can happen that the insertNeighbor(refresh=true) gets called before insertNeighbor(refresh=false), and only the latter can increment the neigh refcounting. Remove the optimization for the sake of refcounting correctness. Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit e779e96 ] Instead of deriving the value from net.ipv4.neigh.default.base_reachable_time_ms which is 30s by default use a user specified value for the ARP periodic refresher. Running ARP refresher every 15-45s is too often, as in most clusters node hw addrs don't change at all. Users who have highly dynamic environments will be able to schedule the refresher in higher frequency by using the flag. Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit 0ff85f2 ] Most of arping failures which we observed are transient, so these false positives might mislead users. Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit fbadf99 ] When using subnet mode for encryption IPs subnets are not bound to any specific node. So rather than use xfrm stack to write src/dst IPs in the IPSec path we use bpf_host.o attached to cilium_host qdisc to complete the header. This is necessary because we want to avoid having an entry per pod in the xfrm tables and instead have at max an entry per node. Currently, we use the srcIP of the first interface we find in the egress interface list. This works on some more forgiving networks, but EKS uses source based IP routing, but does not have a source IP in the source routing tables for all network facing interfaces. What happens then is we rewrite the SIP of the IPSec header to the IPAddr assigned to the network interface. This is passed to the stack for routing. The routing table doesn't have a source entry for the SIP so it uses the default route. The default may or may not point to the first interface. If it doesn't then the network appears to do a firewall rule where the NIC/SIP lookup is unexpected and the network drops the packet either by firewall or just missing routing rules. It appears common for the first interface "eth0" to match the default route so often get working by depending on this ordering, buts its very fragile and easy to break. Fix by using the SIP of cilium_host which always has an entry in the source routing table. Note: Above only applies to non-fib case (kernels <4.17) with fib lookup we do a lookup in BPF datapath and write MACs and IPs correctly without stack intervention. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit d56f565 ] We can not do a redirect to the egress interface if the fib lookup is not supported. If we do try to do the redirect there is no guarantee the MAC addresses are correct because we didn't rewrite them from the BPF datapath. This may work, if the network is very fogiving, but more likely the NIC will drop the packet because of MAC addresses being incorrect. Also without a correct ENCRYPT_IFINDEX the FIB lookup will fail. So lets guard fib lookups with both encrypt ifindex and fib support this way we will pass the packets to the stack instead of dropping them when we don't have the correct set of kernel features or config. EKS creates a unique problem here. EKS does source routing and has multiple egress interfaces. The correct egress interface is unclear at init time (there are multiple interfaces) and can change at runtime. Rather than try to pick an interface (fragile!) let the stack route the packet and skip FIB lookup and redirect from BPF datapath. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit 69e885c ] We moved the subnet detection lower into the init path so that it handled reinitialization, but unfortunately this broke setting IP_POOLS define in header writing. Do the simplest fix and just always enable IP_POOLS for ENI case. Fixes: a42d442 ("cilium: auto-discovery pod subnets for ENI IPAM") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit e34115a ] In order to push auto-discovery of the subnet into the nodeSubent fields we need to do the discovery early enough in the setup that the subnets exist. At the moment we do the auto-discovery too late and the values are not pushed into the nodeConfiguration so we miss updating ipsec route table correctly. Rather than overwrite the config options this patch pushes the code into the nodeConfigurationChanged() and updates the node configuration directly. This has two advantages: we avoid stomping on user config and then we also catch any subnet updates. Fixes: a42d442 ("cilium: auto-discovery pod subnets for ENI IPAM") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit adb5ccf ] The dmac may not be set in the case of !HAVE_FIB_SUPPORT so we need to ensure the stack doesn't mark this packet as OTHERHOST and drop it. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit c3d971a ] When nodes are connected via L3 GW, the periodic ARP refresher pings the same IP (GW). To avoid that, introduce neighLastPingByNextHop which stores a timestamp of the last successful arping for a given next hop. If delta between the last and the next arping is less than option.Config.ARPPingRefreshPeriod, then the arping is skipped until the next time the refresher is scheduled. Signed-off-by: Martynas Pumputis <m@lambda.lt>

[ upstream commit a753b38 ] The impact of the wrong removal was very small - unnecessary neigh insert via netlink and dangling unused value. This could have happened under very rare circumstances when a next hop to a node had changed. Fixes: 0483ba0 ("node-neigh: Do not inc neighbor refcount for the same node") Signed-off-by: Martynas Pumputis <m@lambda.lt>

pchaigno

👍 for my changes.

brb · 2021-05-07T14:40:02Z

test-backport-1.9

aanm

Looks good for my commits

jrajahalme · 2021-05-07T21:28:52Z

@pchaigno Verifier test hit by #16050, not sure if this can be considered as unrelated flake for this PR or not?

pchaigno · 2021-05-07T21:30:04Z

@pchaigno Verifier test hit by #16050, not sure if this can be considered as unrelated flake for this PR or not?

We can ignore if it's the same error message (it is here). It already fails on v1.9.

pchaigno and others added 23 commits May 7, 2021 15:51

node-neigh: add metric to count arping requests

222c862

[ upstream commit 42eb1ed ] Could be used to decide if the arping refresh interval needs to be adjusted in order to limit spams. Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com> Signed-off-by: Martynas Pumputis <m@lambda.lt>

Update weekly community meeting timeslot

caf85ef

[ upstream commit 785c2e7 ] As discussed during community meeting 2021-05-03. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>

node-neigh: Remove unnecessary looping over recv

692830a

[ upstream commit 7c45d17 ] There is no default case in the select, so the select will block until one of its cases is fulfilled. Signed-off-by: Martynas Pumputis <m@lambda.lt>

node-neigh: Log arping failures only in debug mode

4d118d4

[ upstream commit 0ff85f2 ] Most of arping failures which we observed are transient, so these false positives might mislead users. Signed-off-by: Martynas Pumputis <m@lambda.lt>

brb added kind/backports This PR provides functionality previously merged into master. backport/1.9 labels May 7, 2021

brb requested a review from a team as a code owner May 7, 2021 14:03

pchaigno approved these changes May 7, 2021

View reviewed changes

aanm reviewed May 7, 2021

View reviewed changes

jrajahalme mentioned this pull request May 7, 2021

v1.9: CI: K8sVerifier Runs the kernel verifier against Cilium's BPF datapath on 5.4 #16050

Closed

brb added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 10, 2021

ti-mo merged commit 7dc246b into v1.9 May 10, 2021

ti-mo deleted the pr/backport-1.9-2021-05-07 branch May 10, 2021 13:59

joestringer mentioned this pull request May 13, 2021

Prepare for release v1.9.7 #16131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.9 backports 2021-05-07 #16048

v1.9 backports 2021-05-07 #16048

brb commented May 7, 2021

pchaigno left a comment

brb commented May 7, 2021

aanm left a comment

jrajahalme commented May 7, 2021

pchaigno commented May 7, 2021

v1.9 backports 2021-05-07 #16048

v1.9 backports 2021-05-07 #16048

Conversation

brb commented May 7, 2021

pchaigno left a comment

Choose a reason for hiding this comment

brb commented May 7, 2021

aanm left a comment

Choose a reason for hiding this comment

jrajahalme commented May 7, 2021

pchaigno commented May 7, 2021