-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.9 backports 2021-05-07 #16048
Merged
Merged
v1.9 backports 2021-05-07 #16048
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[ upstream commit 63996d6 ] - Update the information on SIGs. - Split the Slack channels in three categories and add a few new suggestions. - Add a section on the Weekly Community Meeting. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 42eb1ed ] Could be used to decide if the arping refresh interval needs to be adjusted in order to limit spams. Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com> Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit fb08c6c ] I often try to start test-only builds with e.g.: test-only --kernel_version=4.19 --focus="..." That fails because our tests expect "419". We can extend the Python script used to parse argument to recognize that and update kernel_version to the expected format. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit d3dc63d ] A Pod could potentially emit non-IPv4/v6 traffic and in that case we must not set the aggregate since we also don't clear it in edt_sched_departure() again. Hence, move the edt_set_aggregate() to the two supported protos. Fixes: #15960 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit c233140 ] Consider a socket which has SO_MAX_PACING_RATE of 4Gbit/s and the socket being part of a Pod. This is currently broken given skb->tstamps are cleared on BPF redirect as well as netns traversal even though fq in hostns manages the socket's pacing. Rates would result being unpredictable: root@apoc:~/go/src/github.com/cilium/cilium# netperf -H 10.217.1.19 -t TCP_STREAM -l40 -s2 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.217.1.19 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 40.04 655.52 root@apoc:~/go/src/github.com/cilium/cilium# netperf -H 10.217.1.19 -t TCP_STREAM -l40 -s2 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.217.1.19 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 40.07 1274.70 root@apoc:~/go/src/github.com/cilium/cilium# netperf -H 10.217.1.19 -t TCP_STREAM -l40 -s2 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.217.1.19 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 40.07 1519.32 root@apoc:~/go/src/github.com/cilium/cilium# netperf -H 10.217.1.19 -t TCP_STREAM -l40 -s2 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.217.1.19 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 40.06 849.96 We are working on a kernel side solution to retain skb->tstamps which would fix this issue and result in a stable 4Gbit/s rate for this example. Once that is merged we can reenable BBR from BWM side for those kernels (and fallback to cubic for those that do not have it). Related: #15324 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 9fc9583 ] The fragment tracking test was disabled on GKE because it is incompatible with endpoint routes [1]. Until that incompatiblity is fixed, we can disable endpoint routes when running on GKE. 1 - #15958 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit e97c7c3 ] Cilium's DNS proxy listens on a port allocated by the kernel. That port can however conflict with the port used by echo-c-host-container from the connectivity checks, leading to CI flakes when run regularly. A kernel comment on the function allocating the port [1] gives us a way out: * if snum is zero it means select any available local port. * We try to allocate an odd port (and leave even ports for connect()) So using an even-numbered port for echo-c-host-container in the connectivity checks should strongly reduce the likelihood of this flake happening again. Other hostns pods already use even-numbered ports. 1 - https://elixir.bootlin.com/linux/v5.12.1/source/net/ipv4/inet_connection_sock.c#L354 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 785c2e7 ] As discussed during community meeting 2021-05-03. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit ea381e7 ] AWS-CNI seems to require more fields than the ones hard coded in Cilium image. This patch adds the missing fields. Error messages that might show up in pod describe are similar as: ``` network: invalid character '{' after top-level value ``` or ``` \n{\n \"code\": 100,\n \"msg\": \"add cmd: failed to assign an IP address to container\"\n}": invalid character '{' after top-level value ``` Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit a994147 ] This commit introduces a dedicated lock for managing node neighbor entries. Previously, the neighbor subsystem was sharing the same lock with the node manager. This resulted in the lock starvation when the periodic neighbor refresh was taking longer than expected and it was queuing up the refresh requests. To avoid the starvation which can slow down node updates, we introduce the dedicated lock. Also, we make sure that insertNeighbor() won't block NodeUpdate() in the case of the scenario above by running it in a separate goroutine (same applies to deleteNeighbor() and NodeDelete()). Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 7c45d17 ] There is no default case in the select, so the select will block until one of its cases is fulfilled. Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 0d72941 ] - Lock insertNeighbor() before accessing shared structures. - Defer neighbor BPF map retirement so that it's moved from the contend path. - Make getSrcAndNextHopIPv4() a function to indicate that it's not using any shared structures. Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit a82137b ] This commit changes how insertNeighbor() method takes the lock. Instead of holding the lock for the entire execution duration, we release the lock while arpinging. This allows parallel arpings which should make the periodic ARP refresher faster. Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 38d0049 ] It can happen that the insertNeighbor(refresh=true) gets called before insertNeighbor(refresh=false), and only the latter can increment the neigh refcounting. Remove the optimization for the sake of refcounting correctness. Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit e779e96 ] Instead of deriving the value from net.ipv4.neigh.default.base_reachable_time_ms which is 30s by default use a user specified value for the ARP periodic refresher. Running ARP refresher every 15-45s is too often, as in most clusters node hw addrs don't change at all. Users who have highly dynamic environments will be able to schedule the refresher in higher frequency by using the flag. Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 0ff85f2 ] Most of arping failures which we observed are transient, so these false positives might mislead users. Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit fbadf99 ] When using subnet mode for encryption IPs subnets are not bound to any specific node. So rather than use xfrm stack to write src/dst IPs in the IPSec path we use bpf_host.o attached to cilium_host qdisc to complete the header. This is necessary because we want to avoid having an entry per pod in the xfrm tables and instead have at max an entry per node. Currently, we use the srcIP of the first interface we find in the egress interface list. This works on some more forgiving networks, but EKS uses source based IP routing, but does not have a source IP in the source routing tables for all network facing interfaces. What happens then is we rewrite the SIP of the IPSec header to the IPAddr assigned to the network interface. This is passed to the stack for routing. The routing table doesn't have a source entry for the SIP so it uses the default route. The default may or may not point to the first interface. If it doesn't then the network appears to do a firewall rule where the NIC/SIP lookup is unexpected and the network drops the packet either by firewall or just missing routing rules. It appears common for the first interface "eth0" to match the default route so often get working by depending on this ordering, buts its very fragile and easy to break. Fix by using the SIP of cilium_host which always has an entry in the source routing table. Note: Above only applies to non-fib case (kernels <4.17) with fib lookup we do a lookup in BPF datapath and write MACs and IPs correctly without stack intervention. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit d56f565 ] We can not do a redirect to the egress interface if the fib lookup is not supported. If we do try to do the redirect there is no guarantee the MAC addresses are correct because we didn't rewrite them from the BPF datapath. This may work, if the network is very fogiving, but more likely the NIC will drop the packet because of MAC addresses being incorrect. Also without a correct ENCRYPT_IFINDEX the FIB lookup will fail. So lets guard fib lookups with both encrypt ifindex and fib support this way we will pass the packets to the stack instead of dropping them when we don't have the correct set of kernel features or config. EKS creates a unique problem here. EKS does source routing and has multiple egress interfaces. The correct egress interface is unclear at init time (there are multiple interfaces) and can change at runtime. Rather than try to pick an interface (fragile!) let the stack route the packet and skip FIB lookup and redirect from BPF datapath. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 69e885c ] We moved the subnet detection lower into the init path so that it handled reinitialization, but unfortunately this broke setting IP_POOLS define in header writing. Do the simplest fix and just always enable IP_POOLS for ENI case. Fixes: a42d442 ("cilium: auto-discovery pod subnets for ENI IPAM") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit e34115a ] In order to push auto-discovery of the subnet into the nodeSubent fields we need to do the discovery early enough in the setup that the subnets exist. At the moment we do the auto-discovery too late and the values are not pushed into the nodeConfiguration so we miss updating ipsec route table correctly. Rather than overwrite the config options this patch pushes the code into the nodeConfigurationChanged() and updates the node configuration directly. This has two advantages: we avoid stomping on user config and then we also catch any subnet updates. Fixes: a42d442 ("cilium: auto-discovery pod subnets for ENI IPAM") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit adb5ccf ] The dmac may not be set in the case of !HAVE_FIB_SUPPORT so we need to ensure the stack doesn't mark this packet as OTHERHOST and drop it. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit c3d971a ] When nodes are connected via L3 GW, the periodic ARP refresher pings the same IP (GW). To avoid that, introduce neighLastPingByNextHop which stores a timestamp of the last successful arping for a given next hop. If delta between the last and the next arping is less than option.Config.ARPPingRefreshPeriod, then the arping is skipped until the next time the refresher is scheduled. Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit a753b38 ] The impact of the wrong removal was very small - unnecessary neigh insert via netlink and dangling unused value. This could have happened under very rare circumstances when a next hop to a node had changed. Fixes: 0483ba0 ("node-neigh: Do not inc neighbor refcount for the same node") Signed-off-by: Martynas Pumputis <m@lambda.lt>
brb
added
kind/backports
This PR provides functionality previously merged into master.
backport/1.9
labels
May 7, 2021
pchaigno
approved these changes
May 7, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for my changes.
test-backport-1.9 |
aanm
reviewed
May 7, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for my commits
brb
added
the
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
label
May 10, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
kind/backports
This PR provides functionality previously merged into master.
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
v1.9 backports 2021-05-07
@
in clustermesh-apiserver helm chart #15934 -- Drop a@
in clustermesh-apiserver helm chart (@anthr76)NOT BACKPORTED:
cilium service list
#15968 -- test: Fix the search for VIPs incilium service list
(@pchaigno) <-- many merge conflictsOnce this PR is merged, you can update the PR labels via: