-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.11 backports 2022-01-10 #18418
v1.11 backports 2022-01-10 #18418
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My change LGTM, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for my changes, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
882b731
to
247a52c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My change LGTM! Thanks!
[ upstream commit 5052445 ] The db5300d commit normalized the KPR initialisation routines by making them to return an error. Unfortunately, it forgot to add a check for the error return of finishKubeProxyReplacementInit(). This made to the agent to hide any error returned by the routine. Fixes: db5300d ("choir: normalize error handling in kube_proxy_replacement.go") Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 5e3a7b3 ] This fixes the `first-interface-index` section in the ENI docs where we introduced a new default value and wanted to document that new default, but by doing that accidentally changed a value in the examples. This commit actually fixes the default value and reverts the example to its proper meaning. Ref: cilium#14801 Fixes: 231a217 ("docs: first-interface-index new ENI default") Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit aea1b9f ] buildAllocationResult may return an error in case of inconsistencies found in the local CN's status. For example, there are situations where an IP is already part of spec.ipam.pool (including the resource/ENI where the IP comes from), while the corresponding ENI is not part of status.eni.enis yet. If that is the case, the IP would be allocated (e.g. by allocateNext) and then marked as allocated (via a.markAllocated). Shortly after that, a.buildAllocationResult() would fail and then NOT undo the changes done by a.markAllocated(). This will then result in the IP never being freed up again. At the same time, kubelet will keep scheduling PODs onto the same node without knowing that IPs run out and thus causing new PODs to never get an IP. Why exactly this inconsistency between the spec and the status arise is a different topic and should maybe be investigated further. This commit/PR fixes this issue by simply moving a.markAllocated() after the a.buildAllocationResult() result, so that the function is bailed out early enough. Some additional info on how I encountered this issue and maybe how to reproduce it. We have a cluster running that does automatic downscaling of all deployments at night and then relies on cluster-autoscaler to also shut down nodes. Next morning, all deployments are upscaled again, causing cluster-autoscaler to also start many nodes at once. This causes many nodes to appear in k8s at the same time, all being `NotReady` at the beginning. Cilium agents are then started on each node. When cilium agents start to get ready, the node are also marked `Ready`, causing the k8s scheduler to immediately schedule dozens of PODs onto the `Ready` nodes, long before cilium-operator had a chance to attach new ENIs and IPs to the fresh nodes. This means that all PODs scheduled to the fresh nodes run into a temporary state where the CNI plugin reports that there are no more IPs available. All this is expected and normal until this point. After a few seconds, cilium-operator finishes attaching new ENIs to the fresh nodes and then tries to update the CN. The update to the spec.pool seems to be successful then, causing the agent to allocate the IP. But as the update to the status seems to fail, the agent then bails out with the IP being marked as used and thus causing the leak. This is only happening with very high load on the apiserver. At the same time, I can observe errors like these happening in cilium-operator: ``` level=warning msg="Failed to update CiliumNode" attempt=1 error="Operation cannot be fulfilled on ciliumnodes.cilium.io \"ip-100-66-62-168.eu-central-1.compute.internal\": the object has been modified; please apply your changes to the latest version and try again" instanceID=i-009466ca3d82a1ec0 name=ip-100-66-62-168.eu-central-1.compute.internal subsys=ipam updateStatus=true ``` Please note the `attempt=1` in the log line, it indicates that the first attempt also failed and that no further attempt is done (looking at the many `for retry := 0; retry < 2; retry++` loops found in the code). I assume (without 100% knowing) that this is the reason for the inconsistency in spec vs status. Signed-off-by: Alexander Block <ablock84@gmail.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit ca80a99 ] fix typo in masquerading documentation, on CIDR mask separator Signed-off-by: yanhongchang <yanhongchang5@163.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 635ba6c ] For kube-proxy replacement (specifically, socket-based load-balancing) to work correctly in KIND clusters, the BPF cgroup programs need to be attached at the correct cgroup hierarchy. For this to happen, the KIND nodes need to have their own separate cgroup namespace. More details in PR - cilium#16259. While cgroup namespaces are supported across both cgroup v1 and v2 modes, container runtimes like Docker enable private cgroup namespace mode by default only with cgroup v2 [1]. With cgroup v1, the default is host cgroup namespace, whereby KIND node containers (and also cilium agent pods) are created in the same cgroup namespace as the underlying host. [1] https://docs.docker.com/config/containers/runmetrics/#running-docker-on-cgroup-v2 Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit e3dca63 ] to make it more explicit their purpose Signed-off-by: Gilberto Bertin <gilberto@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit cdfc30d ] In the context of egress gateway, when traffic is leaving the cluster we need to check twice if it is a match for an egress NAT policy: * first time in handle_ipv4_from_lxc(), on the node where the client pod is running (to determine if it should be forwarded to a gateway node) * second time in snat_v4_needed(), on the actual gateway node (to determine if it should be SNATed) Currently the 2 checks are slightly diverging wrt how traffic destined to outside the cluster is identified: * in the first case we use is_cluster_destination(), which uses the information stored on the ipcache and EP maps * in the second case we just rely on the IPV4_SNAT_EXCLUSION_DST_CIDR The issue with the IPV4_SNAT_EXCLUSION_DST_CIDR logic is that we may incorrectly exclude from egress gw SNAT traffic that is supposed to be SNATed: case in point an EKS environment where the primary VPC is shared between the cluster and some other EC2 nodes that don't belong to the cluster. To fix this, this commit changes the snat_v4_needed() logic to match the one we use in handle_ipv4_from_lxc() and executes it before the IPV4_SNAT_EXCLUSION_DST_CIDR check. Signed-off-by: Gilberto Bertin <gilberto@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit a38aaba ] In snat_v4_neeeded(), one of the conditions to determine if a packet should be SNATed with an egress IP is: !local_ep || is_cluster_destination() The intent of the first check (!local_ep) was to that tells us that traffic was redirected by an egress gateway policy to a different node to be masqueraded, but in practice it's not needed: as long as the packet is destined to outside the cluster, is not reply traffic and it's matched by an egress NAT policy, it should be SNATed with the egress gw IP (moreover we should not assume that there's no local EP since it's possible that the node where the client pod is running is the same node that will act as egress gateway). Signed-off-by: Gilberto Bertin <gilberto@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 018c945 ] The 0.11.1 release bumps the base ubuntu image to 21.04 [1], which should fix the issue we are seeing with the current test: ++ docker exec -i kind-control-plane /bin/sh -c 'echo $(( $(ip -o l show eth0 | awk "{print $1}" | cut -d: -f1) ))' [..] Reading package lists... E: The repository 'http://security.ubuntu.com/ubuntu groovy-security Release' does not have a Release file. E: The repository 'http://archive.ubuntu.com/ubuntu groovy Release' does not have a Release file. E: The repository 'http://archive.ubuntu.com/ubuntu groovy-updates Release' does not have a Release file. E: The repository 'http://archive.ubuntu.com/ubuntu groovy-backports Release' does not have a Release file. Error: Process completed with exit code 100. [1] https://github.com/kubernetes-sigs/kind/releases/tag/v0.11.1 Signed-off-by: Gilberto Bertin <gilberto@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit d8577ff ] Previously, the Kubespray documentation recommended changing the role variables. However, changing the role files in an Ansible playbook could lead to problems. So, with this commit, the documentation recommends using the extra variables or editing the group_vars files. Co-authored-by: Yasin Taha Erol <yasintahaerol@gmail.com> Signed-off-by: necatican <necaticanyildirim@gmail.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit ab9bfd7 ] When a new egress gateway manager is created, it will wait for the k8s cache to be fully synced before running the first reconciliation. Currently the logic is based on the WaitUntilK8sCacheIsSynced method of the Daemon object, which waits on the k8sCachesSynced channel to be closed (which indicates that the cache has been indeed synced). The issue with this approach is that Daemon object is passed to the NewEgressGatewayManager method _before_ its k8sCachesSynced channel is properly initialized. This in turn causes the WaitUntilK8sCacheIsSynced method to never return. Since NewEgressGatewayManager must be called before that channel is initialized, we need to switch to a polling approach, where the k8sCachesSynced is checked periodically. Signed-off-by: Gilberto Bertin <gilberto@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 155eacc ] Fixes: cilium#17914 Signed-off-by: Weilong Cui <cuiwl@google.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
… node [ upstream commit 13cfe29 ] Imagine a scenario where a node has 2 unused IPs and pre-allocate set to 1. Let's say one of the IPs is in the middle of a handshake and a new pod is scheduled on the node. The other unused IP would be allocated to the pod. Now, when the operator re-evaluates, the node is no longer considered to be in excess. Without this commit, the operator does not act further on IPs in this state. This results in a scenario where no new IPs are allocated to the node and agent cannot allocate the unused IPs because they're in the middle of a handshake. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
…ol() [ upstream commit 2778f5a ] With the addition of IP release handhake, maintainIPPool() became too long and not very readable. So, moving release and allocate logic into their own functions. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit d9c6eb1 ] Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 13f5c8c ] UpdateIdentities needs to lock SelectorCache.mutex, which might be held by SelectorCache.AddFQDNSelector while endpoints/policies are being regenerated. If in that situation idMDMU is locked already, AddFQDNSelector will deadlock while calling ipcache.GetIDMetadataByIP deep inside the call chain. Example stacktraces of 2 goroutines that show the deadlock: ``` goroutine 236 [semacquire, 61 minutes]: sync.runtime_SemacquireMutex(0x5000107, 0x0, 0xffffffffffffffff) /usr/local/go/src/runtime/sema.go:71 +0x25 sync.(*Mutex).lockSlow(0xc0004d13b0) /usr/local/go/src/sync/mutex.go:138 +0x165 sync.(*Mutex).Lock(...) /usr/local/go/src/sync/mutex.go:81 sync.(*RWMutex).Lock(0xc001365f58) /usr/local/go/src/sync/rwmutex.go:111 +0x36 github.com/cilium/cilium/pkg/policy.(*SelectorCache).UpdateIdentities(0xc0004d13b0, 0xc001f0dbf0, 0x0, 0x8) /go/src/github.com/cilium/cilium/pkg/policy/selectorcache.go:949 +0x65 github.com/cilium/cilium/pkg/ipcache.InjectLabels({0x27fe323, 0xf}, {0x2bfdbe0, 0xc0004d13b0}, {0x7f1837d6d080, 0xc0009e0e20}) /go/src/github.com/cilium/cilium/pkg/ipcache/metadata.go:179 +0xa53 github.com/cilium/cilium/pkg/ipcache.(*IPCache).TriggerLabelInjection.func1({0xc0013674d0, 0xc001367190}) /go/src/github.com/cilium/cilium/pkg/ipcache/metadata.go:550 +0x31 github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc0005a4360) /go/src/github.com/cilium/cilium/pkg/controller/controller.go:206 +0x1cb created by github.com/cilium/cilium/pkg/controller.(*Manager).updateController /go/src/github.com/cilium/cilium/pkg/controller/manager.go:111 +0xb67 goroutine 2578413 [semacquire, 61 minutes]: sync.runtime_SemacquireMutex(0xe, 0x0, 0xc000c76000) /usr/local/go/src/runtime/sema.go:71 +0x25 sync.(*RWMutex).RLock(...) /usr/local/go/src/sync/rwmutex.go:63 github.com/cilium/cilium/pkg/ipcache.GetIDMetadataByIP({0xc001c5b830, 0x17}) /go/src/github.com/cilium/cilium/pkg/ipcache/metadata.go:68 +0x68 github.com/cilium/cilium/pkg/ipcache.AllocateCIDRs({0xc0018ca660, 0x4, 0xc00041fce0}, 0x0) /go/src/github.com/cilium/cilium/pkg/ipcache/cidr.go:62 +0x20a github.com/cilium/cilium/pkg/ipcache.AllocateCIDRsForIPs({0xc00009bc20, 0x5, 0xc003a46e80}, 0x37) /go/src/github.com/cilium/cilium/pkg/ipcache/cidr.go:99 +0x2f github.com/cilium/cilium/daemon/cmd.cachingIdentityAllocator.AllocateCIDRsForIPs({0xc00041fc70}, {0xc00009bc20, 0xc00278f510, 0x3}, 0x40e3cb) /go/src/github.com/cilium/cilium/daemon/cmd/identity.go:113 +0x2a github.com/cilium/cilium/pkg/policy.(*fqdnSelector).allocateIdentityMappings(0xc002698370, {0x7f1837c69138, 0xc000ac05a0}, 0x25) /go/src/github.com/cilium/cilium/pkg/policy/selectorcache.go:491 +0x258 github.com/cilium/cilium/pkg/policy.(*SelectorCache).AddFQDNSelector(0xc0004d13b0, {0x2bfdbc0, 0xc002851a00}, {{0x0, 0x0}, {0xc004c7ec60, 0xa}}) /go/src/github.com/cilium/cilium/pkg/policy/selectorcache.go:820 +0x40b github.com/cilium/cilium/pkg/policy.(*L4Filter).cacheFQDNSelector(0xc002851a00, {{0x0, 0xc00000c180}, {0xc004c7ec60, 0xc000e6eb10}}, 0x0, 0x0) /go/src/github.com/cilium/cilium/pkg/policy/l4.go:479 +0x65 github.com/cilium/cilium/pkg/policy.(*L4Filter).cacheFQDNSelectors(0xc0018ca280, {0xc000988380, 0xe, 0x2c732c0}, 0x0, 0x0) /go/src/github.com/cilium/cilium/pkg/policy/l4.go:474 +0x8f github.com/cilium/cilium/pkg/policy.createL4Filter({0x2c732c0, 0xc0018ca060}, {0xc0018ca280, 0x1, 0x1}, {0x2c24098, 0xc0029a9740}, {{0x2bd17b2, 0x1}, {0x27d4b6a, ...}}, ...) /go/src/github.com/cilium/cilium/pkg/policy/l4.go:574 +0x398 github.com/cilium/cilium/pkg/policy.createL4EgressFilter(...) /go/src/github.com/cilium/cilium/pkg/policy/l4.go:697 github.com/cilium/cilium/pkg/policy.mergeEgressPortProto({0x2c732c0, 0xc0018ca060}, 0x0, {0xc0018ca280, 0xc0025cfe38, 0x44cd45}, {0x2c24098, 0xc0029a9740}, {{0x2bd17b2, 0x1}, ...}, ...) /go/src/github.com/cilium/cilium/pkg/policy/rule.go:773 +0x108 github.com/cilium/cilium/pkg/policy.mergeEgress({0x2c732c0, 0xc0018ca060}, 0xc001f0e3f0, {0xc0018ca280, 0x1, 0x1}, {0x2c2a268, 0x4875340}, {0x2c2a218, 0x4875340}, ...) /go/src/github.com/cilium/cilium/pkg/policy/rule.go:678 +0x35c github.com/cilium/cilium/pkg/policy.(*rule).resolveEgressPolicy(0xc0023fb860, {0x2c732c0, 0xc0018ca060}, 0xc001f0e3f0, 0xc002790308, 0xc0029a9620, {0x0, 0x0, 0x0}, {0x0, ...}) /go/src/github.com/cilium/cilium/pkg/policy/rule.go:809 +0x84f github.com/cilium/cilium/pkg/policy.ruleSlice.resolveL4EgressPolicy({0xc00299af40, 0x6, 0xb7}, {0x2c732c0, 0xc0018ca060}, 0xc001f0e3f0) /go/src/github.com/cilium/cilium/pkg/policy/rules.go:102 +0x470 github.com/cilium/cilium/pkg/policy.(*Repository).resolvePolicyLocked(0xc0004d1490, 0xc001fcdbc0) /go/src/github.com/cilium/cilium/pkg/policy/repository.go:704 +0x531 github.com/cilium/cilium/pkg/policy.(*PolicyCache).updateSelectorPolicy(0xc000852a98, 0xc001fcdbc0) /go/src/github.com/cilium/cilium/pkg/policy/distillery.go:119 +0x14c github.com/cilium/cilium/pkg/policy.(*PolicyCache).UpdatePolicy(...) /go/src/github.com/cilium/cilium/pkg/policy/distillery.go:153 github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regeneratePolicy(0xc000ee3880) /go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:230 +0x22b github.com/cilium/cilium/pkg/endpoint.(*Endpoint).runPreCompilationSteps(0xc000ee3880, 0xc000bff400) /go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:815 +0x2dd github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerateBPF(0xc000ee3880, 0xc000bff400) /go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:584 +0x19d github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate(0xc000ee3880, 0xc000bff400) /go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:405 +0x7b3 github.com/cilium/cilium/pkg/endpoint.(*EndpointRegenerationEvent).Handle(0xc003a950d0, 0xc001fcd620) /go/src/github.com/cilium/cilium/pkg/endpoint/events.go:53 +0x32c github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run.func1() /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:245 +0x13b sync.(*Once).doSlow(0xc000c25901, 0x43f325) /usr/local/go/src/sync/once.go:68 +0xd2 sync.(*Once).Do(...) /usr/local/go/src/sync/once.go:59 github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run(0xc00304ae40) /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:233 +0x45 created by github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:229 +0x7b ``` Signed-off-by: Alexander Block <ablock84@gmail.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
…tLabels and RemoveLabelsExcluded [ upstream commit 2e59bbd ] This implements a temporary solution to protect agains parallel execution of InjectLabels and RemoveLabelsExcluded, which might cause a different form of the same deadlock as fixed by the previous commit. A better solution should be implemented in the future, as mentioned in the PR comment cilium#18343 (comment) Signed-off-by: Alexander Block <ablock84@gmail.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit ecdff12 ] Fix TX queue selection problem on the phys device as reported by Laurent. At high throughput, they noticed a significant amount of TCP retransmissions that they tracked back to qdic drops (fq_codel was used). Suspicion is that kernel commit edbea9220251 ("veth: Store queue_mapping independently of XDP prog presence") caused this due to its unconditional skb_record_rx_queue() which sets queue mapping to 1, and thus this gets propagated all the way to the physical device hitting only single queue in a mq device. Lets have bpf_lxc reset it as a workaround until we have a kernel fix. Doing this unconditionally is good anyway in order to avoid Pods messing with TX queue selection. Kernel will catch up with fix in 710ad98c363a ("veth: Do not record rx queue hint in veth_xmit"). Fixes: cilium#18311 Reported-by: Laurent Bernaille <laurent.bernaille@datadoghq.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Laurent Bernaille <laurent.bernaille@datadoghq.com> Link (Bug): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=edbea922025169c0e5cdca5ebf7bf5374cc5566c Link (Fix): https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=710ad98c363a66a0cd8526465426c5c5f8377ee0 Signed-off-by: Paul Chaignon <paul@cilium.io>
247a52c
to
7271333
Compare
/test-backport-1.11 |
Travis hit rate limit. One of the builds passed though. Looks like Runtime hit an infra issue -
|
test-runtime |
first-interface-index
documentation #18327 -- docs: Fixfirst-interface-index
documentation (@gandro)None of the PRs are conflicts!
Once this PR is merged, you can update the PR labels via: