v1.10 backports 2022-01-14 #18487

aditighag · 2022-01-14T18:19:24Z

Fix possible IP leak in case ENI's are not present in the CN yet #18352 -- Fix possible IP leak in case ENI's are not present in the CN yet (@codablock)
test: bump l4lb Vagrantfile kind to 0.11.1 #18370 -- test: bump l4lb Vagrantfile kind to 0.11.1 (@jibi)
bpf: Reset Pod's queue mapping in host veth to fix phys dev mq selection #18388 -- bpf: Reset Pod's queue mapping in host veth to fix phys dev mq selection (@borkmann)

Once this PR is merged, you can update the PR labels via:

$ for pr in 18352 18370 18388; do contrib/backporting/set-labels.py $pr done 1.10; done

aditighag · 2022-01-14T18:22:00Z

test-backport-1.10

[ upstream commit aea1b9f ] buildAllocationResult may return an error in case of inconsistencies found in the local CN's status. For example, there are situations where an IP is already part of spec.ipam.pool (including the resource/ENI where the IP comes from), while the corresponding ENI is not part of status.eni.enis yet. If that is the case, the IP would be allocated (e.g. by allocateNext) and then marked as allocated (via a.markAllocated). Shortly after that, a.buildAllocationResult() would fail and then NOT undo the changes done by a.markAllocated(). This will then result in the IP never being freed up again. At the same time, kubelet will keep scheduling PODs onto the same node without knowing that IPs run out and thus causing new PODs to never get an IP. Why exactly this inconsistency between the spec and the status arise is a different topic and should maybe be investigated further. This commit/PR fixes this issue by simply moving a.markAllocated() after the a.buildAllocationResult() result, so that the function is bailed out early enough. Some additional info on how I encountered this issue and maybe how to reproduce it. We have a cluster running that does automatic downscaling of all deployments at night and then relies on cluster-autoscaler to also shut down nodes. Next morning, all deployments are upscaled again, causing cluster-autoscaler to also start many nodes at once. This causes many nodes to appear in k8s at the same time, all being `NotReady` at the beginning. Cilium agents are then started on each node. When cilium agents start to get ready, the node are also marked `Ready`, causing the k8s scheduler to immediately schedule dozens of PODs onto the `Ready` nodes, long before cilium-operator had a chance to attach new ENIs and IPs to the fresh nodes. This means that all PODs scheduled to the fresh nodes run into a temporary state where the CNI plugin reports that there are no more IPs available. All this is expected and normal until this point. After a few seconds, cilium-operator finishes attaching new ENIs to the fresh nodes and then tries to update the CN. The update to the spec.pool seems to be successful then, causing the agent to allocate the IP. But as the update to the status seems to fail, the agent then bails out with the IP being marked as used and thus causing the leak. This is only happening with very high load on the apiserver. At the same time, I can observe errors like these happening in cilium-operator: ``` level=warning msg="Failed to update CiliumNode" attempt=1 error="Operation cannot be fulfilled on ciliumnodes.cilium.io \"ip-100-66-62-168.eu-central-1.compute.internal\": the object has been modified; please apply your changes to the latest version and try again" instanceID=i-009466ca3d82a1ec0 name=ip-100-66-62-168.eu-central-1.compute.internal subsys=ipam updateStatus=true ``` Please note the `attempt=1` in the log line, it indicates that the first attempt also failed and that no further attempt is done (looking at the many `for retry := 0; retry < 2; retry++` loops found in the code). I assume (without 100% knowing) that this is the reason for the inconsistency in spec vs status. Signed-off-by: Alexander Block <ablock84@gmail.com> Signed-off-by: Aditi Ghag <aditi@cilium.io>

[ upstream commit 018c945 ] The 0.11.1 release bumps the base ubuntu image to 21.04 [1], which should fix the issue we are seeing with the current test: ++ docker exec -i kind-control-plane /bin/sh -c 'echo $(( $(ip -o l show eth0 | awk "{print $1}" | cut -d: -f1) ))' [..] Reading package lists... E: The repository 'http://security.ubuntu.com/ubuntu groovy-security Release' does not have a Release file. E: The repository 'http://archive.ubuntu.com/ubuntu groovy Release' does not have a Release file. E: The repository 'http://archive.ubuntu.com/ubuntu groovy-updates Release' does not have a Release file. E: The repository 'http://archive.ubuntu.com/ubuntu groovy-backports Release' does not have a Release file. Error: Process completed with exit code 100. [1] https://github.com/kubernetes-sigs/kind/releases/tag/v0.11.1 Signed-off-by: Gilberto Bertin <gilberto@isovalent.com> Signed-off-by: Aditi Ghag <aditi@cilium.io>

[ upstream commit ecdff12 ] Fix TX queue selection problem on the phys device as reported by Laurent. At high throughput, they noticed a significant amount of TCP retransmissions that they tracked back to qdic drops (fq_codel was used). Suspicion is that kernel commit edbea9220251 ("veth: Store queue_mapping independently of XDP prog presence") caused this due to its unconditional skb_record_rx_queue() which sets queue mapping to 1, and thus this gets propagated all the way to the physical device hitting only single queue in a mq device. Lets have bpf_lxc reset it as a workaround until we have a kernel fix. Doing this unconditionally is good anyway in order to avoid Pods messing with TX queue selection. Kernel will catch up with fix in 710ad98c363a ("veth: Do not record rx queue hint in veth_xmit"). Fixes: cilium#18311 Reported-by: Laurent Bernaille <laurent.bernaille@datadoghq.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Laurent Bernaille <laurent.bernaille@datadoghq.com> Link (Bug): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=edbea922025169c0e5cdca5ebf7bf5374cc5566c Link (Fix): https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=710ad98c363a66a0cd8526465426c5c5f8377ee0 Signed-off-by: Aditi Ghag <aditi@cilium.io>

aditighag · 2022-01-18T06:04:15Z

Need to do rebase the PR.

All tests passed.

aditighag · 2022-01-18T06:04:46Z

test-backport-1.10

Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests NodePort

Failure Output

FAIL: Can not connect to service "tftp://192.168.36.12:32038/hello" from outside cluster (2/10)

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.16-net-next so I can create a new GitHub issue to track it.

jibi

👍 for my commit, thanks

aditighag · 2022-01-18T16:02:43Z

net-next failure - #12511.
CodeQL errors are reported offline, and are misleading.

aditighag requested a review from a team as a code owner January 14, 2022 18:19

maintainer-s-little-helper bot added backport/1.10 kind/backports This PR provides functionality previously merged into master. labels Jan 14, 2022

codablock and others added 3 commits January 18, 2022 06:04

aditighag force-pushed the pr/v1.10-backport-2022-01-14 branch from 954cde9 to 3ec6f22 Compare January 18, 2022 06:04

jibi approved these changes Jan 18, 2022

View reviewed changes

aditighag merged commit 42df137 into cilium:v1.10 Jan 18, 2022

joestringer mentioned this pull request Jan 18, 2022

Prepare for release v1.10.7 #18534

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.10 backports 2022-01-14 #18487

v1.10 backports 2022-01-14 #18487

aditighag commented Jan 14, 2022

aditighag commented Jan 14, 2022

aditighag commented Jan 18, 2022

aditighag commented Jan 18, 2022 •

edited by maintainer-s-little-helper bot

Test Name

Failure Output

jibi left a comment

aditighag commented Jan 18, 2022

v1.10 backports 2022-01-14 #18487

v1.10 backports 2022-01-14 #18487

Conversation

aditighag commented Jan 14, 2022

aditighag commented Jan 14, 2022

aditighag commented Jan 18, 2022

aditighag commented Jan 18, 2022 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

jibi left a comment

Choose a reason for hiding this comment

aditighag commented Jan 18, 2022

aditighag commented Jan 18, 2022 •

edited by maintainer-s-little-helper bot