Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.10 backports 2022-01-14 #18487

Merged
merged 3 commits into from
Jan 18, 2022

Conversation

aditighag
Copy link
Member

Once this PR is merged, you can update the PR labels via:

$ for pr in 18352 18370 18388; do contrib/backporting/set-labels.py $pr done 1.10; done

@aditighag aditighag requested a review from a team as a code owner January 14, 2022 18:19
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.10 kind/backports This PR provides functionality previously merged into master. labels Jan 14, 2022
@aditighag
Copy link
Member Author

test-backport-1.10

codablock and others added 3 commits January 18, 2022 06:04
[ upstream commit aea1b9f ]

buildAllocationResult may return an error in case of inconsistencies found
in the local CN's status. For example, there are situations where an IP
is already part of spec.ipam.pool (including the resource/ENI where the IP
comes from), while the corresponding ENI is not part of status.eni.enis
yet.

If that is the case, the IP would be allocated (e.g. by allocateNext)
and then marked as allocated (via a.markAllocated). Shortly after that,
a.buildAllocationResult() would fail and then NOT undo the changes done
by a.markAllocated(). This will then result in the IP never being freed up
again. At the same time, kubelet will keep scheduling PODs onto the same
node without knowing that IPs run out and thus causing new PODs to never
get an IP.

Why exactly this inconsistency between the spec and the status arise is
a different topic and should maybe be investigated further.

This commit/PR fixes this issue by simply moving a.markAllocated() after
the a.buildAllocationResult() result, so that the function is bailed out
early enough.

Some additional info on how I encountered this issue and maybe how to
reproduce it. We have a cluster running that does automatic downscaling
of all deployments at night and then relies on cluster-autoscaler to also
shut down nodes. Next morning, all deployments are upscaled again, causing
cluster-autoscaler to also start many nodes at once.

This causes many nodes to appear in k8s at the same time, all being
`NotReady` at the beginning. Cilium agents are then started on each node.
When cilium agents start to get ready, the node are also marked `Ready`,
causing the k8s scheduler to immediately schedule dozens of PODs onto the
`Ready` nodes, long before cilium-operator had a chance to attach new ENIs
and IPs to the fresh nodes.

This means that all PODs scheduled to the fresh nodes run into a temporary
state where the CNI plugin reports that there are no more IPs available.
All this is expected and normal until this point.

After a few seconds, cilium-operator finishes attaching new ENIs to the
fresh nodes and then tries to update the CN. The update to the spec.pool
seems to be successful then, causing the agent to allocate the IP. But as
the update to the status seems to fail, the agent then bails out with the
IP being marked as used and thus causing the leak.

This is only happening with very high load on the apiserver. At the same
time, I can observe errors like these happening in cilium-operator:
```
level=warning msg="Failed to update CiliumNode" attempt=1 error="Operation cannot be fulfilled on ciliumnodes.cilium.io \"ip-100-66-62-168.eu-central-1.compute.internal\": the object has been modified; please apply your changes to the latest version and try again" instanceID=i-009466ca3d82a1ec0 name=ip-100-66-62-168.eu-central-1.compute.internal subsys=ipam updateStatus=true
```

Please note the `attempt=1` in the log line, it indicates that the first
attempt also failed and that no further attempt is done (looking at the
many `for retry := 0; retry < 2; retry++` loops found in the code).
I assume (without 100% knowing) that this is the reason for the
inconsistency in spec vs status.

Signed-off-by: Alexander Block <ablock84@gmail.com>
Signed-off-by: Aditi Ghag <aditi@cilium.io>
[ upstream commit 018c945 ]

The 0.11.1 release bumps the base ubuntu image to 21.04 [1], which should
fix the issue we are seeing with the current test:

    ++ docker exec -i kind-control-plane /bin/sh -c 'echo $(( $(ip -o l show eth0 | awk "{print $1}" | cut -d: -f1) ))'
    [..]
    Reading package lists...
    E: The repository 'http://security.ubuntu.com/ubuntu groovy-security Release' does not have a Release file.
    E: The repository 'http://archive.ubuntu.com/ubuntu groovy Release' does not have a Release file.
    E: The repository 'http://archive.ubuntu.com/ubuntu groovy-updates Release' does not have a Release file.
    E: The repository 'http://archive.ubuntu.com/ubuntu groovy-backports Release' does not have a Release file.
    Error: Process completed with exit code 100.

[1] https://github.com/kubernetes-sigs/kind/releases/tag/v0.11.1

Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
Signed-off-by: Aditi Ghag <aditi@cilium.io>
[ upstream commit ecdff12 ]

Fix TX queue selection problem on the phys device as reported by Laurent.
At high throughput, they noticed a significant amount of TCP retransmissions
that they tracked back to qdic drops (fq_codel was used).

Suspicion is that kernel commit edbea9220251 ("veth: Store queue_mapping
independently of XDP prog presence") caused this due to its unconditional
skb_record_rx_queue() which sets queue mapping to 1, and thus this gets
propagated all the way to the physical device hitting only single queue
in a mq device.

Lets have bpf_lxc reset it as a workaround until we have a kernel fix.
Doing this unconditionally is good anyway in order to avoid Pods messing
with TX queue selection.

Kernel will catch up with fix in 710ad98c363a ("veth: Do not record rx queue
hint in veth_xmit").

Fixes: cilium#18311
Reported-by: Laurent Bernaille <laurent.bernaille@datadoghq.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Laurent Bernaille <laurent.bernaille@datadoghq.com>
Link (Bug): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=edbea922025169c0e5cdca5ebf7bf5374cc5566c
Link (Fix): https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=710ad98c363a66a0cd8526465426c5c5f8377ee0
Signed-off-by: Aditi Ghag <aditi@cilium.io>
@aditighag
Copy link
Member Author

Need to do rebase the PR.

Screen Shot 2022-01-17 at 10 03 08 PM

All tests passed.

@aditighag
Copy link
Member Author

aditighag commented Jan 18, 2022

test-backport-1.10

Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests NodePort

Failure Output

FAIL: Can not connect to service "tftp://192.168.36.12:32038/hello" from outside cluster (2/10)

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.16-net-next so I can create a new GitHub issue to track it.

Copy link
Member

@jibi jibi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for my commit, thanks

@aditighag
Copy link
Member Author

net-next failure - #12511.
CodeQL errors are reported offline, and are misleading.

@aditighag aditighag merged commit 42df137 into cilium:v1.10 Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants