-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix possible IP leak in case ENI's are not present in the CN yet #18352
Conversation
FYI, This was found/fixed/tested on the v1.11 branch. Backporting to at least 1.11 is required as I assume. |
Some additional info on how I encountered this issue and maybe how to reproduce it. We have a cluster running that does automatic downscaling of all deployments at night and then relies on cluster-autoscaler to also shut down nodes. Next morning, all deployments are upscaled again, causing cluster-autoscaler to also start many nodes at once. This causes many nodes to appear in k8s at the same time, all being This means that all PODs scheduled to the fresh nodes run into a temporary state where the CNI plugin reports that there are no more IPs available. All this is expected and normal until this point. After a few seconds, cilium-operator finishes attaching new ENIs to the fresh nodes and then tries to update the CN. The update to the spec.pool seems to be successful then, causing the agent to allocate the IP. But as the update to the status seems to fail, the agent then bails out with the IP being marked as used and thus causing the leak. This is only happening with very high load on the apiserver. At the same time, I can observe errors like these happening in cilium-operator:
Please note the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for investigating and submitting this PR, great work!
Just one comment below on error / log handling.
Additionally from #18352 (comment):
|
Pinging @jaffcheng who worked on this code last for quick look. |
@christarazi Thanks for taking a look at the PR. I've applied your suggestion in regard to the error message and also added the repro/description/analysis to the commit message. I'll do the issue about the retries tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Here is the issue for the retry issue: #18366 |
buildAllocationResult may return an error in case of inconsistencies found in the local CN's status. For example, there are situations where an IP is already part of spec.ipam.pool (including the resource/ENI where the IP comes from), while the corresponding ENI is not part of status.eni.enis yet. If that is the case, the IP would be allocated (e.g. by allocateNext) and then marked as allocated (via a.markAllocated). Shortly after that, a.buildAllocationResult() would fail and then NOT undo the changes done by a.markAllocated(). This will then result in the IP never being freed up again. At the same time, kubelet will keep scheduling PODs onto the same node without knowing that IPs run out and thus causing new PODs to never get an IP. Why exactly this inconsistency between the spec and the status arise is a different topic and should maybe be investigated further. This commit/PR fixes this issue by simply moving a.markAllocated() after the a.buildAllocationResult() result, so that the function is bailed out early enough. Some additional info on how I encountered this issue and maybe how to reproduce it. We have a cluster running that does automatic downscaling of all deployments at night and then relies on cluster-autoscaler to also shut down nodes. Next morning, all deployments are upscaled again, causing cluster-autoscaler to also start many nodes at once. This causes many nodes to appear in k8s at the same time, all being `NotReady` at the beginning. Cilium agents are then started on each node. When cilium agents start to get ready, the node are also marked `Ready`, causing the k8s scheduler to immediately schedule dozens of PODs onto the `Ready` nodes, long before cilium-operator had a chance to attach new ENIs and IPs to the fresh nodes. This means that all PODs scheduled to the fresh nodes run into a temporary state where the CNI plugin reports that there are no more IPs available. All this is expected and normal until this point. After a few seconds, cilium-operator finishes attaching new ENIs to the fresh nodes and then tries to update the CN. The update to the spec.pool seems to be successful then, causing the agent to allocate the IP. But as the update to the status seems to fail, the agent then bails out with the IP being marked as used and thus causing the leak. This is only happening with very high load on the apiserver. At the same time, I can observe errors like these happening in cilium-operator: ``` level=warning msg="Failed to update CiliumNode" attempt=1 error="Operation cannot be fulfilled on ciliumnodes.cilium.io \"ip-100-66-62-168.eu-central-1.compute.internal\": the object has been modified; please apply your changes to the latest version and try again" instanceID=i-009466ca3d82a1ec0 name=ip-100-66-62-168.eu-central-1.compute.internal subsys=ipam updateStatus=true ``` Please note the `attempt=1` in the log line, it indicates that the first attempt also failed and that no further attempt is done (looking at the many `for retry := 0; retry < 2; retry++` loops found in the code). I assume (without 100% knowing) that this is the reason for the inconsistency in spec vs status. Signed-off-by: Alexander Block <ablock84@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thank you :)
/test Job 'Cilium-PR-K8s-GKE' hit: #17270 (93.27% similarity) |
Marked ready to merge as the CI failures are unrelated and are flakes. The L4LB test has been broken for some time due to infra issues. |
CC @twpayne , this seems like it should be marked as a "bug" in the release notes rather than "minor". "Minor" would be anything the user should know about that isn't a bug. |
buildAllocationResult may return an error in case of inconsistencies found
in the local CN's status. For example, there are situations where an IP
is already part of spec.ipam.pool (including the resource/ENI where the IP
comes from), while the corresponding ENI is not part of status.eni.enis
yet.
If that is the case, the IP would be allocated (e.g. by allocateNext)
and then marked as allocated (via a.markAllocated). Shortly after that,
a.buildAllocationResult() would fail and then NOT undo the changes done
by a.markAllocated(). This will then result in the IP never being freed up
again. At the same time, kubelet will keep scheduling PODs onto the same
node without knowing that IPs run out and thus causing new PODs to never
get an IP.
Why exactly this inconsistency between the spec and the status arise is
a different topic and should maybe be investigated further.
This commit/PR fixes this issue by simply moving a.markAllocated() after
the a.buildAllocationResult() result, so that the function is bailed out
early enough.