Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix possible IP leak in case ENI's are not present in the CN yet #18352

Merged
merged 1 commit into from
Jan 4, 2022

Conversation

codablock
Copy link
Contributor

buildAllocationResult may return an error in case of inconsistencies found
in the local CN's status. For example, there are situations where an IP
is already part of spec.ipam.pool (including the resource/ENI where the IP
comes from), while the corresponding ENI is not part of status.eni.enis
yet.

If that is the case, the IP would be allocated (e.g. by allocateNext)
and then marked as allocated (via a.markAllocated). Shortly after that,
a.buildAllocationResult() would fail and then NOT undo the changes done
by a.markAllocated(). This will then result in the IP never being freed up
again. At the same time, kubelet will keep scheduling PODs onto the same
node without knowing that IPs run out and thus causing new PODs to never
get an IP.

Why exactly this inconsistency between the spec and the status arise is
a different topic and should maybe be investigated further.

This commit/PR fixes this issue by simply moving a.markAllocated() after
the a.buildAllocationResult() result, so that the function is bailed out
early enough.

Fix possible IP leak in case ENI's are not present in the CN yet

@codablock codablock requested review from a team and twpayne January 3, 2022 10:40
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 3, 2022
@codablock
Copy link
Contributor Author

FYI, This was found/fixed/tested on the v1.11 branch. Backporting to at least 1.11 is required as I assume.

@codablock
Copy link
Contributor Author

codablock commented Jan 3, 2022

Some additional info on how I encountered this issue and maybe how to reproduce it. We have a cluster running that does automatic downscaling of all deployments at night and then relies on cluster-autoscaler to also shut down nodes. Next morning, all deployments are upscaled again, causing cluster-autoscaler to also start many nodes at once.

This causes many nodes to appear in k8s at the same time, all being NotReady at the beginning. Cilium agents are then started on each node. When cilium agents start to get ready, the node are also marked Ready, causing the k8s scheduler to immediately schedule dozens of PODs onto the Ready nodes, long before cilium-operator had a chance to attach new ENIs and IPs to the fresh nodes.

This means that all PODs scheduled to the fresh nodes run into a temporary state where the CNI plugin reports that there are no more IPs available. All this is expected and normal until this point.

After a few seconds, cilium-operator finishes attaching new ENIs to the fresh nodes and then tries to update the CN. The update to the spec.pool seems to be successful then, causing the agent to allocate the IP. But as the update to the status seems to fail, the agent then bails out with the IP being marked as used and thus causing the leak.

This is only happening with very high load on the apiserver. At the same time, I can observe errors like these happening in cilium-operator:

level=warning msg="Failed to update CiliumNode" attempt=1 error="Operation cannot be fulfilled on ciliumnodes.cilium.io \"ip-100-66-62-168.eu-central-1.compute.internal\": the object has been modified; please apply your changes to the latest version and try again" instanceID=i-009466ca3d82a1ec0 name=ip-100-66-62-168.eu-central-1.compute.internal subsys=ipam updateStatus=true

Please note the attempt=1 in the log line, it indicates that the first attempt also failed and that no further attempt is done (looking at the many for retry := 0; retry < 2; retry++ loops found in the code). I assume (without 100% knowing) that this is the reason for the inconsistency in spec vs status.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating and submitting this PR, great work!

Just one comment below on error / log handling.

pkg/ipam/crd.go Outdated Show resolved Hide resolved
@christarazi
Copy link
Member

Additionally from #18352 (comment):

  • Could you add the repro steps and explanation in the commit msg itself? This is helpful for developers in the future coming across this and being reminded how the issue manifested.
  • Could you file an issue to fix the insufficient number of attempts to resolve the custom resource update conflict in the Operator?

@christarazi
Copy link
Member

Pinging @jaffcheng who worked on this code last for quick look.

@codablock
Copy link
Contributor Author

codablock commented Jan 3, 2022

@christarazi Thanks for taking a look at the PR. I've applied your suggestion in regard to the error message and also added the repro/description/analysis to the commit message.

I'll do the issue about the retries tomorrow.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@codablock
Copy link
Contributor Author

Here is the issue for the retry issue: #18366

buildAllocationResult may return an error in case of inconsistencies found
in the local CN's status. For example, there are situations where an IP
is already part of spec.ipam.pool (including the resource/ENI where the IP
comes from), while the corresponding ENI is not part of status.eni.enis
yet.

If that is the case, the IP would be allocated (e.g. by allocateNext)
and then marked as allocated (via a.markAllocated). Shortly after that,
a.buildAllocationResult() would fail and then NOT undo the changes done
by a.markAllocated(). This will then result in the IP never being freed up
again. At the same time, kubelet will keep scheduling PODs onto the same
node without knowing that IPs run out and thus causing new PODs to never
get an IP.

Why exactly this inconsistency between the spec and the status arise is
a different topic and should maybe be investigated further.

This commit/PR fixes this issue by simply moving a.markAllocated() after
the a.buildAllocationResult() result, so that the function is bailed out
early enough.

Some additional info on how I encountered this issue and maybe how to
reproduce it. We have a cluster running that does automatic downscaling
of all deployments at night and then relies on cluster-autoscaler to also
shut down nodes. Next morning, all deployments are upscaled again, causing
cluster-autoscaler to also start many nodes at once.

This causes many nodes to appear in k8s at the same time, all being
`NotReady` at the beginning. Cilium agents are then started on each node.
When cilium agents start to get ready, the node are also marked `Ready`,
causing the k8s scheduler to immediately schedule dozens of PODs onto the
`Ready` nodes, long before cilium-operator had a chance to attach new ENIs
and IPs to the fresh nodes.

This means that all PODs scheduled to the fresh nodes run into a temporary
state where the CNI plugin reports that there are no more IPs available.
All this is expected and normal until this point.

After a few seconds, cilium-operator finishes attaching new ENIs to the
fresh nodes and then tries to update the CN. The update to the spec.pool
seems to be successful then, causing the agent to allocate the IP. But as
the update to the status seems to fail, the agent then bails out with the
IP being marked as used and thus causing the leak.

This is only happening with very high load on the apiserver. At the same
time, I can observe errors like these happening in cilium-operator:
```
level=warning msg="Failed to update CiliumNode" attempt=1 error="Operation cannot be fulfilled on ciliumnodes.cilium.io \"ip-100-66-62-168.eu-central-1.compute.internal\": the object has been modified; please apply your changes to the latest version and try again" instanceID=i-009466ca3d82a1ec0 name=ip-100-66-62-168.eu-central-1.compute.internal subsys=ipam updateStatus=true
```

Please note the `attempt=1` in the log line, it indicates that the first
attempt also failed and that no further attempt is done (looking at the
many `for retry := 0; retry < 2; retry++` loops found in the code).
I assume (without 100% knowing) that this is the reason for the
inconsistency in spec vs status.

Signed-off-by: Alexander Block <ablock84@gmail.com>
@twpayne twpayne added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jan 4, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 4, 2022
@twpayne twpayne added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. needs-backport/1.11 labels Jan 4, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 4, 2022
Copy link
Contributor

@twpayne twpayne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thank you :)

@twpayne
Copy link
Contributor

twpayne commented Jan 4, 2022

/test

Job 'Cilium-PR-K8s-GKE' hit: #17270 (93.27% similarity)

@christarazi christarazi added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 4, 2022
@christarazi
Copy link
Member

Marked ready to merge as the CI failures are unrelated and are flakes. The L4LB test has been broken for some time due to infra issues.

@christarazi christarazi merged commit aea1b9f into cilium:master Jan 4, 2022
@codablock codablock deleted the fix-ip-leak branch January 4, 2022 21:37
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.7 Jan 14, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.7 Jan 18, 2022
@joestringer joestringer added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Jan 18, 2022
@joestringer joestringer added release-note/bug This PR fixes an issue in a previous release of Cilium. and removed release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Jan 19, 2022
@joestringer
Copy link
Member

CC @twpayne , this seems like it should be marked as a "bug" in the release notes rather than "minor". "Minor" would be anything the user should know about that isn't a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.10.7
Backport done to v1.10
Development

Successfully merging this pull request may close these issues.

None yet

7 participants