Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: add cleanup for stale local ciliumendpoints that aren't being managed. #20350

Merged
merged 6 commits into from
Nov 21, 2022

Conversation

tommyp1ckles
Copy link
Contributor

@tommyp1ckles tommyp1ckles commented Jun 29, 2022

It's possible for CiliumEndpoints to become stale where they still reference existing Pods that are no longer being managed by Cilium.
In this scenario, the operator will not GC these CEPs as they have a valid pod owner reference.
This commit adds an init cleanup which cleans up stale ceps. As well, cep/ces K8s watchers will mark such CEPs for deletion and a controller GC routine will periodically GC the old CEPs.

Fixes #17631

Background, we've seen some instances where Pods CEP have become stale & out-of-sync with the actual Pod they're meant to be managing. Particularly in the following two cases:

  1. Pods somehow un-managed, while retaining their CEP. One way that this can happen (i.e. I reproduced by) if the /etc/cni... Cilium config files get removed and the Node is restarted and loses its Cilium bpf state. In this case the same Pod might get re-sandboxed with another CNI (i.e. if the containers have to be restarted). At this point you'll have a Pod with a CEP but no endpoint in the endpoint manager. In this situation the CEP IP and actual Pod IP are likely to differ since the Pod has been restarted under a different CNI. Controller will not GC as the Pod.UUID and owner reference have not changed.

  2. Pod becomes un-managed due to lost state. This can happen if the bpf state gets deleted for an endpoint (such as with a temporary fs following a reboot). Once you restart the Cilium pod, upon restore the existing endpoint will not be restored. But, the Pod is still running with all the Cilium state still intact.

In both cases, the agent can determine if the CEP should be deleted by checking against it's managed endpoints. If none exist then we know that that Pod is unmanaged. Endpoints that are changed, such as in the case of a Pod container being killed will have its CiliumEndpoint eventually resynced via the k8s sync controller.

Added Agent init check that removes all CiliumEndpoints referencing local Node that are not managed. This fixes issues where sometimes CiliumEndpoints referencing still running Pods can become unmanaged during Cilium restart.

@maintainer-s-little-helper
Copy link

Commit 0740077f1c203ddf5604d04d5c4da5fdd003313c does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 29, 2022
@tommyp1ckles
Copy link
Contributor Author

/test

@maintainer-s-little-helper
Copy link

Commits 0740077f1c203ddf5604d04d5c4da5fdd003313c, 6a286990870d06c7fff055023656a6efba62837a do not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@tommyp1ckles tommyp1ckles force-pushed the pr/tommyp1ckles/stale-cep-cleanup branch from 6a28699 to 76ff3fd Compare June 29, 2022 22:07
@maintainer-s-little-helper
Copy link

Commit 76ff3fddd544401c9cd9215d7d42e16b5204767f does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@tommyp1ckles tommyp1ckles force-pushed the pr/tommyp1ckles/stale-cep-cleanup branch from 76ff3fd to a1c4ca5 Compare June 30, 2022 05:37
@maintainer-s-little-helper
Copy link

Commit a1c4ca5169db32ad772c747d343f1bc0016b0ebd does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@tommyp1ckles tommyp1ckles force-pushed the pr/tommyp1ckles/stale-cep-cleanup branch from a1c4ca5 to d110e68 Compare June 30, 2022 06:39
@maintainer-s-little-helper
Copy link

Commit d110e68257a3f6c28805fc63a855d5c913aeeaa1 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@tommyp1ckles tommyp1ckles marked this pull request as ready for review June 30, 2022 06:40
@tommyp1ckles tommyp1ckles requested a review from a team June 30, 2022 06:40
@tommyp1ckles tommyp1ckles requested a review from a team as a code owner June 30, 2022 06:40
@tommyp1ckles tommyp1ckles requested a review from a team June 30, 2022 06:40
@tommyp1ckles tommyp1ckles requested a review from a team as a code owner June 30, 2022 06:40
@tommyp1ckles tommyp1ckles force-pushed the pr/tommyp1ckles/stale-cep-cleanup branch from d110e68 to 8643d0e Compare June 30, 2022 07:01
@maintainer-s-little-helper
Copy link

Commit 8643d0ee2e2d648474e90b01ed52345e619b4e94 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles force-pushed the pr/tommyp1ckles/stale-cep-cleanup branch from 8643d0e to 0d96615 Compare June 30, 2022 07:02
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 30, 2022
@tommyp1ckles
Copy link
Contributor Author

Lots of failures, going to take a look at these

@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles force-pushed the pr/tommyp1ckles/stale-cep-cleanup branch 2 times, most recently from d054300 to 20be299 Compare July 4, 2022 22:44
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! A few comments below. Overall the approach is sound.

daemon/cmd/ciliumendpoints.go Outdated Show resolved Hide resolved
daemon/cmd/ciliumendpoints.go Outdated Show resolved Hide resolved
daemon/cmd/ciliumendpoints.go Outdated Show resolved Hide resolved
daemon/cmd/ciliumendpoints_test.go Outdated Show resolved Hide resolved
pkg/k8s/factory_functions.go Outdated Show resolved Hide resolved
pkg/k8s/watchers/cilium_endpoint.go Outdated Show resolved Hide resolved
pkg/option/config.go Outdated Show resolved Hide resolved
daemon/cmd/ciliumendpoints.go Outdated Show resolved Hide resolved
@bimmlerd bimmlerd mentioned this pull request Dec 1, 2022
14 tasks
@pippolo84 pippolo84 mentioned this pull request Dec 6, 2022
8 tasks
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.11.12 Dec 6, 2022
@gandro gandro mentioned this pull request Dec 6, 2022
10 tasks
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Needs backport from master in 1.11.12 Dec 7, 2022
@gandro
Copy link
Member

gandro commented Dec 7, 2022

@tommyp1ckles For the tophats, this seems to be non-trivial to backport to v1.12, 1.11 and 1.11 due to changes in the agent structure. Could you attempt the backport yourself? Thanks!

@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from master in 1.12.5 Dec 7, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from master in 1.10.18 Dec 7, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from master in 1.11.12 Dec 7, 2022
@gandro
Copy link
Member

gandro commented Dec 7, 2022

Removed the backport labels, as this has been picked up for automation multiple times already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. kind/feature This introduces new functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CEPs might be kept for unmanaged pods