-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon: add cleanup for stale local ciliumendpoints that aren't being managed. #20350
daemon: add cleanup for stale local ciliumendpoints that aren't being managed. #20350
Conversation
Commit 0740077f1c203ddf5604d04d5c4da5fdd003313c does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
/test |
Commits 0740077f1c203ddf5604d04d5c4da5fdd003313c, 6a286990870d06c7fff055023656a6efba62837a do not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
6a28699
to
76ff3fd
Compare
Commit 76ff3fddd544401c9cd9215d7d42e16b5204767f does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
76ff3fd
to
a1c4ca5
Compare
Commit a1c4ca5169db32ad772c747d343f1bc0016b0ebd does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
a1c4ca5
to
d110e68
Compare
Commit d110e68257a3f6c28805fc63a855d5c913aeeaa1 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
d110e68
to
8643d0e
Compare
Commit 8643d0ee2e2d648474e90b01ed52345e619b4e94 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
/test |
8643d0e
to
0d96615
Compare
Lots of failures, going to take a look at these |
/test |
d054300
to
20be299
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! A few comments below. Overall the approach is sound.
@tommyp1ckles For the tophats, this seems to be non-trivial to backport to v1.12, 1.11 and 1.11 due to changes in the agent structure. Could you attempt the backport yourself? Thanks! |
Removed the backport labels, as this has been picked up for automation multiple times already. |
It's possible for CiliumEndpoints to become stale where they still reference existing Pods that are no longer being managed by Cilium.
In this scenario, the operator will not GC these CEPs as they have a valid pod owner reference.
This commit adds an init cleanup which cleans up stale ceps. As well, cep/ces K8s watchers will mark such CEPs for deletion and a controller GC routine will periodically GC the old CEPs.
Fixes #17631
Background, we've seen some instances where Pods CEP have become stale & out-of-sync with the actual Pod they're meant to be managing. Particularly in the following two cases:
Pods somehow un-managed, while retaining their CEP. One way that this can happen (i.e. I reproduced by) if the /etc/cni... Cilium config files get removed and the Node is restarted and loses its Cilium bpf state. In this case the same Pod might get re-sandboxed with another CNI (i.e. if the containers have to be restarted). At this point you'll have a Pod with a CEP but no endpoint in the endpoint manager. In this situation the CEP IP and actual Pod IP are likely to differ since the Pod has been restarted under a different CNI. Controller will not GC as the Pod.UUID and owner reference have not changed.
Pod becomes un-managed due to lost state. This can happen if the bpf state gets deleted for an endpoint (such as with a temporary fs following a reboot). Once you restart the Cilium pod, upon restore the existing endpoint will not be restored. But, the Pod is still running with all the Cilium state still intact.
In both cases, the agent can determine if the CEP should be deleted by checking against it's managed endpoints. If none exist then we know that that Pod is unmanaged. Endpoints that are changed, such as in the case of a Pod container being killed will have its CiliumEndpoint eventually resynced via the k8s sync controller.