-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-apiserver IP adresses disappear from bpf ipcache 10 minutes after restarting cilium-agent #24502
Comments
Hi @svanschie , thanks for the report. I'm wondering, in the case that doesn't work, does Cilium successfully sync with kube-apiserver on startup? Also, do you see any other matches for the kube-apiserver IP addresses in the logs beyond the output that you've shared? The intent behind the grace period logic is to restore information about IPs (such as kube-apiserver IPs) from the previous cilium-agent run, and give enough time for cilium-agent to sync with the kube-apiserver to pick up the new kube-apiserver IPs. If that synchronization step doesn't occur, I could see how these IPs end up getting deleted from the ipcache later on. |
Hi @joestringer, I'm not entirely sure what to look for in the logs in regard to syncing with kube-apiserver, hopefully this helps (there's lots of logs in between these entries but these seemed most relevant):
The logs I provided include all matches for the API server IPs, this is based on logs for the first 20 minutes after a restart of the Cilium agent. |
Hey guys, i'm having exactly same issue... everything was working great.. then suddenly all endpoints that requires access to the kube-apiserver starts getting dropped... is there a work around? |
@voltagebots is this also on EKS? |
Working with @voltagebots here, yes eks 1.23 in our case. what other info can I grab for you. I can confirm that rolling back to 1.12.8 I dont see this issue anymore |
I tried to naively reproduce this on eks with images from master, no luck; even after 30 minutes, the identities were still there. I'll try and reproduce with 1.13.1 now. |
Hello, i have this problem too. I have egress network policy to kube-apiserver in 1.13.0 stil ok, but after upgrading to 1.13.1 kube-apiserver ip address disapear from bpf after 10-20minutes. Then i rolled back to 1.13.0 i dont see the issue anymore. Upgrade via helm chart btw, |
I'm seeing a very similar issue. I rolled back to 1.13.0, the problem persists.
In my case I see drops such as (10.21.96.226 is the IP of the Kube API server):
The identity does not show the
If I list Identities, I do see an identity with the correct IP and labels:
But the ipcache has that IP mapped to the
As with OP, restarting Cilium fixes it for a while, but it comes back after about 20 minutes. I was seeing the exact same behavior with 1.13.1, and rolled back to 1.13.0 but it persists. I will rollback to 1.12.8 to see if that resolves it, but (assuming this is the same issue) it does appear to be present in both 1.13.1 and 1.13.0. Edited to add: This is an EKS cluster Update: Was unable to reproduce in 1.12.8. |
I was, again, not able to reproduce this by merely creating a cluster. I find it curious that in all of these cases, there is some kind of duplication around identities. Specifically, the IPs of the apiservers show up in multiple identities Out of curiosity, are you doing anything explicit that would cause this to happen? Manually referencing those IPs in a CiliumNetworkPolicy? Anything else to help me reproduce? |
so in my case, the drops weren't immediate.. like we have had cilium running in our env - IIRC for now, we intend to deploy the 13.0/13.1 on our test env and try to reproduce the issue..will get back to you once done... |
In my case there are other policies in the cluster which reference the same CIDRs (/24) that the API server is in. These are the private subnets of the VPC. I don't know if this would impact anything, but worth noting that we're using |
One observation: restarting the cilium agent allocates a new numerical identity for the apiserver IPs. This is surprising to me. |
So, garbage collection relies on a specific identity not existing to keep from deleting the CIDR identity. This strikes me as incredibly fragile. Am I missing something, @christarazi? |
I think the logic is correct there, but it assumes that the identities are associated with the expected labels. The labels association with the identities seems to be the mismatch here. Let's say you have a policy which allows to So in the end, you have When the kube-apiserver logic kicks it and notices that For the reference counting to be done properly, 99 identity should have been deallocated. It seems like what's happening in this bug is that we are not only deallocating the old identity (99), but also the new one somehow. My suspicion is something to do with how the CIDR restoration logic (i.e. "Reallocated restored CIDR identity" log msg) interacts with the ipcache / identity allocation / labels. |
So, on startup, the basic flow is
Step 4 doesn't delete the kube-apsierver identities because it only deletes identities that have cidr labels. Specifically, it looks for identities with a complete set of cidr labels and no others. It seems like, when this bug occurs, step 3 is somehow broken in that it creates new identities for the apiserver, but doesn't delete the restored ones. The question: why? |
My initial idea was to track restored IDs, rather than restored CIDRs, and release those upon grace-period cleanup. However, that's not going to work right now. This is because So, either we need to change how the CIDR release mechanism works, or think of a new approach. |
I think this problem, that affect me too, may be showing another one. For me when it occurs, flux cannot reconcile anymore because of a problem on all services.
I have to remove all |
@n0rad that seems - at first glance - like a different issue. Would you mind opening a separate bug? |
@schaffino @dctrwatson @svanschie I have a proposed fix for this. Would you be willing to try a test image and see if this resolves this issue for you? If so, please reach out to me on the Cilium slack, or via email. Thanks! |
InjectLabels is one of the functions responsible for synchronizing the ipcache metadata store and ip store. As such, it shouldn't shortcut when the numeric identity is the same, but the source is different; this means that an update to the ipcache isn't complete. This can happen, for example, when there are two identities for the same IP, which can happen on daemon restart whenever a CIDR is referenced. Fixes: cilium#24502 Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@squeed |
My tests are still failing with the same issue with |
@dctrwatson looking back over your initial report, I saw this comment:
This comment seems to indicate that policy add/delete plays a role in the issue in your case.
Do you have a writeup somewhere around exactly how this test works? Do you think we could reproduce it by just following some instructions or running a script? |
I can split off into a separate issue if that would help.
It's an internal e2e test suite. Gist is, it's adding/removing a handful of "services" (usually, 6 pods each) in parallel each with their own network policy with ingress+egress rules to allow only access to pods in the same service. There is a global CCNP that allows egress to |
InjectLabels is one of the functions responsible for synchronizing the ipcache metadata store and ip store. As such, it shouldn't shortcut when the numeric identity is the same, but the source is different; this means that an update to the ipcache isn't complete. This can happen, for example, when there are two identities for the same IP, which can happen on daemon restart whenever a CIDR is referenced. Fixes: cilium#24502 Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@dctrwatson interesting. At first glance, that seems like a separate issue. When the bug happens, do all pods (with an applicable policy) on the node in question lose access to the apiserver? If you trigger it, can you confirm that the apiserver's IP is missing from |
I have tried installing v1.13.2 and within 10min I started seeing a lot of unexpected drops so I tried quay.io/cilium/cilium-dev:v1.13.2-with-24875 image and indeed it fixed these unwanted drops. From the image above it is possible to see that after deploying the image with the fix at 0820utc the pattern of the expected drops came to back to normal and its been like that after ~11h. I will keep monitoring and if I spot something new I will update here |
No, only pods that have their specific network policy deleted are affected. Other pods on the node continue to work fine until then.
When I checked affected nodes, both apiserver ips there. That's why I wasn't sure if this was the same same issue. It just manifested similarly. |
InjectLabels is one of the functions responsible for synchronizing the ipcache metadata store and ip store. As such, it shouldn't shortcut when the numeric identity is the same, but the source is different; this means that an update to the ipcache isn't complete. This can happen, for example, when there are two identities for the same IP, which can happen on daemon restart whenever a CIDR is referenced. Fixes: #24502 Signed-off-by: Casey Callendrello <cdc@isovalent.com>
[ upstream commit 8d3a498 ] InjectLabels is one of the functions responsible for synchronizing the ipcache metadata store and ip store. As such, it shouldn't shortcut when the numeric identity is the same, but the source is different; this means that an update to the ipcache isn't complete. This can happen, for example, when there are two identities for the same IP, which can happen on daemon restart whenever a CIDR is referenced. Fixes: cilium#24502 Signed-off-by: Casey Callendrello <cdc@isovalent.com>
[ upstream commit 8d3a498 ] InjectLabels is one of the functions responsible for synchronizing the ipcache metadata store and ip store. As such, it shouldn't shortcut when the numeric identity is the same, but the source is different; this means that an update to the ipcache isn't complete. This can happen, for example, when there are two identities for the same IP, which can happen on daemon restart whenever a CIDR is referenced. Fixes: #24502 Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Apologies for the slow follow up. We've now done the v1.13.2 upgrade on a number of v1.13.0 and v1.13.1 clusters, setting the agent image specifically to quay.io/cilium/cilium-dev:v1.13.2-with-24875 and are no longer facing this issue |
@squeed This batch of about 30 clusters was all 1.13.1 to 1.13.2 upgrades. Unfortunately this doesn't seem completely solved, specifically in the case of an upgrade. About a third of the clusters exhibited 1 or more nodes with pods having issue's communicating with the k8s master after upgrade, it also seemed to happen within less than the previous 10 minute window. Post upgrade, agent restarts don't seem to be problematic or cause any outages that we have seen yet. I used to be able to consistently exhibit the problem by manually restarting an agent. This is no longer the case so it looks fixed forward. Now that things are upgraded and the nodes with k8s connectivity issues restarted, everything is gravey. But I wouldn't say the upgrade was smooth. Additionally, there also seemed to be more significant network disruption with this rollout than I've previously experienced. Services exposed by external load balancers were getting 502's for periods of up to a couple minutes during upgrades when previously this seemed to only be a few seconds, if noticeable at all. For context, this is on GKE, all clusters updated to 1.25.7-gke.1000 prior to update |
@schaffino thanks for the update. Of the pods that had issues, did they have unconstrained network access, or was it restricted by (Cilium)NetworkPolicy? Would you be able to get a sysdump off a node experiencing this issue? Feel free to give it to me privately. Do you use KubeProxyReplacement? |
@dctrwatson |
I created a separate issue for what I'm seeing: #25172
Not seeing this issue on v1.12.x up through v1.12.9 |
@squeed # Allow egress to the kubernetes API
- toServices:
- k8sService:
serviceName: kubernetes
namespace: default
- toCIDR:
- {{ .Values.private_kube_endpoint }}/32
toPorts:
- ports:
- port: "443"
protocol: TCP I have noticed that some other users give a policy like the following. But that wouldn't make any difference would it? Given the above has been working previously
We don't use KubeProxyReplacement. We use pretty much the default GKE config, except ipv4NativeRoutingCIDR is a wider range than just the clusters. I feel like this must be something that is specific to cilium in GKE, because after a number of weeks of not having any problems, this has started randomly occurring on fresh cilium installs (with the patched agent) on newly deployed clusters. The GKE version has bumped a couple of times since. Symptoms are the same, no connectivity to the kube master. Fix is also the same, restart the agent on the problem node. I'll try and get a sysdump for you. Would you prefer I open a new issue? |
[ upstream commit 8d3a498 ] InjectLabels is one of the functions responsible for synchronizing the ipcache metadata store and ip store. As such, it shouldn't shortcut when the numeric identity is the same, but the source is different; this means that an update to the ipcache isn't complete. This can happen, for example, when there are two identities for the same IP, which can happen on daemon restart whenever a CIDR is referenced. Fixes: cilium#24502 Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
Is there an existing issue for this?
What happened?
When I restart the cilium agent, everything works fine for 10 minutes. After that connectivity to kube-apiserver breaks. I expect this happens due to the IP adresses of the API no longer being in the ipcache. The IP addresses seem to be removed from this cache as part of the functionality behind the
--identity-restore-grace-period
flag, which is set to the default of 10 minutes.The IP addresses of my API server are:
After a restart, they both show up twice in the output of
cilium identity list
:The corresponding entries in
cilium bpf ipcache list
:After 10 minutes, identities 16777219 and 16777220 are removed:
And entries from ipcache for the API server IP addresses are also removed:
No new entries are added to the ipcache for the API server IP addresses.
When allowing outbound traffic to port 443 anywhere it works fine, but with a policy that allows traffic to the cluster (or kube-apiserver specifically) entities the traffic is dropped:
Example of a dropped request:
Cilium Version
1.13.1
Kernel Version
5.15.79
Kubernetes Version
v1.24.8-eks-ffeb93d
Sysdump
No response
Relevant log output
Anything else?
When it's in a "broken" state and I restart cilium agent, it will work again. After this restart it will also keep working after the 10 minutes. If cilium agent is restarted again after that, it will break after 10 minutes again.
I was unable to reproduce the issue with Cilium v1.12.7. I compared the debug logging between v1.12.7 and v.13.1 for what happened after 10 minutes. The primary differences are as follows:
v1.12.7:
v1.13.1:
Code of Conduct
The text was updated successfully, but these errors were encountered: