-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in Cilium Agent #23707
Comments
Thanks for the report! The log you shared contains the following error:
Which could indicate that the liveness probe failed to respond because it crashed due to a concurrent map write. There is however a recent FQDN-related deadlock we fixed as well, which might be of interest for you: That deadlock results in the following log line (which subsequently also causes the liveness check to fail):
|
I checked back with my colleagues and unfortunately, we do not recommend running |
This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: cilium#23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
I've opened a PR to fix the It's still possible however that you also observed a deadlock. Thus, if you have additional threadumps from other runs that would be helpful to confirm that there are indeed no other unknown problems. |
Hey @gandro Thank you so much for looking into this issue. We indeed saw the I've checked #23377 and we have indeed seen some occurrences of this log message followed by agent restarts (3 in Dev, 1 in Prod, in the last ~15 days). So we will definitely get this into our image. We haven't gotten any other threadd umps yet but we will add them here when we get them. Thanks again! |
Could the error you pointed out in #23713 have resulted in a deadlock / agent becoming unresponsive / healthcheck failing? |
Yes, at least in theory. The panic stack dump you provided above has it's root in If the probe never return a status (such as in case it panics), Line 221 in 314ca7b
If there are any stale probes, then Cilium as a whole will not be marked as "ready"/"live": Lines 652 to 657 in 428c6d2
The only issue with that theory is that we should see a the following warning emitted 15 seconds after the panic (which I do not see in the logs): Line 198 in 314ca7b
|
This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: #23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit e3a78b0 ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: cilium#23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit e3a78b0 ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: #23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit e3a78b0 ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: cilium#23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit e3a78b0 ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: #23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit e3a78b0 ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: cilium#23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit e3a78b0 ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This however is not sufficient to block concurrent writes to the `allocationPoolSize` map, since that map is written to by `nodeStore.updateLocalNodeResource`, which only holds the `nodeStore` mutex. This commit fixes the issue by moving the `totalPoolSize` function to the `nodeStore` and having it explicitly take the `nodeStore` mutex (instead of requiring the `crdAllocator` mutex to be held). This ensures that all access to `allocationPoolSize` is now protected by the `nodeStore` mutex. The lock ordering is also preserved: The `crdAllocator` calls into `nodeStore`, but not vise-versa. Thus, the lock ordering is always that the `crdAllocator` lock is held first, and the `nodeStore` lock second. Related to: #23707 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>
Are you still experiencing this in v1.12.8? |
We've merged a deadlock fix around IPCache and FQDN (#24672) which potentially sounds like this was the issue here. Please test again with v1.12.9 which should be released next week. |
Hello @christarazi We did have one more crash occurrence with the patches you mentioned (we had backported them to our image). @gandro Great I'll test it out then! To be fair this is really rare, we last encountered the issue 20 days ago. Hopefully this issue is the right one! :) Thanks again! |
This issue has been automatically marked as stale because it has not |
This issue has not seen any activity since it was marked stale. |
Hi @mantoine96, It seems we are facing the same issue (however, we only observe the issue in the AWS Zurich region, and not in the AWS Frankfurt region). Did you upgrade to the latest version of Cilium? Did it resolve the issue? Thx! |
Note that we fixed another related issue #26242 which however is still in the process of getting backported to v1.13 |
Hello @gandro Just confirming that in the last ~74 days, we have not had a single occurrence of the crash using our custom image based on v1.12.9. #26242 also looks similar. We are in the process of upgrading to v1.13 so I will look out for it! Thanks for your work on this! @lukaselmer I would therefore recommend you try out 1.12.9 if you're experiencing similar issues. Like I mentioned, we haven't had an issue in > 74 days, which is pretty good going. |
Thanks @mantoine96 @gandro 🙏 In that case we'll upgrade asap 🚀🚀🚀 |
Is there an existing issue for this?
What happened?
After a while running (typically for us, several days), a Cilium agent will start to become unresponsive, failing its liveness checks which causes it to be restarted by Kubernetes. Since we leverage FQDN rules a lot, this ends up causing network disruption on the given host.
Cilium Version
1.12.5 820a308 2022-12-22T16:16:56+00:00 go version go1.18.9 linux/amd64
This is essentially the released 1.12.5 + this PR backported: #22252
Kernel Version
5.4.228-131.415.amzn2.x86_64
Kubernetes Version
v1.23.14-eks-ffeb93d
Sysdump
No response
Relevant log output
No response
Anything else?
The entire threaddump can be found here: cilium-wh7xn.log
This was captured by setting a preStop lifecycle hook to run
kill -s ABRT 1
.This issue started occurring with v1.12.2 for us.
cc @joaoubaldo @carloscastrojumo
From our dashboards, this is what we can observe:
The restart of the agent/agent going unhealthy happens shortly after a large amount of endpoints not ready.
We have other metrics and logs if needed so let us know if you need more information for troubleshooting.
Would you recommend running an image compiled with the deadlock detection flag in production? We unfortunately cannot replicate this issue in a testing cluster.
Code of Conduct
The text was updated successfully, but these errors were encountered: