New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backport-v1.11] agent: dump stack on stale probes #24977
Conversation
[ backport of d85c093 ] [ upstream commit 87f7a11 ] Most of the time, when we see a stale probe, it's due to a deadlock. So, write a stack dump to disk (since we're probably going to be restarted soon due to a liveness probe). To prevent any sort of excessive resource consumption, only dump stack once every 5 minutes, and always write to the same file. Also, let's make the check lock-free while we're at it. Also, make sure we capture this file in bugtool. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
/test-backport-1.11 Job 'Cilium-PR-K8s-1.18-kernel-4.9' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.18-kernel-4.9/2698/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
Travis failure is somewhat comical: we output too many log lines. |
/test-1.18-4.9 |
known flake #24394 in test-1.18-4.9. |
/test-1.18-4.9 |
/test-1.18-4.9 |
known flakes only, merging. |
Backport of #24213
(This needed to be manually backported since the original commit used go stdlib functions are new to 1.19.)
Most of the time, when we see a stale probe, it's due to a deadlock. So, write a stack dump to disk (since we're probably going to be restarted soon due to a liveness probe).
To prevent any sort of excessive resource consumption, only dump stack once every 5 minutes, and always write to the same file. Also, let's make the check lock-free while we're at it.
Also, make sure we capture this file in bugtool.