Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backport-v1.11] agent: dump stack on stale probes #24977

Merged
merged 1 commit into from May 11, 2023

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Apr 19, 2023

Backport of #24213

(This needed to be manually backported since the original commit used go stdlib functions are new to 1.19.)

Most of the time, when we see a stale probe, it's due to a deadlock. So, write a stack dump to disk (since we're probably going to be restarted soon due to a liveness probe).

To prevent any sort of excessive resource consumption, only dump stack once every 5 minutes, and always write to the same file. Also, let's make the check lock-free while we're at it.

Also, make sure we capture this file in bugtool.

[ backport of d85c093 ]
[ upstream commit 87f7a11 ]

Most of the time, when we see a stale probe, it's due to a deadlock. So,
write a stack dump to disk (since we're probably going to be restarted
soon due to a liveness probe).

To prevent any sort of excessive resource consumption, only dump stack
once every 5 minutes, and always write to the same file. Also, let's
make the check lock-free while we're at it.

Also, make sure we capture this file in bugtool.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@squeed squeed requested a review from a team as a code owner April 19, 2023 11:54
@squeed squeed added kind/backports This PR provides functionality previously merged into master. backport/1.11 This PR represents a backport for Cilium 1.11.x of a PR that was merged to main. labels Apr 19, 2023
@squeed
Copy link
Contributor Author

squeed commented Apr 19, 2023

/test-backport-1.11

Job 'Cilium-PR-K8s-1.18-kernel-4.9' failed:

Click to show.

Test Name

K8sHubbleTest Hubble Observe Test L3/L4 Flow

Failure Output

FAIL: Found 2 k8s-app=cilium logs matching list of errors that must be investigated:

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.18-kernel-4.9/2698/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.18-kernel-4.9 so I can create one.

Then please upload the Jenkins artifacts to that issue.

@squeed
Copy link
Contributor Author

squeed commented Apr 19, 2023

Travis failure is somewhat comical: we output too many log lines.

@jrajahalme
Copy link
Member

/test-1.18-4.9

@jrajahalme
Copy link
Member

known flake #24394 in test-1.18-4.9.

@jrajahalme jrajahalme added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 11, 2023
@jrajahalme
Copy link
Member

/test-1.18-4.9

@jrajahalme jrajahalme removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 11, 2023
@jrajahalme
Copy link
Member

/test-1.18-4.9

@jrajahalme
Copy link
Member

known flakes only, merging.

@jrajahalme jrajahalme merged commit d3bf9d7 into cilium:v1.11 May 11, 2023
51 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.11 This PR represents a backport for Cilium 1.11.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants