Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tetragon: fix hang on error in tetragonExecute #1770

Merged
merged 1 commit into from
Nov 17, 2023

Conversation

willfindlay
Copy link
Contributor

There has been a longstanding bug where if Tetragon encounters an error inside of tetragonExecute, the process will hang instead of exiting as expected. When looking at the goroutine stacktrace dump provided by the runtime on SIGABRT, we can immediately see the problem. The main thread is stuck on a channel send inside of observer.RemoveSensors(). Further investigation reveals that the channel is never opened because InitSensorManager() is waiting on the waitChan to be closed, which does not happen until we have loaded the base sensor.

To fix this issue, we simply need to move the defer call into observer.RemoveSensors() to after we indicate that InitSensorManager() is cleared to run. This patch does exactly that. Since we haven't loaded any BPF progs yet until the base sensor has been loaded anyway, this should be safe to do.

Fix an issue that caused Tetragon to hang when it encounters an error early on in its init phase.

There has been a longstanding bug where if Tetragon encounters an error inside of
tetragonExecute, the process will hang instead of exiting as expected. When looking at the
goroutine stacktrace dump provided by the runtime on SIGABRT, we can immediately see the
problem. The main thread is stuck on a channel send inside of observer.RemoveSensors().
Further investigation reveals that the channel is never opened because InitSensorManager()
is waiting on the waitChan to be closed, which does not happen until we have loaded the
base sensor.

To fix this issue, we simply need to move the defer call into observer.RemoveSensors() to
after we indicate that InitSensorManager() is cleared to run. This patch does exactly
that. Since we haven't loaded any BPF progs yet until the base sensor has been loaded
anyway, this should be safe to do.

Signed-off-by: William Findlay <will@isovalent.com>
@willfindlay willfindlay added release-blocker This PR or issue is blocking the next release. release-note/bug This PR fixes an issue in a previous release of Tetragon. needs-backport/1.0 This PR needs backporting to 1.0 labels Nov 17, 2023
@willfindlay willfindlay requested a review from a team as a code owner November 17, 2023 17:26
@willfindlay
Copy link
Contributor Author

Side note: William is now automatically suspicious of any defers he sees during future code reviews.

@willfindlay willfindlay removed the needs-backport/1.0 This PR needs backporting to 1.0 label Nov 17, 2023
@jrfastab jrfastab merged commit 87ccfc6 into main Nov 17, 2023
33 checks passed
@jrfastab jrfastab deleted the pr/willfindlay/fix-tetragon-hang-on-error branch November 17, 2023 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker This PR or issue is blocking the next release. release-note/bug This PR fixes an issue in a previous release of Tetragon.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants