tetragon: fix hang on error in tetragonExecute #1770

willfindlay · 2023-11-17T17:26:30Z

There has been a longstanding bug where if Tetragon encounters an error inside of tetragonExecute, the process will hang instead of exiting as expected. When looking at the goroutine stacktrace dump provided by the runtime on SIGABRT, we can immediately see the problem. The main thread is stuck on a channel send inside of observer.RemoveSensors(). Further investigation reveals that the channel is never opened because InitSensorManager() is waiting on the waitChan to be closed, which does not happen until we have loaded the base sensor.

To fix this issue, we simply need to move the defer call into observer.RemoveSensors() to after we indicate that InitSensorManager() is cleared to run. This patch does exactly that. Since we haven't loaded any BPF progs yet until the base sensor has been loaded anyway, this should be safe to do.

Fix an issue that caused Tetragon to hang when it encounters an error early on in its init phase.

There has been a longstanding bug where if Tetragon encounters an error inside of tetragonExecute, the process will hang instead of exiting as expected. When looking at the goroutine stacktrace dump provided by the runtime on SIGABRT, we can immediately see the problem. The main thread is stuck on a channel send inside of observer.RemoveSensors(). Further investigation reveals that the channel is never opened because InitSensorManager() is waiting on the waitChan to be closed, which does not happen until we have loaded the base sensor. To fix this issue, we simply need to move the defer call into observer.RemoveSensors() to after we indicate that InitSensorManager() is cleared to run. This patch does exactly that. Since we haven't loaded any BPF progs yet until the base sensor has been loaded anyway, this should be safe to do. Signed-off-by: William Findlay <will@isovalent.com>

willfindlay · 2023-11-17T17:29:11Z

Side note: William is now automatically suspicious of any defers he sees during future code reviews.

willfindlay added release-blocker This PR or issue is blocking the next release. release-note/bug This PR fixes an issue in a previous release of Tetragon. needs-backport/1.0 This PR needs backporting to 1.0 labels Nov 17, 2023

willfindlay requested review from kkourt and jrfastab November 17, 2023 17:26

willfindlay requested a review from a team as a code owner November 17, 2023 17:26

tpapagian approved these changes Nov 17, 2023

View reviewed changes

kkourt approved these changes Nov 17, 2023

View reviewed changes

willfindlay removed the needs-backport/1.0 This PR needs backporting to 1.0 label Nov 17, 2023

jrfastab merged commit 87ccfc6 into main Nov 17, 2023
33 checks passed

jrfastab deleted the pr/willfindlay/fix-tetragon-hang-on-error branch November 17, 2023 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tetragon: fix hang on error in tetragonExecute #1770

tetragon: fix hang on error in tetragonExecute #1770

willfindlay commented Nov 17, 2023

willfindlay commented Nov 17, 2023

tetragon: fix hang on error in tetragonExecute #1770

tetragon: fix hang on error in tetragonExecute #1770

Conversation

willfindlay commented Nov 17, 2023

willfindlay commented Nov 17, 2023