watchdog: add new action to capture backtraces#44620
Conversation
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
|
CC @envoyproxy/coverage-shephards: FYI only for changes made to |
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
|
/retest |
| // Async-signal-safe: reads a thread-local cached on each watched thread when | ||
| // it registered with the watchdog (see worker_impl.cc / server.cc), so this | ||
| // is just a TLS load by the time we reach the signal handler. | ||
| const int64_t mytid = Thread::getCurrentThreadId(); |
There was a problem hiding this comment.
Something to note here is that I'm still not 100% sure this is actually async signal safe even if it is guaranteed that the thread local TID has already been initialized. From what I have gathered, it seems like it's possible in some cases that a lock may be acquired when accessing TLS. Although, it seems rather unlikely here.
It might be possible to come up with some other scheme for claiming slots, but I don't have one in mind at the moment. Alternatively, we can just use pipes to communicate the backtrace a thread that isn't handling the signal since write is guaranteed to be async signal safe. Pipes have some caveats too, though.
|
Needs a main merge |
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
|
/coverage |
|
Coverage for this Pull Request will be rendered here: https://storage.googleapis.com/envoy-cncf-pr/44620/coverage/index.html For comparison, current coverage on https://storage.googleapis.com/envoy-cncf-postsubmit/main/coverage/index.html The coverage results are (re-)rendered each time the CI |
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
|
/retest |
| source/extensions/wasm_runtime/wamr: 0.0 # Not enabled in coverage build | ||
| source/extensions/wasm_runtime/wasmtime: 0.0 # Not enabled in coverage build | ||
| source/extensions/watchdog: 83.3 # Death tests within extensions | ||
| source/extensions/watchdog/backtrace_action: 91.0 |
There was a problem hiding this comment.
Can we add more unit tests to increase coverage above the limit?
There was a problem hiding this comment.
I thought this was addressed in our prior comment chain: #44620 (comment)
Commit Message: watchdog: add new action to capture backtraces
Additional Description: Adds
envoy.watchdog.backtrace_action, a new watchdog action that captures stack traces of stuck threads. When triggered, the action signals each offending thread via SIGUSR2, which captures the trace in-place in the signal handler, then logs it on the dispatcher thread. A configurable per-thread cooldown (default: 10s) prevents trace spam on persistent stalls.Risk Level: Low.
Testing: Added unit tests.
Docs Changes: Updated
watchdog.rst.Release Notes: Added.
Platform Specific Features: Uses POSIX signals to collect backtraces. Windows is not supported.