Skip to content

watchdog: add new action to capture backtraces#44620

Open
jmsadair wants to merge 27 commits into
envoyproxy:mainfrom
jmsadair:backtrace-watchdog-action
Open

watchdog: add new action to capture backtraces#44620
jmsadair wants to merge 27 commits into
envoyproxy:mainfrom
jmsadair:backtrace-watchdog-action

Conversation

@jmsadair
Copy link
Copy Markdown
Contributor

@jmsadair jmsadair commented Apr 23, 2026

Commit Message: watchdog: add new action to capture backtraces
Additional Description: Adds envoy.watchdog.backtrace_action, a new watchdog action that captures stack traces of stuck threads. When triggered, the action signals each offending thread via SIGUSR2, which captures the trace in-place in the signal handler, then logs it on the dispatcher thread. A configurable per-thread cooldown (default: 10s) prevents trace spam on persistent stalls.
Risk Level: Low.
Testing: Added unit tests.
Docs Changes: Updated watchdog.rst.
Release Notes: Added.
Platform Specific Features: Uses POSIX signals to collect backtraces. Windows is not supported.

James Adair added 10 commits April 19, 2026 15:10
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
@repokitteh-read-only
Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #44620 was opened by jmsadair.

see: more, trace.

@jmsadair jmsadair marked this pull request as ready for review April 23, 2026 22:14
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #44620 was ready_for_review by jmsadair.

see: more, trace.

James Adair added 5 commits April 23, 2026 22:28
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/coverage-shephards: FYI only for changes made to (test/coverage.yaml).
envoyproxy/coverage-shephards assignee is @RyanTheOptimist

🐱

Caused by: #44620 was synchronize by jmsadair.

see: more, trace.

James Adair added 5 commits April 24, 2026 03:29
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
@jmsadair
Copy link
Copy Markdown
Contributor Author

/retest

Comment on lines +53 to +56
// Async-signal-safe: reads a thread-local cached on each watched thread when
// it registered with the watchdog (see worker_impl.cc / server.cc), so this
// is just a TLS load by the time we reach the signal handler.
const int64_t mytid = Thread::getCurrentThreadId();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to note here is that I'm still not 100% sure this is actually async signal safe even if it is guaranteed that the thread local TID has already been initialized. From what I have gathered, it seems like it's possible in some cases that a lock may be acquired when accessing TLS. Although, it seems rather unlikely here.

It might be possible to come up with some other scheme for claiming slots, but I don't have one in mind at the moment. Alternatively, we can just use pipes to communicate the backtrace a thread that isn't handling the signal since write is guaranteed to be async signal safe. Pipes have some caveats too, though.

@KBaichoo KBaichoo self-assigned this Apr 27, 2026
@RyanTheOptimist
Copy link
Copy Markdown
Contributor

Needs a main merge
/wait

Signed-off-by: James Adair <jadair@netflix.com>
Comment thread test/coverage.yaml Outdated
James Adair added 2 commits May 2, 2026 00:05
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
@jmsadair
Copy link
Copy Markdown
Contributor Author

jmsadair commented May 2, 2026

/coverage

@repokitteh-read-only
Copy link
Copy Markdown

Coverage for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-cncf-pr/44620/coverage/index.html

For comparison, current coverage on main branch is here:

https://storage.googleapis.com/envoy-cncf-postsubmit/main/coverage/index.html

The coverage results are (re-)rendered each time the CI Envoy/Checks (coverage) job completes.

🐱

Caused by: a #44620 (comment) was created by @jmsadair.

see: more, trace.

Signed-off-by: James Adair <jadair@netflix.com>
@jmsadair jmsadair requested a review from RyanTheOptimist May 2, 2026 02:04
James Adair added 3 commits May 3, 2026 21:45
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
@jmsadair
Copy link
Copy Markdown
Contributor Author

jmsadair commented May 8, 2026

/retest

Comment thread test/coverage.yaml
source/extensions/wasm_runtime/wamr: 0.0 # Not enabled in coverage build
source/extensions/wasm_runtime/wasmtime: 0.0 # Not enabled in coverage build
source/extensions/watchdog: 83.3 # Death tests within extensions
source/extensions/watchdog/backtrace_action: 91.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more unit tests to increase coverage above the limit?

Copy link
Copy Markdown
Contributor Author

@jmsadair jmsadair May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was addressed in our prior comment chain: #44620 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants