fix(native): Respect shutdown timeout in daemon#1691
Conversation
Pass the configured SDK shutdown timeout through the native crash context so the daemon uses the same limit as the SDK during transport shutdown. Add a sentry_example argument for tests that need a longer timeout under sanitizers while keeping most native tests on the production default. Co-Authored-By: OpenAI Codex <noreply@openai.com>
0c8a59c to
619a5df
Compare
| // envelopes (crash envelope + logs envelope, etc.). | ||
| int rv = sentry__transport_shutdown( |
There was a problem hiding this comment.
Bug: The default shutdown timeout for the crash daemon was reduced from 10s to 2s, which may cause crash reports to fail sending on slow networks.
Severity: HIGH
Suggested Fix
Revert the default shutdown timeout for the crash daemon to 10,000ms to ensure reliable delivery of crash envelopes in various network conditions. Alternatively, if the 2,000ms default is desired, this breaking change should be clearly documented for users, advising them to manually set sentry_options_set_shutdown_timeout if they require a longer timeout for reliability.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/backends/native/sentry_crash_daemon.c#L3431-L3432
Potential issue: The crash daemon's transport shutdown timeout has been reduced from a
hardcoded 10,000ms to the SDK's default of 2,000ms. This change is silent for users who
have not explicitly configured `sentry_options_set_shutdown_timeout`. The shorter
2-second window significantly increases the likelihood that crash envelopes will fail to
be delivered, particularly in production environments with slow or high-latency
networks. The previous 10-second timeout was intended for reliability beyond just
accommodating slower sanitizer builds, which are handled separately in test
infrastructure.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Yeah, it's a difficult balance. If I remember correctly, it was originally the same 2 seconds but was bumped up to 10 seconds specifically for TSan and ASan tests, so now we're only using it for a small subset of tests when running with sanitizers.
While we do generally want most envelopes to go through at crash time, it can feel a bit awkward if an app window is stuck on the screen for 10 seconds, which can easily happen with large attachments.
There was a problem hiding this comment.
Here's the commit where it was bumped up from 2 to 10 seconds: e7ef468. Unfortunately, logs for the preceding CI runs have expired, so we cannot see which exact tests failed with ASan or TSan back then.
Pass the configured SDK shutdown timeout through the native crash context so the daemon uses the same limit as the SDK during transport shutdown.
This is likely the first thing users will try to adjust when it comes to large attachment uploads on shutdown. While a full-screen game may want a relatively high timeout to allow as many crash reports as possible to get uploaded right away, in a desktop environment, it may feel less nice if a GUI app window is stuck in a crashed state for a long time...
As for tests, add a
shutdown-timeoutargument for those tests that most likely need a longer 10s timeout under sanitizers, while keeping most native tests on the production default (2s).