Fix Edge worker SIGTERM storm during API server outage#65830
Merged
dheerajturaga merged 1 commit intoapache:mainfrom Apr 25, 2026
Merged
Fix Edge worker SIGTERM storm during API server outage#65830dheerajturaga merged 1 commit intoapache:mainfrom
dheerajturaga merged 1 commit intoapache:mainfrom
Conversation
When an Edge worker drains while its API server is unreachable, the supervisor subprocess's inherited asyncio signal wakeup fd caused signals in the child to re-fire the parent's shutdown handler at ~9.5 kHz, flooding logs with tens of thousands of SIGTERM received messages per second. Reset the inherited signal state (wakeup fd, SIGTERM/SIGINT/SIG_STATUS handlers) in the forked supervisor child before running supervise(), and make shutdown_handler idempotent so any residual re-fires are no-ops. Also drop the racy os.setpgid(child_pid, 0) call from the handler — it was a no-op when the child had already called setpgrp() and raised EPERM once the child had exec'd.
eladkal
approved these changes
Apr 25, 2026
jscheffl
approved these changes
Apr 25, 2026
Contributor
jscheffl
left a comment
There was a problem hiding this comment.
Thanks for the improvement!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When an Edge worker drains while its API server is unreachable, the
supervisor subprocess's inherited asyncio signal wakeup fd caused
signals in the child to re-fire the parent's shutdown handler at
~9.5 kHz, flooding logs with tens of thousands of SIGTERM received
messages per second.
Reset the inherited signal state (wakeup fd, SIGTERM/SIGINT/SIG_STATUS
handlers) in the forked supervisor child before running supervise(),
and make shutdown_handler idempotent so any residual re-fires are
no-ops. Also drop the racy os.setpgid(child_pid, 0) call from the
handler — it was a no-op when the child had already called setpgrp()
and raised EPERM once the child had exec'd.
Was generative AI tooling used to co-author this PR?
ClaudeCode Opus 4.7