Skip to content

Fix Edge worker SIGTERM storm during API server outage#65830

Merged
dheerajturaga merged 1 commit intoapache:mainfrom
dheerajturaga:bugfix/edge3-sigterm-storm
Apr 25, 2026
Merged

Fix Edge worker SIGTERM storm during API server outage#65830
dheerajturaga merged 1 commit intoapache:mainfrom
dheerajturaga:bugfix/edge3-sigterm-storm

Conversation

@dheerajturaga
Copy link
Copy Markdown
Member

@dheerajturaga dheerajturaga commented Apr 25, 2026

When an Edge worker drains while its API server is unreachable, the
supervisor subprocess's inherited asyncio signal wakeup fd caused
signals in the child to re-fire the parent's shutdown handler at
~9.5 kHz, flooding logs with tens of thousands of SIGTERM received
messages per second.

Reset the inherited signal state (wakeup fd, SIGTERM/SIGINT/SIG_STATUS
handlers) in the forked supervisor child before running supervise(),
and make shutdown_handler idempotent so any residual re-fires are
no-ops. Also drop the racy os.setpgid(child_pid, 0) call from the
handler — it was a no-op when the child had already called setpgrp()
and raised EPERM once the child had exec'd.


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)
    ClaudeCode Opus 4.7

  When an Edge worker drains while its API server is unreachable, the
  supervisor subprocess's inherited asyncio signal wakeup fd caused
  signals in the child to re-fire the parent's shutdown handler at
  ~9.5 kHz, flooding logs with tens of thousands of SIGTERM received
  messages per second.

  Reset the inherited signal state (wakeup fd, SIGTERM/SIGINT/SIG_STATUS
  handlers) in the forked supervisor child before running supervise(),
  and make shutdown_handler idempotent so any residual re-fires are
  no-ops. Also drop the racy os.setpgid(child_pid, 0) call from the
  handler — it was a no-op when the child had already called setpgrp()
  and raised EPERM once the child had exec'd.
@dheerajturaga dheerajturaga requested a review from jscheffl as a code owner April 25, 2026 04:20
@boring-cyborg boring-cyborg Bot added area:providers provider:edge Edge Executor / Worker (AIP-69) / edge3 labels Apr 25, 2026
Copy link
Copy Markdown
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement!

@dheerajturaga dheerajturaga merged commit b274abc into apache:main Apr 25, 2026
89 checks passed
@dheerajturaga dheerajturaga deleted the bugfix/edge3-sigterm-storm branch April 25, 2026 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:edge Edge Executor / Worker (AIP-69) / edge3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants