Skip to content

Sweep stale queue_processes rows on worker startup#480

Merged
dereuromark merged 2 commits intomasterfrom
bugfix/sweep-stale-processes-on-worker-startup
Apr 30, 2026
Merged

Sweep stale queue_processes rows on worker startup#480
dereuromark merged 2 commits intomasterfrom
bugfix/sweep-stale-processes-on-worker-startup

Conversation

@dereuromark
Copy link
Copy Markdown
Owner

Summary

Workers that die without graceful shutdown (SIGKILL, OOM, container restart, anything that bypasses the SIGTERM handler) leave their queue_processes row behind with terminate=0. Without intervention these ghost rows accumulate and count toward Queue.maxworkers via QueueProcessesTable::validateCount(), which can refuse to register new workers (Too many workers running) even though the actual processes are long dead.

cleanEndedProcesses() already exists and removes any row whose modified is older than Queue.defaultRequeueTimeout, but it is only invoked from:

  • the worker's end-of-loop cleanup, gated by gcprob (default 10%, so ~10 minutes between actual sweeps with one-minute worker turnover);
  • a fallback inside the PersistenceFailedException catch block, which only runs after a worker has already failed to start.

Neither path runs before initPid(), so the maxworkers check sees stale rows. This PR moves an unconditional cleanEndedProcesses() call to the top of Processor::run() so every fresh worker sweeps dead siblings before attempting to register.

Reproduction

  1. Queue.maxworkers = 2, Queue.defaultRequeueTimeout = 60
  2. Start a worker that picks up a long-running job
  3. SIGKILL the worker mid-job (or let the container OOM-kill it)
  4. The queue_processes row stays with terminate = 0 and modified frozen at the last cycle
  5. After 2 such kills the third worker fails to start: Too many workers running (2/2)

After this change, the third worker first sweeps the stale rows and proceeds normally.

Tests

Added to QueueProcessesTableTest:

  • testCleanEndedProcessesRemovesStaleRowsOnly — covers the existing method's threshold logic (no behaviour change here, just coverage that was missing).
  • testStaleRowsCountTowardMaxWorkersUntilCleaned — regression for the scenario above.

Both pass; existing tests in the file unchanged.

Risk

Low. cleanEndedProcesses() is a single DELETE WHERE modified < threshold, runs in milliseconds, and the threshold is the same value already used by the existing call sites. The only behavioural change is when it runs — once per worker startup vs. occasionally at worker shutdown.

Workers that die without graceful shutdown (SIGKILL, OOM, container
restart, anything that bypasses the SIGTERM handler) leave their
queue_processes row behind with terminate=0. Without intervention these
ghost rows accumulate and count toward Queue.maxworkers via
QueueProcessesTable::validateCount(), which can refuse to register new
workers ("Too many workers running") even though the actual processes
are long dead.

cleanEndedProcesses() already exists and removes any row whose `modified`
is older than Queue.defaultRequeueTimeout. Today it is only invoked from
- the worker's end-of-loop cleanup, gated by gcprob (default 10%, so
  ~10 minutes between actual sweeps with one-minute worker turnover);
- a fallback inside the PersistenceFailedException catch block, which
  only runs after a worker has already failed to start.

Neither path runs before initPid(), so the maxworkers check sees stale
rows. Move an unconditional cleanEndedProcesses() call to the top of
Processor::run() so every fresh worker sweeps dead siblings before
attempting to register.

Adds two tests:
- testCleanEndedProcessesRemovesStaleRowsOnly: confirms the existing
  method's threshold logic (no behaviour change here, just coverage
  that was missing).
- testStaleRowsCountTowardMaxWorkersUntilCleaned: regression for the
  scenario described above.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 30, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.27%. Comparing base (0e8126f) to head (bbf578a).
⚠️ Report is 1 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff              @@
##             master     #480      +/-   ##
============================================
+ Coverage     77.25%   77.27%   +0.02%     
  Complexity      955      955              
============================================
  Files            45       45              
  Lines          3214     3213       -1     
============================================
  Hits           2483     2483              
+ Misses          731      730       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The previous commit moved cleanEndedProcesses() to run unconditionally
at every worker startup. That makes the two existing call sites
redundant:

1. The shutdown gate (`$this->exit || gcprob`) — only the queued_jobs
   `cleanOldJobs()` call there is still useful; the next worker's
   startup sweep already handles process cleanup within ~1 minute.

2. The retry-sweep inside the PersistenceFailedException catch — the
   pre-`initPid` sweep already ran. If initPid still fails, the limit
   is genuinely reached and re-sweeping would not change that.

Removing both keeps the cleanup discipline clear (one canonical sweep
point per worker lifecycle) and saves a couple of needless DELETE
queries per worker. No behaviour change for the happy path.
@dereuromark dereuromark merged commit 72e35b0 into master Apr 30, 2026
16 checks passed
@dereuromark dereuromark deleted the bugfix/sweep-stale-processes-on-worker-startup branch April 30, 2026 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants