Sweep stale queue_processes rows on worker startup by dereuromark · Pull Request #480 · dereuromark/cakephp-queue

dereuromark · 2026-04-30T11:12:18Z

Summary

Workers that die without graceful shutdown (SIGKILL, OOM, container restart, anything that bypasses the SIGTERM handler) leave their queue_processes row behind with terminate=0. Without intervention these ghost rows accumulate and count toward Queue.maxworkers via QueueProcessesTable::validateCount(), which can refuse to register new workers (Too many workers running) even though the actual processes are long dead.

cleanEndedProcesses() already exists and removes any row whose modified is older than Queue.defaultRequeueTimeout, but it is only invoked from:

the worker's end-of-loop cleanup, gated by gcprob (default 10%, so ~10 minutes between actual sweeps with one-minute worker turnover);
a fallback inside the PersistenceFailedException catch block, which only runs after a worker has already failed to start.

Neither path runs before initPid(), so the maxworkers check sees stale rows. This PR moves an unconditional cleanEndedProcesses() call to the top of Processor::run() so every fresh worker sweeps dead siblings before attempting to register.

Reproduction

Queue.maxworkers = 2, Queue.defaultRequeueTimeout = 60
Start a worker that picks up a long-running job
SIGKILL the worker mid-job (or let the container OOM-kill it)
The queue_processes row stays with terminate = 0 and modified frozen at the last cycle
After 2 such kills the third worker fails to start: Too many workers running (2/2)

After this change, the third worker first sweeps the stale rows and proceeds normally.

Tests

Added to QueueProcessesTableTest:

testCleanEndedProcessesRemovesStaleRowsOnly — covers the existing method's threshold logic (no behaviour change here, just coverage that was missing).
testStaleRowsCountTowardMaxWorkersUntilCleaned — regression for the scenario above.

Both pass; existing tests in the file unchanged.

Risk

Low. cleanEndedProcesses() is a single DELETE WHERE modified < threshold, runs in milliseconds, and the threshold is the same value already used by the existing call sites. The only behavioural change is when it runs — once per worker startup vs. occasionally at worker shutdown.

Workers that die without graceful shutdown (SIGKILL, OOM, container restart, anything that bypasses the SIGTERM handler) leave their queue_processes row behind with terminate=0. Without intervention these ghost rows accumulate and count toward Queue.maxworkers via QueueProcessesTable::validateCount(), which can refuse to register new workers ("Too many workers running") even though the actual processes are long dead. cleanEndedProcesses() already exists and removes any row whose `modified` is older than Queue.defaultRequeueTimeout. Today it is only invoked from - the worker's end-of-loop cleanup, gated by gcprob (default 10%, so ~10 minutes between actual sweeps with one-minute worker turnover); - a fallback inside the PersistenceFailedException catch block, which only runs after a worker has already failed to start. Neither path runs before initPid(), so the maxworkers check sees stale rows. Move an unconditional cleanEndedProcesses() call to the top of Processor::run() so every fresh worker sweeps dead siblings before attempting to register. Adds two tests: - testCleanEndedProcessesRemovesStaleRowsOnly: confirms the existing method's threshold logic (no behaviour change here, just coverage that was missing). - testStaleRowsCountTowardMaxWorkersUntilCleaned: regression for the scenario described above.

codecov-commenter · 2026-04-30T11:13:48Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.27%. Comparing base (0e8126f) to head (bbf578a).
⚠️ Report is 1 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@             Coverage Diff              @@
##             master     #480      +/-   ##
============================================
+ Coverage     77.25%   77.27%   +0.02%     
  Complexity      955      955              
============================================
  Files            45       45              
  Lines          3214     3213       -1     
============================================
  Hits           2483     2483              
+ Misses          731      730       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The previous commit moved cleanEndedProcesses() to run unconditionally at every worker startup. That makes the two existing call sites redundant: 1. The shutdown gate (`$this->exit || gcprob`) — only the queued_jobs `cleanOldJobs()` call there is still useful; the next worker's startup sweep already handles process cleanup within ~1 minute. 2. The retry-sweep inside the PersistenceFailedException catch — the pre-`initPid` sweep already ran. If initPid still fails, the limit is genuinely reached and re-sweeping would not change that. Removing both keeps the cleanup discipline clear (one canonical sweep point per worker lifecycle) and saves a couple of needless DELETE queries per worker. No behaviour change for the happy path.

dereuromark merged commit 72e35b0 into master Apr 30, 2026
16 checks passed

dereuromark deleted the bugfix/sweep-stale-processes-on-worker-startup branch April 30, 2026 11:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sweep stale queue_processes rows on worker startup#480

Sweep stale queue_processes rows on worker startup#480
dereuromark merged 2 commits intomasterfrom
bugfix/sweep-stale-processes-on-worker-startup

dereuromark commented Apr 30, 2026

Uh oh!

codecov-commenter commented Apr 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dereuromark commented Apr 30, 2026

Summary

Reproduction

Tests

Risk

Uh oh!

codecov-commenter commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov-commenter commented Apr 30, 2026 •

edited

Loading