Sweep stale queue_processes rows on worker startup#480
Merged
dereuromark merged 2 commits intomasterfrom Apr 30, 2026
Merged
Conversation
Workers that die without graceful shutdown (SIGKILL, OOM, container
restart, anything that bypasses the SIGTERM handler) leave their
queue_processes row behind with terminate=0. Without intervention these
ghost rows accumulate and count toward Queue.maxworkers via
QueueProcessesTable::validateCount(), which can refuse to register new
workers ("Too many workers running") even though the actual processes
are long dead.
cleanEndedProcesses() already exists and removes any row whose `modified`
is older than Queue.defaultRequeueTimeout. Today it is only invoked from
- the worker's end-of-loop cleanup, gated by gcprob (default 10%, so
~10 minutes between actual sweeps with one-minute worker turnover);
- a fallback inside the PersistenceFailedException catch block, which
only runs after a worker has already failed to start.
Neither path runs before initPid(), so the maxworkers check sees stale
rows. Move an unconditional cleanEndedProcesses() call to the top of
Processor::run() so every fresh worker sweeps dead siblings before
attempting to register.
Adds two tests:
- testCleanEndedProcessesRemovesStaleRowsOnly: confirms the existing
method's threshold logic (no behaviour change here, just coverage
that was missing).
- testStaleRowsCountTowardMaxWorkersUntilCleaned: regression for the
scenario described above.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #480 +/- ##
============================================
+ Coverage 77.25% 77.27% +0.02%
Complexity 955 955
============================================
Files 45 45
Lines 3214 3213 -1
============================================
Hits 2483 2483
+ Misses 731 730 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The previous commit moved cleanEndedProcesses() to run unconditionally at every worker startup. That makes the two existing call sites redundant: 1. The shutdown gate (`$this->exit || gcprob`) — only the queued_jobs `cleanOldJobs()` call there is still useful; the next worker's startup sweep already handles process cleanup within ~1 minute. 2. The retry-sweep inside the PersistenceFailedException catch — the pre-`initPid` sweep already ran. If initPid still fails, the limit is genuinely reached and re-sweeping would not change that. Removing both keeps the cleanup discipline clear (one canonical sweep point per worker lifecycle) and saves a couple of needless DELETE queries per worker. No behaviour change for the happy path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Workers that die without graceful shutdown (SIGKILL, OOM, container restart, anything that bypasses the SIGTERM handler) leave their
queue_processesrow behind withterminate=0. Without intervention these ghost rows accumulate and count towardQueue.maxworkersviaQueueProcessesTable::validateCount(), which can refuse to register new workers (Too many workers running) even though the actual processes are long dead.cleanEndedProcesses()already exists and removes any row whosemodifiedis older thanQueue.defaultRequeueTimeout, but it is only invoked from:gcprob(default 10%, so ~10 minutes between actual sweeps with one-minute worker turnover);PersistenceFailedExceptioncatch block, which only runs after a worker has already failed to start.Neither path runs before
initPid(), so themaxworkerscheck sees stale rows. This PR moves an unconditionalcleanEndedProcesses()call to the top ofProcessor::run()so every fresh worker sweeps dead siblings before attempting to register.Reproduction
Queue.maxworkers = 2,Queue.defaultRequeueTimeout = 60queue_processesrow stays withterminate = 0andmodifiedfrozen at the last cycleToo many workers running (2/2)After this change, the third worker first sweeps the stale rows and proceeds normally.
Tests
Added to
QueueProcessesTableTest:testCleanEndedProcessesRemovesStaleRowsOnly— covers the existing method's threshold logic (no behaviour change here, just coverage that was missing).testStaleRowsCountTowardMaxWorkersUntilCleaned— regression for the scenario above.Both pass; existing tests in the file unchanged.
Risk
Low.
cleanEndedProcesses()is a singleDELETE WHERE modified < threshold, runs in milliseconds, and the threshold is the same value already used by the existing call sites. The only behavioural change is when it runs — once per worker startup vs. occasionally at worker shutdown.