Auto-evict stale queue_processes rows on worker startup#493
Merged
Conversation
Adds Queue.staleHeartbeatThreshold config (default 90s). QueueProcessesTable::add() now evicts a stale (pid, server) row whose modified is older than the threshold *before* attempting the insert. Fixes the crash loop seen on container restarts (FrankenPHP / Docker), where killed workers leave orphan rows and the new worker - getting the same recycled PID - fails the unique-index insert with a QueryException that the existing PersistenceFailedException catch in Processor::run() doesn't handle. The threshold is intentionally separate from defaultRequeueTimeout (default 10min): that one governs in-flight job requeueing and is deliberately long; workers heartbeat far more often, so a row not refreshed in 90s almost certainly belongs to a dead worker.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #493 +/- ##
============================================
+ Coverage 77.51% 77.58% +0.07%
- Complexity 971 972 +1
============================================
Files 45 45
Lines 3255 3266 +11
============================================
+ Hits 2523 2534 +11
Misses 732 732 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3 tasks
…ware-pid-slot # Conflicts: # tests/TestCase/Model/Table/QueueProcessesTableTest.php
dereuromark
added a commit
that referenced
this pull request
May 8, 2026
…494) PID is not a stable worker identity. The OS recycles low PIDs across container restarts, and the unique index on (pid, server) collides whenever a containerized worker is replaced before its row is cleaned. workerkey is already a UUID-style hash generated per process - the correct identity, and already uniquely indexed on its own. Schema: - New migration drops the unique index on (pid, server) and re-adds it as a non-unique index for query performance. - Test fixture mirrors the change. Behavior: - update() and remove() now look up by workerkey. - Processor stores the workerkey alongside the pid and passes it on heartbeat / shutdown. - endProcess() picks the most recently heartbeated row when multiple rows share a PID. terminateProcess() already wipes by pid which remains correct - the OS guarantees only one live process per pid at any moment. Open questions for review: - Should worker end / kill accept --workerkey as an alternative to PID for precise targeting? - Can the eviction logic added in PR #493 be removed once this lands, or kept as belt-and-braces?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Queue.staleHeartbeatThreshold(default 90s).QueueProcessesTable::add()now evicts a stale(pid, server)row whosemodifiedis older than the threshold before attempting the insert.Why a separate threshold (not
defaultRequeueTimeout)defaultRequeueTimeout(default 10 min) is the threshold for treating an in-flight job as abandoned. Deliberately long, to avoid double-execution. Workers heartbeat every loop iteration — far more often. A row not refreshed in ~90s almost certainly belongs to a dead worker, but waiting 10 min before evicting it makes container redeploys painful.Caught vs. uncaught exception path
Worth highlighting: the unique
(pid, server)collision throwsCake\\Database\\Exception\\QueryExceptionat the DB level — notPersistenceFailedException. The existing catch inProcessor::run()only handles the latter, so without this PR the worker crashes hard. The new testtestAddDoesNotEvictRecentRowOnSamePidServerasserts that exact exception type so any future swap of the constraint mechanism is loud.Relationship to other PRs
--forceflag onqueue worker clean) is a complementary operator escape hatch.workerkeythe canonical identity and dropping the(pid, server)unique index entirely; that would make this eviction logic redundant.