Implement a watchdog-based recovery design for this workflow engine.
Goals:
- Keep the user experience simple: if a normal queue worker is running, recovery should work without any extra setup step.
- Make the watchdog slim and focused.
- Let ordinary queue retry/redelivery handle the cases it is already good at.
- Add high-signal tests that prove the final design works under real failure conditions.
High-level design:
- Split recovery into two paths.
- Path 1: if a workflow job was already queued and a worker dies while it is running, recovery should happen through normal queue redelivery. The workflow runtime should recognize that redelivery and resume cleanly instead of getting stuck in a re-release loop.
- Path 2: if a workflow start is persisted but the actual workflow job never makes it to the queue, a global watchdog should recover it. The watchdog should only handle these stale startup states and should not own recovery for actively running jobs.
Watchdog requirements:
- The watchdog should be a single global delayed background job, not a fleet of jobs.
- It should re-arm itself using queue-native behavior so it is not dependent on spawning fresh copies every time.
- It should only scan for stale workflow records that represent “started in durable storage but no active workflow job is making progress yet”.
- When it finds one, it should safely clear the uniqueness barrier and resume the workflow.
- It should tolerate races where another actor already recovered or completed the workflow.
Bootstrap and liveness requirements:
- A workflow start should still try to kick the watchdog as a fast path.
- But do not rely only on that kick.
- Normal queue workers should also act as the independent liveness source by periodically ensuring the watchdog is present while they are looping.
- This must close the bootstrap gap where the original dispatch-time watchdog kick could be lost in a producer crash.
- Do this without requiring users to run a second special process.
Scope and constraints:
- Reuse existing repo patterns, queue behavior, and test infrastructure.
- Do not add a manual recovery command.
- Do not require a scheduler as the primary solution.
- Keep the implementation small and operationally clear.
- Preserve the repo’s current transactional/after-commit safety model.
Proof requirements:
Add focused executable tests that prove:
- a producer crash can strand a workflow before its real job is queued
- the watchdog can recover that stranded workflow
- a normal queue worker can bootstrap the watchdog even if the original dispatch-time kick is lost
- a running workflow self-heals through queue redelivery after worker loss
- the watchdog does not need its own watchdog once it is alive
- previously completed work is not duplicated during recovery
Deliverables:
- the implementation
- tests proving the behavior
Implement a watchdog-based recovery design for this workflow engine.
Goals:
High-level design:
Watchdog requirements:
Bootstrap and liveness requirements:
Scope and constraints:
Proof requirements:
Add focused executable tests that prove:
Deliverables: