Skip to content

Watchdog #369

@rmcdaniel

Description

@rmcdaniel

Implement a watchdog-based recovery design for this workflow engine.

Goals:

  • Keep the user experience simple: if a normal queue worker is running, recovery should work without any extra setup step.
  • Make the watchdog slim and focused.
  • Let ordinary queue retry/redelivery handle the cases it is already good at.
  • Add high-signal tests that prove the final design works under real failure conditions.

High-level design:

  • Split recovery into two paths.
  • Path 1: if a workflow job was already queued and a worker dies while it is running, recovery should happen through normal queue redelivery. The workflow runtime should recognize that redelivery and resume cleanly instead of getting stuck in a re-release loop.
  • Path 2: if a workflow start is persisted but the actual workflow job never makes it to the queue, a global watchdog should recover it. The watchdog should only handle these stale startup states and should not own recovery for actively running jobs.

Watchdog requirements:

  • The watchdog should be a single global delayed background job, not a fleet of jobs.
  • It should re-arm itself using queue-native behavior so it is not dependent on spawning fresh copies every time.
  • It should only scan for stale workflow records that represent “started in durable storage but no active workflow job is making progress yet”.
  • When it finds one, it should safely clear the uniqueness barrier and resume the workflow.
  • It should tolerate races where another actor already recovered or completed the workflow.

Bootstrap and liveness requirements:

  • A workflow start should still try to kick the watchdog as a fast path.
  • But do not rely only on that kick.
  • Normal queue workers should also act as the independent liveness source by periodically ensuring the watchdog is present while they are looping.
  • This must close the bootstrap gap where the original dispatch-time watchdog kick could be lost in a producer crash.
  • Do this without requiring users to run a second special process.

Scope and constraints:

  • Reuse existing repo patterns, queue behavior, and test infrastructure.
  • Do not add a manual recovery command.
  • Do not require a scheduler as the primary solution.
  • Keep the implementation small and operationally clear.
  • Preserve the repo’s current transactional/after-commit safety model.

Proof requirements:
Add focused executable tests that prove:

  • a producer crash can strand a workflow before its real job is queued
  • the watchdog can recover that stranded workflow
  • a normal queue worker can bootstrap the watchdog even if the original dispatch-time kick is lost
  • a running workflow self-heals through queue redelivery after worker loss
  • the watchdog does not need its own watchdog once it is alive
  • previously completed work is not duplicated during recovery

Deliverables:

  • the implementation
  • tests proving the behavior

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions