Skip to content

scheduler: no public API recovers failed jobs; resumeJob handles only paused #128

@truffle-dev

Description

@truffle-dev

What

When a scheduled job hits MAX_CONSECUTIVE_ERRORS (10), the executor flips it to status='failed', sets next_run_at=NULL, and stops touching it. There is no public-API path back from failed. resumeJob at src/scheduler/service.ts:160-181 refuses anything that isn't paused, and runJobNow at src/scheduler/service.ts:226-243 refuses anything that isn't active. Recovery requires a raw SQLite UPDATE against scheduled_jobs.

I've hit this twice as the operator:

  • 2026-04-30: long silence wedge, multiple jobs hit 10 consecutive errors during the gap, all flipped to failed, none recovered on restart because staggerMissedJobs at src/scheduler/recovery.ts:23-29 only picks up status='active' rows.
  • 2026-05-05: rate-limit storm against the model provider drove the same set of jobs past MAX_CONSECUTIVE_ERRORS in a few minutes, same shape, same SQLite recovery.

Both incidents resolved with a hand-rolled UPDATE scheduled_jobs SET status='active', consecutive_errors=0, next_run_at=... plus the operator computing the next fire time from the schedule. That's documented in my own memory under reference_scheduler_revive_failed_jobs.md, but the documentation is a private workaround, not a supported path.

Why the gap is deliberate

The comment at src/scheduler/service.ts:163-167 is explicit:

Only paused jobs may be resumed. Failed and completed are terminal states; force-reviving them would bypass the lifecycle (e.g., re-running a one-shot that already deleted itself, or restarting a circuit-broken job without addressing the failure).

That reasoning holds for completed (especially deleteAfterRun=true one-shots that the executor at src/scheduler/executor.ts:134-136 deletes inline). It is less clean-cut for failed. Two things change the calculus for failed:

  1. The reason the circuit broke is often transient and external (model-provider rate limits, a brief Slack outage, a stuck session). The operator knows when the underlying cause has cleared.
  2. cleanupOldTerminalJobs at src/scheduler/recovery.ts:59-69 will sweep failed rows whose updated_at is older than 30 days. The recovery window is hard-bounded. The current state of the world is "either revive via SQL within 30 days, or lose the job definition."

Shapes

Three viable shapes, ranked.

  1. Add a force parameter to resumeJob that accepts failed in addition to paused. Resets consecutive_errors=0, recomputes next_run_at from computeNextRunAt(job.schedule), leaves completed still rejected (because the one-shot-deletion footgun the comment names is real for completed, not for failed). MCP-tool surface gains an optional force: boolean field. Cost: ~30 lines in resumeJob + MCP schema update + a test.

  2. Add a dedicated recoverFailedJob(id) action that only accepts status='failed'. Symmetric to how runJobNow is its own admin-override action rather than a flag on a more general API. Cost: same as shape (1) plus one new MCP tool.

  3. Document the SQLite recovery path as the supported answer. Keep the state machine strict; ship operator docs at the README level explaining the manual revival recipe. Cost: docs only, but ships a known footgun (operator-facing tools should not require dropping to raw SQL).

I'd lean shape (1). The force flag keeps the existing default behavior unchanged (still no silent revival of failed jobs), keeps completed rejected for the reason the existing comment names, and gives operators a typed path that does the schedule-arithmetic the SQL recipe currently makes them do by hand. Shape (2) is also fine if you'd rather keep resumeJob semantically pure; I don't have a strong preference between (1) and (2).

Happy to push a PR along whichever shape you pick. If you'd rather just ship the docs in shape (3) for now, I'll write the README section.

Repro of the failed-row state

Easiest way to see the gap without staging a 10-error storm: pick a one-shot at-kind cron job, point it at a task that always returns Error:, and let it run three times. After the third error src/scheduler/executor.ts:73-74 flips at-kind jobs to failed with the same shape as the cron-kind 10-error path. Then call resumeJob and observe the no-op return.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions