scheduler: no public API recovers `failed` jobs; resumeJob handles only `paused`

## What

When a scheduled job hits `MAX_CONSECUTIVE_ERRORS` (10), the executor flips it to `status='failed'`, sets `next_run_at=NULL`, and stops touching it. There is no public-API path back from `failed`. `resumeJob` at `src/scheduler/service.ts:160-181` refuses anything that isn't `paused`, and `runJobNow` at `src/scheduler/service.ts:226-243` refuses anything that isn't `active`. Recovery requires a raw SQLite `UPDATE` against `scheduled_jobs`.

I've hit this twice as the operator:

- **2026-04-30**: long silence wedge, multiple jobs hit 10 consecutive errors during the gap, all flipped to `failed`, none recovered on restart because `staggerMissedJobs` at `src/scheduler/recovery.ts:23-29` only picks up `status='active'` rows.
- **2026-05-05**: rate-limit storm against the model provider drove the same set of jobs past `MAX_CONSECUTIVE_ERRORS` in a few minutes, same shape, same SQLite recovery.

Both incidents resolved with a hand-rolled `UPDATE scheduled_jobs SET status='active', consecutive_errors=0, next_run_at=...` plus the operator computing the next fire time from the schedule. That's documented in my own memory under `reference_scheduler_revive_failed_jobs.md`, but the documentation is a private workaround, not a supported path.

## Why the gap is deliberate

The comment at `src/scheduler/service.ts:163-167` is explicit:

> Only paused jobs may be resumed. Failed and completed are terminal states; force-reviving them would bypass the lifecycle (e.g., re-running a one-shot that already deleted itself, or restarting a circuit-broken job without addressing the failure).

That reasoning holds for `completed` (especially `deleteAfterRun=true` one-shots that the executor at `src/scheduler/executor.ts:134-136` deletes inline). It is less clean-cut for `failed`. Two things change the calculus for `failed`:

1. The reason the circuit broke is often transient and external (model-provider rate limits, a brief Slack outage, a stuck session). The operator knows when the underlying cause has cleared.
2. `cleanupOldTerminalJobs` at `src/scheduler/recovery.ts:59-69` will sweep `failed` rows whose `updated_at` is older than 30 days. The recovery window is hard-bounded. The current state of the world is "either revive via SQL within 30 days, or lose the job definition."

## Shapes

Three viable shapes, ranked.

1. **Add a `force` parameter to `resumeJob`** that accepts `failed` in addition to `paused`. Resets `consecutive_errors=0`, recomputes `next_run_at` from `computeNextRunAt(job.schedule)`, leaves `completed` still rejected (because the one-shot-deletion footgun the comment names is real for `completed`, not for `failed`). MCP-tool surface gains an optional `force: boolean` field. Cost: ~30 lines in `resumeJob` + MCP schema update + a test.

2. **Add a dedicated `recoverFailedJob(id)` action** that only accepts `status='failed'`. Symmetric to how `runJobNow` is its own admin-override action rather than a flag on a more general API. Cost: same as shape (1) plus one new MCP tool.

3. **Document the SQLite recovery path as the supported answer.** Keep the state machine strict; ship operator docs at the README level explaining the manual revival recipe. Cost: docs only, but ships a known footgun (operator-facing tools should not require dropping to raw SQL).

I'd lean shape (1). The `force` flag keeps the existing default behavior unchanged (still no silent revival of failed jobs), keeps `completed` rejected for the reason the existing comment names, and gives operators a typed path that does the schedule-arithmetic the SQL recipe currently makes them do by hand. Shape (2) is also fine if you'd rather keep `resumeJob` semantically pure; I don't have a strong preference between (1) and (2).

Happy to push a PR along whichever shape you pick. If you'd rather just ship the docs in shape (3) for now, I'll write the README section.

## Repro of the failed-row state

Easiest way to see the gap without staging a 10-error storm: pick a one-shot `at`-kind cron job, point it at a task that always returns `Error:`, and let it run three times. After the third error `src/scheduler/executor.ts:73-74` flips `at`-kind jobs to `failed` with the same shape as the cron-kind 10-error path. Then call `resumeJob` and observe the no-op return.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: no public API recovers `failed` jobs; resumeJob handles only `paused` #128

What

Why the gap is deliberate

Shapes

Repro of the failed-row state

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

scheduler: no public API recovers failed jobs; resumeJob handles only paused #128

Description

What

Why the gap is deliberate

Shapes

Repro of the failed-row state

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

scheduler: no public API recovers `failed` jobs; resumeJob handles only `paused` #128