feat(workflows): add continue_on_error step field for non-halting failures by doquanghuy · Pull Request #2663 · github/spec-kit

doquanghuy · 2026-05-21T14:34:47Z

Description

Closes #2591.

Adds an optional continue_on_error: bool field on every step.
When set to true and the step fails, the engine records the
result (exit_code, stderr, status) into steps.<id>.output and
continues to the next sibling step instead of halting the run.
Downstream if, switch, or gate steps can then branch on
{{ steps.<id>.output.exit_code }} to route the recovery path.

This is the shape @mnriem proposed in the issue discussion —
it composes with primitives that already exist (the exit code
is already captured, the expression engine already resolves it,
and if/switch/gate are already available). The only gap
was that a non-zero exit hard-stopped the pipeline before any
downstream step could evaluate it.

Canonical usage

- id: heavy-thing
  type: command
  integration: claude
  command: speckit.heavy-thing
  continue_on_error: true

- id: check-result
  type: if
  condition: "{{ steps.heavy-thing.output.exit_code != 0 }}"
  then:
    - id: review
      type: gate
      message: "Step failed (exit {{ steps.heavy-thing.output.exit_code }}). Retry or skip?"
      on_reject: skip
  else:
    - id: next-thing
      command: speckit.next-thing

Engine

WorkflowEngine._execute_steps now consults the step config when
a step returns StepStatus.FAILED:

Gate aborts (output.aborted) always halt the run — operator
decisions take precedence over the flag.
Otherwise, if continue_on_error: true, log a
step_continue_on_error event and proceed to the next sibling.
Otherwise, behave as before: log step_failed, set
RunStatus.FAILED, and return.

Exactly one event per failure-resolution path is logged so the
log timeline is unambiguous: either the run continued past the
failure or it halted.

Validation

_validate_steps rejects non-bool values for continue_on_error.
Coerced strings like "true" are not accepted so authoring
mistakes surface at validation time rather than silently
changing run semantics.

Default behaviour preserved

When continue_on_error is omitted, every code path is
byte-equivalent to before this change. Existing workflows see no
difference.

Verdict coverage (from the issue discussion)

Scenario	How
Skip	`continue_on_error: true` + `if` branches around the failure
Abort	Omit the flag — today's default halts the run
Retry	`continue_on_error: true` + `gate` → operator approves → `resume` re-runs from gate

Fully unattended retry-on-transient (e.g. retry a 429 at 3 AM
without operator attendance) is intentionally out of scope here.
The skip and abort verdicts work without a human; the
retry verdict still pauses for one at the gate. A future
loop/retry-count primitive or an auto-approving gate type could
close that gap on top of this mechanism without further engine
changes — happy to follow up on that in a separate issue if
useful.

Testing

Tested locally with uv run specify --help
Ran existing tests with uv sync && uv run pytest
→ 2967 passed, 35 skipped (was 2960 before; +7 new
tests added in this PR).
Tested with a sample project: ran a 3-step workflow where
the middle step exits non-zero. Without
continue_on_error, run halts at the failing step (as
before). With continue_on_error: true, the failing step
records exit_code and the third step executes. A
downstream if branching on
{{ steps.flaky.output.exit_code != 0 }} routes into a
recovery gate cleanly.

New test coverage

TestContinueOnError in tests/test_workflows.py:

Test	What it locks
`test_undeclared_failure_halts_run`	Default behaviour byte-equivalent — no flag → run halts on first non-zero exit.
`test_declared_and_fired_continues_run`	Flag set + step fails → run continues, exit_code recorded.
`test_declared_but_step_succeeded_is_noop`	Flag set + step succeeds → no behaviour change.
`test_if_branch_routes_around_failure`	End-to-end recovery pattern from the issue discussion.
`test_gate_abort_still_halts_with_continue_on_error`	Operator-driven gate abort always halts, even with the flag set.
`test_validation_rejects_non_bool_continue_on_error`	`"true"` (string) fails validation.
`test_validation_accepts_bool_continue_on_error`	`true` and `false` pass validation cleanly.

AI Disclosure

I did not use AI assistance for this contribution
I did use AI assistance (described below)

Used Claude Opus to draft the engine change, the test suite, the
docs section, and this PR body. The shape (continue_on_error

exit-code-as-API + branch on it via existing primitives) was
proposed by @mnriem on the issue thread; this PR implements that
proposal. Code, tests, and design decisions were human-reviewed
before submission.

Closes github#2591. Adds an optional `continue_on_error: bool` field on every step. When set to `true` and the step fails, the engine records the result (exit_code, stderr, status) into `steps.<id>.output` and continues to the next sibling step instead of halting the run. Downstream `if`, `switch`, or `gate` steps can then branch on `{{ steps.<id>.output.exit_code }}` to route the recovery path. This composes with primitives that already exist (the exit code is already captured, the expression engine already resolves it, and `if`/`switch`/`gate` are already available) — the only gap was that a non-zero exit hard-stopped the pipeline before any downstream step could evaluate it. ### Engine `WorkflowEngine._execute_steps` now consults the step config when a step returns `StepStatus.FAILED`: - Gate aborts (`output.aborted`) always halt the run — operator decisions take precedence over the flag. - Otherwise, if `continue_on_error: true`, log a `step_continue_on_error` event and proceed to the next sibling. - Otherwise, behave as before: set `RunStatus.FAILED` and return. ### Validation `_validate_steps` rejects non-bool values for `continue_on_error`. Coerced strings like `"true"` are not accepted so authoring mistakes surface at validation time rather than silently changing run semantics. ### Default behaviour preserved When `continue_on_error` is omitted, every code path is byte-equivalent to before this change. Existing workflows see no difference. ### Tests New `TestContinueOnError` class in `tests/test_workflows.py` covers all four scenarios from the issue's acceptance criteria plus two extras: - undeclared (default) failure halts the run. - declared-and-fired continues past the failure. - declared-but-step-succeeded is a no-op (flag only matters on FAILED). - if-branch end-to-end exercising the canonical recovery pattern from the issue discussion. - gate abort still halts even with `continue_on_error: true` set. - validation rejects non-bool values; accepts both `true` and `false` cleanly. ### Docs Adds an "Error Handling" section to `workflows/README.md` documenting the field, the gate-abort precedence rule, and the canonical recovery pattern. ### Follow-on Auto-retry-on-transient (e.g. retry a 429 at 3 AM without operator attendance) is intentionally out of scope. The current proposal covers the **skip** and **abort** verdicts from the original discussion; the **retry** verdict still pauses for an operator at the gate step. A future loop/retry-count primitive or an auto-approving gate could close that gap on top of this mechanism without further engine changes.

doquanghuy requested a review from mnriem as a code owner May 21, 2026 14:34

doquanghuy force-pushed the feat/continue-on-error branch from e00e687 to f34ab4c Compare May 21, 2026 14:45

doquanghuy force-pushed the feat/continue-on-error branch from f34ab4c to da8ed4d Compare May 21, 2026 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(workflows): add continue_on_error step field for non-halting failures#2663

feat(workflows): add continue_on_error step field for non-halting failures#2663
doquanghuy wants to merge 1 commit into
github:mainfrom
doquanghuy:feat/continue-on-error

doquanghuy commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

doquanghuy commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Canonical usage

Engine

Validation

Default behaviour preserved

Verdict coverage (from the issue discussion)

Testing

New test coverage

AI Disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

doquanghuy commented May 21, 2026 •

edited

Loading