Skip to content

feat(workflows): add continue_on_error step field for non-halting failures#2663

Open
doquanghuy wants to merge 1 commit into
github:mainfrom
doquanghuy:feat/continue-on-error
Open

feat(workflows): add continue_on_error step field for non-halting failures#2663
doquanghuy wants to merge 1 commit into
github:mainfrom
doquanghuy:feat/continue-on-error

Conversation

@doquanghuy
Copy link
Copy Markdown

@doquanghuy doquanghuy commented May 21, 2026

Description

Closes #2591.

Adds an optional continue_on_error: bool field on every step.
When set to true and the step fails, the engine records the
result (exit_code, stderr, status) into steps.<id>.output and
continues to the next sibling step instead of halting the run.
Downstream if, switch, or gate steps can then branch on
{{ steps.<id>.output.exit_code }} to route the recovery path.

This is the shape @mnriem proposed in the issue discussion —
it composes with primitives that already exist (the exit code
is already captured, the expression engine already resolves it,
and if/switch/gate are already available). The only gap
was that a non-zero exit hard-stopped the pipeline before any
downstream step could evaluate it.

Canonical usage

- id: heavy-thing
  type: command
  integration: claude
  command: speckit.heavy-thing
  continue_on_error: true

- id: check-result
  type: if
  condition: "{{ steps.heavy-thing.output.exit_code != 0 }}"
  then:
    - id: review
      type: gate
      message: "Step failed (exit {{ steps.heavy-thing.output.exit_code }}). Retry or skip?"
      on_reject: skip
  else:
    - id: next-thing
      command: speckit.next-thing

Engine

WorkflowEngine._execute_steps now consults the step config when
a step returns StepStatus.FAILED:

  • Gate aborts (output.aborted) always halt the run — operator
    decisions take precedence over the flag.
  • Otherwise, if continue_on_error: true, log a
    step_continue_on_error event and proceed to the next sibling.
  • Otherwise, behave as before: log step_failed, set
    RunStatus.FAILED, and return.

Exactly one event per failure-resolution path is logged so the
log timeline is unambiguous: either the run continued past the
failure or it halted.

Validation

_validate_steps rejects non-bool values for continue_on_error.
Coerced strings like "true" are not accepted so authoring
mistakes surface at validation time rather than silently
changing run semantics.

Default behaviour preserved

When continue_on_error is omitted, every code path is
byte-equivalent to before this change. Existing workflows see no
difference.

Verdict coverage (from the issue discussion)

Scenario How
Skip continue_on_error: true + if branches around the failure
Abort Omit the flag — today's default halts the run
Retry continue_on_error: true + gate → operator approves → resume re-runs from gate

Fully unattended retry-on-transient (e.g. retry a 429 at 3 AM
without operator attendance) is intentionally out of scope here.
The skip and abort verdicts work without a human; the
retry verdict still pauses for one at the gate. A future
loop/retry-count primitive or an auto-approving gate type could
close that gap on top of this mechanism without further engine
changes — happy to follow up on that in a separate issue if
useful.

Testing

  • Tested locally with uv run specify --help
  • Ran existing tests with uv sync && uv run pytest
    2967 passed, 35 skipped (was 2960 before; +7 new
    tests added in this PR).
  • Tested with a sample project: ran a 3-step workflow where
    the middle step exits non-zero. Without
    continue_on_error, run halts at the failing step (as
    before). With continue_on_error: true, the failing step
    records exit_code and the third step executes. A
    downstream if branching on
    {{ steps.flaky.output.exit_code != 0 }} routes into a
    recovery gate cleanly.

New test coverage

TestContinueOnError in tests/test_workflows.py:

Test What it locks
test_undeclared_failure_halts_run Default behaviour byte-equivalent — no flag → run halts on first non-zero exit.
test_declared_and_fired_continues_run Flag set + step fails → run continues, exit_code recorded.
test_declared_but_step_succeeded_is_noop Flag set + step succeeds → no behaviour change.
test_if_branch_routes_around_failure End-to-end recovery pattern from the issue discussion.
test_gate_abort_still_halts_with_continue_on_error Operator-driven gate abort always halts, even with the flag set.
test_validation_rejects_non_bool_continue_on_error "true" (string) fails validation.
test_validation_accepts_bool_continue_on_error true and false pass validation cleanly.

AI Disclosure

  • I did not use AI assistance for this contribution
  • I did use AI assistance (described below)

Used Claude Opus to draft the engine change, the test suite, the
docs section, and this PR body. The shape (continue_on_error

  • exit-code-as-API + branch on it via existing primitives) was
    proposed by @mnriem on the issue thread; this PR implements that
    proposal. Code, tests, and design decisions were human-reviewed
    before submission.

@doquanghuy doquanghuy requested a review from mnriem as a code owner May 21, 2026 14:34
@doquanghuy doquanghuy force-pushed the feat/continue-on-error branch from e00e687 to f34ab4c Compare May 21, 2026 14:45
Closes github#2591.

Adds an optional `continue_on_error: bool` field on every step.
When set to `true` and the step fails, the engine records the
result (exit_code, stderr, status) into `steps.<id>.output` and
continues to the next sibling step instead of halting the run.
Downstream `if`, `switch`, or `gate` steps can then branch on
`{{ steps.<id>.output.exit_code }}` to route the recovery path.

This composes with primitives that already exist (the exit code
is already captured, the expression engine already resolves it,
and `if`/`switch`/`gate` are already available) — the only gap
was that a non-zero exit hard-stopped the pipeline before any
downstream step could evaluate it.

### Engine

`WorkflowEngine._execute_steps` now consults the step config when
a step returns `StepStatus.FAILED`:

- Gate aborts (`output.aborted`) always halt the run — operator
  decisions take precedence over the flag.
- Otherwise, if `continue_on_error: true`, log a
  `step_continue_on_error` event and proceed to the next sibling.
- Otherwise, behave as before: set `RunStatus.FAILED` and return.

### Validation

`_validate_steps` rejects non-bool values for `continue_on_error`.
Coerced strings like `"true"` are not accepted so authoring
mistakes surface at validation time rather than silently
changing run semantics.

### Default behaviour preserved

When `continue_on_error` is omitted, every code path is
byte-equivalent to before this change. Existing workflows see no
difference.

### Tests

New `TestContinueOnError` class in `tests/test_workflows.py`
covers all four scenarios from the issue's acceptance criteria
plus two extras:

- undeclared (default) failure halts the run.
- declared-and-fired continues past the failure.
- declared-but-step-succeeded is a no-op (flag only matters on
  FAILED).
- if-branch end-to-end exercising the canonical recovery pattern
  from the issue discussion.
- gate abort still halts even with `continue_on_error: true` set.
- validation rejects non-bool values; accepts both `true` and
  `false` cleanly.

### Docs

Adds an "Error Handling" section to `workflows/README.md`
documenting the field, the gate-abort precedence rule, and the
canonical recovery pattern.

### Follow-on

Auto-retry-on-transient (e.g. retry a 429 at 3 AM without
operator attendance) is intentionally out of scope. The current
proposal covers the **skip** and **abort** verdicts from the
original discussion; the **retry** verdict still pauses for an
operator at the gate step. A future loop/retry-count primitive
or an auto-approving gate could close that gap on top of this
mechanism without further engine changes.
@doquanghuy doquanghuy force-pushed the feat/continue-on-error branch from f34ab4c to da8ed4d Compare May 21, 2026 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Step-level on_failure: hook for recovery gates

1 participant