Skip to content

ASI-08: Add circuit breaker for repeatedly failing workflows #28776

@lpcox

Description

@lpcox

Problem

Per the OWASP Agentic Top 10 — ASI-08 (Cascading Failures & Denial-of-Wallet), agentic workflows should have circuit breakers to prevent runaway execution and cost accumulation when workflows fail repeatedly.

Current behavior: A workflow that fails 100 consecutive times will continue to trigger and execute on every event. There is no failure budget or automatic disabling mechanism.

Existing Failure Safeguards

Mechanism File What it does Limitation
stop-after stop_after.go Time-based cutoff (e.g., "+6h") Time-based, not failure-based
Concurrency concurrency.go Limits concurrent runs Manages parallelism, not failure budget
Secret validation compiler_activation_job_builder.go Validates tokens exist Checks format, not validity

None of these prevent repeated execution of a failing workflow.

Parent Issue

Part of #28770 (OWASP Agentic Top 10 Compliance Evaluation)

Proposed Solution

Frontmatter Configuration

Add a circuit-breaker field to frontmatter:

---
circuit-breaker:
  max-consecutive-failures: 5    # Open circuit after N consecutive failures
  time-window: 24h               # Only count failures within this window
  cooldown: 1h                   # Time to wait before allowing retry after circuit opens
  notify: true                   # Post workflow annotation when circuit opens
---

Defaults (when circuit-breaker is not specified):

  • Disabled by default (backward compatible)
  • Can be enabled globally via a feature flag: features.circuit-breaker: true (uses sensible defaults: 5 failures, 24h window, 1h cooldown)

Implementation Architecture

1. Failure Counter (GitHub Actions Artifacts)

Use workflow run artifacts to persist the failure counter across runs:

Run N (success) → upload artifact: {consecutive_failures: 0, last_success: <timestamp>}
Run N+1 (fail)  → download prev artifact → upload: {consecutive_failures: 1, last_failure: <timestamp>}
Run N+2 (fail)  → download prev artifact → upload: {consecutive_failures: 2, last_failure: <timestamp>}
...
Run N+5 (fail)  → consecutive_failures >= max → CIRCUIT OPEN → skip activation
Run N+6 (trigger) → check cooldown → if elapsed: allow one retry (half-open)

2. Activation Job Integration

Add a circuit breaker check step before the activation condition in compiler_activation_job_builder.go:

// In activationJobBuildContext (L17-41)
type activationJobBuildContext struct {
    // ... existing fields ...
    circuitBreakerConfig *CircuitBreakerConfig
}

New step in activation job (before validate-secret):

- name: Check circuit breaker
  id: check-circuit-breaker
  uses: actions/download-artifact@v4
  # Download previous run's failure counter
  # Compare against threshold
  # Output: is_open=true/false, consecutive_failures=N

Activation condition becomes:

if: >-
  ${{ steps.check-circuit-breaker.outputs.is_open != 'true' &&
      <existing-activation-conditions> }}

3. Post-Execution Counter Update

Add a step at the end of the agent job to update the failure counter:

- name: Update circuit breaker counter
  if: always()
  run: |
    if [ "${{ job.status }}" = "success" ]; then
      echo '{"consecutive_failures": 0, "last_success": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' > circuit-breaker-state.json
    else
      PREV=${{ steps.check-circuit-breaker.outputs.consecutive_failures }}
      echo "{\"consecutive_failures\": $((PREV + 1)), \"last_failure\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" > circuit-breaker-state.json
    fi
- uses: actions/upload-artifact@v4
  if: always()
  with:
    name: circuit-breaker-state
    path: circuit-breaker-state.json

4. Circuit Breaker States

Follow the standard circuit breaker pattern:

CLOSED (normal) ──[N consecutive failures]──→ OPEN (blocking)
                                                  │
                                          [cooldown elapsed]
                                                  │
                                                  ▼
                                            HALF-OPEN (probe)
                                              │         │
                                          [success]  [failure]
                                              │         │
                                              ▼         ▼
                                           CLOSED      OPEN

Key Files to Modify

File Change
pkg/workflow/frontmatter_types.go Add CircuitBreaker *CircuitBreakerConfig field
pkg/workflow/circuit_breaker.go New file: config parsing, step generation
pkg/workflow/compiler_activation_job_builder.go Add circuit breaker check step before activation
pkg/workflow/compiler_yaml_main_job.go Add counter update step after agent execution
pkg/parser/schemas/ Add circuit-breaker to frontmatter JSON schema
actions/setup/sh/check_circuit_breaker.sh Runtime script for state checking

Pattern to Follow

Follow the same pattern as stop_after.go:

  • extractCircuitBreakerConfig() — parse frontmatter
  • generateCircuitBreakerSteps() — generate activation/post-execution steps
  • Integration in buildActivationJob() and main job builder

Acceptance Criteria

  • circuit-breaker frontmatter field parsed and validated
  • Circuit breaker check runs in activation job before agent execution
  • Failure counter persisted via artifacts across workflow runs
  • Circuit opens after N consecutive failures within time window
  • Half-open state allows one retry after cooldown period
  • Workflow annotation posted when circuit opens/closes
  • Feature flag features.circuit-breaker: true enables with defaults
  • Backward compatible — no behavior change for existing workflows
  • Unit tests for config parsing, state transitions, step generation
  • Documentation for circuit-breaker frontmatter configuration

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions