feat(ASI-08): Add circuit breaker for repeatedly failing workflows#28778
feat(ASI-08): Add circuit breaker for repeatedly failing workflows#28778
Conversation
Implements OWASP Agentic Top 10 ASI-08 (Cascading Failures & Denial-of-Wallet) protection by adding a circuit breaker mechanism that prevents runaway execution of consistently failing workflows. Key changes: - Add CircuitBreakerConfig struct to frontmatter types - Add circuit-breaker field to WorkflowData - Create circuit_breaker.go with config extraction and step generation - Add pre-activation check step that reads artifact state from previous runs - Add post-execution state update steps (if: always()) to persist failure counts - Add check_circuit_breaker.cjs and update_circuit_breaker.cjs runtime scripts - Add circuit-breaker to JSON schema with validation - Add constants: CheckCircuitBreakerStepID, UpdateCircuitBreakerStepID, etc. - Add unit tests covering extraction, defaults, duration parsing, step generation Supports: circuit-breaker frontmatter config object, boolean shorthand, and features.circuit-breaker: true feature flag for enabling with defaults. Agent-Logs-Url: https://github.com/github/gh-aw/sessions/f1c289ae-c6dc-4e12-a67a-33f91178acce Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
…o use download-artifact for ZIP extraction - Replace single-step circuit breaker check with 3-step approach: 1. find_circuit_breaker_artifact.cjs: find previous run with artifact 2. actions/download-artifact@v4: native ZIP extraction (no adm-zip needed) 3. check_circuit_breaker.cjs: read extracted JSON and evaluate state - Add FindCircuitBreakerArtifactStepID constant - Split overly-long log message in compiler_jobs.go into 3 grouped lines Agent-Logs-Url: https://github.com/github/gh-aw/sessions/f1c289ae-c6dc-4e12-a67a-33f91178acce Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
…nd add clarifying comment Agent-Logs-Url: https://github.com/github/gh-aw/sessions/f1c289ae-c6dc-4e12-a67a-33f91178acce Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
🧪 Test Quality Sentinel ReportTest Quality Score: 91/100✅ Excellent test quality
Test Classification DetailsView All Test Classifications
Flagged Tests — Requires ReviewNo tests are flagged for issues. The two single-scenario step-generation tests ( Language SupportTests analyzed:
Verdict
📖 Understanding Test ClassificationsDesign Tests (High Value) verify what the system does:
Implementation Tests (Low Value) verify how the system does it:
Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators. References: §25013046517
|
There was a problem hiding this comment.
✅ Test Quality Sentinel: 91/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). All 7 tests enforce behavioral contracts with proper build tags, descriptive assertion messages, and no mock library violations.
Generated by Design Decision Gate workflow run 25013046554.
|
Commit pushed:
|
🏗️ Design Decision Gate — ADR RequiredThis PR makes significant changes to core business logic (684 new lines in AI has analyzed the PR diff and generated a draft ADR to help you get started: 📄 Draft ADR: 📋 ADR Summary — what the AI inferred from the diffDecision: Implement the closed → open → half-open circuit breaker state machine as an opt-in frontmatter feature ( Driver: Workflows that fail repeatedly consume runner resources indefinitely (OWASP Agentic Top-10 ASI-08). GitHub Actions has no native mechanism to suppress a workflow that fails consecutively. Alternatives considered:
Key consequences:
What to do next
Once an ADR is linked in the PR body, this gate will re-run and verify the implementation matches the decision. Why ADRs MatterADRs create a searchable, permanent record of why the codebase looks the way it does. Future contributors (and your future self) will thank you. 📋 Michael Nygard ADR Format ReferenceAn ADR must contain these four sections to be considered complete:
All ADRs are stored in
References: §25013046554
|
There was a problem hiding this comment.
Pull request overview
Adds an optional circuit breaker mechanism to prevent repeatedly failing workflows from continuing to trigger indefinitely, using persisted GitHub Actions artifacts to track failure state across runs.
Changes:
- Introduces
circuit-breakerfrontmatter (boolean + object forms) and wires it into workflow compilation. - Injects pre-activation steps to locate/download prior state and gate activation, plus post-execution steps to update/upload state.
- Adds schema/constants updates and initial Go unit tests for config extraction and YAML generation.
Show a summary per file
| File | Description |
|---|---|
| pkg/workflow/frontmatter_types.go | Adds CircuitBreakerConfig and exposes it on FrontmatterConfig. |
| pkg/workflow/compiler_yaml_main_job.go | Appends circuit breaker post-steps to update/upload state after agent execution. |
| pkg/workflow/compiler_types.go | Adds CircuitBreaker to WorkflowData. |
| pkg/workflow/compiler_pre_activation_job.go | Adds circuit breaker check steps and gates activation on circuit_breaker_ok. |
| pkg/workflow/compiler_orchestrator_workflow.go | Extracts circuit breaker config from frontmatter into WorkflowData. |
| pkg/workflow/compiler_jobs.go | Ensures pre-activation job is generated when circuit breaker is enabled; improves gate logging. |
| pkg/workflow/circuit_breaker.go | Implements config extraction/defaulting + YAML generation for circuit breaker steps. |
| pkg/workflow/circuit_breaker_test.go | Adds unit tests for extraction/defaults/duration parsing and YAML step generation. |
| pkg/parser/schemas/main_workflow_schema.json | Adds schema support for circuit-breaker boolean/object forms. |
| pkg/constants/job_constants.go | Adds step IDs and output constants for the circuit breaker feature. |
| actions/setup/js/find_circuit_breaker_artifact.cjs | New script to find the latest run containing the state artifact. |
| actions/setup/js/check_circuit_breaker.cjs | New script to evaluate CLOSED/OPEN/HALF-OPEN and set circuit_breaker_ok. |
| actions/setup/js/update_circuit_breaker.cjs | New script to update and persist the circuit breaker state artifact. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comments suppressed due to low confidence (2)
actions/setup/js/check_circuit_breaker.cjs:63
- The time-window logic doesn’t actually “forget” old failures: when
!failureIsRecentthe circuit is treated as CLOSED, butconsecutive_failuresis still carried forward in state. That means a single new failure (even after the window has elapsed) can immediately trip the circuit becauseupdate_circuit_breaker.cjswill increment from the old count. Consider resetting the failure counter when the last failure is outside the window (e.g., treat it as 0 in this check and/or reset it inupdate_circuit_breaker.cjswhennow - last_failure > window).
const nowMs = Date.now();
const windowMs = timeWindowMinutes * 60 * 1000;
const failureIsRecent = lastFailure !== null && nowMs - lastFailure.getTime() <= windowMs;
// CLOSED state: fewer failures than threshold, or failures are outside the time window
if (consecutiveFailures < maxFailures || !failureIsRecent) {
core.info(`✅ Circuit breaker is CLOSED — workflow execution allowed`);
core.setOutput("circuit_breaker_ok", "true");
core.setOutput("consecutive_failures", String(consecutiveFailures));
return;
actions/setup/js/update_circuit_breaker.cjs:34
- The comment says “The check step would have written the state … if one was found”, but the state file is created by the
actions/download-artifactstep (the check script only reads it). Updating this comment would prevent confusion when debugging state propagation.
// Load the previous state from the artifact downloaded in the check step (if available).
// The check step would have written the state to /tmp/gh-aw/circuit-breaker-state.json
// if one was found; otherwise we start fresh.
let previousState = { consecutive_failures: 0 };
- Files reviewed: 14/14 changed files
- Comments generated: 5
| yaml.WriteString(" env:\n") | ||
| yaml.WriteString(" GH_AW_CB_JOB_STATUS: ${{ job.status }}\n") | ||
| fmt.Fprintf(yaml, " GH_AW_CB_MAX_FAILURES: \"%d\"\n", data.CircuitBreaker.MaxConsecutiveFailures) | ||
| fmt.Fprintf(yaml, " GH_AW_WORKFLOW_NAME: %q\n", data.Name) | ||
| yaml.WriteString(" with:\n") | ||
| yaml.WriteString(" script: |\n") |
| // the equivalent number of minutes as an integer string suitable for passing to the check script. | ||
| func circuitBreakerDurationToMinutes(d string) (int, error) { | ||
| dur, err := time.ParseDuration(d) | ||
| if err != nil { | ||
| return 0, fmt.Errorf("invalid circuit-breaker duration %q: %w", d, err) | ||
| } | ||
| return int(dur.Minutes()), nil |
| async function main() { | ||
| const maxFailures = parseInt(process.env.GH_AW_CB_MAX_FAILURES?.trim() || "5", 10); | ||
| const timeWindowMinutes = parseInt(process.env.GH_AW_CB_TIME_WINDOW_MINUTES?.trim() || "1440", 10); | ||
| const cooldownMinutes = parseInt(process.env.GH_AW_CB_COOLDOWN_MINUTES?.trim() || "60", 10); | ||
| const notify = (process.env.GH_AW_CB_NOTIFY?.trim() || "true") === "true"; | ||
| const workflowName = process.env.GH_AW_WORKFLOW_NAME || "Unknown Workflow"; | ||
|
|
||
| core.info(`🔌 Circuit breaker check for workflow '${workflowName}'`); | ||
| core.info(` Configuration: max=${maxFailures} failures, window=${timeWindowMinutes}m, cooldown=${cooldownMinutes}m`); | ||
|
|
||
| // Read the circuit breaker state downloaded by the preceding download-artifact step. | ||
| // The artifact is extracted to /tmp/gh-aw, so the JSON file is at: | ||
| // /tmp/gh-aw/circuit-breaker-state.json | ||
| const stateFile = path.join("/tmp/gh-aw", "circuit-breaker-state.json"); | ||
| let state = { consecutive_failures: 0 }; | ||
|
|
||
| try { | ||
| if (fs.existsSync(stateFile)) { | ||
| const content = fs.readFileSync(stateFile, "utf8"); | ||
| state = JSON.parse(content); | ||
| core.info(` Loaded state: consecutive_failures=${state.consecutive_failures}`); | ||
| } else { | ||
| core.info(` No previous state found — starting fresh (circuit CLOSED)`); | ||
| } | ||
| } catch (error) { | ||
| // If we can't load the previous state, assume circuit is closed (fail-open for availability) | ||
| core.warning(`Could not read previous circuit breaker state from ${stateFile}: ${getErrorMessage(error)}. Assuming circuit is closed.`); | ||
| } | ||
|
|
||
| const consecutiveFailures = state.consecutive_failures ?? 0; | ||
| const lastFailure = state.last_failure ? new Date(state.last_failure) : null; | ||
|
|
||
| core.info(` Consecutive failures: ${consecutiveFailures} / ${maxFailures}`); | ||
|
|
||
| const nowMs = Date.now(); | ||
| const windowMs = timeWindowMinutes * 60 * 1000; | ||
| const failureIsRecent = lastFailure !== null && nowMs - lastFailure.getTime() <= windowMs; | ||
|
|
||
| // CLOSED state: fewer failures than threshold, or failures are outside the time window | ||
| if (consecutiveFailures < maxFailures || !failureIsRecent) { | ||
| core.info(`✅ Circuit breaker is CLOSED — workflow execution allowed`); | ||
| core.setOutput("circuit_breaker_ok", "true"); | ||
| core.setOutput("consecutive_failures", String(consecutiveFailures)); | ||
| return; | ||
| } | ||
|
|
||
| // Circuit is OPEN — check if cooldown has elapsed (HALF-OPEN state) | ||
| const cooldownMs = cooldownMinutes * 60 * 1000; | ||
| const cooldownElapsed = lastFailure !== null && nowMs - lastFailure.getTime() >= cooldownMs; | ||
|
|
||
| if (cooldownElapsed) { | ||
| core.info(`🔄 Circuit breaker is HALF-OPEN — cooldown elapsed, allowing one retry`); | ||
| core.setOutput("circuit_breaker_ok", "true"); | ||
| core.setOutput("consecutive_failures", String(consecutiveFailures)); | ||
| return; | ||
| } |
| async function main() { | ||
| const jobStatus = (process.env.GH_AW_CB_JOB_STATUS || "").toLowerCase(); | ||
| const maxFailures = parseInt(process.env.GH_AW_CB_MAX_FAILURES?.trim() || "5", 10); | ||
| const workflowName = process.env.GH_AW_WORKFLOW_NAME || "Unknown Workflow"; | ||
|
|
||
| const { | ||
| repo: { owner, repo }, | ||
| runId, | ||
| } = context; | ||
|
|
||
| core.info(`🔌 Updating circuit breaker state for workflow '${workflowName}'`); | ||
| core.info(` Job status: ${jobStatus}`); | ||
|
|
||
| // Load the previous state from the artifact downloaded in the check step (if available). | ||
| // The check step would have written the state to /tmp/gh-aw/circuit-breaker-state.json | ||
| // if one was found; otherwise we start fresh. | ||
| let previousState = { consecutive_failures: 0 }; | ||
|
|
||
| const stateDir = "/tmp/gh-aw"; | ||
| const stateFile = path.join(stateDir, "circuit-breaker-state.json"); | ||
|
|
||
| try { | ||
| if (fs.existsSync(stateFile)) { | ||
| const content = fs.readFileSync(stateFile, "utf8"); | ||
| previousState = JSON.parse(content); | ||
| core.info(` Loaded previous state: consecutive_failures=${previousState.consecutive_failures}`); | ||
| } | ||
| } catch (error) { | ||
| core.warning(`Could not load existing circuit breaker state: ${getErrorMessage(error)}. Starting fresh.`); | ||
| } | ||
|
|
||
| const nowISO = new Date().toISOString(); | ||
| let newState; | ||
|
|
||
| if (jobStatus === "success") { | ||
| // Success — reset the circuit breaker | ||
| newState = { | ||
| consecutive_failures: 0, | ||
| last_success: nowISO, | ||
| last_failure: previousState.last_failure ?? null, | ||
| circuit_opened_at: null, | ||
| }; | ||
| core.info(`✅ Job succeeded — resetting circuit breaker (was ${previousState.consecutive_failures} failures)`); | ||
| } else { | ||
| // Failure or cancellation — increment the failure counter | ||
| const newCount = (previousState.consecutive_failures ?? 0) + 1; | ||
| // Preserve the original circuit_opened_at timestamp from when the circuit first opened. | ||
| // Using ?? ensures we only record the timestamp on the first opening (newCount === maxFailures), | ||
| // and keep that value on all subsequent failures without overwriting it. | ||
| const circuitOpenedAt = newCount >= maxFailures ? (previousState.circuit_opened_at ?? nowISO) : null; | ||
|
|
||
| newState = { | ||
| consecutive_failures: newCount, | ||
| last_failure: nowISO, | ||
| last_success: previousState.last_success ?? null, | ||
| circuit_opened_at: circuitOpenedAt, | ||
| }; | ||
|
|
||
| core.info(`❌ Job failed — consecutive failures: ${newCount} / ${maxFailures}`); | ||
| if (newCount >= maxFailures) { | ||
| core.warning(`🔴 Circuit breaker threshold reached (${newCount} consecutive failures). Circuit is now OPEN.`); | ||
| } | ||
| } |
| async function main() { | ||
| const { | ||
| repo: { owner, repo }, | ||
| runId, | ||
| } = context; | ||
|
|
||
| core.info(`🔌 Looking for previous circuit-breaker-state artifact`); | ||
|
|
||
| try { | ||
| // Get the workflow ID of the current run | ||
| const { data: runData } = await github.rest.actions.getWorkflowRun({ | ||
| owner, | ||
| repo, | ||
| run_id: runId, | ||
| }); | ||
| const workflowId = runData.workflow_id; | ||
| core.info(` Workflow ID: ${workflowId}`); | ||
|
|
||
| // List recent completed runs for this workflow (excluding the current one) | ||
| const { data: runsData } = await github.rest.actions.listWorkflowRuns({ | ||
| owner, | ||
| repo, | ||
| workflow_id: workflowId, | ||
| status: "completed", | ||
| per_page: 20, | ||
| }); | ||
|
|
||
| core.info(` Found ${runsData.workflow_runs.length} recent completed runs`); | ||
|
|
||
| for (const run of runsData.workflow_runs) { | ||
| if (run.id === runId) continue; | ||
|
|
||
| try { | ||
| const { data: artifactsData } = await github.rest.actions.listWorkflowRunArtifacts({ | ||
| owner, | ||
| repo, | ||
| run_id: run.id, | ||
| }); | ||
|
|
||
| const artifact = artifactsData.artifacts.find(a => a.name === "circuit-breaker-state" && !a.expired); | ||
| if (artifact) { | ||
| core.info(` Found circuit-breaker-state artifact in run #${run.id}`); | ||
| core.setOutput("previous_run_id", String(run.id)); | ||
| return; | ||
| } | ||
| } catch (error) { | ||
| core.debug(` Could not list artifacts for run #${run.id}: ${getErrorMessage(error)}`); | ||
| continue; | ||
| } | ||
| } | ||
|
|
||
| core.info(` No previous circuit-breaker-state artifact found`); | ||
| core.setOutput("previous_run_id", ""); | ||
| } catch (error) { | ||
| core.warning(`Could not search for previous circuit breaker state: ${getErrorMessage(error)}`); | ||
| core.setOutput("previous_run_id", ""); | ||
| } |
|
@copilot address this review feedback #28778 (review) |
1. circuitBreakerDurationToMinutes: round sub-minute durations up to 1 min
(prevents 0-minute window/cooldown that would break check logic)
2. generateCircuitBreakerUpdateSteps: pass GH_AW_CB_TIME_WINDOW_MINUTES to
update script so stale out-of-window failures are properly discarded
3. update_circuit_breaker.cjs: use time window to reset counter when
last_failure falls outside the window before incrementing
4. check_circuit_breaker.cjs + update_circuit_breaker.cjs: add
GH_AW_CB_STATE_DIR env var (default /tmp/gh-aw) for test isolation
5. Add vitest tests for all three new JavaScript scripts:
- check_circuit_breaker.test.cjs (10 tests): CLOSED/OPEN/HALF-OPEN states,
window expiry, cooldown expiry, custom threshold, notify flag, corrupt file
- update_circuit_breaker.test.cjs (9 tests): success reset, failure
increment, window-based reset, circuit_opened_at preservation, corrupt file
- find_circuit_breaker_artifact.test.cjs (8 tests): pagination, run
exclusion, expired artifact skip, API error handling
All 27 JS tests and all Go unit tests pass.
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/aaf984e4-72ca-4d4a-ad68-c82454e21bb1
Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
All four review items addressed in commit
3–4. Vitest tests for all three new scripts — added Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
@lpcox I think this should be agenetic maintenance job instead |
|
How does an AW restart once it was restarted? |
Workflows that fail repeatedly continue to trigger and consume resources indefinitely. This adds a circuit breaker mechanism (OWASP Agentic Top-10 ASI-08) that tracks consecutive failures via GitHub Actions artifacts and blocks execution when a threshold is exceeded.
State machine
CLOSED→ (N consecutive failures) →OPEN→ (cooldown elapsed) →HALF-OPEN→ (success) →CLOSEDFrontmatter configuration
Disabled by default; fully backward-compatible.
Implementation
Pre-activation job (3 injected steps):
find_circuit_breaker_artifact.cjs— locates the most recent run that has acircuit-breaker-stateartifact (GitHub API)actions/download-artifact@v4— downloads and extracts the JSON state file natively (no extra deps)check_circuit_breaker.cjs— reads state, evaluates closed/open/half-open, setscircuit_breaker_okoutput that gates theactivatedconditionAgent job (2
if: always()steps appended after execution):update_circuit_breaker.cjs— resets counter on success, increments on failure (respecting the configured time window so stale failures are discarded), writes state to diskactions/upload-artifact@v4— persists state for next runKey files
pkg/workflow/circuit_breaker.gopkg/workflow/circuit_breaker_test.goactions/setup/js/find_circuit_breaker_artifact.cjsactions/setup/js/find_circuit_breaker_artifact.test.cjsactions/setup/js/check_circuit_breaker.cjsactions/setup/js/check_circuit_breaker.test.cjsactions/setup/js/update_circuit_breaker.cjsactions/setup/js/update_circuit_breaker.test.cjspkg/parser/schemas/main_workflow_schema.jsonpkg/constants/job_constants.goCorrectness fixes
circuitBreakerDurationToMinutesrounds sub-minute durations (e.g."30s") up to 1 minute, preventing a 0-minute window/cooldown that would break the check logic.GH_AW_CB_TIME_WINDOW_MINUTESis now passed to the update script so that failures whoselast_failurefalls outside the configured window are discarded before incrementing the counter, preventing unbounded accumulation of stale failures.check_circuit_breaker.cjsandupdate_circuit_breaker.cjsaccept aGH_AW_CB_STATE_DIRenv var (default/tmp/gh-aw) for test isolation.