Skip to content

[aw-failures] Fix P1: Detection parse_error causes workflow run conclusion=failure even when agent succeeds #28969

@github-actions

Description

@github-actions

Parent investigation: #28947

Problem Statement

The threat detection job is failing with parse_error on a significant fraction of workflow runs (at least 5 confirmed in the last 6 hours), causing the overall GitHub Actions workflow run conclusion to show as failure even when the agent completed its task successfully. This creates false failure issues, noisy dashboards, and obscures real failures.

Affected Workflows & Runs (2026-04-28 13:00–19:12 UTC)

Run Workflow Engine Time (UTC) Agent Result
§25072244854 Design Decision Gate claude 19:05 noop ✅
§25071332891 PR Triage Agent copilot 18:44 success ✅
§25067188919 Design Decision Gate claude 17:13 noop ✅
§25063756460 Daily Repository Chronicle copilot 16:02 success ✅
§25062034363 Design Decision Gate claude 15:28 success ✅

The detection warnings tracker (#28866) shows parse_error has been firing since at least 06:26 UTC, with 10+ unique workflows affected across the full day.

Failure Signature

jobs:
  activation:  success
  agent:       success     ← Agent completed; noop or write actions emitted
  detection:   failure     ← parse_error → job conclusion=failure
  conclusion:  success
  safe_outputs: skipped

run conclusion: failure    ← propagated from detection job failure

In GitHub Actions, continue-on-error: true on the detection job allows downstream jobs to run but does not prevent the overall workflow run from being marked as failure. This means every run where detection fails shows as a broken workflow in the GitHub UI and triggers false failure issues.

Probable Root Cause

The detection parser fails to parse the agent output for certain output shapes:

  1. Runs that produce a noop (minimal output) — Design Decision Gate (3×)
  2. Runs that produce heavy copilot output (large JSON blobs) — PR Triage Agent, Daily Repository Chronicle
  3. Possible JSON syntax edge-case or encoding issue in the agent-stdio log that the parser cannot handle

The detection system already records the exact reason in #28866 (parse_error) but the job itself exits non-zero, escalating to a workflow failure.

Proposed Remediation

  1. Quick fix: Change the detection job's exit code behavior — on parse_error, the job should exit 0 (warning) rather than non-zero (failure), since parse errors are not security threats
  2. Root fix: Add structured error logging to the detection parser to surface the exact line/token that fails to parse; fix the underlying parsing bug
  3. Observability: Extend [aw] Detection Runs #28866 comments to include the specific error location (not just the reason code) so failures are debuggable without reading raw logs

Success Criteria

  • Detection parse_error no longer causes workflow run conclusion: failure
  • Zero false-positive failure issues created due to detection parse failures
  • parse_error rate reported in [aw] Detection Runs #28866 decreases as root cause is fixed

References:

Note

🔒 Integrity filter blocked 3 items

The following items were blocked because they don't meet the GitHub integrity level.

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by [aw] Failure Investigator (6h) · ● 439.9K ·

  • expires on May 5, 2026, 7:25 PM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions