Skip to content

Add safe-output-health workflow to monitor safe output job failures#2525

Merged
pelikhan merged 2 commits intomainfrom
copilot/add-safe-output-health-workflow
Oct 26, 2025
Merged

Add safe-output-health workflow to monitor safe output job failures#2525
pelikhan merged 2 commits intomainfrom
copilot/add-safe-output-health-workflow

Conversation

Copy link
Contributor

Copilot AI commented Oct 26, 2025

Adds daily monitoring for safe output job health across all agentic workflows. Safe output jobs (create_discussion, create_issue, add_comment, create_pull_request, etc.) run after agent execution to write GitHub API outputs, and failures in these jobs can cause silent data loss.

Workflow Design

  • Runs daily at midnight UTC or on manual dispatch
  • Uses gh-aw MCP server to fetch 24h of workflow logs
  • Scopes analysis exclusively to safe output job failures (not agent or detection failures)
  • Clusters errors by type: API failures, parsing errors, validation issues, permission problems
  • Stores patterns in cache-memory for historical trend analysis
  • Outputs structured discussion reports in "audits" category with:
    • Job success/failure rates by type
    • Root cause analysis for error clusters
    • Actionable work item plans with priority assessment

Implementation

Created .github/workflows/safe-output-health.md with:

  • Claude engine with 30min timeout
  • Imports shared/mcp/gh-aw.md for MCP server setup
  • Instructions to parse workflow-logs directories for job-specific errors
  • Template for comprehensive audit reports with KPIs and recommendations

The workflow explicitly excludes agent/detection job analysis to avoid overlap with existing monitoring workflows (audit-workflows.md, ci-doctor.md).

Original prompt

Add a new agentic workflow, "safe-output-health" that scans for safe output job errors and generates a report.

  • use shared workflow gh-aw to load agentic workflow MCP
  • get 24h of logs and look for errors in the safe output jobs (not agent or detection)
  • cluster and analyze potential root cause
  • suggest potential work item / issues plans to address problems

Runs daily or on dispatch. Creates discussion.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Add new agentic workflow for safe output health scanning Add safe-output-health workflow to monitor safe output job failures Oct 26, 2025
Copilot AI requested a review from pelikhan October 26, 2025 10:45
@pelikhan pelikhan marked this pull request as ready for review October 26, 2025 10:50
Copilot AI review requested due to automatic review settings October 26, 2025 10:50
@pelikhan pelikhan merged commit a354027 into main Oct 26, 2025
5 checks passed
@pelikhan pelikhan deleted the copilot/add-safe-output-health-workflow branch October 26, 2025 10:50
@github-actions
Copy link
Contributor

Agentic Changeset Generator triggered by this pull request.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new agentic workflow for daily monitoring of safe output job health across all workflows. The workflow addresses a gap in observability by specifically tracking failures in jobs that write GitHub API outputs (discussions, issues, comments, PRs) after agent execution completes. These failures can cause silent data loss if not monitored.

Key changes:

  • Creates a daily automated health check that analyzes 24 hours of safe output job execution logs
  • Implements error clustering and root cause analysis specifically scoped to output job failures
  • Generates structured audit reports with actionable work items and recommendations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

engine: claude
tools:
cache-memory: true
timeout: 300
Copy link

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout: 300 field is not a valid frontmatter field. This appears to be intended as a step-level timeout but is placed at the top-level. According to the schema, workflows use timeout_minutes at the top level (which is already set to 30 on line 22), and individual steps can have timeout configurations within their definition.

Suggested change
timeout: 300

Copilot uses AI. Check for mistakes.
category: "audits"
max: 1
timeout_minutes: 30
strict: true
Copy link

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strict: true field is not a valid frontmatter field according to the workflow schema. This field does not exist in the allowed frontmatter fields (on, permissions, engine, tools, steps, safe-outputs, timeout_minutes, imports, etc.) and will cause validation errors during compilation.

Suggested change
strict: true

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +17
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: ./gh-aw logs --start-date -1d -o /tmp/gh-aw/aw-mcp/logs
Copy link

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The step attempts to run ./gh-aw binary directly, but this conflicts with line 46 which states 'DO NOT ATTEMPT TO USE GH AW DIRECTLY, it is not authenticated. Use the MCP server instead.' Either remove this step and rely on the MCP server's logs tool, or update the documentation to clarify when direct binary usage is appropriate.

Suggested change
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: ./gh-aw logs --start-date -1d -o /tmp/gh-aw/aw-mcp/logs
run: mcp logs --start-date -1d -o /tmp/gh-aw/aw-mcp/logs

Copilot uses AI. Check for mistakes.
permissions:
contents: read
actions: read
engine: claude
Copy link

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The workflow uses engine: claude with tools.cache-memory: true, but according to the coding guidelines, cache-memory is only documented to work with Claude and Custom engines. While this is valid, it would be clearer to use the object notation engine: { id: claude } for consistency with other engine configurations in the codebase.

Suggested change
engine: claude
engine:
id: claude

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants