Add safe-output-health workflow to monitor safe output job failures#2525
Add safe-output-health workflow to monitor safe output job failures#2525
Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
Agentic Changeset Generator triggered by this pull request. |
There was a problem hiding this comment.
Pull Request Overview
This PR adds a new agentic workflow for daily monitoring of safe output job health across all workflows. The workflow addresses a gap in observability by specifically tracking failures in jobs that write GitHub API outputs (discussions, issues, comments, PRs) after agent execution completes. These failures can cause silent data loss if not monitored.
Key changes:
- Creates a daily automated health check that analyzes 24 hours of safe output job execution logs
- Implements error clustering and root cause analysis specifically scoped to output job failures
- Generates structured audit reports with actionable work items and recommendations
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| engine: claude | ||
| tools: | ||
| cache-memory: true | ||
| timeout: 300 |
There was a problem hiding this comment.
The timeout: 300 field is not a valid frontmatter field. This appears to be intended as a step-level timeout but is placed at the top-level. According to the schema, workflows use timeout_minutes at the top level (which is already set to 30 on line 22), and individual steps can have timeout configurations within their definition.
| timeout: 300 |
| category: "audits" | ||
| max: 1 | ||
| timeout_minutes: 30 | ||
| strict: true |
There was a problem hiding this comment.
The strict: true field is not a valid frontmatter field according to the workflow schema. This field does not exist in the allowed frontmatter fields (on, permissions, engine, tools, steps, safe-outputs, timeout_minutes, imports, etc.) and will cause validation errors during compilation.
| strict: true |
| env: | ||
| GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| run: ./gh-aw logs --start-date -1d -o /tmp/gh-aw/aw-mcp/logs |
There was a problem hiding this comment.
The step attempts to run ./gh-aw binary directly, but this conflicts with line 46 which states 'DO NOT ATTEMPT TO USE GH AW DIRECTLY, it is not authenticated. Use the MCP server instead.' Either remove this step and rely on the MCP server's logs tool, or update the documentation to clarify when direct binary usage is appropriate.
| env: | |
| GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} | |
| run: ./gh-aw logs --start-date -1d -o /tmp/gh-aw/aw-mcp/logs | |
| run: mcp logs --start-date -1d -o /tmp/gh-aw/aw-mcp/logs |
| permissions: | ||
| contents: read | ||
| actions: read | ||
| engine: claude |
There was a problem hiding this comment.
[nitpick] The workflow uses engine: claude with tools.cache-memory: true, but according to the coding guidelines, cache-memory is only documented to work with Claude and Custom engines. While this is valid, it would be clearer to use the object notation engine: { id: claude } for consistency with other engine configurations in the codebase.
| engine: claude | |
| engine: | |
| id: claude |
Adds daily monitoring for safe output job health across all agentic workflows. Safe output jobs (
create_discussion,create_issue,add_comment,create_pull_request, etc.) run after agent execution to write GitHub API outputs, and failures in these jobs can cause silent data loss.Workflow Design
Implementation
Created
.github/workflows/safe-output-health.mdwith:shared/mcp/gh-aw.mdfor MCP server setupThe workflow explicitly excludes agent/detection job analysis to avoid overlap with existing monitoring workflows (
audit-workflows.md,ci-doctor.md).Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.