Summary
Agentic workflows that rely on MCP servers (github, safeoutputs) are failing at a high rate due to transient HTTP 401 responses from the Copilot MCP registry endpoint. When this occurs, the Copilot CLI blocks all non-default MCP servers as a safety measure, leaving the agent unable to produce structured safe outputs. The agent completes its analysis correctly via shell fallback but produces {"items":[]}, which gh-aw interprets as a failure.
Environment
- Repository: Azure/azure-sdk-for-net
- Workflows affected:
issue-triage.md (issue triage) and update-samples-and-docs.md (documentation gap detection)
- Engine: Copilot (default)
- gh-aw version: Latest as of April 2026
- Runner:
ubuntu-latest
Reproduction
This is not manually reproducible — it's a transient infrastructure failure. It manifests as temporal clusters (e.g., 8 failures in 36 hours) then clears up.
Failure Pattern
Agent process log signature
GET api.github.com/copilot/mcp_registry → 401 Unauthorized
Followed by:
filtered [...github, safeoutputs] (blocked by policy)
What happens
- Copilot CLI fetches
api.github.com/copilot/mcp_registry during startup to validate MCP servers against org policy
- The endpoint returns HTTP 401
- CLI blocks ALL non-default MCP servers as a safety measure — both
github and safeoutputs are filtered out
- Agent falls back to shell commands (
curl, grep, gh CLI) and performs correct analysis
- Agent cannot call any safe-output MCP tools (add-labels, add-comment, assign-to-user, etc.)
- Agent output artifact is
{"items":[]}
- gh-aw conclusion job detects empty outputs and files a failure report issue
Key observation
The agent's reasoning and analysis are correct. The failure is purely in the MCP server initialization path — the agent has no mechanism to write structured outputs when MCP servers are blocked.
Evidence: 10 confirmed failures (Apr 8–13, 2026)
All failures confirmed by downloading agent artifacts (gh run download <run_id> --name agent) and inspecting sandbox/agent/logs/process-*.log for the mcp_registry 401 pattern.
| Issue |
Workflow |
Date (UTC) |
Run ID (from issue body) |
| #58113 |
Triage |
2026-04-13 09:13 |
See issue |
| #58107 |
Triage |
2026-04-11 18:00 |
See issue |
| #58075 |
Triage |
2026-04-10 16:14 |
See issue |
| #58072 |
Docs |
2026-04-10 15:28 |
See issue |
| #58059 |
Triage |
2026-04-10 06:34 |
See issue |
| #58055 |
Docs |
2026-04-10 00:59 |
See issue |
| #58054 |
Docs |
2026-04-10 00:17 |
See issue |
| #58053 |
Triage |
2026-04-10 00:16 |
See issue |
| #58048 |
Triage |
2026-04-09 22:14 |
See issue |
| #58044 |
Docs |
2026-04-09 21:09 |
See issue |
Temporal pattern: Heaviest cluster Apr 9–10 (8 failures in ~36 hours), continuing intermittently through Apr 13.
Impact
- 10 out of 13 recent agentic workflow failures (77%) share this exact root cause
- Each failure auto-files a GitHub issue with the
agentic-workflows label, creating noise in the issue tracker
- Issues that should have been triaged or had docs gaps filed remain unprocessed until a human notices and re-triggers
- No workaround exists from the workflow author side — the MCP registry check is internal to the Copilot CLI
Mitigation applied (workflow-side)
We have set report-failure-as-issue: false on both workflows to suppress the noisy auto-filed failure issues. This is a stopgap — it also suppresses reports for real workflow bugs.
Suggested improvements
- Fix the transient 401s — the root cause in the MCP registry auth path
- Graceful degradation — if the registry check fails transiently, consider retrying before blocking all MCP servers, or allowing a configurable fallback policy
- Distinguish infrastructure failures from agent failures — the current
{"items":[]} output doesn't distinguish "agent chose not to act" from "agent couldn't access its tools"
Summary
Agentic workflows that rely on MCP servers (
github,safeoutputs) are failing at a high rate due to transient HTTP 401 responses from the Copilot MCP registry endpoint. When this occurs, the Copilot CLI blocks all non-default MCP servers as a safety measure, leaving the agent unable to produce structured safe outputs. The agent completes its analysis correctly via shell fallback but produces{"items":[]}, which gh-aw interprets as a failure.Environment
issue-triage.md(issue triage) andupdate-samples-and-docs.md(documentation gap detection)ubuntu-latestReproduction
This is not manually reproducible — it's a transient infrastructure failure. It manifests as temporal clusters (e.g., 8 failures in 36 hours) then clears up.
Failure Pattern
Agent process log signature
Followed by:
What happens
api.github.com/copilot/mcp_registryduring startup to validate MCP servers against org policygithubandsafeoutputsare filtered outcurl,grep,ghCLI) and performs correct analysis{"items":[]}Key observation
The agent's reasoning and analysis are correct. The failure is purely in the MCP server initialization path — the agent has no mechanism to write structured outputs when MCP servers are blocked.
Evidence: 10 confirmed failures (Apr 8–13, 2026)
All failures confirmed by downloading agent artifacts (
gh run download <run_id> --name agent) and inspectingsandbox/agent/logs/process-*.logfor themcp_registry401 pattern.Temporal pattern: Heaviest cluster Apr 9–10 (8 failures in ~36 hours), continuing intermittently through Apr 13.
Impact
agentic-workflowslabel, creating noise in the issue trackerMitigation applied (workflow-side)
We have set
report-failure-as-issue: falseon both workflows to suppress the noisy auto-filed failure issues. This is a stopgap — it also suppresses reports for real workflow bugs.Suggested improvements
{"items":[]}output doesn't distinguish "agent chose not to act" from "agent couldn't access its tools"