Systemic MCP registry 401 failures block all agentic workflow safe outputs

## Summary

Agentic workflows that rely on MCP servers (`github`, `safeoutputs`) are failing at a high rate due to transient HTTP 401 responses from the Copilot MCP registry endpoint. When this occurs, the Copilot CLI blocks all non-default MCP servers as a safety measure, leaving the agent unable to produce structured safe outputs. The agent completes its analysis correctly via shell fallback but produces `{"items":[]}`, which gh-aw interprets as a failure.

## Environment

- **Repository:** [Azure/azure-sdk-for-net](https://github.com/Azure/azure-sdk-for-net)
- **Workflows affected:** `issue-triage.md` (issue triage) and `update-samples-and-docs.md` (documentation gap detection)
- **Engine:** Copilot (default)
- **gh-aw version:** Latest as of April 2026
- **Runner:** `ubuntu-latest`

## Reproduction

This is not manually reproducible — it's a transient infrastructure failure. It manifests as temporal clusters (e.g., 8 failures in 36 hours) then clears up.

## Failure Pattern

### Agent process log signature

```
GET api.github.com/copilot/mcp_registry → 401 Unauthorized
```

Followed by:

```
filtered [...github, safeoutputs] (blocked by policy)
```

### What happens

1. Copilot CLI fetches `api.github.com/copilot/mcp_registry` during startup to validate MCP servers against org policy
2. The endpoint returns HTTP 401
3. CLI blocks ALL non-default MCP servers as a safety measure — both `github` and `safeoutputs` are filtered out
4. Agent falls back to shell commands (`curl`, `grep`, `gh` CLI) and performs correct analysis
5. Agent cannot call any safe-output MCP tools (add-labels, add-comment, assign-to-user, etc.)
6. Agent output artifact is `{"items":[]}`
7. gh-aw conclusion job detects empty outputs and files a failure report issue

### Key observation

The agent's reasoning and analysis are correct. The failure is purely in the MCP server initialization path — the agent has no mechanism to write structured outputs when MCP servers are blocked.

## Evidence: 10 confirmed failures (Apr 8–13, 2026)

All failures confirmed by downloading agent artifacts (`gh run download <run_id> --name agent`) and inspecting `sandbox/agent/logs/process-*.log` for the `mcp_registry` 401 pattern.

| Issue | Workflow | Date (UTC) | Run ID (from issue body) |
|-------|----------|-----------|--------------------------|
| [#58113](https://github.com/Azure/azure-sdk-for-net/issues/58113) | Triage | 2026-04-13 09:13 | See issue |
| [#58107](https://github.com/Azure/azure-sdk-for-net/issues/58107) | Triage | 2026-04-11 18:00 | See issue |
| [#58075](https://github.com/Azure/azure-sdk-for-net/issues/58075) | Triage | 2026-04-10 16:14 | See issue |
| [#58072](https://github.com/Azure/azure-sdk-for-net/issues/58072) | Docs | 2026-04-10 15:28 | See issue |
| [#58059](https://github.com/Azure/azure-sdk-for-net/issues/58059) | Triage | 2026-04-10 06:34 | See issue |
| [#58055](https://github.com/Azure/azure-sdk-for-net/issues/58055) | Docs | 2026-04-10 00:59 | See issue |
| [#58054](https://github.com/Azure/azure-sdk-for-net/issues/58054) | Docs | 2026-04-10 00:17 | See issue |
| [#58053](https://github.com/Azure/azure-sdk-for-net/issues/58053) | Triage | 2026-04-10 00:16 | See issue |
| [#58048](https://github.com/Azure/azure-sdk-for-net/issues/58048) | Triage | 2026-04-09 22:14 | See issue |
| [#58044](https://github.com/Azure/azure-sdk-for-net/issues/58044) | Docs | 2026-04-09 21:09 | See issue |

**Temporal pattern:** Heaviest cluster Apr 9–10 (8 failures in ~36 hours), continuing intermittently through Apr 13.

## Impact

- 10 out of 13 recent agentic workflow failures (77%) share this exact root cause
- Each failure auto-files a GitHub issue with the `agentic-workflows` label, creating noise in the issue tracker
- Issues that should have been triaged or had docs gaps filed remain unprocessed until a human notices and re-triggers
- No workaround exists from the workflow author side — the MCP registry check is internal to the Copilot CLI

## Mitigation applied (workflow-side)

We have set `report-failure-as-issue: false` on both workflows to suppress the noisy auto-filed failure issues. This is a stopgap — it also suppresses reports for real workflow bugs.

## Suggested improvements

1. **Fix the transient 401s** — the root cause in the MCP registry auth path
2. **Graceful degradation** — if the registry check fails transiently, consider retrying before blocking all MCP servers, or allowing a configurable fallback policy
3. **Distinguish infrastructure failures from agent failures** — the current `{"items":[]}` output doesn't distinguish "agent chose not to act" from "agent couldn't access its tools"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Systemic MCP registry 401 failures block all agentic workflow safe outputs #26069

Summary

Environment

Reproduction

Failure Pattern

Agent process log signature

What happens

Key observation

Evidence: 10 confirmed failures (Apr 8–13, 2026)

Impact

Mitigation applied (workflow-side)

Suggested improvements

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue	Workflow	Date (UTC)	Run ID (from issue body)
#58113	Triage	2026-04-13 09:13	See issue
#58107	Triage	2026-04-11 18:00	See issue
#58075	Triage	2026-04-10 16:14	See issue
#58072	Docs	2026-04-10 15:28	See issue
#58059	Triage	2026-04-10 06:34	See issue
#58055	Docs	2026-04-10 00:59	See issue
#58054	Docs	2026-04-10 00:17	See issue
#58053	Triage	2026-04-10 00:16	See issue
#58048	Triage	2026-04-09 22:14	See issue
#58044	Docs	2026-04-09 21:09	See issue

Systemic MCP registry 401 failures block all agentic workflow safe outputs #26069

Description

Summary

Environment

Reproduction

Failure Pattern

Agent process log signature

What happens

Key observation

Evidence: 10 confirmed failures (Apr 8–13, 2026)

Impact

Mitigation applied (workflow-side)

Suggested improvements

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions