Problem Statement
The Design Decision Gate workflow reliably fails when the GitHub MCP server's HTTP connection drops at ~100–110 seconds of uptime. The agent exits with code 1 after two terminal connection errors, even when it has already completed its primary work (ADR write, PR review).
Three instances confirmed today (2026-04-30):
| Run |
Branch |
Start Time |
MCP uptime at drop |
| §25177070075 |
copilot/refactor-semantic-function-clustering |
16:29 UTC |
110s |
| §25179263531 |
copilot/deep-report-enable-firewall-artifacts |
17:18 UTC |
105s |
| §25181104179 |
copilot/update-agentic-maintenance-action |
17:59 UTC |
101s |
Root Cause
The GitHub MCP server maintains an HTTP streaming connection (SSE or long-poll). Claude Code drops the connection when it has been up for ~100–110s. The error signature is consistent across all three runs:
MCP server "github": HTTP connection dropped after 10Xs uptime
MCP server "github": Connection error: The operation was aborted.
MCP server "github": Terminal connection error 1/3
MCP server "github": Terminal connection error 2/3
Process exiting with code: 1
The design-decision-gate agent takes 8–13 minutes total, so the MCP connection always drops well within the run. The agent typically completes its ADR write and bundle push before the drop, but the drop causes Claude Code to exit with a failure exit code regardless.
Impact
- 3 workflow failures in a 3-hour window
- Each failure loses the safe_outputs (although the agent's push_to_pull_request_branch inline call succeeds before the drop)
- Automation marks PRs as having a failed gate check even when the gate agent completed its work
Proposed Remediation
Option A (Recommended): Increase GitHub MCP keepalive/connection timeout
- Investigate whether the MCP server has a configurable connection timeout (120s?) and raise it to ≥15 minutes for long-running agent workflows
Option B: Add MCP reconnect logic to Claude Code
- Configure the Claude Code MCP client to reconnect and resume after a terminal connection error instead of exiting
Option C: Restructure DDG prompt to complete all work within 90s
- Limit the DDG agent to a maximum of 2–3 MCP calls so it finishes before the ~100s timeout window
Success Criteria
Design Decision Gate runs on branches complete without GitHub MCP connection errors. Audit logs show agent: success for DDG runs of 8+ minutes duration.
Parent: #29232
References:
Generated by [aw] Failure Investigator (6h) · ● 397.8K · ◷
Problem Statement
The Design Decision Gate workflow reliably fails when the GitHub MCP server's HTTP connection drops at ~100–110 seconds of uptime. The agent exits with code 1 after two terminal connection errors, even when it has already completed its primary work (ADR write, PR review).
Three instances confirmed today (2026-04-30):
Root Cause
The GitHub MCP server maintains an HTTP streaming connection (SSE or long-poll). Claude Code drops the connection when it has been up for ~100–110s. The error signature is consistent across all three runs:
The design-decision-gate agent takes 8–13 minutes total, so the MCP connection always drops well within the run. The agent typically completes its ADR write and bundle push before the drop, but the drop causes Claude Code to exit with a failure exit code regardless.
Impact
Proposed Remediation
Option A (Recommended): Increase GitHub MCP keepalive/connection timeout
Option B: Add MCP reconnect logic to Claude Code
Option C: Restructure DDG prompt to complete all work within 90s
Success Criteria
Design Decision Gate runs on branches complete without GitHub MCP connection errors. Audit logs show
agent: successfor DDG runs of 8+ minutes duration.Parent: #29232
References: