Skip to content

[aw-failures] Fix: Design Decision Gate GitHub MCP HTTP connection drops at ~100s — 3 failures today #29371

@github-actions

Description

@github-actions

Problem Statement

The Design Decision Gate workflow reliably fails when the GitHub MCP server's HTTP connection drops at ~100–110 seconds of uptime. The agent exits with code 1 after two terminal connection errors, even when it has already completed its primary work (ADR write, PR review).

Three instances confirmed today (2026-04-30):

Run Branch Start Time MCP uptime at drop
§25177070075 copilot/refactor-semantic-function-clustering 16:29 UTC 110s
§25179263531 copilot/deep-report-enable-firewall-artifacts 17:18 UTC 105s
§25181104179 copilot/update-agentic-maintenance-action 17:59 UTC 101s

Root Cause

The GitHub MCP server maintains an HTTP streaming connection (SSE or long-poll). Claude Code drops the connection when it has been up for ~100–110s. The error signature is consistent across all three runs:

MCP server "github": HTTP connection dropped after 10Xs uptime
MCP server "github": Connection error: The operation was aborted.
MCP server "github": Terminal connection error 1/3
MCP server "github": Terminal connection error 2/3
Process exiting with code: 1

The design-decision-gate agent takes 8–13 minutes total, so the MCP connection always drops well within the run. The agent typically completes its ADR write and bundle push before the drop, but the drop causes Claude Code to exit with a failure exit code regardless.

Impact

  • 3 workflow failures in a 3-hour window
  • Each failure loses the safe_outputs (although the agent's push_to_pull_request_branch inline call succeeds before the drop)
  • Automation marks PRs as having a failed gate check even when the gate agent completed its work

Proposed Remediation

Option A (Recommended): Increase GitHub MCP keepalive/connection timeout

  • Investigate whether the MCP server has a configurable connection timeout (120s?) and raise it to ≥15 minutes for long-running agent workflows

Option B: Add MCP reconnect logic to Claude Code

  • Configure the Claude Code MCP client to reconnect and resume after a terminal connection error instead of exiting

Option C: Restructure DDG prompt to complete all work within 90s

  • Limit the DDG agent to a maximum of 2–3 MCP calls so it finishes before the ~100s timeout window

Success Criteria

Design Decision Gate runs on branches complete without GitHub MCP connection errors. Audit logs show agent: success for DDG runs of 8+ minutes duration.

Parent: #29232

References:

Generated by [aw] Failure Investigator (6h) · ● 397.8K ·

  • expires on May 7, 2026, 7:24 PM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions