[repository-quality] 🎯 Repository Quality Improvement Report - MCP Integration Robustness & Error Recovery #34015
Closed
Replies: 1 comment
-
|
This discussion was automatically closed because it expired on 2026-05-23T13:46:23.843Z.
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Analysis Date: 2026-05-22
Focus Area: MCP Integration Robustness & Error Recovery
Strategy Type: Custom
Custom Area: Yes — This focus area addresses gh-aw's unique MCP server integration patterns, which are central to the repository's mission as a workflow orchestration platform. The 5-minute inactivity timeout and error propagation challenges are specific to this project's architecture.
Executive Summary
The gh-aw codebase has extensive MCP (Model Context Protocol) integration with 170+ MCP-related Go files and 234 compiled workflows using MCP servers. While the codebase demonstrates solid error handling fundamentals (1,121 context timeout references, 202 retry instances, 80 cleanup/defer patterns), analysis reveals critical gaps in MCP-specific error recovery, observability, and resilience patterns.
Key Findings:
Impact: The 5-minute MCP inactivity timeout is a known workflow failure mode (documented in AGENTS.md), yet the codebase lacks systematic prevention, detection, and recovery mechanisms. This creates workflow fragility and debugging challenges.
Full Analysis Report
Focus Area: MCP Integration Robustness & Error Recovery
Current State Assessment
The gh-aw repository integrates MCP servers extensively for workflow orchestration, providing tools for GitHub operations, scripts execution, and AI agent interactions. The MCP integration spans multiple layers:
pkg/cli/mcp*.go(server management, inspection, validation)pkg/workflow/mcp*.go(config, rendering, gateway setup).lock.ymlfiles) actively use MCP serversmcp_setup_generator.go(919 lines) — largest MCP file, generates setup stepsmcp_gateway_config.go(185 lines) — gateway configurationmcp_config_validation.go(224 lines) — validation logicMetrics Collected:
Findings
Strengths
gh aw mcp inspect,gh aw mcp list), audit capabilities, and operational runbooksmcp_config_validation.go,mcp_validation.go) with 224+ lines of validation logicpkg/cli/mcp_error.goprovides structured MCP error creation with JSON-RPC error codes.github/aw/runbooks/workflow-health.mdprovides investigation procedures for MCP failuresAreas for Improvement
1. Reactive vs. Proactive Error Handling (Severity: High)
2. Observability Gaps (Severity: High)
gh aw audit,gh aw logs) don't expose MCP connection metrics (idle time, request count, error breakdown)3. Recovery Mechanisms (Severity: Medium)
4. Test Coverage Gaps (Severity: Medium)
5. Documentation Fragmentation (Severity: Low)
error-recovery-patterns,debugging-workflows) but don't provide MCP-specific guidanceDetailed Analysis
MCP Connection Lifecycle Management
The documented 5-minute inactivity timeout in AGENTS.md reveals a fundamental challenge:
Current mitigation: Documentation instructs workflow authors to "use bash, not MCP tools, for build/test validation at the end of a session."
Gap: This is a manual workaround, not a systematic solution. Better approaches:
context cancelederrors and retry with fresh connectionError Taxonomy and Observability
The
mcp_error.gofile provides structured error creation:But there's no centralized error taxonomy classifying MCP failures:
Opportunity: Create an MCP error classifier that categorizes errors, determines if they're retryable, and provides recovery suggestions.
Audit and Debugging Tools
Existing tools are strong:
gh aw audit <run-id>: Comprehensive run analysis including MCP failuresgh aw logs: Download and analyze workflow logsgh aw mcp inspect: Inspect MCP server configuration for a workflow.github/aw/runbooks/workflow-health.mdGap: These are post-mortem tools. Missing:
Testing Strategy
Current test coverage:
mcp_server_stdio_integration_test.go,mcp_server_http_integration_test.go)Gaps:
🤖 Tasks for Copilot Agent
NOTE TO PLANNER AGENT: Split the following tasks into individual work items.
Improvement Tasks
Task 1: Create MCP Error Taxonomy and Classifier
Priority: High
Estimated Effort: Medium
Focus Area: Error Recovery & Observability
Description: Create a centralized MCP error taxonomy that classifies all MCP failure modes (connection, timeout, protocol, auth, runtime) and provides structured error information including retryability, recovery suggestions, and user-facing messages. This will replace ad-hoc error handling with a systematic approach.
Acceptance Criteria:
pkg/cli/mcp_error_taxonomy.gowith error classification types (ConnectionError, TimeoutError, ProtocolError, AuthError, RuntimeError)ClassifyMCPError(err error) *MCPErrorInfofunction that categorizes errors and returns metadata (type, retryable, recovery_suggestion)mcp_error.goto use the new taxonomy when creating structured errorspkg/cli/mcp_error_taxonomy_test.gocovering all error categories and edge cases (nil errors, wrapped errors, unknown errors)error-recovery-patternsskill with MCP-specific error classification examplesCode Region:
pkg/cli/mcp_error*.goTask 2: Implement MCP Connection Keepalive Mechanism
Priority: High
Estimated Effort: Large
Focus Area: Resilience & Timeout Prevention
Description: Implement an automatic MCP connection keepalive mechanism to prevent the documented 5-minute inactivity timeout. This will eliminate the current manual workaround ("use bash for validation") and make workflows more reliable by default.
Acceptance Criteria:
list_toolsorpingevery 3-4 minutes during idle periods)keepalive-intervalconfiguration option to MCP server frontmatter (default: 240 seconds, disable: 0)Code Region:
pkg/cli/mcp_server.go,pkg/workflow/mcp_*.goDefault: 240 seconds (4 minutes). Document in
docs/src/content/docs/reference/mcp-gateway.md.Update AGENTS.md:
Add integration tests:
time.Sleep()with short intervals or mock time for faster testsFollow channel lifecycle guidelines from AGENTS.md:
defer close(done)pattern for cleanup signalsValidation
make build && make fmtafter first code changemake agent-report-progressbefore PRgo test -v ./pkg/cli/mcp*keepalive* -run TestKeepaliveTask 4: Create MCP Failure Injection Test Suite
Priority: Medium
Estimated Effort: Medium
Focus Area: Testing & Resilience Validation
Description: Build a comprehensive failure injection test suite that simulates real-world MCP failure scenarios (server crashes, network partitions, timeouts, slow responses) to validate error handling, recovery, and observability mechanisms.
Acceptance Criteria:
pkg/cli/mcp_failure_injection_test.gowith chaos engineering testsgo test -race)Code Region:
pkg/cli/mcp_*test.goTimeout progression tests:
Reconnection tests:
Validate error classification:
ClassifyMCPError(err)(from Task 1)Run all tests with race detector:
go test -race ./pkg/cli/mcp_failure_injection_test.goValidation
make build && make fmtafter first code changemake agent-report-progressbefore PRgo test -v ./pkg/cli/mcp_failure_injection_test.gogo test -race ./pkg/cli/mcp_failure_injection_test.goConsolidate content from existing docs:
.github/aw/runbooks/workflow-health.md→ Detection/Troubleshootingdebugging.md→ Detection/TroubleshootingUpdate AGENTS.md:
Update runbook:
.github/aw/runbooks/workflow-health.md: "For MCP-specific error recovery, see MCP Error Recovery Playbook"Follow documentation skill guidelines:
Validation
make buildneeded (documentation only)cd docs && npm run devBeta Was this translation helpful? Give feedback.
All reactions