[repository-quality] 🎯 Repository Quality Improvement Report - MCP Integration Robustness & Error Recovery #34015

2026-05-22T13:46:24Z

github-actions[bot]
Bot May 22, 2026

Analysis Date: 2026-05-22
Focus Area: MCP Integration Robustness & Error Recovery
Strategy Type: Custom
Custom Area: Yes — This focus area addresses gh-aw's unique MCP server integration patterns, which are central to the repository's mission as a workflow orchestration platform. The 5-minute inactivity timeout and error propagation challenges are specific to this project's architecture.

Executive Summary

The gh-aw codebase has extensive MCP (Model Context Protocol) integration with 170+ MCP-related Go files and 234 compiled workflows using MCP servers. While the codebase demonstrates solid error handling fundamentals (1,121 context timeout references, 202 retry instances, 80 cleanup/defer patterns), analysis reveals critical gaps in MCP-specific error recovery, observability, and resilience patterns.

Key Findings:

Documentation exists but scattered: MCP timeout guidance in AGENTS.md, troubleshooting in docs, and operational runbooks — but no centralized error recovery playbook
Limited proactive resilience: Only 25 instances of circuit breaker/backoff patterns; no systematic MCP connection health checks
Test coverage gaps: Only 22 MCP test files test timeout/cancellation; 34 error test cases exist but lack comprehensive failure injection scenarios
Missing observability: 171 logging statements in MCP code, but no structured error metrics or connection lifecycle tracking
Graceful degradation absent: Only 2 graceful shutdown patterns found; no fallback mechanisms when MCP servers become unavailable

Impact: The 5-minute MCP inactivity timeout is a known workflow failure mode (documented in AGENTS.md), yet the codebase lacks systematic prevention, detection, and recovery mechanisms. This creates workflow fragility and debugging challenges.

Full Analysis Report

Focus Area: MCP Integration Robustness & Error Recovery

Current State Assessment

The gh-aw repository integrates MCP servers extensively for workflow orchestration, providing tools for GitHub operations, scripts execution, and AI agent interactions. The MCP integration spans multiple layers:

CLI Layer: 90+ MCP-related files in pkg/cli/mcp*.go (server management, inspection, validation)
Workflow Layer: 70+ MCP-related files in pkg/workflow/mcp*.go (config, rendering, gateway setup)
Production Usage: 234 compiled workflows (.lock.yml files) actively use MCP servers
Core Infrastructure Files:
- mcp_setup_generator.go (919 lines) — largest MCP file, generates setup steps
- mcp_gateway_config.go (185 lines) — gateway configuration
- mcp_config_validation.go (224 lines) — validation logic

Metrics Collected:

Metric	Value	Status
MCP-related Go files	170+	✅ Extensive coverage
Workflows using MCP	234	✅ High adoption
Context timeout handling	1,121 instances	✅ Good baseline
Retry logic instances	202	⚠️ Present but unsystematic
Cleanup/defer patterns	80	⚠️ Adequate but could improve
Circuit breaker/backoff	25 instances	❌ Limited resilience patterns
Health check implementations	<10 relevant	❌ No systematic health checks
Graceful shutdown patterns	2	❌ Insufficient lifecycle management
MCP timeout tests	22 test files	⚠️ Partial coverage
MCP error test cases	34	⚠️ Needs expansion
Logging statements (MCP)	171	✅ Good but lacks structure

Findings

Strengths

Comprehensive MCP tooling: Excellent CLI commands for MCP inspection (gh aw mcp inspect, gh aw mcp list), audit capabilities, and operational runbooks
Documentation of known issues: The 5-minute inactivity timeout is clearly documented in AGENTS.md with workarounds ("Use bash for end-of-session validation")
Strong validation foundation: Dedicated validation files (mcp_config_validation.go, mcp_validation.go) with 224+ lines of validation logic
Error abstraction layer: pkg/cli/mcp_error.go provides structured MCP error creation with JSON-RPC error codes
Testing mindset: 22 test files specifically test timeout/cancellation scenarios; 34 error test cases demonstrate awareness
Operational maturity: Runbook at .github/aw/runbooks/workflow-health.md provides investigation procedures for MCP failures

Areas for Improvement

1. Reactive vs. Proactive Error Handling (Severity: High)

Current approach documents the 5-minute timeout problem but relies on workflow authors to avoid it
No automatic keepalive mechanism to prevent MCP connection timeouts during long analysis phases
No MCP connection health monitoring or early warning system
Missing circuit breaker patterns to prevent cascading failures when MCP servers are degraded

2. Observability Gaps (Severity: High)

171 logging statements exist but lack structured metrics (counters, histograms, error rates)
No MCP connection lifecycle tracing (connect, idle, timeout, disconnect events)
Missing centralized error taxonomy for MCP failures (connection, timeout, auth, protocol errors)
Audit tools (gh aw audit, gh aw logs) don't expose MCP connection metrics (idle time, request count, error breakdown)

3. Recovery Mechanisms (Severity: Medium)

Only 2 graceful shutdown patterns found; most MCP operations fail-fast without recovery
No automatic retry with exponential backoff for transient MCP errors
Missing fallback strategies when MCP tools are unavailable (e.g., fallback to direct GitHub CLI)
No timeout budget tracking to warn agents when approaching the 5-minute inactivity threshold

4. Test Coverage Gaps (Severity: Medium)

Only 22 test files specifically test MCP timeout/cancellation scenarios out of 170+ MCP files
Missing failure injection tests (simulate MCP server crashes, network partitions, slow responses)
No integration tests for MCP reconnection after timeout
Missing chaos engineering tests for MCP gateway degradation

5. Documentation Fragmentation (Severity: Low)

MCP error guidance scattered across AGENTS.md, docs/troubleshooting, runbooks, and error-recovery-patterns skill
No single "MCP Error Recovery Playbook" consolidating prevention, detection, and recovery strategies
Skills exist (error-recovery-patterns, debugging-workflows) but don't provide MCP-specific guidance

Detailed Analysis

MCP Connection Lifecycle Management

The documented 5-minute inactivity timeout in AGENTS.md reveals a fundamental challenge:

MCP connections (HTTP/WebSocket transports) time out after approximately 5 minutes of inactivity.
Workflows with long file-exploration or analysis phases routinely exceed this threshold.
When the agent finally attempts an end-of-session validation call via an MCP tool,
the MCP transport has been torn down, resulting in: MCP error -32003: context canceled

Current mitigation: Documentation instructs workflow authors to "use bash, not MCP tools, for build/test validation at the end of a session."

Gap: This is a manual workaround, not a systematic solution. Better approaches:

Automatic keepalive: Send periodic MCP ping/list_tools requests during idle periods
Timeout budget tracking: Track idle time and warn agents before reaching 5-minute threshold
Lazy reconnection: Automatically reconnect MCP transport on first use after timeout
Graceful degradation: Detect context canceled errors and retry with fresh connection

Error Taxonomy and Observability

The mcp_error.go file provides structured error creation:

func newMCPError(code int64, msg string, data any) error {
    return &jsonrpc.Error{Code: code, Message: msg, Data: mcpErrorData(data)}
}

But there's no centralized error taxonomy classifying MCP failures:

Connection errors: Server unreachable, network partition, DNS failure
Timeout errors: Inactivity timeout (5 min), request timeout, context deadline exceeded
Protocol errors: Invalid JSON-RPC, schema mismatch, version incompatibility
Auth errors: Missing credentials, expired tokens, insufficient permissions
Runtime errors: Server crash, out of memory, tool execution failure

Opportunity: Create an MCP error classifier that categorizes errors, determines if they're retryable, and provides recovery suggestions.

Audit and Debugging Tools

Existing tools are strong:

gh aw audit <run-id>: Comprehensive run analysis including MCP failures
gh aw logs: Download and analyze workflow logs
gh aw mcp inspect: Inspect MCP server configuration for a workflow
Operational runbook: .github/aw/runbooks/workflow-health.md

Gap: These are post-mortem tools. Missing:

Live MCP health dashboard: Real-time view of MCP connection states across active workflows
MCP connection profiler: Track request counts, idle time, error rates per MCP server
Proactive alerting: Detect MCP degradation patterns before workflow failures

Testing Strategy

Current test coverage:

22 test files test timeout/cancellation (13% of 170 MCP files)
34 error test cases exist
Strong integration tests for stdio, HTTP, JSON transports (mcp_server_stdio_integration_test.go, mcp_server_http_integration_test.go)

Gaps:

No failure injection tests (kill MCP server mid-request, simulate network partition)
No timeout progression tests (verify behavior at 4:30, 4:50, 5:00, 5:10 idle time)
No reconnection tests (verify MCP client can recover after timeout)
No chaos engineering tests (random MCP server crashes, slow responses, partial failures)

🤖 Tasks for Copilot Agent

NOTE TO PLANNER AGENT: Split the following tasks into individual work items.

Improvement Tasks

Task 1: Create MCP Error Taxonomy and Classifier

Priority: High
Estimated Effort: Medium
Focus Area: Error Recovery & Observability

Description: Create a centralized MCP error taxonomy that classifies all MCP failure modes (connection, timeout, protocol, auth, runtime) and provides structured error information including retryability, recovery suggestions, and user-facing messages. This will replace ad-hoc error handling with a systematic approach.

Acceptance Criteria:

Create pkg/cli/mcp_error_taxonomy.go with error classification types (ConnectionError, TimeoutError, ProtocolError, AuthError, RuntimeError)
Add ClassifyMCPError(err error) *MCPErrorInfo function that categorizes errors and returns metadata (type, retryable, recovery_suggestion)
Update mcp_error.go to use the new taxonomy when creating structured errors
Add unit tests in pkg/cli/mcp_error_taxonomy_test.go covering all error categories and edge cases (nil errors, wrapped errors, unknown errors)
Update error-recovery-patterns skill with MCP-specific error classification examples

Code Region: pkg/cli/mcp_error*.go

You are improving MCP error handling in gh-aw by creating a centralized error taxonomy.

## Context
Currently, MCP errors are created ad-hoc using `newMCPError()` in `pkg/cli/mcp_error.go`. There's no systematic classification of error types, no structured metadata about retryability, and no recovery suggestions. This makes debugging harder and prevents automated error recovery.

## Task
1. Create `pkg/cli/mcp_error_taxonomy.go` with:
   - Error category types (ConnectionError, TimeoutError, ProtocolError, AuthError, RuntimeError)
   - `MCPErrorInfo` struct with fields: category, retryable (bool), recovery_suggestion (string), original_error
   - `ClassifyMCPError(err error) *MCPErrorInfo` function that inspects errors and returns classification
   - Support for wrapped errors (use `errors.Unwrap()` and `errors.As()`)
   - Handle common error patterns: "context canceled", "connection refused", "timeout", "401", "403", etc.

2. Update `pkg/cli/mcp_error.go`:
   - Add optional taxonomy integration to `newMCPError()` (backward compatible)
   - Include error category in the JSON-RPC error data field when available

3. Create comprehensive tests in `pkg/cli/mcp_error_taxonomy_test.go`:
   - Test each error category classification
   - Test wrapped error handling
   - Test unknown/unclassified errors (should return fallback category)
   - Test nil error handling
   - Table-driven tests with realistic error messages

4. Update `.github/skills/error-recovery-patterns/SKILL.md`:
   - Add section "MCP Error Classification" with examples of each category
   - Show how to use `ClassifyMCPError()` for automated recovery decisions
   - Provide example error messages and their classifications

## Validation
- Run `make build && make fmt` after first substantial code change (Checkpoint 1)
- Run `make agent-report-progress` before creating PR (Checkpoint 2)
- Verify all tests pass: `go test -v ./pkg/cli/mcp_error*`
- Check that error classification examples in skill are accurate and helpful

Task 2: Implement MCP Connection Keepalive Mechanism

Priority: High
Estimated Effort: Large
Focus Area: Resilience & Timeout Prevention

Description: Implement an automatic MCP connection keepalive mechanism to prevent the documented 5-minute inactivity timeout. This will eliminate the current manual workaround ("use bash for validation") and make workflows more reliable by default.

Acceptance Criteria:

Add configurable keepalive mechanism to MCP client connections (send list_tools or ping every 3-4 minutes during idle periods)
Track MCP connection idle time and automatically trigger keepalive before timeout threshold
Add keepalive-interval configuration option to MCP server frontmatter (default: 240 seconds, disable: 0)
Update AGENTS.md to document the keepalive feature and remove reliance on manual bash workaround
Add debug logging for keepalive events (when sent, success/failure, connection state)
Create integration tests that verify keepalive prevents timeout during 6+ minute idle periods

Code Region: pkg/cli/mcp_server.go, pkg/workflow/mcp_*.go

You are implementing automatic MCP connection keepalive to prevent the 5-minute inactivity timeout.

## Context
MCP connections (HTTP/WebSocket) timeout after ~5 minutes of inactivity. This is documented in AGENTS.md:

> MCP connections time out after approximately 5 minutes of inactivity.
> When the agent finally attempts an end-of-session validation call via an MCP tool,
> the MCP transport has been torn down, resulting in: MCP error -32003: context canceled

Current mitigation: Documentation tells workflow authors to "use bash, not MCP tools, for validation." This is a manual workaround that's easy to forget.

## Task
1. Add keepalive mechanism to MCP client connections:
   - Track last activity timestamp for each MCP connection
   - Start a background goroutine that monitors idle time
   - Send lightweight `list_tools` request every 3-4 minutes if idle
   - Stop keepalive goroutine when connection closes (use context cancellation)
   - Log keepalive events with `logger.New("cli:mcp_keepalive")`

2. Make keepalive configurable in workflow frontmatter:
   ```yaml
   tools:
     github:
       toolsets: [default]
       keepalive-interval: 240  # seconds, 0 to disable

Default: 240 seconds (4 minutes). Document in docs/src/content/docs/reference/mcp-gateway.md.

Update AGENTS.md:
- Update "MCP Connection Inactivity Timeout" section to explain keepalive feature
- Change guidance from "always use bash" to "keepalive is enabled by default; use bash if keepalive is disabled"
- Document how to configure keepalive-interval
Add integration tests:
- Test that keepalive prevents timeout during 6-minute idle period
- Test that keepalive can be disabled (verify timeout occurs)
- Test that keepalive stops when connection closes (no goroutine leak)
- Use time.Sleep() with short intervals or mock time for faster tests
Follow channel lifecycle guidelines from AGENTS.md:
- Document ownership of keepalive goroutine channels
- Use defer close(done) pattern for cleanup signals
- Ensure proper context cancellation when connection closes

Validation

Run make build && make fmt after first code change
Run make agent-report-progress before PR
Verify integration tests pass: go test -v ./pkg/cli/mcp*keepalive* -run TestKeepalive
Test manually: create workflow with 6-minute analysis phase, verify no timeout


---

#### Task 3: Add MCP Connection Lifecycle Observability

**Priority**: Medium
**Estimated Effort**: Medium
**Focus Area**: Observability & Debugging

**Description:** Enhance MCP observability by adding structured lifecycle event tracking (connect, idle, timeout, disconnect) and exposing connection metrics in audit/logs commands. This will make MCP issues easier to diagnose and prevent.

**Acceptance Criteria:**
- [ ] Add structured lifecycle logging for MCP connections (connect, first_request, idle_start, keepalive_sent, timeout_warning, disconnect)
- [ ] Track connection metrics: request count, idle time, total duration, error count per connection
- [ ] Extend `gh aw audit` output to include MCP connection summary (connections opened, idle time distribution, timeout events)
- [ ] Add MCP connection profiler output to `gh aw logs` (show connection states, request rates, error rates per MCP server)
- [ ] Create `pkg/cli/mcp_telemetry.go` for centralized connection state tracking
- [ ] Add unit tests for telemetry collection and aggregation

**Code Region:** `pkg/cli/mcp_server.go`, `pkg/cli/audit*.go`, `pkg/cli/logs*.go`

```markdown
You are adding MCP connection lifecycle observability to improve debugging and monitoring.

## Context
Currently, MCP connections have 171 logging statements but no structured metrics. When debugging timeout issues, there's no visibility into:
- How long connections have been idle
- How many requests were sent before timeout
- When keepalive was triggered
- Connection lifecycle progression

This makes it hard to diagnose whether timeouts are due to genuine inactivity or keepalive failures.

## Task
1. Create `pkg/cli/mcp_telemetry.go` with:
   - `MCPConnectionState` struct tracking: server_name, started_at, last_activity, request_count, error_count, state (connected/idle/timeout/disconnected)
   - `MCPTelemetryCollector` that aggregates connection states
   - Thread-safe state updates using sync.Mutex
   - Methods: `RecordConnect()`, `RecordRequest()`, `RecordIdle()`, `RecordTimeout()`, `RecordDisconnect()`
   - `GetConnectionSummary()` returning aggregated metrics

2. Instrument MCP connection lifecycle:
   - Call telemetry methods at key lifecycle events in `pkg/cli/mcp_server.go`
   - Add lifecycle event logging: `mcpLifecycleLog.Printf("Connection opened: %s", serverName)`
   - Track idle time progression: log at 3 min, 4 min, 4.5 min idle
   - Log warning at 4:30 idle: "Approaching 5-minute timeout, triggering keepalive"

3. Extend `gh aw audit` to show MCP connection summary:
   - Add section "MCP Connection Summary" to audit output
   - Show: total connections, avg idle time, max idle time, timeout events, error rate
   - Use console formatting for readability

4. Extend `gh aw logs` to show MCP connection profiler:
   - Add `--mcp-profile` flag to enable detailed MCP connection analysis
   - Output: per-server request counts, idle time histogram, error breakdown
   - Format as table using existing console rendering patterns

5. Add comprehensive tests:
   - Unit tests for `MCPTelemetryCollector` state transitions
   - Test concurrent updates (use `go test -race`)
   - Integration test: verify audit output includes connection summary
   - Test that telemetry doesn't leak memory (connections are cleaned up)

## Validation
- Run `make build && make fmt` after first code change
- Run `make agent-report-progress` before PR
- Run with race detector: `go test -race ./pkg/cli/mcp_telemetry*`
- Verify audit output is readable and accurate

Task 4: Create MCP Failure Injection Test Suite

Priority: Medium
Estimated Effort: Medium
Focus Area: Testing & Resilience Validation

Description: Build a comprehensive failure injection test suite that simulates real-world MCP failure scenarios (server crashes, network partitions, timeouts, slow responses) to validate error handling, recovery, and observability mechanisms.

Acceptance Criteria:

Create pkg/cli/mcp_failure_injection_test.go with chaos engineering tests
Test scenarios: MCP server crash mid-request, network timeout, slow response (>30s), connection refused, protocol error
Verify timeout progression: test behavior at 4:30, 4:50, 5:00, 5:10 idle time (use mock time or accelerated timeouts)
Test reconnection after timeout: verify client can recover and retry
Validate error classification: each failure scenario should produce correct error category
All tests must pass with race detector enabled (go test -race)

Code Region: pkg/cli/mcp_*test.go

You are creating a failure injection test suite to validate MCP resilience.

## Context
Current test coverage:
- 22 test files test timeout/cancellation (13% of 170 MCP files)
- 34 error test cases exist
- Strong integration tests for normal operation

Gaps:
- No failure injection (simulate crashes, network issues)
- No timeout progression tests (what happens at 4:30, 4:50, 5:00 idle?)
- No reconnection tests (can client recover after timeout?)
- No chaos engineering (random failures, partial degradation)

## Task
1. Create `pkg/cli/mcp_failure_injection_test.go` with:
   - Helper function `newFailingMCPServer(failureMode string)` that returns test server simulating failures
   - Failure modes: "crash", "timeout", "slow", "refused", "protocol_error"
   - Use httptest.Server for HTTP transport tests
   - Use channels for simulating stdio transport failures

2. Test scenarios (use table-driven tests):
   ```go
   tests := []struct{
       name string
       failureMode string
       expectError bool
       errorCategory string
   }{
       {"server crash mid-request", "crash", true, "ConnectionError"},
       {"network timeout", "timeout", true, "TimeoutError"},
       {"slow response >30s", "slow", true, "TimeoutError"},
       {"connection refused", "refused", true, "ConnectionError"},
       {"invalid JSON-RPC", "protocol_error", true, "ProtocolError"},
   }

Timeout progression tests:
- Create test MCP client with configurable idle timeout (use 10s for tests, not 5min)
- Simulate requests at: 0s, 8s (before timeout), verify no issue
- Simulate 11s idle, then request → verify timeout error
- Verify keepalive prevents timeout (if implemented)
Reconnection tests:
- Simulate timeout, then send new request
- Verify client detects stale connection and reconnects
- Verify second request succeeds
Validate error classification:
- For each failure scenario, call ClassifyMCPError(err) (from Task 1)
- Assert error category matches expected (ConnectionError, TimeoutError, etc.)

Run all tests with race detector:

go test -race ./pkg/cli/mcp_failure_injection_test.go

Validation

Run make build && make fmt after first code change
Run make agent-report-progress before PR
All tests pass: go test -v ./pkg/cli/mcp_failure_injection_test.go
Race detector clean: go test -race ./pkg/cli/mcp_failure_injection_test.go
Tests complete in <30 seconds (use short timeouts, not real 5-minute waits)


---

#### Task 5: Consolidate MCP Error Recovery Documentation

**Priority**: Low
**Estimated Effort**: Small
**Focus Area**: Documentation & Knowledge Sharing

**Description:** Create a centralized "MCP Error Recovery Playbook" that consolidates scattered guidance from AGENTS.md, troubleshooting docs, runbooks, and skills into a single authoritative reference for preventing, detecting, and recovering from MCP failures.

**Acceptance Criteria:**
- [ ] Create `docs/src/content/docs/reference/mcp-error-recovery.md` as the single source of truth for MCP error handling
- [ ] Include sections: Error Taxonomy, Prevention (keepalive, timeout budgets), Detection (audit tools, logs), Recovery (retry strategies, fallbacks)
- [ ] Consolidate guidance from AGENTS.md "MCP Connection Inactivity Timeout", runbook "workflow-health.md", and skill "error-recovery-patterns"
- [ ] Add cross-references from existing docs to the new playbook
- [ ] Include code examples, workflow frontmatter examples, and CLI command examples
- [ ] Update AGENTS.md to reference the playbook instead of duplicating content

**Code Region:** `docs/src/content/docs/reference/mcp-error-recovery.md`, `AGENTS.md`, `.github/aw/runbooks/workflow-health.md`

```markdown
You are consolidating MCP error recovery documentation into a single playbook.

## Context
MCP error guidance is scattered across:
- AGENTS.md: "MCP Connection Inactivity Timeout" section
- docs/troubleshooting/debugging.md: MCP debugging steps
- .github/aw/runbooks/workflow-health.md: MCP failure investigation
- .github/skills/error-recovery-patterns/SKILL.md: General error handling
- .github/skills/debugging-workflows/SKILL.md: Workflow debugging

This fragmentation makes it hard to find comprehensive guidance. Need one authoritative reference.

## Task
1. Create `docs/src/content/docs/reference/mcp-error-recovery.md` with structure:
   ```markdown
   ---
   title: MCP Error Recovery Playbook
   description: Comprehensive guide to preventing, detecting, and recovering from MCP failures
   sidebar:
     order: 260
   ---

   ## Overview
   [Brief intro to MCP integration and common failure modes]

   ## Error Taxonomy
   [Table of error categories from Task 1, with examples]

   ## Prevention
   ### Keepalive Configuration
   [Document keepalive feature from Task 2]
   ### Timeout Budget Awareness
   [Best practices for long-running workflows]
   ### Workflow Design Patterns
   [Interleave MCP calls with bash commands to reset idle timer]

   ## Detection
   ### Audit Tools
   [How to use `gh aw audit` to spot MCP issues]
   ### Log Analysis
   [Patterns to grep for in `gh aw logs` output]
   ### Connection Telemetry
   [Reading connection lifecycle events from Task 3]

   ## Recovery
   ### Automatic Retry
   [When and how gh-aw retries MCP errors]
   ### Manual Recovery
   [Steps to recover from timeout: restart workflow, use bash alternative]
   ### Fallback Strategies
   [Graceful degradation when MCP unavailable]

   ## Troubleshooting
   ### Common Scenarios
   [Runbook-style procedures for specific error messages]
   ### Debug Checklist
   [Step-by-step debugging flowchart]

   ## See Also
   - [Debugging Workflows](/troubleshooting/debugging)
   - [MCP Gateway Reference](/reference/mcp-gateway)
   - [Error Recovery Patterns skill](/.github/skills/error-recovery-patterns/SKILL.md)

Consolidate content from existing docs:
- Copy "MCP Connection Inactivity Timeout" section from AGENTS.md → Prevention section
- Copy investigation steps from .github/aw/runbooks/workflow-health.md → Detection/Troubleshooting
- Extract MCP-specific guidance from debugging.md → Detection/Troubleshooting
- Reference error-recovery-patterns skill for general patterns
Update AGENTS.md:
- Replace detailed MCP timeout guidance with: "See MCP Error Recovery Playbook for comprehensive guidance."
- Keep the critical rule: "Use bash for end-of-session validation" (until Task 2 keepalive is deployed)
Update runbook:
- Add note at top of .github/aw/runbooks/workflow-health.md: "For MCP-specific error recovery, see MCP Error Recovery Playbook"
Follow documentation skill guidelines:
- Use Diataxis framework (this is "Reference" + "How-to Guide")
- Use GitHub Flavored Markdown
- Add front matter with title, description, sidebar order
- Use h3 (###) or lower for headers (h1 is title, h2 reserved)
- Include code examples and workflow snippets

Validation

No make build needed (documentation only)
Preview docs locally: cd docs && npm run dev
Verify all cross-references resolve correctly
Check that consolidated content is cohesive (no duplicates or contradictions)


---

</details>

---

## 📊 Historical Context

<details>
<summary>Previous Focus Areas</summary>

| Date | Focus Area | Type | Custom | Key Outcomes |
|------|------------|------|--------|--------------|
| 2026-05-20 | Large File Refactoring & Maintainability | Custom | Y | Identified 20 files >800 lines; 5 tasks for modular refactoring |
| 2026-05-21 | Error Message Quality & User Experience | Custom | Y | Console formatting adoption, output stream compliance; 5 tasks |
| 2026-05-22 | MCP Integration Robustness & Error Recovery | Custom | Y | Error taxonomy, keepalive, observability; 5 tasks |

</details>

---

## 🎯 Recommendations

### Immediate Actions (This Week)
1. **Task 1: MCP Error Taxonomy** — Priority: High — Provides foundation for all other improvements; enables systematic error classification and recovery
2. **Task 3: Connection Observability** — Priority: Medium — Improves debugging immediately; helps validate keepalive effectiveness once implemented

### Short-term Actions (This Month)
1. **Task 2: Keepalive Mechanism** — Priority: High — Eliminates the 5-minute timeout problem; makes workflows reliable by default
2. **Task 4: Failure Injection Tests** — Priority: Medium — Validates resilience improvements; catches regressions early

### Long-term Actions (This Quarter)
1. **Task 5: Documentation Consolidation** — Priority: Low — Knowledge sharing; reduces onboarding friction for new contributors
2. **Circuit breaker patterns**: Implement systematic circuit breakers for MCP connections to prevent cascading failures
3. **MCP health dashboard**: Build real-time monitoring for production MCP usage across all workflows

---

## 📈 Success Metrics

- **MCP timeout failures**: Current: documented issue → Target: <1% of workflow runs (via keepalive)
- **MCP error classification**: Current: ad-hoc → Target: 100% classified with recovery suggestions
- **Test coverage (MCP error scenarios)**: Current: 34 test cases → Target: 100+ covering all failure modes
- **Mean time to debug MCP issues**: Current: ~30 min (manual log analysis) → Target: <5 min (via telemetry)
- **Documentation centralization**: Current: scattered across 5+ docs → Target: single playbook

---

## Next Steps

1. Review and prioritise the tasks above
2. Assign tasks to Copilot coding agent via planner agent
3. Track progress on improvement items
4. Re-evaluate this focus area in 2 weeks (validate keepalive effectiveness)

---

*Generated by Repository Quality Improvement Agent*
*Next analysis: 2026-05-23 — Focus area selected by diversity algorithm*




> Generated by [⚡ Repository Quality Improvement Agent](https://github.com/github/gh-aw/actions/runs/26291026265) · ● 1.3M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Frepository-quality-improver%22&type=discussions)
> - [x] expires <!-- gh-aw-expires: 2026-05-23T13:46:23.843Z --> on May 23, 2026, 1:46 PM UTC

<!-- gh-aw-agentic-workflow: Repository Quality Improvement Agent, engine: copilot, version: 1.0.48, model: claude-sonnet-4.5, id: 26291026265, workflow_id: repository-quality-improver, run: https://github.com/github/gh-aw/actions/runs/26291026265 -->

<!-- gh-aw-workflow-id: repository-quality-improver -->
<!-- gh-aw-workflow-call-id: github/gh-aw/repository-quality-improver -->

2026-05-23T14:13:45Z

github-actions[bot]
Bot May 23, 2026
Author

This discussion was automatically closed because it expired on 2026-05-23T13:46:23.843Z.

Closed by Workflow

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[repository-quality] 🎯 Repository Quality Improvement Report - MCP Integration Robustness & Error Recovery #34015

Uh oh!

{{title}}

Uh oh!

Focus Area: MCP Integration Robustness & Error Recovery

Current State Assessment

Findings

Strengths

Areas for Improvement

Detailed Analysis

MCP Connection Lifecycle Management

Error Taxonomy and Observability

Audit and Debugging Tools

Testing Strategy

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[repository-quality] 🎯 Repository Quality Improvement Report - MCP Integration Robustness & Error Recovery #34015

Uh oh!

github-actions[bot] Bot May 22, 2026

Executive Summary

Focus Area: MCP Integration Robustness & Error Recovery

Current State Assessment

Findings

Strengths

Areas for Improvement

Detailed Analysis

MCP Connection Lifecycle Management

Error Taxonomy and Observability

Audit and Debugging Tools

Testing Strategy

🤖 Tasks for Copilot Agent

Improvement Tasks

Task 1: Create MCP Error Taxonomy and Classifier

Task 2: Implement MCP Connection Keepalive Mechanism

Validation

Task 4: Create MCP Failure Injection Test Suite

Validation

Validation

Replies: 1 comment

Uh oh!

github-actions[bot] Bot May 23, 2026 Author

github-actions[bot]
Bot May 22, 2026

github-actions[bot]
Bot May 23, 2026
Author