Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 3, 2026

Addresses MCP error -32603 failures in static analysis workflow by fixing insufficient health checking, tight timeouts for 146 workflows, and lack of retry logic.

Changes

MCP Server Health Check (.github/workflows/shared/mcp/gh-aw.md)

  • Replace 2-second sleep with 15-second TCP connection test loop
  • Verify server accepts connections on port 8765, not just process existence
  • Add progress feedback and fail-fast on unexpected death

Timeouts (.github/workflows/static-analysis-report.md)

  • Tool timeout: 300s → 600s
  • Workflow timeout: 30min → 45min
  • Rationale: 146 workflows × 15s ≈ 35min needed

Error Handling (.github/workflows/static-analysis-report.md)

  • Add retry-once instruction for MCP -32603 errors
  • Fallback to cache-memory historical data on failure
  • Graceful degradation with partial results

Implementation

# Before: Process check only
sleep 2
if ! kill -0 $MCP_PID 2>/dev/null; then
  echo "MCP server failed to start"
  exit 1
fi

# After: TCP connection verification
for i in {1..15}; do
  if ! kill -0 $MCP_PID 2>/dev/null; then
    echo "Error: MCP server process died unexpectedly"
    exit 1
  fi
  if timeout 1 bash -c "echo > /dev/tcp/localhost/8765" 2>/dev/null; then
    echo "MCP server is accepting connections on port 8765"
    exit 0
  fi
  echo "Waiting for server to accept connections... (attempt $i/15)"
  sleep 1
done

Lock files for 12 workflows updated via make recompile.

Original prompt

This section details on the original issue you should resolve

<issue_title>Improve MCP server reliability for static analysis workflow</issue_title>
<issue_description>## Summary

This PR improves the reliability of the MCP server used in the static analysis workflow to prevent MCP error -32603 issues reported in discussion #8763.

Issues Found (from investigation)

Static Analysis Workflow

  • Historical Issue: MCP compilation errors (-32603) preventing security scans
  • Root Causes Identified:
    1. Insufficient health checking after MCP server startup
    2. Tight timeouts for large repository (146 workflows)
    3. No retry logic for transient MCP errors
    4. No graceful degradation on failures

Changes Made

1. MCP Server Health Check (.github/workflows/shared/mcp/gh-aw.md)

  • Added: Robust TCP connection health check with 15-second timeout
  • Improvement: Waits for server to actually accept connections, not just start process
  • Safety: Fails fast if server dies unexpectedly
  • Feedback: Progress messages every second during startup
# Before: 2-second sleep + process check only
# After: 15-second retry loop with TCP connection test
for i in {1..15}; do
  if timeout 1 bash -c "echo > /dev/tcp/localhost/8765"; then
    echo "MCP server is accepting connections on port 8765"
    exit 0
  fi
done

2. Increased Timeouts (.github/workflows/static-analysis-report.md)

  • Tool timeout: 300s → 600s (5 min → 10 min)
  • Workflow timeout: 30 min → 45 min
  • Rationale: 146 workflows × ~15s per compile = ~35 minutes needed for full scan

3. Agent Error Handling (.github/workflows/static-analysis-report.md)

  • Added: Retry logic instructions for agent
  • Graceful degradation: Use historical data if compile fails
  • User experience: Partial results better than total failure
**Error Handling**: If you receive an MCP error (such as -32603):
- Wait 10 seconds and retry the compile operation once
- If the second attempt fails, provide summary based on:
  * Historical data from cache-memory
  * Partial results if any workflows were successfully compiled
  * Recommendations for manual investigation

Expected Improvements

  • Reduced MCP startup failures - Health check ensures server is ready before agent starts
  • Fewer timeout errors - Doubled tool timeout prevents premature cancellation
  • Better error recovery - Retry logic handles transient network/resource issues
  • Graceful degradation - Partial results provided even if compile fails
  • Improved monitoring - Better progress feedback during MCP server startup

Testing

All changes are non-breaking and additive:

  • Health check is backward compatible (still checks process, plus TCP test)
  • Timeout increases only affect long-running operations
  • Agent instructions don't change behavior when MCP is working
  • No lock file changes needed

Validation

Changes follow workflow best practices:

  • Only modified .md workflow files (no .lock.yml files)
  • Preserved existing working configurations
  • Added defensive programming without over-engineering
  • Clear documentation of all changes

References


Note: Lock files will be automatically compiled after merge per repository workflow.

AI generated by Q


[!NOTE]
This was originally intended as a pull request, but the git push operation failed.

Workflow Run: View run details and download patch artifact

The patch file is available as an artifact (aw.patch) in the workflow run linked above.

To apply the patch locally:

# Download the artifact from the workflow run https://github.com/githubnext/gh-aw/actions/runs/20679332677
# (Use GitHub MCP tools if gh CLI is not available)
gh run download 20679332677 -n aw.patch

# Apply the patch
git am aw.patch
Show patch preview (97 of 97 lines)
From 1b7e20ba97e8985ecb2ab7c6b01c6dc7b2e7ccfc Mon Sep 17 00:00:00 2001
From: "Q (via Copilot)" <copilot@github.com>
Date: Sat, 3 Jan 2026 15:42:57 +0000
Subject: [PATCH] Improve MCP server reliability for static analysis workflow

- Add health check with TCP connection test to MCP server startup
- Increase tool timeout from 300s to 600s (10 minutes)
- Increase workflow timeout from 30 to 45 minutes
- Add retry logic instructions for agent on MCP errors
- Add graceful error handling with historical data fallback

Addresses MCP error -32603 issues reported in githubnext/gh-aw#8763
---
 .github/workflows/shared/mcp/gh-aw.md       | 30 ++++++++++++++----...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes githubnext/gh-aw#8774

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

Copilot AI and others added 2 commits January 3, 2026 15:57
…d timeouts

Co-authored-by: mnkiefer <8320933+mnkiefer@users.noreply.github.com>
Co-authored-by: mnkiefer <8320933+mnkiefer@users.noreply.github.com>
Copilot AI changed the title [WIP] Improve MCP server reliability for static analysis workflow Improve MCP server reliability with robust health checks and increased timeouts Jan 3, 2026
Copilot AI requested a review from mnkiefer January 3, 2026 16:04
@pelikhan pelikhan marked this pull request as ready for review January 3, 2026 16:16
@pelikhan pelikhan merged commit 2c1f68a into main Jan 3, 2026
4 checks passed
@pelikhan pelikhan deleted the copilot/improve-mcp-server-reliability branch January 3, 2026 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants