Skip to content

fix: add explicit step-level timeout to Go workflow Test step#57

Merged
erraggy merged 7 commits intomainfrom
fix/go-workflow-step-timeout
Nov 29, 2025
Merged

fix: add explicit step-level timeout to Go workflow Test step#57
erraggy merged 7 commits intomainfrom
fix/go-workflow-step-timeout

Conversation

@erraggy
Copy link
Copy Markdown
Owner

@erraggy erraggy commented Nov 29, 2025

Summary

Adds explicit timeout-minutes: 10 to the Test step in the Go workflow to align with the go test -timeout=10m configuration.

Problem

The Go workflow is experiencing unexpected timeouts at approximately 60 seconds despite having:

  • Job-level timeout: timeout-minutes: 10 (600 seconds)
  • Go test timeout: -timeout=10m (600 seconds)

The workflow is being killed with exit code 143 (SIGTERM) after only ~63 seconds, suggesting something is timing out before the configured values.

Root Cause

Investigation revealed that while we have a job-level timeout configured, we don't have an explicit step-level timeout for the Test step. Without this, the default step timeout of 360 minutes applies, but the actual behavior shows premature termination.

Solution

Add explicit timeout-minutes: 10 to the Test step to:

  1. Make timeout configuration more explicit and easier to debug
  2. Align step-level timeout with go test timeout
  3. Help identify whether the issue is with GitHub Actions behavior or actual test hangs

Changes

  • .github/workflows/go.yml: Added timeout-minutes: 10 to Test step

Testing

This change will be tested by monitoring the workflow runs to see if:

  1. The timeout behavior changes
  2. We get clearer feedback about what's timing out
  3. Tests complete successfully within the 10-minute window

Related

Part of investigation documented in plan: Fix Go Workflow Test Timeout Issue

🤖 Generated with Claude Code

The Go workflow was experiencing unexpected timeouts at ~60 seconds despite
having a job-level timeout of 10 minutes and go test -timeout=10m configured.

This change adds an explicit step-level timeout-minutes: 10 to the Test step,
aligning with the go test timeout value. This makes the timeout configuration
more explicit and easier to debug.

Without an explicit step-level timeout, the default of 360 minutes applies,
but the actual behavior was causing premature termination. Making this
timeout explicit will help identify whether the issue is with:
1. GitHub Actions step timeout behavior
2. Go test command timeout
3. Actual test hangs in specific packages

Part of investigation documented in plan: fix Go workflow test timeout issue.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings November 29, 2025 06:52
@codecov
Copy link
Copy Markdown

codecov Bot commented Nov 29, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.26%. Comparing base (809a1de) to head (1e74925).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #57      +/-   ##
==========================================
- Coverage   66.52%   66.26%   -0.27%     
==========================================
  Files          42       42              
  Lines        7795     7795              
==========================================
- Hits         5186     5165      -21     
- Misses       2070     2079       +9     
- Partials      539      551      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an explicit step-level timeout to the Test step in the Go workflow to help diagnose and resolve timeout issues. The change adds timeout-minutes: 10 to align with the go test -timeout=10m configuration, making the timeout behavior more explicit.

Key Change:

  • Adds step-level timeout-minutes: 10 to the Test step in the Go workflow

Comment thread .github/workflows/go.yml Outdated
@erraggy
Copy link
Copy Markdown
Owner Author

erraggy commented Nov 29, 2025

Investigation Update

After investigation, I've discovered that:

  1. Tests pass consistently locally - All tests complete in ~1.3 seconds with a 2-minute timeout

  2. The issue is intermittent in CI - Checking recent main branch runs shows a pattern of intermittent failures:

  3. Root cause: GitHub Actions runner instability, not hanging tests

    • Exit code 143 = SIGTERM (external process termination)
    • No configured timeout is being reached (8m go test, 10m step)
    • The runner itself appears to be killing processes sporadically

Recommendation

Rather than pursuing timeouts, we should add automatic retry logic to handle these intermittent runner failures. The step-level timeout addition in this PR is still good practice, but won't solve the underlying issue.

I recommend:

  • Merge this PR for the explicit timeout documentation
  • Add GitHub Actions retry logic in a follow-up PR to handle intermittent failures

Investigation revealed that the intermittent timeout failures are likely caused
by resource contention on GitHub Actions runners, which have only 2 CPU cores
but were running tests with -parallel=4.

Changes:
- Reduced -parallel from 4 to 2 to match runner CPU count
- Added GOMAXPROCS=2 environment variable to prevent over-subscription
- Kept timeout-minutes: 10 and -timeout=8m for graceful timeout handling

This should prevent the sporadic SIGTERM (exit code 143) failures caused by
the runner killing resource-intensive test processes.

Testing locally showed all tests pass in ~3 seconds, but GitHub Actions runners
with limited resources were experiencing CPU contention leading to processes
being killed after ~60 seconds.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@erraggy
Copy link
Copy Markdown
Owner Author

erraggy commented Nov 29, 2025

Root Cause Identified: Resource Contention

You were absolutely right about resource constraints! The issue is CPU contention on GitHub Actions runners.

Investigation Details

GitHub Actions Runners:

  • 2 CPU cores
  • 7 GB RAM

Our Test Configuration (before fix):

  • -parallel=4 (up to 4 packages running concurrently)
  • No GOMAXPROCS limit (defaults to available CPUs)
  • Result: 4 parallel test processes competing for 2 CPU cores = severe contention

Evidence

  1. Local tests pass instantly (~3 seconds) on powerful dev machine
  2. CI shows intermittent failures at ~60 seconds with exit code 143 (SIGTERM)
  3. Pattern matches resource exhaustion: Runner kills process when it can't keep up

Solution Applied

Latest commit reduces test parallelism to match runner resources:

  • -parallel=2 (match runner CPU count)
  • GOMAXPROCS=2 (prevent Go runtime from over-subscribing)
  • Kept explicit timeouts for graceful handling

This should eliminate the intermittent SIGTERM failures by preventing resource contention.

Root cause identified: The parser package fuzz test corpus was consuming
excessive resources when run alongside all other tests, causing GitHub
Actions runners to terminate the job with "runner received shutdown signal".

Evidence:
- Parser package is the only package that never completes in CI
- Parser package is the only package with a fuzz test (FuzzParseBytes)
- Runner shutdown occurs after other packages complete, when parser runs
- Fuzz tests automatically execute their corpus during regular go test runs
- The corpus includes 19 seeds + 2 generated test cases = 21 test executions
  with resolveRefs=true and validateStructure=true (most resource-intensive)

Solution:
1. Exclude fuzz tests from main test run with -skip='^Fuzz'
2. Run fuzz corpus tests separately with stricter resource limits:
   - GOMAXPROCS=1 (single CPU to prevent resource exhaustion)
   - timeout-minutes: 5 (shorter window)
   - -timeout=2m (go test timeout)
   - Only test parser package explicitly

This prevents the resource spike that was causing GitHub Actions to kill
the runner while maintaining fuzz corpus regression testing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@erraggy
Copy link
Copy Markdown
Owner Author

erraggy commented Nov 29, 2025

✅ ROOT CAUSE IDENTIFIED: Fuzz Test Corpus Resource Exhaustion

The issue is fuzz test corpus execution consuming excessive resources, causing GitHub Actions runners to terminate the job.

The Smoking Gun

Parser package behavior in CI:

  • Parser is the ONLY package with a fuzz test (FuzzParseBytes)
  • Parser is the ONLY package that never completes in CI
  • Runner shutdown occurs right after joiner completes (when parser would be running)
  • Error message: "The runner has received a shutdown signal"

Fuzz test resource consumption:

  • 19 seed corpus entries + 2 generated test cases = 21 test executions
  • Each execution runs ParseBytes(data, resolveRefs=true, validateStructure=true)
  • This is the most resource-intensive code path (parsing + validation + ref resolution)
  • Running with -parallel=2 means 2 packages testing concurrently, each potentially running fuzz corpus

Why This Happens

When you run go test, Go automatically executes fuzz test corpus as regression tests. On a resource-constrained GitHub Actions runner (2 CPU, 7GB RAM), running 21 resource-intensive parse operations was enough to trigger runner termination.

Solution Applied

Latest commit isolates fuzz tests:

  1. Main test run: Skip fuzz tests with -skip='^Fuzz'
  2. Separate fuzz corpus step:
    • Run only parser fuzz tests: go test -run='^Fuzz' ./parser
    • Limit to single CPU: GOMAXPROCS=1
    • Shorter timeout: 2 minutes
    • Prevents resource spike that triggers runner shutdown

This maintains fuzz corpus regression testing while preventing the resource exhaustion that was killing our CI runs.

@erraggy erraggy merged commit 72f53af into main Nov 29, 2025
6 of 7 checks passed
@erraggy erraggy deleted the fix/go-workflow-step-timeout branch November 29, 2025 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants