Skip to content

Prevent workflow_call worker fan-out cancellations by namespacing agent concurrency + enabling queued pending runs#35173

Merged
pelikhan merged 6 commits into
mainfrom
copilot/fix-safe-outputs-concurrency-issue
May 27, 2026
Merged

Prevent workflow_call worker fan-out cancellations by namespacing agent concurrency + enabling queued pending runs#35173
pelikhan merged 6 commits into
mainfrom
copilot/fix-safe-outputs-concurrency-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 27, 2026

safe-outputs.call-workflow fan-out was lossy because compiler-injected agent job concurrency used gh-aw-{engine}-${{ github.workflow }}; in reusable workers, ${{ github.workflow }} resolves to the caller, causing parallel worker invocations to collide in one group and cancel each other. This change makes default worker grouping stable per worker identity and retains queued invocations instead of dropping pending runs.

  • Agent concurrency group: workflow-call aware

    • In GenerateJobConcurrencyConfig, default generated group now uses workflow ID namespace for reusable workflows:
      • from gh-aw-{engine}-${{ github.workflow }}
      • to gh-aw-{engine}-{workflow-id} when on.workflow_call is present.
    • Non-workflow_call behavior remains unchanged.
  • Generated job concurrency now preserves pending fan-out

    • Compiler-generated default agent job concurrency now emits queue: max (respecting existing features.group-concurrency-queue: false opt-out).
    • This addresses GitHub’s single-pending default behavior that was dropping queued runs under bursty fan-out.
  • Coverage updates

    • Updated existing concurrency expectations to include queue: max where defaults are generated.
    • Added focused cases for:
      • workflow_call workers using workflow-ID-based group namespace.
      • feature-flag disabling of generated queue: max.
# before (worker called via workflow_call)
concurrency:
  group: "gh-aw-copilot-${{ github.workflow }}"

# after
concurrency:
  group: "gh-aw-copilot-reviewer-worker"
  queue: max

Copilot AI and others added 2 commits May 27, 2026 10:20
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…queue

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix concurrency issue in safe-outputs.call-workflow Prevent workflow_call worker fan-out cancellations by namespacing agent concurrency + enabling queued pending runs May 27, 2026
Copilot AI requested a review from pelikhan May 27, 2026 10:31
@pelikhan pelikhan marked this pull request as ready for review May 27, 2026 10:45
Copilot AI review requested due to automatic review settings May 27, 2026 10:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes lossy safe-outputs.call-workflow fan-out by namespacing the compiler-generated agent job concurrency group per worker (using the worker's compile-time workflow ID instead of ${{ github.workflow }}, which resolves to the caller in reusable workflows) and by emitting queue: max so pending invocations are queued instead of dropped.

Changes:

  • In GenerateJobConcurrencyConfig, switch the default group to gh-aw-{engine}-{workflow-id} when on.workflow_call is present and a workflow ID is known.
  • Append queue: max to compiler-generated agent job concurrency, gated by the existing group-concurrency-queue feature flag.
  • Update concurrency test expectations and add cases covering the workflow_call namespace and the queue-max opt-out.
Show a summary per file
File Description
pkg/workflow/concurrency.go Adds workflow_call-aware group namespacing and queue: max emission (with feature-flag opt-out) for default agent job concurrency.
pkg/workflow/concurrency_test.go Updates existing expected outputs to include queue: max and adds workflow_call namespace + queue-max opt-out test cases.
.github/workflows/daily-compiler-threat-spec-optimizer.lock.yml Recompiled lock file (schedule cron change, heredoc hash changes, allowDomains update); agent concurrency block does not appear to have been regenerated to include queue: max.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 3/3 changed files
  • Comments generated: 1

Comment on lines +65 to +66
- cron: "6 3 * * 1"
# Friendly format: weekly on monday around 03:00 (scattered)
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

PR Code Quality Reviewer completed the code quality review.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #35173 does not have the 'implementation' label and has only 48 new lines of code in default business logic directories (threshold: 100).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

🧪 Test Quality Sentinel completed test quality analysis.

Jobayer-cloud1

This comment was marked as spam.

@github-actions github-actions Bot mentioned this pull request May 27, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /diagnose and /tdd — clean fix with good coverage. Approving with two minor test-gap suggestions.

📋 Key Themes & Highlights

Root cause correctly addressed

The fix is precise: ${{ github.workflow }} is a late-binding GitHub Actions expression that resolves to the caller name inside a reusable worker, causing all fan-out workers to share one concurrency group. Using the compile-time WorkflowID instead removes the dependency on runtime context entirely.

queue: max default

Adding queue: max by default is the right call for fan-out workflows — GitHub's single-pending-run default silently drops queued runs under bursty load. The existing features.group-concurrency-queue: false opt-out is correctly plumbed and tested.

Test gaps (inline comments)

  1. No pure workflow_call-only trigger test — the new case pairs workflow_call with schedule; a workflow_call-only variant would nail the canonical scenario.
  2. No empty WorkflowID + workflow_call test — the silent fallback to ${{ github.workflow }} should be documented as a known edge case or eliminated with an earlier invariant check.

Positive highlights

  • ✅ Surgical change — non-workflow_call paths are untouched
  • ✅ Feature-flag opt-out works symmetrically in both concurrency.go and notify_comment.go
  • ✅ Lock file diff is a clean downstream artefact (hash rotation + cron change + domain addition)

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · sonnet46 1.5M

workflowData: &WorkflowData{
On: "on:\n workflow_call:\n inputs:\n item:\n required: false\n type: string\n schedule:\n - cron: '0 0 * * *'",
WorkflowID: "reviewer-worker",
EngineConfig: &EngineConfig{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The new workflow_call test pairs the trigger with schedule, but there is no case for a pure workflow_call-only trigger. Because hasWorkflowCallTrigger is the gating condition, a workflow_call-only workflow is the canonical scenario — worth an explicit case.

💡 Suggested additional case
{
  name: "workflow_call-only trigger uses workflow-id namespace",
  workflowData: &WorkflowData{
    On:         "on:\n  workflow_call:\n    inputs:\n      item:\n        type: string",
    WorkflowID: "reviewer-worker",
    EngineConfig: &EngineConfig{ID: "copilot"},
  },
  expected: `concurrency:
  group: "gh-aw-copilot-reviewer-worker"
  queue: max`,
  description: "pure workflow_call trigger must also use compile-time WorkflowID",
},

// For workflow_call workers, github.workflow resolves to the caller workflow
// name, so use the compile-time workflow ID to avoid cross-worker collisions.
groupValue := fmt.Sprintf("gh-aw-%s-${{ github.workflow }}", engineID)
if hasWorkflowCallTrigger(workflowData.On) && workflowData.WorkflowID != "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] The silent fallback when WorkflowID is empty deserves a test. If a workflow_call workflow somehow compiles with an empty WorkflowID, the group silently reverts to the buggy ${{ github.workflow }} form — the original problem returns with no warning. Consider adding a test case (empty WorkflowID + workflow_call → still gets ${{ github.workflow }} group) so the behaviour is documented and any future regression is caught.

💡 Suggested test case
{
  name: "workflow_call with empty WorkflowID falls back to github.workflow group",
  workflowData: &WorkflowData{
    On:           "on:\n  workflow_call:\n",
    WorkflowID:   "", // empty — should fall back
    EngineConfig: &EngineConfig{ID: "copilot"},
  },
  expected: `concurrency:
  group: "gh-aw-copilot-${{ github.workflow }}"
  queue: max`,
  description: "empty WorkflowID falls back gracefully; documents the known limitation",
},

This also serves as a living contract: if WorkflowID is guaranteed non-empty for workflow_call workflows, a require.NotEmpty assertion in the compilation path would be safer than a silent fallback.

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: 80/100 — Excellent

Analyzed 2 new test cases in pkg/workflow/concurrency_test.go: 2 design tests, 0 implementation tests, 0 guideline violations.

📊 Metrics & Test Classification (2 tests analyzed)
Metric Value
New/modified tests analyzed 2
✅ Design tests (behavioral contracts) 2 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 2 (100%)
Duplicate test clusters 0
Test inflation detected Yes (39 test lines added vs 9 production lines; ratio ≈ 3.9:1, threshold 2:1)
🚨 Coding-guideline violations 0

Note on inflation: The 39 added test lines include ~20 lines updating expected values in existing table rows (adding queue: max to expected concurrency output). These updates are required to keep existing tests passing against the new behavior, not gratuitous padding. The 10-point inflation penalty is applied mechanically but does not reflect a real quality concern here.

Test Classification Details

Test File Classification Issues Detected
"workflow_call default group uses workflow id namespace" pkg/workflow/concurrency_test.go:544 ✅ Design None — verifies observable concurrency group output for workflow_call trigger
"generated job concurrency can disable queue max feature" pkg/workflow/concurrency_test.go:568 ✅ Design None — verifies feature flag opt-out disables queue: max emission

Language Support

Tests analyzed:

  • 🐹 Go (*_test.go): 2 tests — unit (//go:build !integration)
  • 🟨 JavaScript (*.test.cjs, *.test.js): 0 tests

Verdict

Check passed. 0% of new tests are implementation tests (threshold: 30%). Both new table rows verify observable outputs of GenerateJobConcurrencyConfig() under distinct trigger conditions, providing genuine behavioral coverage of the workflow_call namespacing fix and the group-concurrency-queue feature flag.

📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References:

🧪 Test quality analysis by Test Quality Sentinel · sonnet46 1.6M ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 80/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%).

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking observations

The core fix (stable workflow_call group + queue: max for fan-out) is correct and well-motivated. Two non-blocking concerns worth addressing:

Findings

queue: max default scope is wider than the problem it fixes. The fan-out cancellation problem only affects workflow_call workers (and discriminated fan-outs), but queue: max is now emitted for all schedule/workflow_call/mixed workflows. Slow scheduled workflows with no fan-out will now accumulate an unbounded queue instead of keeping the GitHub default of 1 pending run. The group-concurrency-queue: false opt-out exists but isn't documented prominently. Commented inline on concurrency.go:85.

Test gap: workflow_call + ConcurrencyJobDiscriminator together. The two new code paths are tested independently but the combined case (worker with a discriminator) has no test. Commented on concurrency_test.go.

The stale lock file (missing queue: max in the concurrency block) was already flagged in a previous review comment — recompiling after the Go change is merged will resolve it.

🔎 Code quality review by PR Code Quality Reviewer · sonnet46 1.6M

}
concurrencyConfig := fmt.Sprintf("concurrency:\n group: \"%s\"", groupValue)
if isGroupConcurrencyQueueEnabled(workflowData) {
concurrencyConfig += "\n queue: max"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queue: max as the default for all schedule/workflow_call workflows may cause unbounded run pile-up for long-running workflows. If a scheduled workflow takes longer than its cadence, or a workflow\_dispatch workflow is triggered in rapid bursts, runs queue indefinitely — blocking until they finish or are manually cancelled. The old GitHub default (1 pending run) was safer for non-fan-out cases.

💡 Detail and suggestion

The fix for fan-out worker cancellations is correct, but applying queue: max globally (rather than only when workflow_call is present or a discriminator is set) imposes a surprising default on ordinary scheduled workflows with no fan-out.

The group-concurrency-queue: false feature flag opt-out exists but isn't prominently documented. Worth noting in AGENTS.md or the authoring guide:

# For non-fan-out scheduled workflows, opt out of unbounded queuing:
features:
  group-concurrency-queue: false

Alternatively, only emit queue: max when workflow_call is detected or when a discriminator is present (the actual fan-out cases), leaving plain schedule-only workflows at the default single-pending-run behaviour.

},
{
name: "Job discriminator ignored when push trigger (special trigger, no default group)",
workflowData: &WorkflowData{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test for workflow\_call + ConcurrencyJobDiscriminator combination. The new branch that substitutes WorkflowID for ${{ github.workflow }} and the discriminator-appending logic are tested independently but never together — a regression in their interaction would go undetected.

💡 Suggested test case to add
{
    name: "workflow_call with discriminator uses workflow-id-based group",
    workflowData: &WorkflowData{
        On:         "on:\n  workflow_call:\n    inputs:\n      item_id:\n        required: false\n        type: string",
        WorkflowID: "reviewer-worker",
        EngineConfig: &EngineConfig{ID: "copilot"},
        ConcurrencyJobDiscriminator: "${{ inputs.item_id }}",
    },
    expected: `concurrency:
  group: "gh-aw-copilot-reviewer-worker-${{ inputs.item_id }}"
  queue: max`,
    description: "workflow_call with discriminator should append discriminator to workflow-id group",
},

Also worth adding a case with workflow_call trigger and empty WorkflowID to cover the fallback to ${{ github.workflow }}.

@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot review all comments and review comments

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 27, 2026

@copilot review all comments and review comments

All 157 lock files have been recompiled and now include queue: max on the agent job concurrency block (e.g. daily-compiler-threat-spec-optimizer.lock.yml lines 349-351). The only changes are the queue: max additions and the corresponding heredoc EOF marker hash updates that result from the changed content. Committed in chore: recompile lock files to emit queue: max on agent job concurrency.

@pelikhan pelikhan merged commit f03abf5 into main May 27, 2026
18 checks passed
@pelikhan pelikhan deleted the copilot/fix-safe-outputs-concurrency-issue branch May 27, 2026 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

safe-outputs.call-workflow workers cancel each other under parallel fan-out due to gh-aw-copilot-${{ github.workflow }} job concurrency group'

4 participants