fix: hard-bake workflow ID in workflow_call concurrency group to prevent deadlock#43078
Conversation
…ent deadlock
When a gateway workflow calls a worker via `call_workflow`, GitHub Actions
resolves `github.workflow` to the *calling* workflow's name in reusable
workflows. This caused both the gateway lockfile and the worker lockfile to
emit the same concurrency group (`gh-aw-${{ github.workflow }}`), resulting
in a deadlock that GitHub Actions detected and cancelled.
Fix `buildConcurrencyGroupKeys` to mirror the existing logic in
`GenerateJobConcurrencyConfig`: when the workflow has a `workflow_call`
trigger and a known compile-time `WorkflowID`, substitute the hard-baked
filename-based ID instead of the `${{ github.workflow }}` expression.
Recompile all 258 workflow lockfiles so existing worker lockfiles pick up
the updated concurrency group immediately.
Closes #42846
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #43078 does not have the 'implementation' label and has only 39 new lines of code in business logic directories (threshold: 100). |
|
✅ Test Quality Sentinel completed test quality analysis. |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
✅ PR Code Quality Reviewer completed the code quality review. |
There was a problem hiding this comment.
Pull request overview
Fixes a GitHub Actions workflow_call deadlock caused by github.workflow resolving to the caller’s workflow name inside reusable workflows, making gateway + worker share the same concurrency group.
Changes:
- Update workflow-level concurrency key generation to prefer the compile-time
WorkflowIDforworkflow_callworkflows (falling back to${{ github.workflow }}whenWorkflowIDis unavailable). - Add unit tests covering both the
WorkflowIDhard-bake behavior and the fallback behavior. - Recompile generated workflow lockfiles so
workflow_callworkers use the new concurrency group value.
Show a summary per file
| File | Description |
|---|---|
| pkg/workflow/concurrency.go | Uses compile-time WorkflowID for workflow_call workflow-level concurrency groups to avoid gateway/worker collisions. |
| pkg/workflow/concurrency_test.go | Adds regression tests for workflow_call concurrency group behavior with and without WorkflowID. |
| .github/workflows/smoke-workflow-call.lock.yml | Updates generated concurrency group to the hard-baked workflow ID. |
| .github/workflows/smoke-workflow-call-with-inputs.lock.yml | Updates generated concurrency group to the hard-baked workflow ID. |
Review details
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 4/4 changed files
- Comments generated: 0
- Review effort level: Low
There was a problem hiding this comment.
Review: fix concurrency deadlock in workflow_call workers
The fix is correct, well-scoped, and consistent with the existing job-level concurrency fix in GenerateJobConcurrencyConfig.
Root cause analysis ✅
GitHub Actions resolves github.workflow to the caller's name when a reusable workflow is invoked via workflow_call. Both gateway and worker therefore produced the same concurrency group string, and with cancel-in-progress semantics GitHub detects a deadlock and cancels the child run before any job records are created.
Implementation ✅
- The fix in
buildConcurrencyGroupKeys(line 236–238) mirrors the identical fix already present inGenerateJobConcurrencyConfig(line 74–76), keeping both levels consistent. - Fallback to
${{ github.workflow }}whenWorkflowID == ""preserves backward compatibility for programmatic compilation paths that don't carry a file path. - The two affected lock files (
smoke-workflow-callandsmoke-workflow-call-with-inputs) are the onlyworkflow_callworkers in the repository, which is consistent with the diff showing only 4 changed files despite 258 total lock files.
Tests ✅
Two new test cases cover both branches of the condition — hard-baked ID when WorkflowID is set, and ${{ github.workflow }} fallback when it is not.
Minor observations (non-blocking)
- Mixed-trigger workflows (
workflow_call+ others): the hard-baked ID is applied even when the workflow might be triggered viaworkflow_dispatch. This is safe because in those cases${{ github.workflow }}would resolve to the workflow's own name anyway — the same value asWorkflowID. cancel-in-progressis not enabled for pureworkflow_callworkers (gated onisPullRequestWorkflow), so these groups function as serialization queues rather than cancellation groups. This is the intended behavior for worker workflows.
🧵 Reviewed using Impeccable skills by Impeccable Skills Reviewer · 27.5 AIC · ⌖ 6.04 AIC · ⊞ 4.9K
🧪 Test Quality Sentinel Report✅ Test Quality Score: 90/100 — Excellent
📊 Metrics (2 tests)
|
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /diagnosing-bugs and /codebase-design — this is a well-targeted fix for a gnarly GitHub Actions behaviour.
📋 Key Themes & Highlights
Key Themes
- Pattern duplication: the
hasWorkflowCallTrigger && WorkflowID != ""guard is now present at two independent sites inconcurrency.go. Extracting a small helper (e.g.workflowConcurrencyID) would make the logic single-source-of-truth and easier to evolve. - Test completeness: the two new test cases cover the primary happy path and the empty-ID fallback well. A mixed-trigger case (
workflow_call+workflow_dispatch) would close the last gap.
Positive Highlights
- ✅ Root cause properly identified and explained — the PR description traces the exact GitHub Actions evaluation rule at fault.
- ✅ Fix mirrors the identical pattern already present in
GenerateJobConcurrencyConfig, so the codebase stays consistent. - ✅ Regression tests included in the same commit — both cases are meaningful specifications, not just coverage padding.
- ✅ Comment in the changed lines is clear and explains the why, not just the what.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 55.4 AIC · ⌖ 6.79 AIC · ⊞ 6.6K
Comment /matt to run again
| // share the same concurrency group, causing a deadlock. Hard-bake the compile-time | ||
| // workflow ID (filename without extension) to give each worker a unique group. | ||
| workflowKey := "${{ github.workflow }}" | ||
| if hasWorkflowCallTrigger(workflowData.On) && workflowData.WorkflowID != "" { |
There was a problem hiding this comment.
[/codebase-design] The guard hasWorkflowCallTrigger(workflowData.On) && workflowData.WorkflowID != "" is now duplicated verbatim at line 74 (GenerateJobConcurrencyConfig) and here. If the rule evolves, both sites must be updated in sync.
💡 Suggested refactor
Extract a small helper that returns the correct workflow-level key:
// workflowConcurrencyID returns the stable key for a workflow.
// For workflow_call workers, github.workflow resolves to the caller
// name at runtime, so we return the compile-time WorkflowID when available.
func workflowConcurrencyID(data *WorkflowData) string {
if hasWorkflowCallTrigger(data.On) && data.WorkflowID != "" {
return data.WorkflowID
}
return "${{ github.workflow }}"
}Both call sites become a single readable expression and the reasoning lives in one place.
@copilot please address this.
There was a problem hiding this comment.
Addressed in 91ceef3: extracted workflowCallBaseKey(...) and reused it from both GenerateJobConcurrencyConfig and buildConcurrencyGroupKeys so workflow_call key selection is single-source-of-truth.
| }, | ||
| isAliasTrigger: false, | ||
| expected: `concurrency: | ||
| group: "gh-aw-call-test-copilot-call-worker"`, |
There was a problem hiding this comment.
[/tdd] The test covers the happy path and fallback, but is missing a case where workflow_call is combined with other triggers (e.g., workflow_call + workflow_dispatch). GitHub Actions allows mixed triggers, and it is worth confirming the fix still kicks in under that scenario.
💡 Suggested additional test case
{
name: "workflow_call mixed with workflow_dispatch still uses hard-baked ID",
workflowData: &WorkflowData{
On: `on:
workflow_call:
workflow_dispatch:`,
WorkflowID: "mixed-call-worker",
Concurrency: "",
},
isAliasTrigger: false,
expected: `concurrency:
group: "gh-aw-mixed-call-worker"`,
description: "workflow_call + workflow_dispatch still prefers compile-time ID",
},This guards against future trigger-detection changes inadvertently flipping the logic on mixed workflows.
@copilot please address this.
There was a problem hiding this comment.
Addressed in 91ceef3: added a mixed-trigger regression in TestGenerateConcurrencyConfig (workflow_call + workflow_dispatch) to verify workflow_call detection still applies.
|
@copilot run pr-finisher skill |
There was a problem hiding this comment.
The deadlock fix is correct in intent but introduces a new concurrency starvation risk for fan-out callers. Three issues need to be addressed before merge.
Findings summary
High — Static group starves parallel worker invocations (line 239)
The generated group "gh-aw-<WorkflowID>" is fully static. GitHub Actions allows at most 1 running + 1 queued run per group (no cancel-in-progress, no queue: max at workflow level). A third concurrent call to the same worker drops the second pending run silently. For fan-out patterns this is a reliability regression. GenerateJobConcurrencyConfig already handles the analogous job-level case with queue: max — the workflow-level group needs similar treatment or an explicit run-isolating token.
Medium — Fallback to ${{ github.workflow }} is silent (line 236)
When WorkflowID == "", the code silently reverts to the original deadlock-prone value. No log line, no compile warning. Any programmatic compile without a path will reproduce the 100%-failure bug with no diagnostic signal.
Low — Test suite misses the mixed workflow_call + workflow_dispatch trigger case (concurrency_test.go)
The new tests cover pure workflow_call only. workflow_call + workflow_dispatch workers (a common real-world pattern) are untested.
🔎 Code quality review by PR Code Quality Reviewer · 187.2 AIC · ⌖ 5.39 AIC · ⊞ 5.4K
Comment /review to run again
| if hasWorkflowCallTrigger(workflowData.On) && workflowData.WorkflowID != "" { | ||
| workflowKey = workflowData.WorkflowID | ||
| } | ||
| keys := []string{"gh-aw", workflowKey} |
There was a problem hiding this comment.
Static concurrency group will serialise (or starve) parallel invocations of the same reusable worker.
The group "gh-aw-<WorkflowID>" is now fully static. GitHub Actions allows at most 1 running + 1 pending run per concurrency group when cancel-in-progress is absent. A third concurrent invocation of the same worker silently cancels the second pending one — dropped work with no error surfaced to the caller.
call_workflow is designed for fan-out: multiple jobs in the same (or different) gateway workflows can invoke the same worker simultaneously. Under the new static group those runs queue behind each other and the third drops the second.
💡 Suggested fix
Include a caller-unique token so parallel invocations can coexist:
if hasWorkflowCallTrigger(workflowData.On) && workflowData.WorkflowID != "" {
// github.run_id is unique per invocation; prevents queue collisions for fan-out callers
workflowKey = fmt.Sprintf("%s-${{ github.run_id }}", workflowData.WorkflowID)
}Alternatively, if sequential serialisation is intentional for all workflow_call workers, document it explicitly and note that the third concurrent caller will lose its run silently.
Note: GenerateJobConcurrencyConfig already handles this at job level by emitting queue: max (line 85). The workflow-level group created here has no equivalent protection.
There was a problem hiding this comment.
Addressed in 91ceef3: workflow-level workflow_call concurrency now appends ${{ github.run_id }} (<workflow-id>-${{ github.run_id }}) to prevent fan-out invocations from contending on a single static group.
| // share the same concurrency group, causing a deadlock. Hard-bake the compile-time | ||
| // workflow ID (filename without extension) to give each worker a unique group. | ||
| workflowKey := "${{ github.workflow }}" | ||
| if hasWorkflowCallTrigger(workflowData.On) && workflowData.WorkflowID != "" { |
There was a problem hiding this comment.
Silent regression: fallback to ${{ github.workflow }} for a workflow_call worker re-introduces the original deadlock with no warning emitted.
When WorkflowID == "" the code falls back to the broken pre-fix value. The PR description acknowledges this fallback exists ("programmatic compilation without a path"), but nothing in the logs or output surfaces it. A compile invoked without a path will silently produce the same deadlock-prone group that caused the 100% failure rate described in the PR.
💡 Suggested fix
At minimum, emit a warning when the fallback is taken so the misconfiguration is visible:
workflowKey := "${{ github.workflow }}"
if hasWorkflowCallTrigger(workflowData.On) {
if workflowData.WorkflowID != "" {
workflowKey = workflowData.WorkflowID
} else {
concurrencyLog.Print("WARNING: workflow_call worker has no WorkflowID; " +
"falling back to github.workflow which will cause deadlock " +
"if gateway and worker share the same caller name")
}
}There was a problem hiding this comment.
Addressed in 91ceef3: when workflow_call is detected and WorkflowID is empty, we now emit a warning log from workflowCallBaseKey before falling back to ${{ github.workflow }}.
| group: "gh-aw-${{ github.workflow }}-${{ github.event.issue.number || github.run_id }}"`, | ||
| description: "Rendered slash_command YAML (issue_comment + workflow_dispatch) uses issue number via isIssueWorkflow", | ||
| }, | ||
| { |
There was a problem hiding this comment.
Missing test coverage for the mixed workflow_call + workflow_dispatch trigger case — which is exactly what the two updated lock files use.
Both smoke-workflow-call.lock.yml and smoke-workflow-call-with-inputs.lock.yml declare only workflow_call in their on: section (verified in the lock files), so the fix is correct for those specific workers. However, there is no test asserting the behaviour when On contains both workflow_call and workflow_dispatch (a common real-world pattern for workers that can also be manually triggered). If a future worker uses that combination, the fix applies — but no test validates the group produced or the absence of cancel-in-progress.
💡 Suggested addition
{
name: "workflow_call + workflow_dispatch worker still uses hard-baked ID",
workflowData: &WorkflowData{
On: `on:
workflow_call:
workflow_dispatch:`,
WorkflowID: "my-worker",
Concurrency: "",
},
isAliasTrigger: false,
expected: `concurrency:
group: "gh-aw-my-worker"`,
description: "Mixed triggers: workflow_call detection still wins and hard-bakes the ID",
},There was a problem hiding this comment.
Addressed in 91ceef3: added explicit mixed-trigger test coverage (workflow_call + workflow_dispatch) and asserted the generated group keeps workflow_call behavior.
…rage Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
@copilot please run the
|
Ran the I also inspected the latest PR workflow runs for this branch: they are |
When a gateway workflow invokes a worker via
call_workflow, GitHub Actions evaluatesgithub.workflowin the reusable workflow to the caller's name — not the callee's. Both lockfiles ended up with the same concurrency group (e.g.gh-aw-Test Copilot Call Workflow), causing GitHub to detect a deadlock and cancel the worker before any job records were created, marking the runfailure. This was a 100% failure rate for everycall_workflowtest since the feature shipped.Changes
pkg/workflow/concurrency.go—buildConcurrencyGroupKeys: When the workflow declares aworkflow_calltrigger and has a known compile-timeWorkflowID, substitute the hard-baked filename-based ID instead of${{ github.workflow }}. Falls back to${{ github.workflow }}whenWorkflowIDis absent (e.g. programmatic compilation without a path).This mirrors the identical fix already present in
GenerateJobConcurrencyConfigfor job-level concurrency.pkg/workflow/concurrency_test.go: Two new cases inTestGenerateConcurrencyConfig— one asserting the hard-baked ID whenWorkflowIDis set, one asserting thegithub.workflowfallback when it is not.All 258
.lock.ymllockfiles recompiled to pick up the updated concurrency groups for any existingworkflow_callworkers.