fix: add retry/backoff to Fetch open non-draft PR queue in pr-sous-chef#43383
Fetch open non-draft PR queue in pr-sous-chef#43383Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…us-chef Wrap the `gh pr list` GraphQL call with a bounded retry loop (4 attempts, exponential backoff) that retries on transient 5xx/429/network errors and fails hard on deterministic errors. This prevents a single transient HTTP 502 from killing the entire run before the agent starts. Closes #43260 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Fetch open non-draft PR queue in pr-sous-chef
This comment has been minimized.
This comment has been minimized.
|
Hey One thing worth noting before this is ready for merge:
If you'd like a hand, you can assign this prompt to your coding agent:
|
|
🔍 Design Decision Gate 🏗️ is checking for design decision records on this pull request... |
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
✅ Test Quality Sentinel completed test quality analysis. No test files were added or modified in this PR. Test Quality Sentinel skipped. |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
There was a problem hiding this comment.
Pull request overview
This PR hardens the PR Sous Chef workflow’s “Fetch open non-draft PR queue” step against transient GitHub API/GraphQL failures by adding a bounded retry loop with exponential backoff, then recompiles the corresponding lock file.
Changes:
- Add retry + backoff logic around
gh pr listinpr-sous-chefto reduce failures from transient 5xx/429/network errors. - Recompile the generated workflow lock file to reflect the workflow source changes.
- Reformat bundled
action_pins.jsondata files (whitespace/indentation-only change).
Show a summary per file
| File | Description |
|---|---|
pkg/workflow/data/action_pins.json |
Reformats pinned-actions JSON data (indentation/whitespace). |
pkg/actionpins/data/action_pins.json |
Same JSON reformat in the actionpins package copy. |
.github/workflows/pr-sous-chef.md |
Adds retry/backoff wrapper for gh pr list in the PR queue fetch step. |
.github/workflows/pr-sous-chef.lock.yml |
Regenerated compiled workflow output reflecting the updated source workflow. |
Review details
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/4 changed files
- Comments generated: 1
- Review effort level: Low
| set +e | ||
| pr_list_output=$(gh pr list --repo "$EXPR_GITHUB_REPOSITORY" \ | ||
| --state open \ | ||
| --search "is:pr is:open -is:draft sort:updated-desc" \ | ||
| --limit 30 \ | ||
| --json number,title,url,headRefOid,headRefName,updatedAt,author,mergeStateStatus \ | ||
| 2>&1) | ||
| pr_list_status=$? | ||
| set -e | ||
| if [ "$pr_list_status" -eq 0 ]; then | ||
| echo "$pr_list_output" > "$candidate_file" | ||
| pr_list_success=true | ||
| elif echo "$pr_list_output" | grep -qE "HTTP (5[0-9]{2}|429)|context deadline exceeded|connection reset|EOF"; then | ||
| if [ "$pr_list_attempt" -lt "$pr_list_max_attempts" ]; then | ||
| backoff=$((2 ** pr_list_attempt)) | ||
| echo "Transient error on attempt $pr_list_attempt/$pr_list_max_attempts (exit $pr_list_status); retrying in ${backoff}s..." >&2 | ||
| echo "$pr_list_output" >&2 | ||
| sleep "$backoff" | ||
| fi | ||
| else | ||
| echo "gh pr list failed (exit $pr_list_status):" >&2 | ||
| echo "$pr_list_output" >&2 | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /diagnosing-bugs and /tdd — requesting changes on two correctness issues in the transient-error classifier and a missing regression test.
📋 Key Themes & Highlights
Key Themes
- Transient signal regex has edge cases — bare
EOFis over-broad and the empty-output path bypasses retries entirely (two inline comments on line 79) - Last-attempt logging gap — the transient error message is not printed on the final attempt, making failure logs harder to interpret (inline comment on line 85)
- No regression test — the fix ships without automated coverage; a future refactor could silently reintroduce the fragility (inline comment on line 65)
Positive Highlights
- ✅ Root cause properly addressed:
set +e/set -esandwich correctly isolates the failable command - ✅ Fail-fast on deterministic errors is the right default — avoids masking real bugs with spurious retries
- ✅ Logging on every retry attempt with attempt count and backoff duration is excellent debuggability
- ✅ Post-loop failure gate (
if [ "$pr_list_success" != "true" ]) is a clean safety net - ✅
2>&1redirect ensuresgh's stderr error messages are captured for the grep classifier - ✅ Lock file is auto-recompiled — correct process followed for workflow source changes
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 76.3 AIC · ⌖ 5.72 AIC · ⊞ 6.6K
Comment /matt to run again
| if [ "$pr_list_status" -eq 0 ]; then | ||
| echo "$pr_list_output" > "$candidate_file" | ||
| pr_list_success=true | ||
| elif echo "$pr_list_output" | grep -qE "HTTP (5[0-9]{2}|429)|context deadline exceeded|connection reset|EOF"; then |
There was a problem hiding this comment.
[/diagnosing-bugs] The bare EOF token matches any error message containing that substring (e.g. Unexpected end-of-file, JSON parse errors, script errors), which could cause the code to silently retry what is actually a deterministic failure.
💡 Suggested fix
Anchor EOF to reduce false positives — for example:
elif echo "$pr_list_output" | grep -qE "HTTP (5[0-9]{2}|429)|context deadline exceeded|connection reset|unexpected EOF|read: EOF"; thenThe risk: an error message that incidentally contains EOF would be retried up to 3 more times before failing, wasting ~14s of CI time and obscuring the root cause.
@copilot please address this.
| if [ "$pr_list_status" -eq 0 ]; then | ||
| echo "$pr_list_output" > "$candidate_file" | ||
| pr_list_success=true | ||
| elif echo "$pr_list_output" | grep -qE "HTTP (5[0-9]{2}|429)|context deadline exceeded|connection reset|EOF"; then |
There was a problem hiding this comment.
[/diagnosing-bugs] If gh pr list exits with a non-zero status but produces no output (e.g., hard kill, OOM, runner timeout), $pr_list_output is empty — the grep -qE match fails silently and the code takes the else branch, exiting immediately with a generic error instead of retrying.
💡 Suggested fix
Add an explicit guard for empty output:
elif [ -z "$pr_list_output" ] || echo "$pr_list_output" | grep -qE "HTTP (5[0-9]{2}|429)|context deadline exceeded|connection reset|EOF"; thenAn empty output from a network command is almost always a transient connectivity/timeout issue, not a deterministic error, so it deserves a retry.
@copilot please address this.
| echo "Transient error on attempt $pr_list_attempt/$pr_list_max_attempts (exit $pr_list_status); retrying in ${backoff}s..." >&2 | ||
| echo "$pr_list_output" >&2 | ||
| sleep "$backoff" | ||
| fi |
There was a problem hiding this comment.
[/diagnosing-bugs] When the final attempt hits a transient error, the inner if [ "$pr_list_attempt" -lt "$pr_list_max_attempts" ] is false — so no log message is printed and no sleep is needed. But the fi just falls through, the loop condition fails, and control lands on the post-loop failure message. This is correct but the last transient error output is only shown in the post-loop message, which makes it hard to tell whether the last attempt was transient or not.
💡 Suggested improvement
Log the transient error even on the last attempt so the failure message makes sense in context:
elif echo "$pr_list_output" | grep -qE "..."; then
echo "Transient error on attempt $pr_list_attempt/$pr_list_max_attempts (exit $pr_list_status); ${backoff}s..." >&2
echo "$pr_list_output" >&2
if [ "$pr_list_attempt" -lt "$pr_list_max_attempts" ]; then
backoff=$((2 ** pr_list_attempt))
sleep "$backoff"
fiMove the logging outside the inner if so it fires on every transient hit, including the last.
@copilot please address this.
| pr_list_max_attempts=4 | ||
| pr_list_attempt=0 | ||
| pr_list_success=false | ||
| until [ "$pr_list_success" = "true" ] || [ "$pr_list_attempt" -ge "$pr_list_max_attempts" ]; do |
There was a problem hiding this comment.
[/tdd] This fix directly addresses a production outage (15 consecutive failures over ~8h) but ships with no regression test. A future degradation event could silently break the retry logic with no automated signal.
💡 What to test
Key scenarios that would prevent regressions:
- Transient error → retries up to max_attempts — mock
gh pr listto returnHTTP 502for N attempts then succeed; assertpr_list_success=true. - All attempts exhausted — mock
gh pr listto always returnHTTP 503; assert the script exits non-zero and the error is reported. - Deterministic error fails fast — mock
gh pr listto return exit 1 witherror: HTTP 422; assert exactly one attempt and immediate exit. - Empty output fails fast (edge case from the grep guard) — mock
gh pr listto return exit 1 with empty output.
Even a minimal shell-based test with bats or a mock gh binary would document the expected contract.
@copilot please address this.
There was a problem hiding this comment.
The retry/backoff logic is sound and the problem it solves is real, but the implementation introduced a correctness regression that needs to be fixed before merging.
### Blocking issues
2>&1 corrupts candidate_file on success (HIGH) — gh pr list --json output is captured with stderr merged in. On a successful call any gh warning (upgrade notice, deprecation, etc.) silently turns the written JSON into garbage, breaking all downstream jq processing. Capture stdout and stderr separately; write only stdout to $candidate_file.
Silent final transient attempt (MEDIUM) — On the 4th (final) transient failure the inner if guard short-circuits logging and the error text from the last attempt is lost. When an operator inspects an all-retries-exhausted failure log, they see only the post-loop message with no indication of what the last error actually was. Move the log outside the if, inside the elif.
🔎 Code quality review by PR Code Quality Reviewer · 105.4 AIC · ⌖ 6.68 AIC · ⊞ 5.4K
Comment /review to run again
| pr_list_status=$? | ||
| set -e | ||
| if [ "$pr_list_status" -eq 0 ]; then | ||
| echo "$pr_list_output" > "$candidate_file" |
There was a problem hiding this comment.
2>&1 contaminates candidate_file on success: pr_list_output contains merged stdout+stderr, so any warning gh emits to stderr on a successful run silently corrupts the JSON written to candidate_file.
💡 Suggested fix
Capture stdout and stderr separately so only the clean JSON stdout lands in candidate_file:
set +e
pr_list_stderr_tmp=$(mktemp)
pr_list_output=$(gh pr list --repo "$EXPR_GITHUB_REPOSITORY" \
--state open \
--search "is:pr is:open -is:draft sort:updated-desc" \
--limit 30 \
--json number,title,url,headRefOid,headRefName,updatedAt,author,mergeStateStatus \
2>"$pr_list_stderr_tmp")
pr_list_status=$?
pr_list_stderr=$(cat "$pr_list_stderr_tmp"; rm -f "$pr_list_stderr_tmp")
set -eThen use "${pr_list_output}${pr_list_stderr}" for the transient-error grep, and $pr_list_output alone when writing to $candidate_file.
Why it matters: gh routinely writes upgrade notices, deprecation warnings, and experimental-feature advisories to stderr while still exiting 0. The original code (gh pr list ... > "$candidate_file") was safe because stdout alone went to the file. The new code merges stderr into the captured variable and then writes it unconditionally into the JSON file. Any non-empty stderr turns the JSON into garbage that jq will reject downstream — yet the step shows no error until then.
| echo "Transient error on attempt $pr_list_attempt/$pr_list_max_attempts (exit $pr_list_status); retrying in ${backoff}s..." >&2 | ||
| echo "$pr_list_output" >&2 | ||
| sleep "$backoff" | ||
| fi |
There was a problem hiding this comment.
Silent exit on final transient failure: when the 4th attempt hits a transient error, the inner guard [ "$pr_list_attempt" -lt "$pr_list_max_attempts" ] is false, so the if block is skipped — no log is emitted, no sleep, the loop simply ends, and the post-loop failure handler fires with no diagnostic output showing whether the last error was transient or not.
💡 Suggested fix
Log the final transient failure unconditionally, outside the if:
elif echo "$pr_list_output" | grep -qE "HTTP (5[0-9]{2}|429)|context deadline exceeded|connection reset|EOF"; then
echo "Transient error on attempt $pr_list_attempt/$pr_list_max_attempts (exit $pr_list_status)" >&2
echo "$pr_list_output" >&2
if [ "$pr_list_attempt" -lt "$pr_list_max_attempts" ]; then
backoff=$((2 ** pr_list_attempt))
echo "Retrying in ${backoff}s..." >&2
sleep "$backoff"
fiWithout this, an operator looking at a 4-attempt-exhausted failure sees the post-loop message but zero context about what the last error actually was.
There was a problem hiding this comment.
Review: retry/backoff for Fetch open non-draft PR queue
The fix is well-motivated and the overall structure (bounded retry, exponential backoff, fail-fast on deterministic errors, set +e/set -e hygiene) is correct. Two issues need addressing before merge.
Blocking
-
Observability gap on the last attempt (line 80–85,
pr-sous-chef.md)
When all retries are exhausted via a transient error, the inner guardif [ "$pr_list_attempt" -lt "$pr_list_max_attempts" ]prevents any diagnostic from being emitted. The loop exits silently; only the post-loopiflogs it. This makes it impossible to tell from logs that the final attempt was also transient. -
Bare
EOFpattern is too broad (line 79,pr-sous-chef.md)
EOFmatches any line containing those characters as a substring (e.g.,SIZEOF,PROOF,EOF in configuration). This could silently retry non-transient errors instead of fail-fast. Tighten tounexpected EOFor\bEOF\b.
Non-blocking / positive observations
2>&1redirect into the variable is correct — stderr must be captured to pattern-match on error messages.- Fail-fast
exit 1on deterministic errors (4xx, auth) is the right behaviour. - The retry counter is incremented at the top of the loop so attempt numbers in log messages are human-readable (1-based).
- Lock file is correctly regenerated from the updated source.
Action pins change (756 additions/deletions)
This is a formatting-only change (indentation/style) across both `pkg/actionpins/data/action_pins.json` and `pkg/workflow/data/action_pins.json`. No pin versions were changed. Not reviewed in detail.🧵 Reviewed using Impeccable skills by Impeccable Skills Reviewer · 78.5 AIC · ⌖ 6.1 AIC · ⊞ 4.9K
| if [ "$pr_list_attempt" -lt "$pr_list_max_attempts" ]; then | ||
| backoff=$((2 ** pr_list_attempt)) | ||
| echo "Transient error on attempt $pr_list_attempt/$pr_list_max_attempts (exit $pr_list_status); retrying in ${backoff}s..." >&2 | ||
| echo "$pr_list_output" >&2 | ||
| sleep "$backoff" | ||
| fi |
There was a problem hiding this comment.
Observability gap on the last attempt: when pr_list_attempt equals pr_list_max_attempts, the transient-error branch is entered but the inner if [ "$pr_list_attempt" -lt "$pr_list_max_attempts" ] guard is false, so no diagnostic is logged and no exit occurs. The loop silently ends and the post-loop if catches it. This makes log triage harder because the retry count message and the final error output are separated, with nothing indicating the last attempt was also transient.
Suggestion — always log, only gate the sleep:
elif echo "$pr_list_output" | grep -qE '...'; then
echo "Transient error on attempt $pr_list_attempt/$pr_list_max_attempts (exit $pr_list_status)" >&2
echo "$pr_list_output" >&2
if [ "$pr_list_attempt" -lt "$pr_list_max_attempts" ]; then
backoff=$((2 ** pr_list_attempt))
echo "Retrying in ${backoff}s..." >&2
sleep "$backoff"
fi
fi@copilot please address this.
| if [ "$pr_list_status" -eq 0 ]; then | ||
| echo "$pr_list_output" > "$candidate_file" | ||
| pr_list_success=true | ||
| elif echo "$pr_list_output" | grep -qE "HTTP (5[0-9]{2}|429)|context deadline exceeded|connection reset|EOF"; then |
There was a problem hiding this comment.
Bare EOF match is too broad: the pattern EOF will match any output line containing that substring (e.g., "EOF in configuration", "PROOF", "SIZEOF"). This could cause legitimate non-transient errors to be silently retried rather than fail-fast.
Consider tightening to:
\bEOF\b(word boundary) — avoids substring false positives- or
unexpected EOF— matches only the Go net/http error text that actually indicates a truncated response
@copilot please address this.
|
@copilot please run the
|
Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
Addressed the outstanding github-actions review feedback in commits |
|
@copilot please run the
After resolving any remaining review feedback, post a concise completion summary on the PR.
|
Handled. There are no remaining code changes needed on this branch: the blocking I re-ran local checks on current HEAD and they pass: I also re-checked prior failing CI logs: the Design Decision Gate failure was infra-only ( No additional code changes were required in this pass. |
🤖 PR Triage — Run §28715668077
|
|
@copilot please run the
|
Checked the current PR state and the blocking |
|
@copilot please run the Blocking reviews to reconcile:
Run: https://github.com/github/gh-aw/actions/runs/28718950400
|
Confirmed on current HEAD ( |
A single transient GraphQL 502 in the
gh pr listcall kills the entire PR Sous Chef run before the agent starts — caused 15 consecutive failures (~8h of 100% red) during a GitHub GraphQL degradation window.Changes
pr-sous-chef.md— wrapsgh pr listin a bounded retry loop: 4 attempts, exponential backoff (2s → 4s → 8s), retrying on transient signals (HTTP 5xx,429,context deadline exceeded,connection reset,EOF); fails fast on deterministic errors (4xx, auth, etc.) so real bugs still surface immediatelypr-sous-chef.lock.yml— recompiled from updated source