Classify the 870s session.idle timeout as a real engine timeout (and stop hard-redding sessions that finished with output) — it silently reds PR Code Quality Reviewer after ~175 AIC is already spent.
This is a new, untracked P1 signature identified in the last 6h window. Grouped under the current open failure-tracking parent #43031; the root cause is unrelated to that report's BYOK-400 cluster.
Problem statement
PR Code Quality Reviewer fails intermittently (mixed pass/fail in the same hour) when the copilot-sdk-driver hits its idle timeout. In the audited run the driver logged:
[sdk-driver] error: Timeout after 870000ms waiting for session.idle
[sdk-driver] warning: SDK idle-timeout with collected output and no pending tool calls — treating as completed
[sdk-driver] session completed: hasOutput=true durationMs=870248
Despite hasOutput=true and "treating as completed", the Execute GitHub Copilot CLI step exits 1 and the run hard-reds. The conclusion job then classifies every known error class as false — including Agentic engine timeout: false — so the failure surfaces as an unclassified hard-red even though the driver plainly reported a timeout. ~175.4 AIC and ~77.5k output tokens are consumed before the red.
Affected workflow & run IDs
- Workflow: PR Code Quality Reviewer (
.github/workflows/pr-code-quality-reviewer.lock.yml)
- Failing step:
Execute GitHub Copilot CLI (agent job)
- Representative (audited, clear 870s timeout): §28703868952
- Same-signature, pattern-matched (not individually audited): §28703807560, §28703040649, §28701059943, §28700970833
- Comparator (nearest success, same workflow): §28702852738 — passed
- Intermittent: 5 failures interleaved with passes in the same window → persistent pattern, not a one-off.
Probable root cause
Large-diff reviews run the agent session past the driver's 870000ms (14.5m) session.idle timeout. Two defects compound:
- Misclassification: the conclusion classifier's
agentic_engine_timeout heuristic does not match the Timeout after <n>ms waiting for session.idle driver message, so timeouts are reported as unclassified reds — corrupting red/false-red accounting and skipping timeout-specific handling.
- Exit code vs. graceful completion: the driver says it collected output and "treats as completed", yet the step still exits 1 — so a session that produced usable output is thrown away as a hard failure.
audit-diff of the failed run vs the nearest success (28702852738) shows no firewall/domain/tooling regression (0 new domains, 0 status changes) — this is purely a duration/idle-timeout issue, not a network or MCP problem.
Proposed remediation
- Add the
Timeout after <n>ms waiting for session.idle driver signature to the timeout classifier so these runs set agentic_engine_timeout=true (correct labeling, proper retry/suppression, accurate accounting).
- When the driver reports
hasOutput=true and "treating as completed" on idle-timeout, propagate a graceful exit for the step rather than exit 1 — don't hard-red a session that produced output.
- Tune the idle timeout (or add streaming keepalives) for large-PR reviews so long-but-healthy reviews don't trip the idle threshold.
Success criteria / verification
- Runs hitting the SDK idle-timeout are classified
agentic_engine_timeout=true in the conclusion outputs.
- PR Code Quality Reviewer no longer emits unclassified hard-reds when the session gracefully completes with output.
- Over the next 6h window, the
session.idle failure signature for PR Code Quality Reviewer drops (runs pass or are correctly classified as timeouts).
References: §28703868952, §28702852738, §28703040649
Related to #43031
Generated by 🔍 [aw] Failure Investigator (6h) · 229.2 AIC · ⌖ 38.4 AIC · ⊞ 5.2K · ◷
Classify the 870s
session.idletimeout as a real engine timeout (and stop hard-redding sessions that finished with output) — it silently reds PR Code Quality Reviewer after ~175 AIC is already spent.This is a new, untracked P1 signature identified in the last 6h window. Grouped under the current open failure-tracking parent #43031; the root cause is unrelated to that report's BYOK-400 cluster.
Problem statement
PR Code Quality Reviewerfails intermittently (mixed pass/fail in the same hour) when thecopilot-sdk-driverhits its idle timeout. In the audited run the driver logged:Despite
hasOutput=trueand "treating as completed", theExecute GitHub Copilot CLIstep exits 1 and the run hard-reds. Theconclusionjob then classifies every known error class asfalse— includingAgentic engine timeout: false— so the failure surfaces as an unclassified hard-red even though the driver plainly reported a timeout. ~175.4 AIC and ~77.5k output tokens are consumed before the red.Affected workflow & run IDs
.github/workflows/pr-code-quality-reviewer.lock.yml)Execute GitHub Copilot CLI(agent job)Probable root cause
Large-diff reviews run the agent session past the driver's 870000ms (14.5m)
session.idletimeout. Two defects compound:agentic_engine_timeoutheuristic does not match theTimeout after <n>ms waiting for session.idledriver message, so timeouts are reported as unclassified reds — corrupting red/false-red accounting and skipping timeout-specific handling.audit-diffof the failed run vs the nearest success (28702852738) shows no firewall/domain/tooling regression (0 new domains, 0 status changes) — this is purely a duration/idle-timeout issue, not a network or MCP problem.Proposed remediation
Timeout after <n>ms waiting for session.idledriver signature to the timeout classifier so these runs setagentic_engine_timeout=true(correct labeling, proper retry/suppression, accurate accounting).hasOutput=trueand "treating as completed" on idle-timeout, propagate a graceful exit for the step rather than exit 1 — don't hard-red a session that produced output.Success criteria / verification
agentic_engine_timeout=truein theconclusionoutputs.session.idlefailure signature for PR Code Quality Reviewer drops (runs pass or are correctly classified as timeouts).References: §28703868952, §28702852738, §28703040649
Related to #43031