[audit-workflows] Daily Workflow Audit 2026-06-08 β π΄ Active copilot PAT-400 incident (88%β58%) #37950
Closed
Replies: 2 comments
-
|
Smoke cave bot here. Me poke tools. Fire still warm. Warning Firewall blocked 6 domainsThe following domains were blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "accounts.google.com"
- "android.clients.google.com"
- "clients2.google.com"
- "contentautofill.googleapis.com"
- "safebrowsingohttpgateway.googleapis.com"
- "www.google.com"See Network Configuration for more information.
|
Beta Was this translation helpful? Give feedback.
0 replies
-
|
This discussion has been marked as outdated by Agentic Workflow Audit Agent. A newer discussion is available at Discussion #38221. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
π΄ ACTIVE PRODUCTION INCIDENT. A new copilot auth-token failure (
PAT-not-supported-400) began at 2026-06-08T19:40Z and was still firing in the most recent runs (21:37Z). It collapsed fleet success from a healthy 88.0% (pre-incident) to 57.9% (post-incident). ~15 runs failed identically, including production-mainschedules. An in-flight fix branch βpelikhan/fix-pat-400-retryβ was observed, so the team is already aware.Window: 2026-06-08T18:21β21:40Z (~3.3h evening cluster, partial β the
logsMCP tool timed out at the 120s cap; 93 of 260 run dirs had full summaries).EffectiveTokenscame back 0/unpopulated for every run this window, so effective-token cap proximity could not be assessed.Summary
mainπ΄ Critical: copilot
PAT-not-supported-400(NEW, active)Both the copilot-harness and copilot-sdk-driver token-check paths POST a Personal Access Token to an endpoint that now rejects it:
The run exits 1 with
turns=0(agent never starts). Onset 19:40:19Z, continuous through the latest observed run (21:37Z). Affected prod-mainschedules: Daily Ambient Context Optimizer, Contribution Check, Daily Project Performance Summary Generator β note these reddened via PAT-400, not their usual classes (token-429 / safe-output). Also hit many PRs/smokes: Matt Pocock Skills Reviewer Γ3, PR Code Quality Reviewer Γ3, PR Description Updater Γ2, Agent Container Smoke Test, Smoke Copilot Γ2, Daily Max AI Credits Test.Recommendation (CRITICAL): Land/expedite
pelikhan/fix-pat-400-retryor roll back the token-source change for the copilot token-check path; confirm whether the endpoint or the credential changed. Add a harness smoke assertion that the token-check uses a credential type the endpoint accepts.Trend Charts
The 30-day health view shows a sharp spike in failures (22, highest since the 05-23 outlier) and the success rate dropping to 75% β entirely attributable to the evening PAT-400 incident; the day was tracking at 88% before 19:40Z.
Token usage (24.4M) sits below the 7-day moving average and continues a two-week downward drift β note this is a partial 3.3h window plus many
turns=0failed runs that consumed zero tokens, so true daily usage is understated.Other failure classes (7 runs)
awf-cli-proxycontainer errored at startup β DIFC proxy liveness probe tolocalhost:18443refused β agent never starts (turns=0). Workflows: PR Description Updater (27158226210), Issue Monster (27167452907). Both pre-incident (18:xx) β looks like a separate infra flake. Rec: bounded retry/health-wait on the probe + surface proxy startup logs.Timeout 870000ms waiting for session.idleβ exit1, discarding the work and reddening the job. Rec: treat "output already emitted + idle-timeout" as success-with-warning.permission denied: read(compiler_safe_outputs_config_test.go).EACCES/tmp scandir), Smoke Copilot-AOAI Γ2 (o4-mini-aw,transient_bad_requestretries β possibly PAT-400-adjacent).Positives & deltas vs prior days
Top cost & data-quality notes
Top cost (claude-measured): [aw] Failure Investigator (6h) $3.21 Β· Daily Code Metrics and Trend Tracking Agent $2.12 Β· Daily Safe Output Tool Optimizer $2.09 β all successful.
Data quality: (1)
logstool timed out at 120s again β only newest ~93/260 runs had full summaries (~18h likely unobserved). (2)EffectiveTokensfield was 0 for all runs this window. (3) Cost is claude-only; copilot/codex report $0.Recommendations
pelikhan/fix-pat-400-retry(or roll back the token change) for the copilot token-check path; add a smoke assertion on the token/credential type.awf-cli-proxyDIFC liveness probe + surface its startup logs.session.idletimeout.References: Β§27164602401 (Ambient/PAT-400) Β· Β§27160990743 (sdk idle-timeout) Β· Β§27158226210 (cli-proxy)
Beta Was this translation helpful? Give feedback.
All reactions