Skip to content

[aw-failures] Daily Cache Strategy Analyzer ~50% red — Codex pinned to alpha snapshot gpt-5-codex-alpha-2025-11-07 that 404s on [Content truncated due to length] #41787

Description

@github-actions

Recommendation

Stop pinning the preview Codex snapshot gpt-5-codex-alpha-2025-11-07 — it intermittently 404s on the AIC api-proxy and wholesale-fails Daily Cache Strategy Analyzer ~50% of runs. Pin a stable/GA Codex model (or add a fallback) and make the harness fail fast on 404 model not found instead of burning 8 retries on zero useful work.

Problem statement

Daily Cache Strategy Analyzer (Codex CLI engine) fails because every sampling request to the configured model gpt-5-codex-alpha-2025-11-07 returns 404 Not Found: Model not found. All 5 in-CLI retries and all 3 harness retries are exhausted, the turn fails before any tool runs, and the job exits 1. The run does zero useful work yet burns ~7 minutes of CI.

Affected workflows & runs

This is a single-workflow, recurring (~50%) failure that is untracked by any open agentic-workflows issue.

Run Workflow Created (UTC) Conclusion
§28259048576 (representative) Daily Cache Strategy Analyzer 2026-06-26T18:58 failure (404 model-not-found)
§28193778404 (comparator) Daily Cache Strategy Analyzer 2026-06-25T19:02 success
§28050001864 Daily Cache Strategy Analyzer 2026-06-23T19:03 failure
§27914144734 Daily Cache Strategy Analyzer 2026-06-21T18:52 failure
§27843593919 Daily Cache Strategy Analyzer 2026-06-19T19:00 failure

Failure cadence is roughly every other day (6/19, 6/21, 6/23, 6/26 failed; 6/20, 6/22, 6/24, 6/25 green), consistent with the alpha snapshot being intermittently absent from the proxy.

Probable root cause

The workflow targets a preview/alpha Codex snapshot (gpt-5-codex-alpha-2025-11-07) that is not reliably served by the AIC api-proxy ((172.30.0.30/redacted) When the proxy lacks that snapshot, *every* /responses call 404s, so there is no degraded mode — the entire run fails. The Codex harness retries the sampling request 5× then retries the whole process 3× (all 3 retries exhausted — giving up`), none of which can succeed because the model name itself is unresolvable.

audit-diff (base 28259048576 failed vs compare 28193778404 success) confirms this is not a firewall or agent-logic problem:

  • Firewall: has_anomalies=false, new/removed_domain_count=0 in both runs.
  • Failed run: 0 model requests, 0 input/output tokens — the agent never completed a single sampling call.
  • Success run: 1,251,039 input / 19,430 output tokens, 28.3 AIC, 12m34s.

The failed run is a pure model-availability 404, distinct from the BYOK auth (#41195), Copilot false-red exit-1 (#41636), and managed-gateway firewall (#41455/#41711) families.

Evidence — error tail (§28259048576) + audit-diff deltas
{"type":"error","message":"unexpected status 404 Not Found: Model not found gpt-5-codex-alpha-2025-11-07, url: (172.30.0.30/redacted) ..."}
{"type":"turn.failed","error":{"message":"unexpected status 404 Not Found: Model not found gpt-5-codex-alpha-2025-11-07, ..."}}
[codex-harness] attempt 4 failed: exitCode=1 ... isInvalidModelError=false retriesRemaining=0
[codex-harness] all 3 retries exhausted — giving up (exitCode=1)
##[error]Process completed with exit code 1.
audit-diff (base=28259048576 failed vs compare=28193778404 success):
  firewall_diff.summary: has_anomalies=false, new_domain_count=0, removed_domain_count=0
  run1 (failed): total_requests=0, input_tokens=0, output_tokens=0, duration=7m0s
  run2 (success): input_tokens=1,251,039, output_tokens=19,430, aic=28.302, duration=12m34s

Note: the harness logs isInvalidModelError=false despite the 404 being exactly an invalid/missing model — the classifier does not recognize this 404 shape, so it retries instead of failing fast.

Proposed remediation

  1. Repin the model. Change Daily Cache Strategy Analyzer from gpt-5-codex-alpha-2025-11-07 to a stable/GA Codex model that the api-proxy serves reliably, or configure a fallback model the harness tries when the primary 404s.
  2. Fail fast on model-not-found. Teach the Codex harness retry classifier to treat 404 ... Model not found <name> as isInvalidModelError=true (terminal), so it surfaces a clear config error immediately instead of consuming 5 sampling retries + 3 process retries (~7m) that cannot succeed.
  3. Guard against decommissioned snapshots. Add a preflight/validation step (or audit anomaly) that flags when a workflow pins an -alpha-/preview model snapshot, so removals are caught before they cause recurring red runs.

Success criteria / verification

  • Daily Cache Strategy Analyzer returns to green on its next scheduled cycle, or fails with a real, distinct signature.
  • A 404 model-not-found terminates the run within seconds with an explicit "model not available" message rather than after ~7m of retries.
  • No open agentic-workflows issue, and no future run, shows the gpt-5-codex-alpha-2025-11-07 404 signature.

Relationship to existing issues

New, standalone signature — no open agentic-workflows issue covers Codex model availability / 404 model-not-found. Distinct from the Copilot BYOK 403 (#41195), Copilot CLI false-red exit-1 (#41636), and managed-gateway firewall (#41455, #41711) families, all of which involve a reachable model and at least partial agent execution; here the model name is unresolvable and the agent never samples.

Generated by 🔍 [aw] Failure Investigator (6h) · 213.8 AIC · ⌖ 34.5 AIC · ⊞ 5.7K ·

  • expires on Jul 3, 2026, 11:33 AM UTC-08:00

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions