Recommendation
Stop pinning the preview Codex snapshot gpt-5-codex-alpha-2025-11-07 — it intermittently 404s on the AIC api-proxy and wholesale-fails Daily Cache Strategy Analyzer ~50% of runs. Pin a stable/GA Codex model (or add a fallback) and make the harness fail fast on 404 model not found instead of burning 8 retries on zero useful work.
Problem statement
Daily Cache Strategy Analyzer (Codex CLI engine) fails because every sampling request to the configured model gpt-5-codex-alpha-2025-11-07 returns 404 Not Found: Model not found. All 5 in-CLI retries and all 3 harness retries are exhausted, the turn fails before any tool runs, and the job exits 1. The run does zero useful work yet burns ~7 minutes of CI.
Affected workflows & runs
This is a single-workflow, recurring (~50%) failure that is untracked by any open agentic-workflows issue.
| Run |
Workflow |
Created (UTC) |
Conclusion |
| §28259048576 (representative) |
Daily Cache Strategy Analyzer |
2026-06-26T18:58 |
failure (404 model-not-found) |
| §28193778404 (comparator) |
Daily Cache Strategy Analyzer |
2026-06-25T19:02 |
success |
| §28050001864 |
Daily Cache Strategy Analyzer |
2026-06-23T19:03 |
failure |
| §27914144734 |
Daily Cache Strategy Analyzer |
2026-06-21T18:52 |
failure |
| §27843593919 |
Daily Cache Strategy Analyzer |
2026-06-19T19:00 |
failure |
Failure cadence is roughly every other day (6/19, 6/21, 6/23, 6/26 failed; 6/20, 6/22, 6/24, 6/25 green), consistent with the alpha snapshot being intermittently absent from the proxy.
Probable root cause
The workflow targets a preview/alpha Codex snapshot (gpt-5-codex-alpha-2025-11-07) that is not reliably served by the AIC api-proxy ((172.30.0.30/redacted) When the proxy lacks that snapshot, *every* /responses call 404s, so there is no degraded mode — the entire run fails. The Codex harness retries the sampling request 5× then retries the whole process 3× (all 3 retries exhausted — giving up`), none of which can succeed because the model name itself is unresolvable.
audit-diff (base 28259048576 failed vs compare 28193778404 success) confirms this is not a firewall or agent-logic problem:
- Firewall:
has_anomalies=false, new/removed_domain_count=0 in both runs.
- Failed run: 0 model requests, 0 input/output tokens — the agent never completed a single sampling call.
- Success run: 1,251,039 input / 19,430 output tokens, 28.3 AIC, 12m34s.
The failed run is a pure model-availability 404, distinct from the BYOK auth (#41195), Copilot false-red exit-1 (#41636), and managed-gateway firewall (#41455/#41711) families.
Evidence — error tail (§28259048576) + audit-diff deltas
{"type":"error","message":"unexpected status 404 Not Found: Model not found gpt-5-codex-alpha-2025-11-07, url: (172.30.0.30/redacted) ..."}
{"type":"turn.failed","error":{"message":"unexpected status 404 Not Found: Model not found gpt-5-codex-alpha-2025-11-07, ..."}}
[codex-harness] attempt 4 failed: exitCode=1 ... isInvalidModelError=false retriesRemaining=0
[codex-harness] all 3 retries exhausted — giving up (exitCode=1)
##[error]Process completed with exit code 1.
audit-diff (base=28259048576 failed vs compare=28193778404 success):
firewall_diff.summary: has_anomalies=false, new_domain_count=0, removed_domain_count=0
run1 (failed): total_requests=0, input_tokens=0, output_tokens=0, duration=7m0s
run2 (success): input_tokens=1,251,039, output_tokens=19,430, aic=28.302, duration=12m34s
Note: the harness logs isInvalidModelError=false despite the 404 being exactly an invalid/missing model — the classifier does not recognize this 404 shape, so it retries instead of failing fast.
Proposed remediation
- Repin the model. Change
Daily Cache Strategy Analyzer from gpt-5-codex-alpha-2025-11-07 to a stable/GA Codex model that the api-proxy serves reliably, or configure a fallback model the harness tries when the primary 404s.
- Fail fast on model-not-found. Teach the Codex harness retry classifier to treat
404 ... Model not found <name> as isInvalidModelError=true (terminal), so it surfaces a clear config error immediately instead of consuming 5 sampling retries + 3 process retries (~7m) that cannot succeed.
- Guard against decommissioned snapshots. Add a preflight/validation step (or audit anomaly) that flags when a workflow pins an
-alpha-/preview model snapshot, so removals are caught before they cause recurring red runs.
Success criteria / verification
Daily Cache Strategy Analyzer returns to green on its next scheduled cycle, or fails with a real, distinct signature.
- A 404 model-not-found terminates the run within seconds with an explicit "model not available" message rather than after ~7m of retries.
- No open
agentic-workflows issue, and no future run, shows the gpt-5-codex-alpha-2025-11-07 404 signature.
Relationship to existing issues
New, standalone signature — no open agentic-workflows issue covers Codex model availability / 404 model-not-found. Distinct from the Copilot BYOK 403 (#41195), Copilot CLI false-red exit-1 (#41636), and managed-gateway firewall (#41455, #41711) families, all of which involve a reachable model and at least partial agent execution; here the model name is unresolvable and the agent never samples.
Generated by 🔍 [aw] Failure Investigator (6h) · 213.8 AIC · ⌖ 34.5 AIC · ⊞ 5.7K · ◷
Recommendation
Stop pinning the preview Codex snapshot
gpt-5-codex-alpha-2025-11-07— it intermittently 404s on the AIC api-proxy and wholesale-failsDaily Cache Strategy Analyzer~50% of runs. Pin a stable/GA Codex model (or add a fallback) and make the harness fail fast on404 model not foundinstead of burning 8 retries on zero useful work.Problem statement
Daily Cache Strategy Analyzer(Codex CLI engine) fails because every sampling request to the configured modelgpt-5-codex-alpha-2025-11-07returns404 Not Found: Model not found. All 5 in-CLI retries and all 3 harness retries are exhausted, the turn fails before any tool runs, and the job exits 1. The run does zero useful work yet burns ~7 minutes of CI.Affected workflows & runs
This is a single-workflow, recurring (~50%) failure that is untracked by any open
agentic-workflowsissue.Failure cadence is roughly every other day (6/19, 6/21, 6/23, 6/26 failed; 6/20, 6/22, 6/24, 6/25 green), consistent with the alpha snapshot being intermittently absent from the proxy.
Probable root cause
The workflow targets a preview/alpha Codex snapshot (
gpt-5-codex-alpha-2025-11-07) that is not reliably served by the AIC api-proxy ((172.30.0.30/redacted) When the proxy lacks that snapshot, *every*/responsescall 404s, so there is no degraded mode — the entire run fails. The Codex harness retries the sampling request 5× then retries the whole process 3× (all 3 retries exhausted — giving up`), none of which can succeed because the model name itself is unresolvable.audit-diff(base28259048576failed vs compare28193778404success) confirms this is not a firewall or agent-logic problem:has_anomalies=false,new/removed_domain_count=0in both runs.The failed run is a pure model-availability 404, distinct from the BYOK auth (#41195), Copilot false-red exit-1 (#41636), and managed-gateway firewall (#41455/#41711) families.
Evidence — error tail (§28259048576) + audit-diff deltas
Note: the harness logs
isInvalidModelError=falsedespite the 404 being exactly an invalid/missing model — the classifier does not recognize this 404 shape, so it retries instead of failing fast.Proposed remediation
Daily Cache Strategy Analyzerfromgpt-5-codex-alpha-2025-11-07to a stable/GA Codex model that the api-proxy serves reliably, or configure a fallback model the harness tries when the primary 404s.404 ... Model not found <name>asisInvalidModelError=true(terminal), so it surfaces a clear config error immediately instead of consuming 5 sampling retries + 3 process retries (~7m) that cannot succeed.-alpha-/preview model snapshot, so removals are caught before they cause recurring red runs.Success criteria / verification
Daily Cache Strategy Analyzerreturns to green on its next scheduled cycle, or fails with a real, distinct signature.agentic-workflowsissue, and no future run, shows thegpt-5-codex-alpha-2025-11-07404 signature.Relationship to existing issues
New, standalone signature — no open
agentic-workflowsissue covers Codex model availability / 404 model-not-found. Distinct from the Copilot BYOK 403 (#41195), Copilot CLI false-red exit-1 (#41636), and managed-gateway firewall (#41455, #41711) families, all of which involve a reachable model and at least partial agent execution; here the model name is unresolvable and the agent never samples.