fix(research): per-sector-team S3 persistence + 429 backoff + isolation — re-runs never re-pay completed teams by cipher813 · Pull Request #194 · cipher813/alpha-engine-research

cipher813 · 2026-05-16T16:56:29Z

Diagnosed mechanism

The 2026-05-16 Saturday SF recovery run failed: the Research Lambda returned status:ERROR with "sector team(s) failed: defensives/financials/technology: RateLimitError 429 — org rate limit of 450,000 input tokens/min, claude-haiku-4-5".

graph/research_graph.py build_graph() compiles the LangGraph with NO checkpointer → the Lambda runs it stateless → an SF re-run re-dispatches all 6 sector teams via Send() and re-pays every Haiku call (quant→qual→peer_review per team).
The 6-team parallel Send() fan-out bursts over the org's 450K Haiku input-TPM ceiling → RateLimitError 429.
score_aggregator hard-failed the WHOLE run if ANY team carried an error (429 = error, not the tolerated partial). Successful teams' outputs lived only in in-memory graph state and were discarded on the ERROR return — nothing persisted them for a re-run to reuse.

The 3-part fix

A. 429-aware retry/backoff. New shared invoke_with_rate_limit_retry in agents/langchain_utils.py — catches anthropic.RateLimitError / 429 APIStatusError, honors the retry-after response header when present, otherwise jittered exponential backoff with an attempt cap. Non-429 errors propagate immediately and unchanged (strict-mode / partial / isolation contracts preserved). Wrapped around every sector-team Haiku .invoke(): quant ReAct + structured extract, qual ReAct + structured extract, peer_review (quant-addition / Pass-1 selection / Pass-2 rationale), and the #193 held-thesis include_raw=True call. Explicit max_retries=SECTOR_TEAM_LLM_MAX_RETRIES (8) on every ChatAnthropic(...) constructor — langchain's default of 2 is insufficient for a sustained org 429.

B. Per-sector-team S3 persistence on success. ArchiveManager.save_sector_team_run / load_sector_team_run, deterministic key archive/sector_team_runs/{run_date}/{team_id}.json (matches the existing archive/{category}/... prefix convention, same _s3_put/_s3_get/bucket). sector_team_node persists each team's full output the moment it finishes without an error — before any other team can fail and ERROR the overall run. Errored teams are NOT persisted so a re-run gets a fresh attempt at them.

C. Resume + isolation.

Resume: sector_team_node checks S3 for (run_date, team_id) before any LLM/tool work; on a present, well-formed, run_date-matched persisted output it loads it and short-circuits the team — zero Haiku calls — feeding it into sector_team_outputs exactly as a fresh run would.
Isolation: score_aggregator no longer hard-fails on a single team error. Failed teams are tolerated exactly like the existing partial_teams philosophy; the run only ERRORs when every team is failed-or-partial AND zero recommendations survive ("nothing for CIO to rank").
Idempotency/safety: persisted output is tied to run_date (a new run_date never reuses a prior date's teams); stale/corrupt/cross-wired JSON logs a structured warning and falls back to re-running that team, never crashes.

Load-bearing invariant

A re-invocation must NEVER re-pay a sector team that already succeeded for that run_date. Persist-on-success happens before any other team can fail, so an SF re-run reuses the completed teams from S3 and only re-executes the previously-failed ones.

Tests

+16 in tests/test_sector_team_persist_backoff.py (429-then-success on backoff; retry-after honored; non-429 propagation; persist to expected S3 key; round-trip load; run-date isolation; corrupt/identity-mismatch fallback; resume short-circuit asserting zero LLM calls; errored-team-not-persisted) and tests/test_score_aggregator_failure.py reframed to the new isolation contract, including the exact 2026-05-16 multi-team-429 regression shape. Full suite: 1283 passed, 1 pre-existing unrelated test_scoring.py::TestRSIScoring::test_bull_overbought_matches_neutral_post_revert failure (known stale-local-config artifact, passes on CI — left untouched per scope).

DEPLOY IS HELD — Research is Lambda-deploy-gated; do not merge/deploy until the user directs. After merge the Research Lambda must be redeployed before this fix is live.

🤖 Generated with Claude Code

…on — re-runs never re-pay completed teams The 2026-05-16 Saturday SF recovery run failed: Research Lambda returned status:ERROR with "sector team(s) failed: defensives/financials/technology: RateLimitError 429 — org rate limit of 450,000 input tokens/min". Mechanism: build_graph() compiles the LangGraph with NO checkpointer, so the Lambda runs it stateless — an SF re-run re-dispatches all 6 sector teams via Send() and re-pays every Haiku call. The 6-team parallel fan-out bursts over the org's 450K Haiku input-TPM ceiling, and score_aggregator hard-failed the WHOLE run if ANY team carried an error, discarding the successful teams (in-memory graph state only). Three coordinated parts (load-bearing invariant: a re-invocation must NEVER re-pay a sector team that already succeeded for that run_date): A. 429-aware retry/backoff. New shared invoke_with_rate_limit_retry in agents/langchain_utils.py honors the retry-after response header, falls back to jittered exponential backoff, caps attempts, and propagates non-429 errors unchanged. Wrapped around every sector-team Haiku .invoke() (quant ReAct + extract, qual ReAct + extract, peer_review ×3, held-thesis structured). Explicit max_retries on every ChatAnthropic constructor (langchain default of 2 is insufficient for sustained org 429). B. Per-sector-team S3 persistence. ArchiveManager.save_sector_team_run / load_sector_team_run, key archive/sector_team_runs/{run_date}/{team_id}.json. sector_team_node persists each team the moment it succeeds (before any other team can fail and ERROR the run); errored teams are NOT persisted so a re-run retries them. C. Resume + isolation. sector_team_node checks S3 before any LLM work and short-circuits on a present, well-formed, run_date-matched persisted output (zero Haiku calls). score_aggregator no longer hard-fails on a single team error — failed teams are tolerated like partials; the run only ERRORs when every team is failed-or-partial AND zero recommendations survive. Stale/corrupt/cross-wired persisted JSON falls back to re-run (structured warning), never crashes. Tests: +16 in test_sector_team_persist_backoff.py (429-then-success, retry-after honored, non-429 propagation, persist key, round-trip, run-date isolation, corrupt/mismatch fallback, resume zero-LLM-calls, errored-not-persisted) + reframed test_score_aggregator_failure.py to the new isolation contract incl. the exact 2026-05-16 multi-team-429 regression. Full suite: 1283 passed, 1 pre-existing unrelated test_scoring RSI failure (known stale-local-config, passes on CI). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…arantine (reworks #193/#194) (#195) Authoritative directive (Brian, 2026-05-16) — supersedes #193/#194's degrade-and-continue philosophy: "If the sector agents don't run, Research shouldn't complete until all sectors are run. We should have a long retry mechanism and after this long period if we still don't have all sectors it should fail. We don't get anything from this process if the sectors, or any other agent for that matter, fail/don't run." Four behavior changes: 1. Long 429 retry. invoke_with_rate_limit_retry is now an overall wall-clock DEADLINE (default RATE_LIMIT_RETRY_DEADLINE_SECONDS = 75 min, env-overridable, clamped 5 min .. 3 hr) of persistent 429 retry with capped expo backoff, honoring retry-after — NOT a fixed ~6-attempt cap. Non-429 errors still propagate immediately. 2. All-agents hard-fail. score_aggregator raises (-> handler status:ERROR, NO signals.json / email / DB write) if ANY sector team is missing (absent from ALL_TEAM_IDS), failed, or partial. CIO + macro economist + macro critic LLM calls wrapped in the same deadline helper with explicit max_retries; their failure hard-fails the run (strict-mode default raise) — no synthetic substitute promoted. 3. Removed #193 carry-forward. _update_thesis_for_held_stock no longer carries the prior thesis forward on persistent failure: a bounded parse re-roll for the transient tool-XML schema leak, then RAISE (no isolation fallback). A 429 past the deadline fails fast. 4. Reverted #194 isolation. A failed/partial sector team aborts the whole run even when other teams produced usable picks. KEPT from #194 (composes with the directive): per-team S3 persistence + sector_team_node resume short-circuit. Extended the same persist+resume pattern to CIO + macro (save_agent_run / load_agent_run, archive/agent_runs/{run_date}/{agent_id}.json) so an SF redrive after a hard-fail reuses every already-succeeded agent with zero LLM calls and the long retry only re-attempts the still-missing one(s) — what makes a 60-90 min window bounded. Stub-quarantine root-cause + structural fix (the dangerous bug): s3://alpha-engine-research/signals/2026-05-15/signals.json (written 2026-05-16T17:08:46Z) shipped synthetic [DRY-RUN] stub theses promoted as real (GOOG/AFL/AXP/ABT/APD/ADBE/AMD). Leak path: install_dry_run_stubs only no-op'd write_signals_json + upload_db, NOT save_sector_team_run. The stub-pass runs the full graph; _stub_run_sector_team returns error=None so sector_team_node PERSISTED synthetic [DRY-RUN] output to archive/sector_team_runs/{run_date}/{team_id}.json. The subsequent REAL pass's #194 resume short-circuit LOADED that stub-persisted output and promoted it — zero real Haiku calls. Fix (defense in depth): (a) install_dry_run_stubs now also no-ops save_sector_team_run + save_agent_run so the stub-pass cannot write resume keys; (b) graph.stub_quarantine.assert_no_stub_output at the top of archive_writer refuses to write if the [DRY-RUN marker appears in any promotable surface or a sector team is missing. Tests: full suite 1305 passed, 1 pre-existing unrelated test_scoring RSI failure (known stale-local-config, passes on CI). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit c0b25a4 into main May 16, 2026
1 check passed

cipher813 deleted the fix/research-sector-team-persist-backoff branch May 16, 2026 17:02

cipher813 mentioned this pull request May 16, 2026

fix(research): all-agents-strict — long 429 retry, hard-fail, stub quarantine (reworks #193/#194) #195

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(research): per-sector-team S3 persistence + 429 backoff + isolation — re-runs never re-pay completed teams#194

fix(research): per-sector-team S3 persistence + 429 backoff + isolation — re-runs never re-pay completed teams#194
cipher813 merged 1 commit into
mainfrom
fix/research-sector-team-persist-backoff

cipher813 commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 16, 2026

Diagnosed mechanism

The 3-part fix

Load-bearing invariant

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant