fix(research): per-sector-team S3 persistence + 429 backoff + isolation — re-runs never re-pay completed teams#194
Merged
Conversation
…on — re-runs never re-pay completed teams
The 2026-05-16 Saturday SF recovery run failed: Research Lambda returned
status:ERROR with "sector team(s) failed: defensives/financials/technology:
RateLimitError 429 — org rate limit of 450,000 input tokens/min".
Mechanism: build_graph() compiles the LangGraph with NO checkpointer, so
the Lambda runs it stateless — an SF re-run re-dispatches all 6 sector
teams via Send() and re-pays every Haiku call. The 6-team parallel
fan-out bursts over the org's 450K Haiku input-TPM ceiling, and
score_aggregator hard-failed the WHOLE run if ANY team carried an error,
discarding the successful teams (in-memory graph state only).
Three coordinated parts (load-bearing invariant: a re-invocation must
NEVER re-pay a sector team that already succeeded for that run_date):
A. 429-aware retry/backoff. New shared invoke_with_rate_limit_retry in
agents/langchain_utils.py honors the retry-after response header, falls
back to jittered exponential backoff, caps attempts, and propagates
non-429 errors unchanged. Wrapped around every sector-team Haiku
.invoke() (quant ReAct + extract, qual ReAct + extract, peer_review
×3, held-thesis structured). Explicit max_retries on every
ChatAnthropic constructor (langchain default of 2 is insufficient for
sustained org 429).
B. Per-sector-team S3 persistence. ArchiveManager.save_sector_team_run /
load_sector_team_run, key archive/sector_team_runs/{run_date}/{team_id}.json.
sector_team_node persists each team the moment it succeeds (before any
other team can fail and ERROR the run); errored teams are NOT persisted
so a re-run retries them.
C. Resume + isolation. sector_team_node checks S3 before any LLM work and
short-circuits on a present, well-formed, run_date-matched persisted
output (zero Haiku calls). score_aggregator no longer hard-fails on a
single team error — failed teams are tolerated like partials; the run
only ERRORs when every team is failed-or-partial AND zero
recommendations survive. Stale/corrupt/cross-wired persisted JSON
falls back to re-run (structured warning), never crashes.
Tests: +16 in test_sector_team_persist_backoff.py (429-then-success,
retry-after honored, non-429 propagation, persist key, round-trip,
run-date isolation, corrupt/mismatch fallback, resume zero-LLM-calls,
errored-not-persisted) + reframed test_score_aggregator_failure.py to
the new isolation contract incl. the exact 2026-05-16 multi-team-429
regression. Full suite: 1283 passed, 1 pre-existing unrelated
test_scoring RSI failure (known stale-local-config, passes on CI).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 16, 2026
…arantine (reworks #193/#194) (#195) Authoritative directive (Brian, 2026-05-16) — supersedes #193/#194's degrade-and-continue philosophy: "If the sector agents don't run, Research shouldn't complete until all sectors are run. We should have a long retry mechanism and after this long period if we still don't have all sectors it should fail. We don't get anything from this process if the sectors, or any other agent for that matter, fail/don't run." Four behavior changes: 1. Long 429 retry. invoke_with_rate_limit_retry is now an overall wall-clock DEADLINE (default RATE_LIMIT_RETRY_DEADLINE_SECONDS = 75 min, env-overridable, clamped 5 min .. 3 hr) of persistent 429 retry with capped expo backoff, honoring retry-after — NOT a fixed ~6-attempt cap. Non-429 errors still propagate immediately. 2. All-agents hard-fail. score_aggregator raises (-> handler status:ERROR, NO signals.json / email / DB write) if ANY sector team is missing (absent from ALL_TEAM_IDS), failed, or partial. CIO + macro economist + macro critic LLM calls wrapped in the same deadline helper with explicit max_retries; their failure hard-fails the run (strict-mode default raise) — no synthetic substitute promoted. 3. Removed #193 carry-forward. _update_thesis_for_held_stock no longer carries the prior thesis forward on persistent failure: a bounded parse re-roll for the transient tool-XML schema leak, then RAISE (no isolation fallback). A 429 past the deadline fails fast. 4. Reverted #194 isolation. A failed/partial sector team aborts the whole run even when other teams produced usable picks. KEPT from #194 (composes with the directive): per-team S3 persistence + sector_team_node resume short-circuit. Extended the same persist+resume pattern to CIO + macro (save_agent_run / load_agent_run, archive/agent_runs/{run_date}/{agent_id}.json) so an SF redrive after a hard-fail reuses every already-succeeded agent with zero LLM calls and the long retry only re-attempts the still-missing one(s) — what makes a 60-90 min window bounded. Stub-quarantine root-cause + structural fix (the dangerous bug): s3://alpha-engine-research/signals/2026-05-15/signals.json (written 2026-05-16T17:08:46Z) shipped synthetic [DRY-RUN] stub theses promoted as real (GOOG/AFL/AXP/ABT/APD/ADBE/AMD). Leak path: install_dry_run_stubs only no-op'd write_signals_json + upload_db, NOT save_sector_team_run. The stub-pass runs the full graph; _stub_run_sector_team returns error=None so sector_team_node PERSISTED synthetic [DRY-RUN] output to archive/sector_team_runs/{run_date}/{team_id}.json. The subsequent REAL pass's #194 resume short-circuit LOADED that stub-persisted output and promoted it — zero real Haiku calls. Fix (defense in depth): (a) install_dry_run_stubs now also no-ops save_sector_team_run + save_agent_run so the stub-pass cannot write resume keys; (b) graph.stub_quarantine.assert_no_stub_output at the top of archive_writer refuses to write if the [DRY-RUN marker appears in any promotable surface or a sector team is missing. Tests: full suite 1305 passed, 1 pre-existing unrelated test_scoring RSI failure (known stale-local-config, passes on CI). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Diagnosed mechanism
The 2026-05-16 Saturday SF recovery run failed: the Research Lambda returned
status:ERRORwith"sector team(s) failed: defensives/financials/technology: RateLimitError 429 — org rate limit of 450,000 input tokens/min, claude-haiku-4-5".graph/research_graph.pybuild_graph()compiles the LangGraph with NO checkpointer → the Lambda runs it stateless → an SF re-run re-dispatches all 6 sector teams viaSend()and re-pays every Haiku call (quant→qual→peer_review per team).Send()fan-out bursts over the org's 450K Haiku input-TPM ceiling →RateLimitError 429.score_aggregatorhard-failed the WHOLE run if ANY team carried anerror(429 = error, not the tolerated partial). Successful teams' outputs lived only in in-memory graph state and were discarded on the ERROR return — nothing persisted them for a re-run to reuse.The 3-part fix
A. 429-aware retry/backoff. New shared
invoke_with_rate_limit_retryinagents/langchain_utils.py— catchesanthropic.RateLimitError/ 429APIStatusError, honors theretry-afterresponse header when present, otherwise jittered exponential backoff with an attempt cap. Non-429 errors propagate immediately and unchanged (strict-mode / partial / isolation contracts preserved). Wrapped around every sector-team Haiku.invoke(): quant ReAct + structured extract, qual ReAct + structured extract, peer_review (quant-addition / Pass-1 selection / Pass-2 rationale), and the #193 held-thesisinclude_raw=Truecall. Explicitmax_retries=SECTOR_TEAM_LLM_MAX_RETRIES(8) on everyChatAnthropic(...)constructor — langchain's default of 2 is insufficient for a sustained org 429.B. Per-sector-team S3 persistence on success.
ArchiveManager.save_sector_team_run/load_sector_team_run, deterministic keyarchive/sector_team_runs/{run_date}/{team_id}.json(matches the existingarchive/{category}/...prefix convention, same_s3_put/_s3_get/bucket).sector_team_nodepersists each team's full output the moment it finishes without an error — before any other team can fail and ERROR the overall run. Errored teams are NOT persisted so a re-run gets a fresh attempt at them.C. Resume + isolation.
sector_team_nodechecks S3 for(run_date, team_id)before any LLM/tool work; on a present, well-formed, run_date-matched persisted output it loads it and short-circuits the team — zero Haiku calls — feeding it intosector_team_outputsexactly as a fresh run would.score_aggregatorno longer hard-fails on a single team error. Failed teams are tolerated exactly like the existingpartial_teamsphilosophy; the run only ERRORs when every team is failed-or-partial AND zero recommendations survive ("nothing for CIO to rank").run_date(a new run_date never reuses a prior date's teams); stale/corrupt/cross-wired JSON logs a structured warning and falls back to re-running that team, never crashes.Load-bearing invariant
A re-invocation must NEVER re-pay a sector team that already succeeded for that run_date. Persist-on-success happens before any other team can fail, so an SF re-run reuses the completed teams from S3 and only re-executes the previously-failed ones.
Tests
+16 in
tests/test_sector_team_persist_backoff.py(429-then-success on backoff; retry-after honored; non-429 propagation; persist to expected S3 key; round-trip load; run-date isolation; corrupt/identity-mismatch fallback; resume short-circuit asserting zero LLM calls; errored-team-not-persisted) andtests/test_score_aggregator_failure.pyreframed to the new isolation contract, including the exact 2026-05-16 multi-team-429 regression shape. Full suite: 1283 passed, 1 pre-existing unrelatedtest_scoring.py::TestRSIScoring::test_bull_overbought_matches_neutral_post_revertfailure (known stale-local-config artifact, passes on CI — left untouched per scope).DEPLOY IS HELD — Research is Lambda-deploy-gated; do not merge/deploy until the user directs. After merge the Research Lambda must be redeployed before this fix is live.
🤖 Generated with Claude Code