Skip to content

fix(research): per-sector-team S3 persistence + 429 backoff + isolation — re-runs never re-pay completed teams#194

Merged
cipher813 merged 1 commit into
mainfrom
fix/research-sector-team-persist-backoff
May 16, 2026
Merged

fix(research): per-sector-team S3 persistence + 429 backoff + isolation — re-runs never re-pay completed teams#194
cipher813 merged 1 commit into
mainfrom
fix/research-sector-team-persist-backoff

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Diagnosed mechanism

The 2026-05-16 Saturday SF recovery run failed: the Research Lambda returned status:ERROR with "sector team(s) failed: defensives/financials/technology: RateLimitError 429 — org rate limit of 450,000 input tokens/min, claude-haiku-4-5".

  • graph/research_graph.py build_graph() compiles the LangGraph with NO checkpointer → the Lambda runs it stateless → an SF re-run re-dispatches all 6 sector teams via Send() and re-pays every Haiku call (quant→qual→peer_review per team).
  • The 6-team parallel Send() fan-out bursts over the org's 450K Haiku input-TPM ceilingRateLimitError 429.
  • score_aggregator hard-failed the WHOLE run if ANY team carried an error (429 = error, not the tolerated partial). Successful teams' outputs lived only in in-memory graph state and were discarded on the ERROR return — nothing persisted them for a re-run to reuse.

The 3-part fix

A. 429-aware retry/backoff. New shared invoke_with_rate_limit_retry in agents/langchain_utils.py — catches anthropic.RateLimitError / 429 APIStatusError, honors the retry-after response header when present, otherwise jittered exponential backoff with an attempt cap. Non-429 errors propagate immediately and unchanged (strict-mode / partial / isolation contracts preserved). Wrapped around every sector-team Haiku .invoke(): quant ReAct + structured extract, qual ReAct + structured extract, peer_review (quant-addition / Pass-1 selection / Pass-2 rationale), and the #193 held-thesis include_raw=True call. Explicit max_retries=SECTOR_TEAM_LLM_MAX_RETRIES (8) on every ChatAnthropic(...) constructor — langchain's default of 2 is insufficient for a sustained org 429.

B. Per-sector-team S3 persistence on success. ArchiveManager.save_sector_team_run / load_sector_team_run, deterministic key archive/sector_team_runs/{run_date}/{team_id}.json (matches the existing archive/{category}/... prefix convention, same _s3_put/_s3_get/bucket). sector_team_node persists each team's full output the moment it finishes without an error — before any other team can fail and ERROR the overall run. Errored teams are NOT persisted so a re-run gets a fresh attempt at them.

C. Resume + isolation.

  • Resume: sector_team_node checks S3 for (run_date, team_id) before any LLM/tool work; on a present, well-formed, run_date-matched persisted output it loads it and short-circuits the team — zero Haiku calls — feeding it into sector_team_outputs exactly as a fresh run would.
  • Isolation: score_aggregator no longer hard-fails on a single team error. Failed teams are tolerated exactly like the existing partial_teams philosophy; the run only ERRORs when every team is failed-or-partial AND zero recommendations survive ("nothing for CIO to rank").
  • Idempotency/safety: persisted output is tied to run_date (a new run_date never reuses a prior date's teams); stale/corrupt/cross-wired JSON logs a structured warning and falls back to re-running that team, never crashes.

Load-bearing invariant

A re-invocation must NEVER re-pay a sector team that already succeeded for that run_date. Persist-on-success happens before any other team can fail, so an SF re-run reuses the completed teams from S3 and only re-executes the previously-failed ones.

Tests

+16 in tests/test_sector_team_persist_backoff.py (429-then-success on backoff; retry-after honored; non-429 propagation; persist to expected S3 key; round-trip load; run-date isolation; corrupt/identity-mismatch fallback; resume short-circuit asserting zero LLM calls; errored-team-not-persisted) and tests/test_score_aggregator_failure.py reframed to the new isolation contract, including the exact 2026-05-16 multi-team-429 regression shape. Full suite: 1283 passed, 1 pre-existing unrelated test_scoring.py::TestRSIScoring::test_bull_overbought_matches_neutral_post_revert failure (known stale-local-config artifact, passes on CI — left untouched per scope).


DEPLOY IS HELD — Research is Lambda-deploy-gated; do not merge/deploy until the user directs. After merge the Research Lambda must be redeployed before this fix is live.

🤖 Generated with Claude Code

…on — re-runs never re-pay completed teams

The 2026-05-16 Saturday SF recovery run failed: Research Lambda returned
status:ERROR with "sector team(s) failed: defensives/financials/technology:
RateLimitError 429 — org rate limit of 450,000 input tokens/min".

Mechanism: build_graph() compiles the LangGraph with NO checkpointer, so
the Lambda runs it stateless — an SF re-run re-dispatches all 6 sector
teams via Send() and re-pays every Haiku call. The 6-team parallel
fan-out bursts over the org's 450K Haiku input-TPM ceiling, and
score_aggregator hard-failed the WHOLE run if ANY team carried an error,
discarding the successful teams (in-memory graph state only).

Three coordinated parts (load-bearing invariant: a re-invocation must
NEVER re-pay a sector team that already succeeded for that run_date):

A. 429-aware retry/backoff. New shared invoke_with_rate_limit_retry in
   agents/langchain_utils.py honors the retry-after response header, falls
   back to jittered exponential backoff, caps attempts, and propagates
   non-429 errors unchanged. Wrapped around every sector-team Haiku
   .invoke() (quant ReAct + extract, qual ReAct + extract, peer_review
   ×3, held-thesis structured). Explicit max_retries on every
   ChatAnthropic constructor (langchain default of 2 is insufficient for
   sustained org 429).

B. Per-sector-team S3 persistence. ArchiveManager.save_sector_team_run /
   load_sector_team_run, key archive/sector_team_runs/{run_date}/{team_id}.json.
   sector_team_node persists each team the moment it succeeds (before any
   other team can fail and ERROR the run); errored teams are NOT persisted
   so a re-run retries them.

C. Resume + isolation. sector_team_node checks S3 before any LLM work and
   short-circuits on a present, well-formed, run_date-matched persisted
   output (zero Haiku calls). score_aggregator no longer hard-fails on a
   single team error — failed teams are tolerated like partials; the run
   only ERRORs when every team is failed-or-partial AND zero
   recommendations survive. Stale/corrupt/cross-wired persisted JSON
   falls back to re-run (structured warning), never crashes.

Tests: +16 in test_sector_team_persist_backoff.py (429-then-success,
retry-after honored, non-429 propagation, persist key, round-trip,
run-date isolation, corrupt/mismatch fallback, resume zero-LLM-calls,
errored-not-persisted) + reframed test_score_aggregator_failure.py to
the new isolation contract incl. the exact 2026-05-16 multi-team-429
regression. Full suite: 1283 passed, 1 pre-existing unrelated
test_scoring RSI failure (known stale-local-config, passes on CI).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit c0b25a4 into main May 16, 2026
1 check passed
@cipher813 cipher813 deleted the fix/research-sector-team-persist-backoff branch May 16, 2026 17:02
cipher813 added a commit that referenced this pull request May 16, 2026
…arantine (reworks #193/#194) (#195)

Authoritative directive (Brian, 2026-05-16) — supersedes #193/#194's
degrade-and-continue philosophy:

  "If the sector agents don't run, Research shouldn't complete until
   all sectors are run. We should have a long retry mechanism and
   after this long period if we still don't have all sectors it
   should fail. We don't get anything from this process if the
   sectors, or any other agent for that matter, fail/don't run."

Four behavior changes:

1. Long 429 retry. invoke_with_rate_limit_retry is now an overall
   wall-clock DEADLINE (default RATE_LIMIT_RETRY_DEADLINE_SECONDS =
   75 min, env-overridable, clamped 5 min .. 3 hr) of persistent
   429 retry with capped expo backoff, honoring retry-after — NOT a
   fixed ~6-attempt cap. Non-429 errors still propagate immediately.

2. All-agents hard-fail. score_aggregator raises (-> handler
   status:ERROR, NO signals.json / email / DB write) if ANY sector
   team is missing (absent from ALL_TEAM_IDS), failed, or partial.
   CIO + macro economist + macro critic LLM calls wrapped in the
   same deadline helper with explicit max_retries; their failure
   hard-fails the run (strict-mode default raise) — no synthetic
   substitute promoted.

3. Removed #193 carry-forward. _update_thesis_for_held_stock no
   longer carries the prior thesis forward on persistent failure: a
   bounded parse re-roll for the transient tool-XML schema leak, then
   RAISE (no isolation fallback). A 429 past the deadline fails fast.

4. Reverted #194 isolation. A failed/partial sector team aborts the
   whole run even when other teams produced usable picks.

KEPT from #194 (composes with the directive): per-team S3
persistence + sector_team_node resume short-circuit. Extended the
same persist+resume pattern to CIO + macro (save_agent_run /
load_agent_run, archive/agent_runs/{run_date}/{agent_id}.json) so an
SF redrive after a hard-fail reuses every already-succeeded agent
with zero LLM calls and the long retry only re-attempts the
still-missing one(s) — what makes a 60-90 min window bounded.

Stub-quarantine root-cause + structural fix (the dangerous bug):
s3://alpha-engine-research/signals/2026-05-15/signals.json (written
2026-05-16T17:08:46Z) shipped synthetic [DRY-RUN] stub theses
promoted as real (GOOG/AFL/AXP/ABT/APD/ADBE/AMD). Leak path:
install_dry_run_stubs only no-op'd write_signals_json + upload_db,
NOT save_sector_team_run. The stub-pass runs the full graph;
_stub_run_sector_team returns error=None so sector_team_node
PERSISTED synthetic [DRY-RUN] output to
archive/sector_team_runs/{run_date}/{team_id}.json. The subsequent
REAL pass's #194 resume short-circuit LOADED that stub-persisted
output and promoted it — zero real Haiku calls. Fix (defense in
depth): (a) install_dry_run_stubs now also no-ops
save_sector_team_run + save_agent_run so the stub-pass cannot write
resume keys; (b) graph.stub_quarantine.assert_no_stub_output at the
top of archive_writer refuses to write if the [DRY-RUN marker
appears in any promotable surface or a sector team is missing.

Tests: full suite 1305 passed, 1 pre-existing unrelated
test_scoring RSI failure (known stale-local-config, passes on CI).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant