fix(research): all-agents-strict — long 429 retry, hard-fail, stub quarantine (reworks #193/#194) by cipher813 · Pull Request #195 · cipher813/alpha-engine-research

cipher813 · 2026-05-16T20:44:27Z

DEPLOY HELD

DEPLOY HELD — reworks just-merged #193/#194; do not merge/deploy/re-run; after merge the Research Lambda must be redeployed before live; user will re-run manually. This reverses the degrade-and-continue philosophy of two PRs that merged earlier today. The user reviews before deploy.

The directive (Brian, 2026-05-16)

"If the sector agents don't run, Research shouldn't complete until all sectors are run. We should have a long retry mechanism and after this long period if we still don't have all sectors it should fail. We don't get anything from this process if the sectors, or any other agent for that matter, fail/don't run."

This OVERRIDES #193's carry-forward and #194's per-team isolation. All-agents-strict scope: sector teams, held-thesis updates, CIO, macro economist (+ critic) — every agent.

The 4 behavior changes

1. Long 429 retry (`agents/langchain_utils.py`)

invoke_with_rate_limit_retry is now an overall wall-clock deadline of persistent 429 retry, not a fixed ~6-attempt cap. Capped exponential backoff (base 4s, cap 60s) + decorrelated jitter between attempts, honoring the retry-after header. When the deadline is exceeded the 429 propagates and the caller hard-fails.

Deadline constant: RATE_LIMIT_RETRY_DEADLINE_SECONDS, default 75 min (4500s), set module-level in agents/langchain_utils.py via _resolve_deadline_seconds(). Env override RATE_LIMIT_RETRY_DEADLINE_SECONDS, clamped to 5 min .. 3 hr so a typo can't make it unbounded or trivial. The wrapper reads the module constant at call time (tests monkeypatch it; the Lambda sets the env before cold-start import).
The max_attempts kwarg was removed (replaced by deadline_seconds). Non-429 errors still propagate immediately and unchanged.

2. All-agents hard-fail (`graph/research_graph.py` `score_aggregator`)

Reverts #194. score_aggregator raises RuntimeError (→ handler status:ERROR, no signals.json / email / DB write) if ANY sector team is missing (absent from ALL_TEAM_IDS coverage), failed (carries error), or partial. Surviving picks from other teams no longer save the run.

CIO (agents/investment_committee/ic_cio.py), macro economist + critic (agents/macro_agent.py): their .invoke() calls are wrapped in the same deadline helper, with explicit max_retries=SECTOR_TEAM_LLM_MAX_RETRIES on every ChatAnthropic. A 429 past the deadline → propagate → strict-mode (production default) raise → run hard-fails. No synthetic/empty CIO or macro substitute is promoted.

3. Removed #193 carry-forward (`agents/sector_teams/sector_team.py`)

_update_thesis_for_held_stock no longer carries the prior thesis forward on persistent failure. It does a bounded parse re-roll (_MAX_PARSE_ATTEMPTS = 3) for the transient tool-XML schema leak (the 2026-05-16 catalysts string-not-list nondeterminism — these recover on a re-roll), then RAISES. A 429 (detected via _is_rate_limit_error) past the deadline-bounded wrapper fails fast — re-rolling can't fix an org TPM ceiling. The raise surfaces as the team's hard error → score_aggregator hard-fails.

4. Reverted #194 isolation

Covered by change 2 — a failed/partial team aborts the run even with other teams' picks.

What #194 behavior is KEPT and why it composes

Per-team S3 persistence + sector_team_node resume short-circuit are KEPT unchanged (archive/manager.py save_sector_team_run/load_sector_team_run, the archive/sector_team_runs/{run_date}/{team_id}.json scheme). It composes with the directive: each agent that succeeds is persisted the moment it completes, so when one agent hard-fails and the run ERRORs, an SF redrive reuses the persisted succeeded teams (zero Haiku calls) and the long 75-min retry only ever re-attempts the still-missing agent(s). That is exactly what makes a 60-90 min window affordable and bounded rather than re-paying the whole 6-team fan-out every redrive.

Extended the same persist+resume to CIO + macro (new ArchiveManager.save_agent_run/load_agent_run, archive/agent_runs/{run_date}/{agent_id}.json; resume short-circuit + persist-on-success in cio_node and macro_economist_node). Macro runs upstream of sector dispatch (Stage B) and CIO is the single most expensive Sonnet call — without this, every redrive after a sector hard-fail would re-pay them before even reaching the still-missing team.

Stub-quarantine — root-cause + structural fix (the dangerous one)

Evidence: s3://alpha-engine-research/signals/2026-05-15/signals.json, written 2026-05-16T17:08:46Z by the recovery-postfix194 run, shipped with synthetic dry_run.py stub output promoted as real — every new buy_candidate (GOOG/AFL/AXP/ABT/APD/ADBE/AMD) had a thesis_summary starting "[DRY-RUN] Strong fundamentals…" and the email rendered them as real picks, on a run that was NOT dry_run_llm.

Precise leak path (root-caused): The default dry-run gate runs a synthetic stub-pass first via install_dry_run_stubs(archive), then _restore(). install_dry_run_stubs only no-op'd write_signals_json + upload_db. The stub-pass runs the full graph: sector_team_node calls save_sector_team_run the moment a team "succeeds", and _stub_run_sector_team returns a normal-looking dict with error=None — so the stub-pass persisted synthetic [DRY-RUN] sector-team output to archive/sector_team_runs/{run_date}/{team_id}.json. The subsequent real pass's sector_team_node resume short-circuit (#194) then loaded that stub-persisted output via load_sector_team_run and promoted the synthetic theses straight into signals.json + the email — with zero real Haiku calls. The per-team resume persistence path (added by #194) was never suppressed for the stub-pass, so the stub-pass leaked synthetic state into the real pass.

Structural fix (defense in depth):

install_dry_run_stubs now also no-ops save_sector_team_run + save_agent_run so the stub-pass cannot write the resume keys at all.
New graph/stub_quarantine.py assert_no_stub_output, called at the top of archive_writer (before any write; archive_writer precedes email_sender so a raise here also blocks the email). It raises StubQuarantineError if the [DRY-RUN marker (canonical dry_run.DRY_RUN_MARKER, embedded in every stub string) appears anywhere in the signals payload / consolidated report / investment theses / sector-team outputs / entry theses / IC decisions / macro report, or if any sector team is missing/failed/partial. A promoted artifact may ONLY be produced by a fully-real, all-agents-complete pass. The signals payload is built once and reused at the write site (no double-build, no TOCTOU).

Pre-existing #193/#194 tests changed (in-scope contract changes)

tests/test_held_thesis_isolation.py → tests/test_held_thesis_strict.py (renamed). Every test rewritten to the new contract:
- test_malformed_catalysts_degrades_to_prior_thesis → test_malformed_catalysts_raises_after_reroll (carry-forward removed → now RAISES).
- test_retry_recovers_when_second_attempt_valid → test_reroll_recovers_when_a_later_attempt_valid (kept; parse budget 2→3, post-budget = raise not carry).
- test_validation_error_does_not_propagate_in_strict_mode → test_validation_error_DOES_propagate (the OPPOSITE assertion — the whole point of the rework).
- test_no_prior_thesis_isolation_marks_score_failed → test_no_prior_thesis_still_raises (no isolation fallback).
tests/test_score_aggregator_failure.py rewritten. TestScoreAggregatorPartialTolerance / TestScoreAggregatorIsolation pinned fix(research): per-sector-team S3 persistence + 429 backoff + isolation — re-runs never re-pay completed teams #194's tolerate-and-continue; replaced with TestPartialNoLongerTolerated / TestNoIsolation asserting RAISE. test_passes_through_when_no_errors / _when_error_key_absent (a single clean team used to pass) → test_missing_teams_hard_fail* + a test_full_clean_set_passes_through that supplies the full 6-team set. The test_2026_05_16_multi_team_429_does_not_abort_when_picks_survive regression is inverted to test_2026_05_16_multi_team_429_now_hard_fails.
tests/test_sector_team_persist_backoff.py: test_persistent_429_eventually_raises → test_persistent_429_raises_at_deadline_not_attempt_count (the max_attempts kwarg was removed). Added test_deadline_constant_default_is_75_min + test_long_429_then_success_within_deadline.
tests/test_thesis_update_recompute.py: the _state helper now pads absent ALL_TEAM_IDS teams with clean stubs so these tests (which exercise the recompute/normalization path after the strict gate, using a non-ALL_TEAM_IDS team_id like "energy") reach the path under test. Zero recommendations/theses contributed by the padding → every test's investment_theses assertion is unchanged.
tests/test_max_tokens_lint.py: allowlist line for the macro critic max_tokens=512 moved 462 → 476 (my import + max_retries= kwarg shifted it). Same intentional 512-literal call.

New tests added: tests/test_stub_quarantine.py (the precise 2026-05-15 failure shape — stub-pass cannot persist resume keys + write-site guard blocks the marker/missing-team), tests/test_cio_macro_strict_persist.py (CIO/macro 429-past-deadline propagation + agent_run persist/resume).

Test count

Full suite: 1305 passed, 1 pre-existing unrelated failure (tests/test_scoring.py::TestRSIScoring::test_bull_overbought_matches_neutral_post_revert — known stale-local-config artifact, passes on CI). Validated with ~/Development/alpha-engine-research/.venv/bin/python -m pytest tests/ -q -p no:cacheprovider.

Ambiguity resolved

The directive said "extend persist+resume to CIO/macro if practical; if high-risk, persist sector teams only and document why." It was practical and low-risk — save_agent_run/load_agent_run mirror the proven #194 sector-team envelope + identity-guard contract exactly, and the stub-quarantine suppression of save_agent_run keeps it consistent with the structural fix — so I extended it to both CIO and macro.

🤖 Generated with Claude Code

…arantine (reworks #193/#194) Authoritative directive (Brian, 2026-05-16) — supersedes #193/#194's degrade-and-continue philosophy: "If the sector agents don't run, Research shouldn't complete until all sectors are run. We should have a long retry mechanism and after this long period if we still don't have all sectors it should fail. We don't get anything from this process if the sectors, or any other agent for that matter, fail/don't run." Four behavior changes: 1. Long 429 retry. invoke_with_rate_limit_retry is now an overall wall-clock DEADLINE (default RATE_LIMIT_RETRY_DEADLINE_SECONDS = 75 min, env-overridable, clamped 5 min .. 3 hr) of persistent 429 retry with capped expo backoff, honoring retry-after — NOT a fixed ~6-attempt cap. Non-429 errors still propagate immediately. 2. All-agents hard-fail. score_aggregator raises (-> handler status:ERROR, NO signals.json / email / DB write) if ANY sector team is missing (absent from ALL_TEAM_IDS), failed, or partial. CIO + macro economist + macro critic LLM calls wrapped in the same deadline helper with explicit max_retries; their failure hard-fails the run (strict-mode default raise) — no synthetic substitute promoted. 3. Removed #193 carry-forward. _update_thesis_for_held_stock no longer carries the prior thesis forward on persistent failure: a bounded parse re-roll for the transient tool-XML schema leak, then RAISE (no isolation fallback). A 429 past the deadline fails fast. 4. Reverted #194 isolation. A failed/partial sector team aborts the whole run even when other teams produced usable picks. KEPT from #194 (composes with the directive): per-team S3 persistence + sector_team_node resume short-circuit. Extended the same persist+resume pattern to CIO + macro (save_agent_run / load_agent_run, archive/agent_runs/{run_date}/{agent_id}.json) so an SF redrive after a hard-fail reuses every already-succeeded agent with zero LLM calls and the long retry only re-attempts the still-missing one(s) — what makes a 60-90 min window bounded. Stub-quarantine root-cause + structural fix (the dangerous bug): s3://alpha-engine-research/signals/2026-05-15/signals.json (written 2026-05-16T17:08:46Z) shipped synthetic [DRY-RUN] stub theses promoted as real (GOOG/AFL/AXP/ABT/APD/ADBE/AMD). Leak path: install_dry_run_stubs only no-op'd write_signals_json + upload_db, NOT save_sector_team_run. The stub-pass runs the full graph; _stub_run_sector_team returns error=None so sector_team_node PERSISTED synthetic [DRY-RUN] output to archive/sector_team_runs/{run_date}/{team_id}.json. The subsequent REAL pass's #194 resume short-circuit LOADED that stub-persisted output and promoted it — zero real Haiku calls. Fix (defense in depth): (a) install_dry_run_stubs now also no-ops save_sector_team_run + save_agent_run so the stub-pass cannot write resume keys; (b) graph.stub_quarantine.assert_no_stub_output at the top of archive_writer refuses to write if the [DRY-RUN marker appears in any promotable surface or a sector team is missing. Tests: full suite 1305 passed, 1 pre-existing unrelated test_scoring RSI failure (known stale-local-config, passes on CI). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yfinance-centralization arc (plan: alpha-engine-docs/private/yfinance-centralization-260516.md), PR 2 — R3, the last HOT research yfinance socket. Rewrote agents/sector_teams/quant_tools.py::get_balance_sheet to read alpha-engine-data's weekly Finnhub fundamentals snapshot instead of yfinance.Ticker().info. Removed `import yfinance as yf` from that path — quant_tools.py is now yfinance-free for PR2's scope (the get_balance_sheet `import yfinance as yf` was the only yfinance reference). Mechanics (mirrors the in-repo factor_profiles pattern + predictor's inference/stages/fetch_alt_data.py read/date-resolution): - New module-level `read_fundamentals_from_s3()` — reads archive/fundamentals/{date}.json; if no run_date, scans the prefix and reads the most-recent snapshot (predictor's fallback; the snapshot is a slow weekly signal so stale-by-days is acceptable). Returns None on any failure (caller graceful-degrades, never raises). - Cached once at tool-creation time (one S3 read per sector team per Saturday SF run, same as factor_profiles) via a new `fundamentals_data` context key (also injectable for tests — no S3/network in tests). Schema-mapping correction vs the plan doc: the data-module collector does NOT persist raw Finnhub keys — collectors/fundamentals.py writes a NORMALIZED/CLIPPED per-ticker schema {pe_ratio, pb_ratio, debt_to_equity, revenue_growth_yoy, fcf_yield, gross_margin, roe, current_ratio}. Followed the plan's intent (consume the S3 key the predictor already reads) and mapped that ACTUAL schema onto the tool's exact existing output keys: debt_to_equity ← debt_to_equity (already a ratio — NO yfinance %/100 scaling, per plan) current_ratio ← current_ratio pe_ratio ← pe_ratio price_to_book ← pb_ratio revenue_growth ← revenue_growth_yoy gross_margins ← gross_margin forward_pe → None (not persisted by the collector) market_cap → None (computed transiently for fcf_yield, not persisted) The agent consumes these as soft directional context (not a hard gate — the hard D/E gate R2 is being deleted in PR1), so the normalized-scale / trailing-vs-TTM delta is tolerable per the plan's risk note. Graceful-degrade preserved: exact return-schema keys unchanged; a ticker absent from the snapshot (or an empty/missing snapshot) returns {"error": ...} per the existing contract — never raises (all-agents-strict, #195). No read that degraded before now raises. Tests: added tests/test_get_balance_sheet_s3.py (8 tests, monkeypatch/fakes, NO unittest.mock.patch per the documented full-suite bleed) — mapped schema from a faked fundamentals JSON, D/E not %/100-rescaled, {"error":...} on missing ticker + empty snapshot (no raise), no yfinance import, S3-reader scan-for-latest + None-on-failure. Full suite: 1329 passed (1321 baseline + 8 new), 1 pre-existing acceptable failure (tests/test_scoring.py::TestRSIScoring::test_bull_overbought_matches_neutral_post_revert — stale-local-config, passes on CI). Zero new failures. **DEPLOY HELD — research auto-deploys on merge; do not merge until user directs. Part of the yfinance-centralization arc (plan doc alpha-engine-docs/private/yfinance-centralization-260516.md); intended to land before the held Research re-run.** Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… (R5) (#199) yfinance-centralization arc (plan: alpha-engine-docs/private/yfinance-centralization-260516.md), PR 3 — R5. With PR1+PR2 this makes alpha-engine-research yfinance-free at runtime (only local/time_scanner.py keeps yfinance — explicitly excluded by the plan as a dev-only scanner not in the Lambda/SF path). scoring/performance_tracker.py::run_performance_checks — replaced the `yf.download(period="2d")` FALLBACK leg with the in-repo data/fetchers/feature_store_reader.py::read_latest_daily_closes() (alpha-engine-data's staging/daily_closes/ parquet → {ticker: close}). Deleted `import yfinance as yf` + `import pandas as pd` (pandas was only used by the removed yfinance multi-ticker DataFrame parsing); updated the module docstring. Polygon grouped-daily stays the PRIMARY path — completely unchanged. Only the fallback leg swapped. The new get_latest_price() returns polygon_prices[t] first, then daily_closes.get(t) — exact same precedence as before (polygon, then fallback). Graceful-degrade preserved (all-agents-strict, #195): - read_latest_daily_closes() returns {ticker: close} or None and NEVER raises (broad except → None), strictly matching the old try/except contract — no read that degraded before now raises. - If both polygon and the fallback yield nothing, the function returns _compute_accuracy_stats() with no new evaluations recorded — identical net behavior to the old yfinance-failed branch (which proceeded with price_data=None and recorded nothing). - Missing-ticker → row skipped (current_price None), unchanged. Tests: reworked tests/test_performance_tracker.py — the 5 tests that patched `scoring.performance_tracker.yf.download` would break (yf removed), so rewrote them to fake feature_store_reader.read_latest_daily_closes via `monkeypatch` (NOT unittest.mock.patch — documented full-suite bleed; mirrors tests/test_held_thesis_strict.py). Replaced the brittle `@patch("...yf.download")` shape; removed the now-unused yf-shaped `_mock_price_data` + pandas import. Added: module-is-yfinance-free assertion, daily_closes-fallback 10d/beat-SPY/missing-ticker evals, graceful-degrade-when-fallback-unavailable, and an explicit polygon-primary-path-unaffected test with a tripwire that fails if the fallback reader is called when polygon covers all tickers. test_archive.py only references performance_tracker in a docstring — unaffected. Full suite: 1323 passed (perf-tracker class 5 yfinance tests → 7 daily-closes tests; net +2 vs the 1321 baseline), 1 pre-existing acceptable failure (tests/test_scoring.py::TestRSIScoring::test_bull_overbought_matches_neutral_post_revert — stale-local-config, passes on CI). Zero new failures. **DEPLOY HELD — research auto-deploys on merge; do not merge until user directs. Part of the yfinance-centralization arc (plan doc alpha-engine-docs/private/yfinance-centralization-260516.md); intended to land before the held Research re-run.** Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit bf86285 into main May 16, 2026
1 check passed

cipher813 deleted the fix/research-all-agents-strict branch May 16, 2026 20:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(research): all-agents-strict — long 429 retry, hard-fail, stub quarantine (reworks #193/#194)#195

fix(research): all-agents-strict — long 429 retry, hard-fail, stub quarantine (reworks #193/#194)#195
cipher813 merged 1 commit into
mainfrom
fix/research-all-agents-strict

cipher813 commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 16, 2026

DEPLOY HELD

The directive (Brian, 2026-05-16)

The 4 behavior changes

1. Long 429 retry (agents/langchain_utils.py)

2. All-agents hard-fail (graph/research_graph.py score_aggregator)

3. Removed #193 carry-forward (agents/sector_teams/sector_team.py)

4. Reverted #194 isolation

What #194 behavior is KEPT and why it composes

Stub-quarantine — root-cause + structural fix (the dangerous one)

Pre-existing #193/#194 tests changed (in-scope contract changes)

Test count

Ambiguity resolved

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Long 429 retry (`agents/langchain_utils.py`)

2. All-agents hard-fail (`graph/research_graph.py` `score_aggregator`)

3. Removed #193 carry-forward (`agents/sector_teams/sector_team.py`)