Skip to content

fix(research): all-agents-strict — long 429 retry, hard-fail, stub quarantine (reworks #193/#194)#195

Merged
cipher813 merged 1 commit into
mainfrom
fix/research-all-agents-strict
May 16, 2026
Merged

fix(research): all-agents-strict — long 429 retry, hard-fail, stub quarantine (reworks #193/#194)#195
cipher813 merged 1 commit into
mainfrom
fix/research-all-agents-strict

Conversation

@cipher813
Copy link
Copy Markdown
Owner

DEPLOY HELD

DEPLOY HELD — reworks just-merged #193/#194; do not merge/deploy/re-run; after merge the Research Lambda must be redeployed before live; user will re-run manually. This reverses the degrade-and-continue philosophy of two PRs that merged earlier today. The user reviews before deploy.

The directive (Brian, 2026-05-16)

"If the sector agents don't run, Research shouldn't complete until all sectors are run. We should have a long retry mechanism and after this long period if we still don't have all sectors it should fail. We don't get anything from this process if the sectors, or any other agent for that matter, fail/don't run."

This OVERRIDES #193's carry-forward and #194's per-team isolation. All-agents-strict scope: sector teams, held-thesis updates, CIO, macro economist (+ critic) — every agent.

The 4 behavior changes

1. Long 429 retry (agents/langchain_utils.py)

invoke_with_rate_limit_retry is now an overall wall-clock deadline of persistent 429 retry, not a fixed ~6-attempt cap. Capped exponential backoff (base 4s, cap 60s) + decorrelated jitter between attempts, honoring the retry-after header. When the deadline is exceeded the 429 propagates and the caller hard-fails.

  • Deadline constant: RATE_LIMIT_RETRY_DEADLINE_SECONDS, default 75 min (4500s), set module-level in agents/langchain_utils.py via _resolve_deadline_seconds(). Env override RATE_LIMIT_RETRY_DEADLINE_SECONDS, clamped to 5 min .. 3 hr so a typo can't make it unbounded or trivial. The wrapper reads the module constant at call time (tests monkeypatch it; the Lambda sets the env before cold-start import).
  • The max_attempts kwarg was removed (replaced by deadline_seconds). Non-429 errors still propagate immediately and unchanged.

2. All-agents hard-fail (graph/research_graph.py score_aggregator)

Reverts #194. score_aggregator raises RuntimeError (→ handler status:ERROR, no signals.json / email / DB write) if ANY sector team is missing (absent from ALL_TEAM_IDS coverage), failed (carries error), or partial. Surviving picks from other teams no longer save the run.

CIO (agents/investment_committee/ic_cio.py), macro economist + critic (agents/macro_agent.py): their .invoke() calls are wrapped in the same deadline helper, with explicit max_retries=SECTOR_TEAM_LLM_MAX_RETRIES on every ChatAnthropic. A 429 past the deadline → propagate → strict-mode (production default) raise → run hard-fails. No synthetic/empty CIO or macro substitute is promoted.

3. Removed #193 carry-forward (agents/sector_teams/sector_team.py)

_update_thesis_for_held_stock no longer carries the prior thesis forward on persistent failure. It does a bounded parse re-roll (_MAX_PARSE_ATTEMPTS = 3) for the transient tool-XML schema leak (the 2026-05-16 catalysts string-not-list nondeterminism — these recover on a re-roll), then RAISES. A 429 (detected via _is_rate_limit_error) past the deadline-bounded wrapper fails fast — re-rolling can't fix an org TPM ceiling. The raise surfaces as the team's hard errorscore_aggregator hard-fails.

4. Reverted #194 isolation

Covered by change 2 — a failed/partial team aborts the run even with other teams' picks.

What #194 behavior is KEPT and why it composes

Per-team S3 persistence + sector_team_node resume short-circuit are KEPT unchanged (archive/manager.py save_sector_team_run/load_sector_team_run, the archive/sector_team_runs/{run_date}/{team_id}.json scheme). It composes with the directive: each agent that succeeds is persisted the moment it completes, so when one agent hard-fails and the run ERRORs, an SF redrive reuses the persisted succeeded teams (zero Haiku calls) and the long 75-min retry only ever re-attempts the still-missing agent(s). That is exactly what makes a 60-90 min window affordable and bounded rather than re-paying the whole 6-team fan-out every redrive.

Extended the same persist+resume to CIO + macro (new ArchiveManager.save_agent_run/load_agent_run, archive/agent_runs/{run_date}/{agent_id}.json; resume short-circuit + persist-on-success in cio_node and macro_economist_node). Macro runs upstream of sector dispatch (Stage B) and CIO is the single most expensive Sonnet call — without this, every redrive after a sector hard-fail would re-pay them before even reaching the still-missing team.

Stub-quarantine — root-cause + structural fix (the dangerous one)

Evidence: s3://alpha-engine-research/signals/2026-05-15/signals.json, written 2026-05-16T17:08:46Z by the recovery-postfix194 run, shipped with synthetic dry_run.py stub output promoted as real — every new buy_candidate (GOOG/AFL/AXP/ABT/APD/ADBE/AMD) had a thesis_summary starting "[DRY-RUN] Strong fundamentals…" and the email rendered them as real picks, on a run that was NOT dry_run_llm.

Precise leak path (root-caused): The default dry-run gate runs a synthetic stub-pass first via install_dry_run_stubs(archive), then _restore(). install_dry_run_stubs only no-op'd write_signals_json + upload_db. The stub-pass runs the full graph: sector_team_node calls save_sector_team_run the moment a team "succeeds", and _stub_run_sector_team returns a normal-looking dict with error=None — so the stub-pass persisted synthetic [DRY-RUN] sector-team output to archive/sector_team_runs/{run_date}/{team_id}.json. The subsequent real pass's sector_team_node resume short-circuit (#194) then loaded that stub-persisted output via load_sector_team_run and promoted the synthetic theses straight into signals.json + the email — with zero real Haiku calls. The per-team resume persistence path (added by #194) was never suppressed for the stub-pass, so the stub-pass leaked synthetic state into the real pass.

Structural fix (defense in depth):

  • install_dry_run_stubs now also no-ops save_sector_team_run + save_agent_run so the stub-pass cannot write the resume keys at all.
  • New graph/stub_quarantine.py assert_no_stub_output, called at the top of archive_writer (before any write; archive_writer precedes email_sender so a raise here also blocks the email). It raises StubQuarantineError if the [DRY-RUN marker (canonical dry_run.DRY_RUN_MARKER, embedded in every stub string) appears anywhere in the signals payload / consolidated report / investment theses / sector-team outputs / entry theses / IC decisions / macro report, or if any sector team is missing/failed/partial. A promoted artifact may ONLY be produced by a fully-real, all-agents-complete pass. The signals payload is built once and reused at the write site (no double-build, no TOCTOU).

Pre-existing #193/#194 tests changed (in-scope contract changes)

  • tests/test_held_thesis_isolation.pytests/test_held_thesis_strict.py (renamed). Every test rewritten to the new contract:
    • test_malformed_catalysts_degrades_to_prior_thesistest_malformed_catalysts_raises_after_reroll (carry-forward removed → now RAISES).
    • test_retry_recovers_when_second_attempt_validtest_reroll_recovers_when_a_later_attempt_valid (kept; parse budget 2→3, post-budget = raise not carry).
    • test_validation_error_does_not_propagate_in_strict_modetest_validation_error_DOES_propagate (the OPPOSITE assertion — the whole point of the rework).
    • test_no_prior_thesis_isolation_marks_score_failedtest_no_prior_thesis_still_raises (no isolation fallback).
  • tests/test_score_aggregator_failure.py rewritten. TestScoreAggregatorPartialTolerance / TestScoreAggregatorIsolation pinned fix(research): per-sector-team S3 persistence + 429 backoff + isolation — re-runs never re-pay completed teams #194's tolerate-and-continue; replaced with TestPartialNoLongerTolerated / TestNoIsolation asserting RAISE. test_passes_through_when_no_errors / _when_error_key_absent (a single clean team used to pass) → test_missing_teams_hard_fail* + a test_full_clean_set_passes_through that supplies the full 6-team set. The test_2026_05_16_multi_team_429_does_not_abort_when_picks_survive regression is inverted to test_2026_05_16_multi_team_429_now_hard_fails.
  • tests/test_sector_team_persist_backoff.py: test_persistent_429_eventually_raisestest_persistent_429_raises_at_deadline_not_attempt_count (the max_attempts kwarg was removed). Added test_deadline_constant_default_is_75_min + test_long_429_then_success_within_deadline.
  • tests/test_thesis_update_recompute.py: the _state helper now pads absent ALL_TEAM_IDS teams with clean stubs so these tests (which exercise the recompute/normalization path after the strict gate, using a non-ALL_TEAM_IDS team_id like "energy") reach the path under test. Zero recommendations/theses contributed by the padding → every test's investment_theses assertion is unchanged.
  • tests/test_max_tokens_lint.py: allowlist line for the macro critic max_tokens=512 moved 462 → 476 (my import + max_retries= kwarg shifted it). Same intentional 512-literal call.

New tests added: tests/test_stub_quarantine.py (the precise 2026-05-15 failure shape — stub-pass cannot persist resume keys + write-site guard blocks the marker/missing-team), tests/test_cio_macro_strict_persist.py (CIO/macro 429-past-deadline propagation + agent_run persist/resume).

Test count

Full suite: 1305 passed, 1 pre-existing unrelated failure (tests/test_scoring.py::TestRSIScoring::test_bull_overbought_matches_neutral_post_revert — known stale-local-config artifact, passes on CI). Validated with ~/Development/alpha-engine-research/.venv/bin/python -m pytest tests/ -q -p no:cacheprovider.

Ambiguity resolved

The directive said "extend persist+resume to CIO/macro if practical; if high-risk, persist sector teams only and document why." It was practical and low-risk — save_agent_run/load_agent_run mirror the proven #194 sector-team envelope + identity-guard contract exactly, and the stub-quarantine suppression of save_agent_run keeps it consistent with the structural fix — so I extended it to both CIO and macro.

🤖 Generated with Claude Code

…arantine (reworks #193/#194)

Authoritative directive (Brian, 2026-05-16) — supersedes #193/#194's
degrade-and-continue philosophy:

  "If the sector agents don't run, Research shouldn't complete until
   all sectors are run. We should have a long retry mechanism and
   after this long period if we still don't have all sectors it
   should fail. We don't get anything from this process if the
   sectors, or any other agent for that matter, fail/don't run."

Four behavior changes:

1. Long 429 retry. invoke_with_rate_limit_retry is now an overall
   wall-clock DEADLINE (default RATE_LIMIT_RETRY_DEADLINE_SECONDS =
   75 min, env-overridable, clamped 5 min .. 3 hr) of persistent
   429 retry with capped expo backoff, honoring retry-after — NOT a
   fixed ~6-attempt cap. Non-429 errors still propagate immediately.

2. All-agents hard-fail. score_aggregator raises (-> handler
   status:ERROR, NO signals.json / email / DB write) if ANY sector
   team is missing (absent from ALL_TEAM_IDS), failed, or partial.
   CIO + macro economist + macro critic LLM calls wrapped in the
   same deadline helper with explicit max_retries; their failure
   hard-fails the run (strict-mode default raise) — no synthetic
   substitute promoted.

3. Removed #193 carry-forward. _update_thesis_for_held_stock no
   longer carries the prior thesis forward on persistent failure: a
   bounded parse re-roll for the transient tool-XML schema leak, then
   RAISE (no isolation fallback). A 429 past the deadline fails fast.

4. Reverted #194 isolation. A failed/partial sector team aborts the
   whole run even when other teams produced usable picks.

KEPT from #194 (composes with the directive): per-team S3
persistence + sector_team_node resume short-circuit. Extended the
same persist+resume pattern to CIO + macro (save_agent_run /
load_agent_run, archive/agent_runs/{run_date}/{agent_id}.json) so an
SF redrive after a hard-fail reuses every already-succeeded agent
with zero LLM calls and the long retry only re-attempts the
still-missing one(s) — what makes a 60-90 min window bounded.

Stub-quarantine root-cause + structural fix (the dangerous bug):
s3://alpha-engine-research/signals/2026-05-15/signals.json (written
2026-05-16T17:08:46Z) shipped synthetic [DRY-RUN] stub theses
promoted as real (GOOG/AFL/AXP/ABT/APD/ADBE/AMD). Leak path:
install_dry_run_stubs only no-op'd write_signals_json + upload_db,
NOT save_sector_team_run. The stub-pass runs the full graph;
_stub_run_sector_team returns error=None so sector_team_node
PERSISTED synthetic [DRY-RUN] output to
archive/sector_team_runs/{run_date}/{team_id}.json. The subsequent
REAL pass's #194 resume short-circuit LOADED that stub-persisted
output and promoted it — zero real Haiku calls. Fix (defense in
depth): (a) install_dry_run_stubs now also no-ops
save_sector_team_run + save_agent_run so the stub-pass cannot write
resume keys; (b) graph.stub_quarantine.assert_no_stub_output at the
top of archive_writer refuses to write if the [DRY-RUN marker
appears in any promotable surface or a sector team is missing.

Tests: full suite 1305 passed, 1 pre-existing unrelated
test_scoring RSI failure (known stale-local-config, passes on CI).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit bf86285 into main May 16, 2026
1 check passed
@cipher813 cipher813 deleted the fix/research-all-agents-strict branch May 16, 2026 20:50
cipher813 added a commit that referenced this pull request May 17, 2026
yfinance-centralization arc (plan: alpha-engine-docs/private/yfinance-centralization-260516.md),
PR 2 — R3, the last HOT research yfinance socket.

Rewrote agents/sector_teams/quant_tools.py::get_balance_sheet to read
alpha-engine-data's weekly Finnhub fundamentals snapshot instead of
yfinance.Ticker().info. Removed `import yfinance as yf` from that path —
quant_tools.py is now yfinance-free for PR2's scope (the get_balance_sheet
`import yfinance as yf` was the only yfinance reference).

Mechanics (mirrors the in-repo factor_profiles pattern + predictor's
inference/stages/fetch_alt_data.py read/date-resolution):
- New module-level `read_fundamentals_from_s3()` — reads
  archive/fundamentals/{date}.json; if no run_date, scans the prefix and
  reads the most-recent snapshot (predictor's fallback; the snapshot is a
  slow weekly signal so stale-by-days is acceptable). Returns None on any
  failure (caller graceful-degrades, never raises).
- Cached once at tool-creation time (one S3 read per sector team per
  Saturday SF run, same as factor_profiles) via a new `fundamentals_data`
  context key (also injectable for tests — no S3/network in tests).

Schema-mapping correction vs the plan doc: the data-module collector does
NOT persist raw Finnhub keys — collectors/fundamentals.py writes a
NORMALIZED/CLIPPED per-ticker schema {pe_ratio, pb_ratio, debt_to_equity,
revenue_growth_yoy, fcf_yield, gross_margin, roe, current_ratio}. Followed
the plan's intent (consume the S3 key the predictor already reads) and
mapped that ACTUAL schema onto the tool's exact existing output keys:
  debt_to_equity ← debt_to_equity (already a ratio — NO yfinance %/100
                   scaling, per plan)
  current_ratio  ← current_ratio
  pe_ratio       ← pe_ratio
  price_to_book  ← pb_ratio
  revenue_growth ← revenue_growth_yoy
  gross_margins  ← gross_margin
  forward_pe     → None  (not persisted by the collector)
  market_cap     → None  (computed transiently for fcf_yield, not persisted)
The agent consumes these as soft directional context (not a hard gate —
the hard D/E gate R2 is being deleted in PR1), so the normalized-scale /
trailing-vs-TTM delta is tolerable per the plan's risk note.

Graceful-degrade preserved: exact return-schema keys unchanged; a ticker
absent from the snapshot (or an empty/missing snapshot) returns
{"error": ...} per the existing contract — never raises (all-agents-strict,
#195). No read that degraded before now raises.

Tests: added tests/test_get_balance_sheet_s3.py (8 tests, monkeypatch/fakes,
NO unittest.mock.patch per the documented full-suite bleed) — mapped schema
from a faked fundamentals JSON, D/E not %/100-rescaled, {"error":...} on
missing ticker + empty snapshot (no raise), no yfinance import, S3-reader
scan-for-latest + None-on-failure. Full suite: 1329 passed
(1321 baseline + 8 new), 1 pre-existing acceptable failure
(tests/test_scoring.py::TestRSIScoring::test_bull_overbought_matches_neutral_post_revert
— stale-local-config, passes on CI). Zero new failures.

**DEPLOY HELD — research auto-deploys on merge; do not merge until user
directs. Part of the yfinance-centralization arc (plan doc
alpha-engine-docs/private/yfinance-centralization-260516.md); intended to
land before the held Research re-run.**

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 17, 2026
… (R5) (#199)

yfinance-centralization arc (plan: alpha-engine-docs/private/yfinance-centralization-260516.md),
PR 3 — R5. With PR1+PR2 this makes alpha-engine-research yfinance-free at
runtime (only local/time_scanner.py keeps yfinance — explicitly excluded by
the plan as a dev-only scanner not in the Lambda/SF path).

scoring/performance_tracker.py::run_performance_checks — replaced the
`yf.download(period="2d")` FALLBACK leg with the in-repo
data/fetchers/feature_store_reader.py::read_latest_daily_closes()
(alpha-engine-data's staging/daily_closes/ parquet → {ticker: close}).
Deleted `import yfinance as yf` + `import pandas as pd` (pandas was only
used by the removed yfinance multi-ticker DataFrame parsing); updated the
module docstring.

Polygon grouped-daily stays the PRIMARY path — completely unchanged. Only
the fallback leg swapped. The new get_latest_price() returns
polygon_prices[t] first, then daily_closes.get(t) — exact same precedence
as before (polygon, then fallback).

Graceful-degrade preserved (all-agents-strict, #195):
- read_latest_daily_closes() returns {ticker: close} or None and NEVER
  raises (broad except → None), strictly matching the old try/except
  contract — no read that degraded before now raises.
- If both polygon and the fallback yield nothing, the function returns
  _compute_accuracy_stats() with no new evaluations recorded — identical
  net behavior to the old yfinance-failed branch (which proceeded with
  price_data=None and recorded nothing).
- Missing-ticker → row skipped (current_price None), unchanged.

Tests: reworked tests/test_performance_tracker.py — the 5 tests that
patched `scoring.performance_tracker.yf.download` would break (yf removed),
so rewrote them to fake feature_store_reader.read_latest_daily_closes via
`monkeypatch` (NOT unittest.mock.patch — documented full-suite bleed;
mirrors tests/test_held_thesis_strict.py). Replaced the brittle
`@patch("...yf.download")` shape; removed the now-unused yf-shaped
`_mock_price_data` + pandas import. Added: module-is-yfinance-free
assertion, daily_closes-fallback 10d/beat-SPY/missing-ticker evals,
graceful-degrade-when-fallback-unavailable, and an explicit
polygon-primary-path-unaffected test with a tripwire that fails if the
fallback reader is called when polygon covers all tickers. test_archive.py
only references performance_tracker in a docstring — unaffected.
Full suite: 1323 passed (perf-tracker class 5 yfinance tests → 7
daily-closes tests; net +2 vs the 1321 baseline), 1 pre-existing
acceptable failure
(tests/test_scoring.py::TestRSIScoring::test_bull_overbought_matches_neutral_post_revert
— stale-local-config, passes on CI). Zero new failures.

**DEPLOY HELD — research auto-deploys on merge; do not merge until user
directs. Part of the yfinance-centralization arc (plan doc
alpha-engine-docs/private/yfinance-centralization-260516.md); intended to
land before the held Research re-run.**

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant