Skip to content

feat(sf): run Research and PredictorTraining in parallel (data-independent; per-branch error isolation)#251

Merged
cipher813 merged 1 commit into
mainfrom
feat/sf-research-predictor-parallel
May 17, 2026
Merged

feat(sf): run Research and PredictorTraining in parallel (data-independent; per-branch error isolation)#251
cipher813 merged 1 commit into
mainfrom
feat/sf-research-predictor-parallel

Conversation

@cipher813
Copy link
Copy Markdown
Owner

DEPLOY HELD

DEPLOY HELD — prod SF-topology change; do not merge/redeploy any SF/trigger any execution until the user directs. This restructures the live Saturday Step Function topology. CLAUDE.md:100 "spread API load" rationale is stale and must be corrected on merge (flagged here, NOT edited — ~/Development/CLAUDE.md is outside this repo and not in a clean git repo).

Rationale (stale "spread API load", verified)

Research and PredictorTraining are data-independent (CLAUDE.md Architecture: "Research and Predictor Training are independent — no data flows between them"). They ran sequentially today only to "spread API load" (CLAUDE.md:100) — a now-stale, verifiably wrong rationale:

  • Predictor training (alpha-engine-predictor/training/train_handler.py) reads ArcticDB universe features + trains CPU LightGBM + a Ridge meta-learner. It makes no Anthropic calls. The yfinance fallback was removed by predictor PR Add PredictorHealthCheck to weekday pipeline #6; the residual yfinance docstrings in train_handler.py are stale.
  • Research's only heavy load is Anthropic (LangGraph multi-agent fan-out).
  • They do not contend on the rate-limited Anthropic API. The sole reason to serialize them no longer holds.

Cost of the stale serialization: PredictorTraining (~91-min EC2 spot) is gated behind Research's now-strict (research #195: all-agents, ~75-min retry, no partial output) variable runtime, and the full Research-chain wall-clock is serially added on top.

Topology

Replace the sequential Research…→PredictorTraining run with an SF Parallel state ResearchPredictorParallel (entered where the research chain previously began):

  • Branch A (StartAt: CheckSkipResearch) — everything that consumes Research output, current order, every CheckSkip* gate / Wait-Check quartet / Retry / fail-soft Catch intact: CheckSkipResearch → Research → CheckResearchStatus → CheckSkipDataPhase2 → DataPhase2 → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeSubmit{FirstSaturday,Weekly} → EvalJudgePoll{Choice,Wait,Poll,Decision} → EvalJudgeProcess → EvalRollingMean → CheckSkipRationaleClustering → RationaleClustering → CheckSkipReplayConcordance → ReplayConcordance → CheckSkipCounterfactual → Counterfactual (+ ExtractResearchError). Terminals BranchAComplete/BranchAFailed.
  • Branch B (StartAt: CheckSkipPredictorTraining) — PredictorTraining quartet + skip-gate intact: CheckSkipPredictorTraining → PredictorTraining → WaitForPredictorTraining → CheckPredictorStatus → PredictorWait (+ ExtractPredictorError). Terminals BranchBComplete/BranchBFailed.
  • Join → then AggregateBranchOutcomes → CheckBranchOutcomes → CheckSkipDriftDetection → DriftDetection → Backtester → Parity → Evaluator → … unchanged. The Parallel join is the natural sync point (Backtester needs both branches done).

Inbound edges (RegimeRetrospectiveEval Next + Catch, CheckSkipRegimeRetrospectiveEval skip choice) re-pointed to ResearchPredictorParallel. The Parallel has no InputPath/Parameters, so each branch gets the full input incl. $.ec2_instance_id (Branch B's SSM calls resolve unchanged); ResultPath: $.parallel_result does not clobber input.

Per-branch error isolation (the correctness-critical design)

SF Parallel default semantics: one branch erroring fails the whole Parallel and abandons siblings. With strict-Research hard-failing and PredictorTraining being an expensive weight-promoting spot (Promoted: True mid-run → weights land in S3), the naive Parallel would cancel/waste an in-flight or completed PredictorTraining.

A branch NEVER throws. Each branch ends in a branch-local Pass terminal with End: true:

  • BranchAComplete/BranchBCompletebranch_{a,b}_status = OK.
  • BranchAFailed/BranchBFailedbranch_{a,b}_status = FAILED + captured $.error, as data, then End: true.

Every in-branch hard-fail edge that previously routed to the shared HandleFailure is re-pointed to that branch's *Failed terminal (Branch A: Research/DataPhase2 Catch + ExtractResearchError; Branch B: PredictorTraining/WaitForPredictorTraining Catch + ExtractPredictorError). The eval/agent-justification Catches stay fail-soft (forward within Branch A, never to *Failed/HandleFailure).

Because both branches always succeed, the Parallel engine never cancels a sibling — both run to their own completion. Failure is surfaced AFTER the join: AggregateBranchOutcomes hoists both statuses; CheckBranchOutcomes routes to ExtractParallelBranchError → HandleFailure → FailExecution iff either branch is FAILED, else continues. This mirrors the existing per-state Extract*Error → HandleFailure convention — no new error channel. The Parallel-level Catch (States.ALL → HandleFailure) is defense-in-depth for a genuine SF-engine Parallel error only; the Parallel-level Retry (States.ALL, MaxAttempts: 0) is an explicit no-op so a completed PredictorTraining is never re-run.

Recovery semantics — Research fail with Predictor done

Branch A hard-fails; Branch B's PredictorTraining completed and already promoted weights to S3. With per-branch isolation: Branch B runs to completion uninterrupted (BranchBComplete=OK), BranchAFailed=FAILED+error. Post-join CheckBranchOutcomes sees branch_a_status=FAILED → SF fails (FAILED SNS alert names which branch failed/completed). The promoted weights are live in S3, untouched. Recovery re-run (composes with the live-SF-derived skip-set practice): re-run the Saturday SF with {"skip_predictor_training": true} — Branch B's CheckSkipPredictorTraining routes straight to BranchBComplete (no spot, no re-train), Branch A re-runs end-to-end. Symmetric case (Predictor fails, Research done): recovery with {"skip_research": true, …} per the same practice.

Tests

  • New tests/test_sf_research_predictor_parallel_wiring.py (72 tests): sibling Parallel branches; no Research→Predictor serial edge anywhere; Branch-A/B contents + preserved skip-gates/quartets; per-branch error isolation incl. test_no_branch_state_routes_to_top_level_handle_failure (the core cross-branch-cancellation guard); Research-hardfail/DataPhase2/Predictor failure → *Failed; eval-chain fail-soft Catches preserved; post-join fail-if-either-FAILED; Parallel Catch=shared HandleFailure + Retry no-op; $.ec2_instance_id reaches Branch B; inbound rewire; Backtester strictly after join; no dangling Next/Default/Catch (top level + in-branch); JSON parses.
  • Updated tests/test_sf_eval_judge_wiring.py — flattened state fixture (states moved into Branch A; all shape/payload/retry/timeout assertions still hold), old cross-boundary Counterfactual → CheckSkipPredictorTraining assertions retargeted to the new BranchAComplete terminal.
  • Updated tests/test_sf_regime_substrate_wiring.py — inbound RegimeRetrospectiveEval/skip edge → ResearchPredictorParallel.
  • Full suite: 1207 passed, 1 skipped (alpha-engine-data/.venv/bin/python -m pytest tests/ -q). Zero new failures. Pre-existing: 1 skip + 5 pandas FutureWarnings in daily_append.py (concat with empty/all-NA — unrelated to this change).

DEPLOY HELD

DEPLOY HELD — prod SF-topology change; do not merge/redeploy/trigger until the user directs. CLAUDE.md:100 "spread API load" rationale is stale and must be corrected on merge (flagged, not edited here — that file is outside this repo).

🤖 Generated with Claude Code

…ndent; per-branch error isolation)

Research and PredictorTraining are data-independent (CLAUDE.md
Architecture: "no data flows between them"). They ran sequentially only
to "spread API load" — a now-stale rationale: predictor training
(alpha-engine-predictor/training/train_handler.py) reads ArcticDB + CPU
LightGBM and makes NO Anthropic calls (yfinance fallback removed by
predictor PR #6; train_handler yfinance docstrings are stale). Research's
only heavy load is Anthropic. They do not contend on the rate-limited API.

Restructure the sequential Research…→PredictorTraining run into an SF
Parallel state (ResearchPredictorParallel):
- Branch A: CheckSkipResearch → Research → DataPhase2 → EvalJudge chain →
  EvalRollingMean → RationaleClustering → ReplayConcordance →
  Counterfactual (everything that consumes Research output, current order,
  all CheckSkip*/quartets/fail-soft Catches intact).
- Branch B: PredictorTraining quartet + skip-gate intact.
- Join → AggregateBranchOutcomes → CheckBranchOutcomes →
  CheckSkipDriftDetection → Backtester → Parity → Evaluator (unchanged).

Per-branch error isolation (the correctness-critical requirement): SF
Parallel's default cancels siblings when one branch errors. To prevent a
strict-Research hard-fail from aborting/wasting an in-flight or
completed+S3-promoted PredictorTraining, each branch ends in a
branch-local Pass terminal (End:true) recording OK/FAILED as data — a
branch NEVER throws. The SF is failed AFTER the join (post-aggregation)
if either branch recorded FAILED, so the other branch's completed work
(incl. already-promoted predictor weights in S3) persists and the
recovery re-run's skip-set can skip whichever branch genuinely completed
(Research-fail + Predictor-done → re-run with skip_predictor_training).
Parallel-level Catch → existing shared HandleFailure (no new error
channel); Parallel Retry is a documented no-op (MaxAttempts:0) so a
completed PredictorTraining is never re-run.

Inbound edges (RegimeRetrospectiveEval Next+Catch,
CheckSkipRegimeRetrospectiveEval skip choice) re-pointed to
ResearchPredictorParallel.

Tests: new tests/test_sf_research_predictor_parallel_wiring.py (72 tests:
sibling branches; Branch-A/B contents; per-branch isolation incl. no
in-branch escape to HandleFailure; post-join fail-if-either-FAILED;
ec2_instance_id reaches Branch B; Backtester after join; no dangling
targets anywhere). Updated test_sf_eval_judge_wiring.py (flattened state
fixture + old cross-boundary edge assertions retargeted to BranchAComplete)
and test_sf_regime_substrate_wiring.py (inbound edge → Parallel). Full
suite green: 1207 passed, 1 skipped (pre-existing pandas FutureWarnings
in daily_append.py, unrelated).

DEPLOY HELD — prod SF-topology change; do not merge/redeploy/trigger
until the user directs. CLAUDE.md:100 "spread API load" rationale is
stale and must be corrected on merge (flagged, not edited — that file is
outside this repo).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 64acbd0 into main May 17, 2026
1 check passed
@cipher813 cipher813 deleted the feat/sf-research-predictor-parallel branch May 17, 2026 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant