feat(sf): run Research and PredictorTraining in parallel (data-independent; per-branch error isolation)#251
Merged
Conversation
…ndent; per-branch error isolation) Research and PredictorTraining are data-independent (CLAUDE.md Architecture: "no data flows between them"). They ran sequentially only to "spread API load" — a now-stale rationale: predictor training (alpha-engine-predictor/training/train_handler.py) reads ArcticDB + CPU LightGBM and makes NO Anthropic calls (yfinance fallback removed by predictor PR #6; train_handler yfinance docstrings are stale). Research's only heavy load is Anthropic. They do not contend on the rate-limited API. Restructure the sequential Research…→PredictorTraining run into an SF Parallel state (ResearchPredictorParallel): - Branch A: CheckSkipResearch → Research → DataPhase2 → EvalJudge chain → EvalRollingMean → RationaleClustering → ReplayConcordance → Counterfactual (everything that consumes Research output, current order, all CheckSkip*/quartets/fail-soft Catches intact). - Branch B: PredictorTraining quartet + skip-gate intact. - Join → AggregateBranchOutcomes → CheckBranchOutcomes → CheckSkipDriftDetection → Backtester → Parity → Evaluator (unchanged). Per-branch error isolation (the correctness-critical requirement): SF Parallel's default cancels siblings when one branch errors. To prevent a strict-Research hard-fail from aborting/wasting an in-flight or completed+S3-promoted PredictorTraining, each branch ends in a branch-local Pass terminal (End:true) recording OK/FAILED as data — a branch NEVER throws. The SF is failed AFTER the join (post-aggregation) if either branch recorded FAILED, so the other branch's completed work (incl. already-promoted predictor weights in S3) persists and the recovery re-run's skip-set can skip whichever branch genuinely completed (Research-fail + Predictor-done → re-run with skip_predictor_training). Parallel-level Catch → existing shared HandleFailure (no new error channel); Parallel Retry is a documented no-op (MaxAttempts:0) so a completed PredictorTraining is never re-run. Inbound edges (RegimeRetrospectiveEval Next+Catch, CheckSkipRegimeRetrospectiveEval skip choice) re-pointed to ResearchPredictorParallel. Tests: new tests/test_sf_research_predictor_parallel_wiring.py (72 tests: sibling branches; Branch-A/B contents; per-branch isolation incl. no in-branch escape to HandleFailure; post-join fail-if-either-FAILED; ec2_instance_id reaches Branch B; Backtester after join; no dangling targets anywhere). Updated test_sf_eval_judge_wiring.py (flattened state fixture + old cross-boundary edge assertions retargeted to BranchAComplete) and test_sf_regime_substrate_wiring.py (inbound edge → Parallel). Full suite green: 1207 passed, 1 skipped (pre-existing pandas FutureWarnings in daily_append.py, unrelated). DEPLOY HELD — prod SF-topology change; do not merge/redeploy/trigger until the user directs. CLAUDE.md:100 "spread API load" rationale is stale and must be corrected on merge (flagged, not edited — that file is outside this repo). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DEPLOY HELD
DEPLOY HELD — prod SF-topology change; do not merge/redeploy any SF/trigger any execution until the user directs. This restructures the live Saturday Step Function topology. CLAUDE.md:100 "spread API load" rationale is stale and must be corrected on merge (flagged here, NOT edited —
~/Development/CLAUDE.mdis outside this repo and not in a clean git repo).Rationale (stale "spread API load", verified)
Research and PredictorTraining are data-independent (CLAUDE.md Architecture: "Research and Predictor Training are independent — no data flows between them"). They ran sequentially today only to "spread API load" (CLAUDE.md:100) — a now-stale, verifiably wrong rationale:
alpha-engine-predictor/training/train_handler.py) reads ArcticDB universe features + trains CPU LightGBM + a Ridge meta-learner. It makes no Anthropic calls. The yfinance fallback was removed by predictor PR Add PredictorHealthCheck to weekday pipeline #6; the residual yfinance docstrings intrain_handler.pyare stale.Cost of the stale serialization: PredictorTraining (~91-min EC2 spot) is gated behind Research's now-strict (research #195: all-agents, ~75-min retry, no partial output) variable runtime, and the full Research-chain wall-clock is serially added on top.
Topology
Replace the sequential
Research…→PredictorTrainingrun with an SFParallelstateResearchPredictorParallel(entered where the research chain previously began):StartAt: CheckSkipResearch) — everything that consumes Research output, current order, everyCheckSkip*gate / Wait-Check quartet / Retry / fail-soft Catch intact:CheckSkipResearch → Research → CheckResearchStatus → CheckSkipDataPhase2 → DataPhase2 → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeSubmit{FirstSaturday,Weekly} → EvalJudgePoll{Choice,Wait,Poll,Decision} → EvalJudgeProcess → EvalRollingMean → CheckSkipRationaleClustering → RationaleClustering → CheckSkipReplayConcordance → ReplayConcordance → CheckSkipCounterfactual → Counterfactual(+ExtractResearchError). TerminalsBranchAComplete/BranchAFailed.StartAt: CheckSkipPredictorTraining) — PredictorTraining quartet + skip-gate intact:CheckSkipPredictorTraining → PredictorTraining → WaitForPredictorTraining → CheckPredictorStatus → PredictorWait(+ExtractPredictorError). TerminalsBranchBComplete/BranchBFailed.AggregateBranchOutcomes → CheckBranchOutcomes → CheckSkipDriftDetection → DriftDetection → Backtester → Parity → Evaluator → …unchanged. The Parallel join is the natural sync point (Backtester needs both branches done).Inbound edges (
RegimeRetrospectiveEvalNext + Catch,CheckSkipRegimeRetrospectiveEvalskip choice) re-pointed toResearchPredictorParallel. The Parallel has noInputPath/Parameters, so each branch gets the full input incl.$.ec2_instance_id(Branch B's SSM calls resolve unchanged);ResultPath: $.parallel_resultdoes not clobber input.Per-branch error isolation (the correctness-critical design)
SF
Paralleldefault semantics: one branch erroring fails the wholeParalleland abandons siblings. With strict-Research hard-failing and PredictorTraining being an expensive weight-promoting spot (Promoted: Truemid-run → weights land in S3), the naive Parallel would cancel/waste an in-flight or completed PredictorTraining.A branch NEVER throws. Each branch ends in a branch-local
Passterminal withEnd: true:BranchAComplete/BranchBComplete→branch_{a,b}_status = OK.BranchAFailed/BranchBFailed→branch_{a,b}_status = FAILED+ captured$.error, as data, thenEnd: true.Every in-branch hard-fail edge that previously routed to the shared
HandleFailureis re-pointed to that branch's*Failedterminal (Branch A:Research/DataPhase2Catch +ExtractResearchError; Branch B:PredictorTraining/WaitForPredictorTrainingCatch +ExtractPredictorError). The eval/agent-justification Catches stay fail-soft (forward within Branch A, never to*Failed/HandleFailure).Because both branches always succeed, the Parallel engine never cancels a sibling — both run to their own completion. Failure is surfaced AFTER the join:
AggregateBranchOutcomeshoists both statuses;CheckBranchOutcomesroutes toExtractParallelBranchError → HandleFailure → FailExecutioniff either branch is FAILED, else continues. This mirrors the existing per-stateExtract*Error → HandleFailureconvention — no new error channel. The Parallel-levelCatch (States.ALL → HandleFailure)is defense-in-depth for a genuine SF-engine Parallel error only; the Parallel-levelRetry (States.ALL, MaxAttempts: 0)is an explicit no-op so a completed PredictorTraining is never re-run.Recovery semantics — Research fail with Predictor done
Branch A hard-fails; Branch B's PredictorTraining completed and already promoted weights to S3. With per-branch isolation: Branch B runs to completion uninterrupted (
BranchBComplete=OK),BranchAFailed=FAILED+error. Post-joinCheckBranchOutcomesseesbranch_a_status=FAILED→ SF fails (FAILED SNS alert names which branch failed/completed). The promoted weights are live in S3, untouched. Recovery re-run (composes with the live-SF-derived skip-set practice): re-run the Saturday SF with{"skip_predictor_training": true}— Branch B'sCheckSkipPredictorTrainingroutes straight toBranchBComplete(no spot, no re-train), Branch A re-runs end-to-end. Symmetric case (Predictor fails, Research done): recovery with{"skip_research": true, …}per the same practice.Tests
tests/test_sf_research_predictor_parallel_wiring.py(72 tests): sibling Parallel branches; no Research→Predictor serial edge anywhere; Branch-A/B contents + preserved skip-gates/quartets; per-branch error isolation incl.test_no_branch_state_routes_to_top_level_handle_failure(the core cross-branch-cancellation guard); Research-hardfail/DataPhase2/Predictor failure →*Failed; eval-chain fail-soft Catches preserved; post-join fail-if-either-FAILED; Parallel Catch=shared HandleFailure + Retry no-op;$.ec2_instance_idreaches Branch B; inbound rewire; Backtester strictly after join; no dangling Next/Default/Catch (top level + in-branch); JSON parses.tests/test_sf_eval_judge_wiring.py— flattened state fixture (states moved into Branch A; all shape/payload/retry/timeout assertions still hold), old cross-boundaryCounterfactual → CheckSkipPredictorTrainingassertions retargeted to the newBranchACompleteterminal.tests/test_sf_regime_substrate_wiring.py— inboundRegimeRetrospectiveEval/skip edge →ResearchPredictorParallel.alpha-engine-data/.venv/bin/python -m pytest tests/ -q). Zero new failures. Pre-existing: 1 skip + 5 pandasFutureWarnings indaily_append.py(concat with empty/all-NA — unrelated to this change).DEPLOY HELD
DEPLOY HELD — prod SF-topology change; do not merge/redeploy/trigger until the user directs. CLAUDE.md:100 "spread API load" rationale is stale and must be corrected on merge (flagged, not edited here — that file is outside this repo).
🤖 Generated with Claude Code