feat(sf): run Research and PredictorTraining in parallel (data-independent; per-branch error isolation) by cipher813 · Pull Request #251 · cipher813/alpha-engine-data

cipher813 · 2026-05-17T00:41:53Z

DEPLOY HELD

DEPLOY HELD — prod SF-topology change; do not merge/redeploy any SF/trigger any execution until the user directs. This restructures the live Saturday Step Function topology. CLAUDE.md:100 "spread API load" rationale is stale and must be corrected on merge (flagged here, NOT edited — ~/Development/CLAUDE.md is outside this repo and not in a clean git repo).

Rationale (stale "spread API load", verified)

Research and PredictorTraining are data-independent (CLAUDE.md Architecture: "Research and Predictor Training are independent — no data flows between them"). They ran sequentially today only to "spread API load" (CLAUDE.md:100) — a now-stale, verifiably wrong rationale:

Predictor training (alpha-engine-predictor/training/train_handler.py) reads ArcticDB universe features + trains CPU LightGBM + a Ridge meta-learner. It makes no Anthropic calls. The yfinance fallback was removed by predictor PR Add PredictorHealthCheck to weekday pipeline #6; the residual yfinance docstrings in train_handler.py are stale.
Research's only heavy load is Anthropic (LangGraph multi-agent fan-out).
They do not contend on the rate-limited Anthropic API. The sole reason to serialize them no longer holds.

Cost of the stale serialization: PredictorTraining (~91-min EC2 spot) is gated behind Research's now-strict (research #195: all-agents, ~75-min retry, no partial output) variable runtime, and the full Research-chain wall-clock is serially added on top.

Topology

Replace the sequential Research…→PredictorTraining run with an SF Parallel state ResearchPredictorParallel (entered where the research chain previously began):

Branch A (StartAt: CheckSkipResearch) — everything that consumes Research output, current order, every CheckSkip* gate / Wait-Check quartet / Retry / fail-soft Catch intact: CheckSkipResearch → Research → CheckResearchStatus → CheckSkipDataPhase2 → DataPhase2 → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeSubmit{FirstSaturday,Weekly} → EvalJudgePoll{Choice,Wait,Poll,Decision} → EvalJudgeProcess → EvalRollingMean → CheckSkipRationaleClustering → RationaleClustering → CheckSkipReplayConcordance → ReplayConcordance → CheckSkipCounterfactual → Counterfactual (+ ExtractResearchError). Terminals BranchAComplete/BranchAFailed.
Branch B (StartAt: CheckSkipPredictorTraining) — PredictorTraining quartet + skip-gate intact: CheckSkipPredictorTraining → PredictorTraining → WaitForPredictorTraining → CheckPredictorStatus → PredictorWait (+ ExtractPredictorError). Terminals BranchBComplete/BranchBFailed.
Join → then AggregateBranchOutcomes → CheckBranchOutcomes → CheckSkipDriftDetection → DriftDetection → Backtester → Parity → Evaluator → … unchanged. The Parallel join is the natural sync point (Backtester needs both branches done).

Inbound edges (RegimeRetrospectiveEval Next + Catch, CheckSkipRegimeRetrospectiveEval skip choice) re-pointed to ResearchPredictorParallel. The Parallel has no InputPath/Parameters, so each branch gets the full input incl. $.ec2_instance_id (Branch B's SSM calls resolve unchanged); ResultPath: $.parallel_result does not clobber input.

Per-branch error isolation (the correctness-critical design)

SF Parallel default semantics: one branch erroring fails the whole Parallel and abandons siblings. With strict-Research hard-failing and PredictorTraining being an expensive weight-promoting spot (Promoted: True mid-run → weights land in S3), the naive Parallel would cancel/waste an in-flight or completed PredictorTraining.

A branch NEVER throws. Each branch ends in a branch-local Pass terminal with End: true:

BranchAComplete/BranchBComplete → branch_{a,b}_status = OK.
BranchAFailed/BranchBFailed → branch_{a,b}_status = FAILED + captured $.error, as data, then End: true.

Every in-branch hard-fail edge that previously routed to the shared HandleFailure is re-pointed to that branch's *Failed terminal (Branch A: Research/DataPhase2 Catch + ExtractResearchError; Branch B: PredictorTraining/WaitForPredictorTraining Catch + ExtractPredictorError). The eval/agent-justification Catches stay fail-soft (forward within Branch A, never to *Failed/HandleFailure).

Because both branches always succeed, the Parallel engine never cancels a sibling — both run to their own completion. Failure is surfaced AFTER the join: AggregateBranchOutcomes hoists both statuses; CheckBranchOutcomes routes to ExtractParallelBranchError → HandleFailure → FailExecution iff either branch is FAILED, else continues. This mirrors the existing per-state Extract*Error → HandleFailure convention — no new error channel. The Parallel-level Catch (States.ALL → HandleFailure) is defense-in-depth for a genuine SF-engine Parallel error only; the Parallel-level Retry (States.ALL, MaxAttempts: 0) is an explicit no-op so a completed PredictorTraining is never re-run.

Recovery semantics — Research fail with Predictor done

Branch A hard-fails; Branch B's PredictorTraining completed and already promoted weights to S3. With per-branch isolation: Branch B runs to completion uninterrupted (BranchBComplete=OK), BranchAFailed=FAILED+error. Post-join CheckBranchOutcomes sees branch_a_status=FAILED → SF fails (FAILED SNS alert names which branch failed/completed). The promoted weights are live in S3, untouched. Recovery re-run (composes with the live-SF-derived skip-set practice): re-run the Saturday SF with {"skip_predictor_training": true} — Branch B's CheckSkipPredictorTraining routes straight to BranchBComplete (no spot, no re-train), Branch A re-runs end-to-end. Symmetric case (Predictor fails, Research done): recovery with {"skip_research": true, …} per the same practice.

Tests

New tests/test_sf_research_predictor_parallel_wiring.py (72 tests): sibling Parallel branches; no Research→Predictor serial edge anywhere; Branch-A/B contents + preserved skip-gates/quartets; per-branch error isolation incl. test_no_branch_state_routes_to_top_level_handle_failure (the core cross-branch-cancellation guard); Research-hardfail/DataPhase2/Predictor failure → *Failed; eval-chain fail-soft Catches preserved; post-join fail-if-either-FAILED; Parallel Catch=shared HandleFailure + Retry no-op; $.ec2_instance_id reaches Branch B; inbound rewire; Backtester strictly after join; no dangling Next/Default/Catch (top level + in-branch); JSON parses.
Updated tests/test_sf_eval_judge_wiring.py — flattened state fixture (states moved into Branch A; all shape/payload/retry/timeout assertions still hold), old cross-boundary Counterfactual → CheckSkipPredictorTraining assertions retargeted to the new BranchAComplete terminal.
Updated tests/test_sf_regime_substrate_wiring.py — inbound RegimeRetrospectiveEval/skip edge → ResearchPredictorParallel.
Full suite: 1207 passed, 1 skipped (alpha-engine-data/.venv/bin/python -m pytest tests/ -q). Zero new failures. Pre-existing: 1 skip + 5 pandas FutureWarnings in daily_append.py (concat with empty/all-NA — unrelated to this change).

DEPLOY HELD

DEPLOY HELD — prod SF-topology change; do not merge/redeploy/trigger until the user directs. CLAUDE.md:100 "spread API load" rationale is stale and must be corrected on merge (flagged, not edited here — that file is outside this repo).

🤖 Generated with Claude Code

…ndent; per-branch error isolation) Research and PredictorTraining are data-independent (CLAUDE.md Architecture: "no data flows between them"). They ran sequentially only to "spread API load" — a now-stale rationale: predictor training (alpha-engine-predictor/training/train_handler.py) reads ArcticDB + CPU LightGBM and makes NO Anthropic calls (yfinance fallback removed by predictor PR #6; train_handler yfinance docstrings are stale). Research's only heavy load is Anthropic. They do not contend on the rate-limited API. Restructure the sequential Research…→PredictorTraining run into an SF Parallel state (ResearchPredictorParallel): - Branch A: CheckSkipResearch → Research → DataPhase2 → EvalJudge chain → EvalRollingMean → RationaleClustering → ReplayConcordance → Counterfactual (everything that consumes Research output, current order, all CheckSkip*/quartets/fail-soft Catches intact). - Branch B: PredictorTraining quartet + skip-gate intact. - Join → AggregateBranchOutcomes → CheckBranchOutcomes → CheckSkipDriftDetection → Backtester → Parity → Evaluator (unchanged). Per-branch error isolation (the correctness-critical requirement): SF Parallel's default cancels siblings when one branch errors. To prevent a strict-Research hard-fail from aborting/wasting an in-flight or completed+S3-promoted PredictorTraining, each branch ends in a branch-local Pass terminal (End:true) recording OK/FAILED as data — a branch NEVER throws. The SF is failed AFTER the join (post-aggregation) if either branch recorded FAILED, so the other branch's completed work (incl. already-promoted predictor weights in S3) persists and the recovery re-run's skip-set can skip whichever branch genuinely completed (Research-fail + Predictor-done → re-run with skip_predictor_training). Parallel-level Catch → existing shared HandleFailure (no new error channel); Parallel Retry is a documented no-op (MaxAttempts:0) so a completed PredictorTraining is never re-run. Inbound edges (RegimeRetrospectiveEval Next+Catch, CheckSkipRegimeRetrospectiveEval skip choice) re-pointed to ResearchPredictorParallel. Tests: new tests/test_sf_research_predictor_parallel_wiring.py (72 tests: sibling branches; Branch-A/B contents; per-branch isolation incl. no in-branch escape to HandleFailure; post-join fail-if-either-FAILED; ec2_instance_id reaches Branch B; Backtester after join; no dangling targets anywhere). Updated test_sf_eval_judge_wiring.py (flattened state fixture + old cross-boundary edge assertions retargeted to BranchAComplete) and test_sf_regime_substrate_wiring.py (inbound edge → Parallel). Full suite green: 1207 passed, 1 skipped (pre-existing pandas FutureWarnings in daily_append.py, unrelated). DEPLOY HELD — prod SF-topology change; do not merge/redeploy/trigger until the user directs. CLAUDE.md:100 "spread API load" rationale is stale and must be corrected on merge (flagged, not edited — that file is outside this repo). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit 64acbd0 into main May 17, 2026
1 check passed

cipher813 deleted the feat/sf-research-predictor-parallel branch May 17, 2026 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sf): run Research and PredictorTraining in parallel (data-independent; per-branch error isolation)#251

feat(sf): run Research and PredictorTraining in parallel (data-independent; per-branch error isolation)#251
cipher813 merged 1 commit into
mainfrom
feat/sf-research-predictor-parallel

cipher813 commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 17, 2026

DEPLOY HELD

Rationale (stale "spread API load", verified)

Topology

Per-branch error isolation (the correctness-critical design)

Recovery semantics — Research fail with Predictor done

Tests

DEPLOY HELD

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant