fix(daily_append): write macro before universe-coverage guard#104
Merged
Conversation
2026-04-27 failure mode: 7 stock tickers (PAYC, ASGN, LW, GTM, MOH, KMPR, MTCH) went missing from daily_closes. The missing-from-closes guard at step 2b raised at threshold>5 BEFORE the macro write block (old "step 5") ran, so SPY/VIX/sector ETFs never landed in ArcticDB for the day. The downstream EOD reconcile then hard-failed on stale SPY (by design — alpha against stale SPY is meaningless) and the EOD email + eod_pnl row did not get produced. Root cause was an architectural coupling: macro freshness is gated on stock-universe coverage, but the two have nothing to do with each other. Macro keys are a fixed list of ~18 well-known tickers (SPY, VIX, 11 sector ETFs, etc.); whether 5 or 50 stocks went missing in the universe doesn't change whether SPY needs to land. Fix: reorder the function so the macro/sector-ETF write runs as step 2a (immediately after ArcticDB libs are opened) and the universe-coverage guard remains as step 2b. Macro lands in ArcticDB first, then the guard raises non-zero on threshold violations (operator still gets paged on the stock issue, pipeline still exits 1, downstream Step Function still marks the run failed). Net effect on this class of failure: EOD email goes out, daily-data still exits non-zero — independent loud failure of the actual problem. Tests: - test_macro_write_runs_before_universe_coverage_guard — locks source-order invariant (macro write call site precedes the missing-from-closes raise). Future refactors that reverse the ordering fail loudly here. - test_macro_write_does_not_block_on_universe_coverage — functional simulation of the 2026-04-27 failure: 10 stocks missing from closes (well above threshold 5), but macro keys + sector ETFs all present. Asserts the function raises on the universe guard AND that all 7 macro keys + 11 sector ETFs were written first. - All 261 existing tests pass (48 daily_append + 213 others). Operational note: this PR alone does not unblock today's run — ArcticDB is still missing SPY for 2026-04-27 because the previously- ordered code already raised before reaching the write. Need to deploy this PR + rerun daily-data to backfill macros, then rerun EOD to send the email. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Apr 27, 2026
…ma (#105) Background — 2026-04-27 EOD-email blackout investigation ======================================================== The structural fix in PR #104 decoupled macro/SPY freshness from stock-coverage correctness. Validation today exposed a second, latent issue: with the universe-coverage guard now passing, daily_append's per-stock writes finally execute — and 100% of them fail with an ArcticDB schema-mismatch error. Schema audit (2026-04-27 22:14 UTC) revealed heterogeneous universe state: - 816 symbols (~90%): 64 cols, no VWAP at all - 88 symbols (~10%): 65 cols, VWAP at idx=64 (appended at end) daily_append writes via OHLCV_COLS = [Open, High, Low, Close, Volume, VWAP, ...features], which puts VWAP at idx=5. ArcticDB update() requires column order match — both schema variants fail. Per-stock universe writes have therefore been failing since the polygon-VWAP work landed on 2026-04-24 (PRs #90/#91/#92), masked until today by the macro-coupled universe-coverage guard. Operational design (yfinance EOD → polygon morning) ==================================================== - yfinance EOD post-close hook writes daily_closes parquet with VWAP=NaN (yfinance does not expose true volume-weighted VWAP). - polygon morning enrichment overwrites the parquet with real VWAP values from polygon grouped-daily. - daily_append runs end-of-day and writes whatever VWAP is in the parquet to ArcticDB universe — NaN initially, real values after the morning enrichment re-runs daily_append. For that flow to work, VWAP must be a first-class column in the universe schema with a stable position. This migration normalizes every symbol to the canonical layout: [Open, High, Low, Close, Volume, VWAP] + FEATURES NaN-fills VWAP historically for the 816 symbols that didn't have it. Repositions VWAP for the 88 symbols that had it appended at idx=64. Existing FEATURES block keeps its relative order. Idempotent — symbols already in canonical order are skipped. Per-symbol error isolation — one symbol's write failure does not abort the batch (records into errors[], continues with the rest). Tests ===== - _canonical_column_order: VWAP inserted at idx=5, feature block preserved in relative order, drops nothing. - _is_canonical: recognizes correct layout, rejects appended-VWAP and missing-VWAP variants. - migrate_universe_vwap apply path: - Inserts VWAP at idx=5 with FLOAT64 NaN when absent. - Relocates VWAP from idx=last when appended (preserving values). - Skips already-canonical symbols (idempotent). - Honors --tickers override for canary / subset runs. - Per-symbol error isolation — partial-status return on partial failure. - All 275 existing tests still pass (261 + 14 new). Operational follow-up (not in this PR) ====================================== After merge, deploy + run: python -m builders.migrate_universe_vwap --apply on ae-trading. Expected: 904 symbols migrated (816 + 88), audit JSON written to s3://alpha-engine-research/builders/migrate_universe_vwap_audit/. Then rerun alpha-engine-daily-data.service (per-stock writes succeed) and alpha-engine-eod.service (held-stock close lookups succeed; EOD email + 2026-04-27 eod_pnl row land). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 3, 2026
Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded and wrote new-format captures to S3, but the eval-judge state silently never fired because the operator had passed skip_backtester=true to skip the long-running backtester for validation purposes. PR 4c (#140) wired the eval-pipeline states between Backtester success and SaturdayHealthCheck: CheckBacktesterStatus.Success → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean → SaturdayHealthCheck But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck, bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit this (skip_backtester defaults false; Backtester runs and routes through eval-judge correctly), but operator manual skips for any non-eval validation purpose silently dropped the eval state. Fix: route skip_backtester=true → CheckSkipEvalJudge instead of SaturdayHealthCheck. Eval pipeline now fires on every SF execution where the operator hasn't explicitly skip_eval_judge'd it. tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge: pins the routing so a future "simplification" can't re-introduce the silent bypass. Tests 433 → 434 (+1 wiring assertion). Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput defense + judge max_tokens to strategic tier — closes the 5/32 remaining failure class observed in this same SF run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 3, 2026
Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded and wrote new-format captures to S3, but the eval-judge state silently never fired because the operator had passed skip_backtester=true to skip the long-running backtester for validation purposes. PR 4c (#140) wired the eval-pipeline states between Backtester success and SaturdayHealthCheck: CheckBacktesterStatus.Success → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean → SaturdayHealthCheck But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck, bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit this (skip_backtester defaults false; Backtester runs and routes through eval-judge correctly), but operator manual skips for any non-eval validation purpose silently dropped the eval state. Fix: route skip_backtester=true → CheckSkipEvalJudge instead of SaturdayHealthCheck. Eval pipeline now fires on every SF execution where the operator hasn't explicitly skip_eval_judge'd it. tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge: pins the routing so a future "simplification" can't re-introduce the silent bypass. Tests 433 → 434 (+1 wiring assertion). Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput defense + judge max_tokens to strategic tier — closes the 5/32 remaining failure class observed in this same SF run). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 3, 2026
* fix(sf): skip_backtester preserves eval-judge skip-gate path Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded and wrote new-format captures to S3, but the eval-judge state silently never fired because the operator had passed skip_backtester=true to skip the long-running backtester for validation purposes. PR 4c (#140) wired the eval-pipeline states between Backtester success and SaturdayHealthCheck: CheckBacktesterStatus.Success → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean → SaturdayHealthCheck But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck, bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit this (skip_backtester defaults false; Backtester runs and routes through eval-judge correctly), but operator manual skips for any non-eval validation purpose silently dropped the eval state. Fix: route skip_backtester=true → CheckSkipEvalJudge instead of SaturdayHealthCheck. Eval pipeline now fires on every SF execution where the operator hasn't explicitly skip_eval_judge'd it. tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge: pins the routing so a future "simplification" can't re-introduce the silent bypass. Tests 433 → 434 (+1 wiring assertion). Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput defense + judge max_tokens to strategic tier — closes the 5/32 remaining failure class observed in this same SF run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: drop dead ALPHA_ENGINE_LIB_TOKEN PAT plumbing alpha-engine-lib was flipped public 2026-05-03; PAT auth machinery that existed to install from a private repo is now dead weight. Removed across 6 files (net −87 lines). CI: - .github/workflows/ci.yml — drop "Configure git auth" step - .github/workflows/deploy.yml — drop the secondary actions/checkout for cipher813/alpha-engine-lib + the LIB_REPO_DIR env on the deploy step Docker / deploy: - Dockerfile — replace `COPY vendor/alpha-engine-lib` + local pip install with `pip install "alpha-engine-lib[flow_doctor] @ git+https://github.com/cipher813/alpha-engine-lib@v0.3.0"`. The [flow_doctor]-only install for Lambda is preserved (Lambda doesn't need [arcticdb] or [rag]); requirements.txt's [arcticdb,flow_doctor,rag] extras still apply for the EC2 install path. - infrastructure/deploy.sh — drop the vendor/alpha-engine-lib staging block + cleanup_lib_staging trap. Replace with one-line comment explaining lib comes from public git+https now. EC2 spot scripts: - infrastructure/spot_data_weekly.sh — drop SSM PAT fetch + insteadOf rewrite from the DEPS step. Update inline comments referencing the old mechanism (3 spots). - infrastructure/spot_drift_detection.sh — same removal. Companion follow-ups (not in this PR): - Delete ALPHA_ENGINE_LIB_TOKEN GitHub Actions secret on this repo - Delete /alpha-engine/lib-token SSM SecureString (us-east-1) - vendor/alpha-engine-lib local checkout can be removed (gitignored, not in any commit) Per ROADMAP follow-up "P3 Drop ALPHA_ENGINE_LIB_TOKEN PAT plumbing" added 2026-05-03. Second of 6 consumer-repo PRs in this cleanup arc; prototype landed in alpha-engine PR #128. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
2026-04-27 failure mode: 7 stock tickers (PAYC, ASGN, LW, GTM, MOH, KMPR, MTCH) went missing from daily_closes. The missing-from-closes guard at step 2b raised at threshold>5 BEFORE the macro write block (old "step 5") ran, so SPY/VIX/sector ETFs never landed in ArcticDB for the day. The downstream EOD reconcile then hard-failed on stale SPY (by design — alpha against stale SPY is meaningless) and the EOD email + eod_pnl row did not get produced.
Root cause was an architectural coupling: macro freshness is gated on stock-universe coverage, but the two have nothing to do with each other. Macro keys are a fixed list of ~18 well-known tickers (SPY, VIX, 11 sector ETFs, etc.); whether 5 or 50 stocks went missing in the universe doesn't change whether SPY needs to land.
Fix: reorder the function so the macro/sector-ETF write runs as step 2a (immediately after ArcticDB libs are opened) and the universe-coverage guard remains as step 2b. Macro lands in ArcticDB first, then the guard raises non-zero on threshold violations (operator still gets paged on the stock issue, pipeline still exits 1, downstream Step Function still marks the run failed). Net effect on this class of failure: EOD email goes out, daily-data still exits non-zero — independent loud failure of the actual problem.
Tests:
Operational note: this PR alone does not unblock today's run — ArcticDB is still missing SPY for 2026-04-27 because the previously- ordered code already raised before reaching the write. Need to deploy this PR + rerun daily-data to backfill macros, then rerun EOD to send the email.