Skip to content

fix(daily_append): write macro before universe-coverage guard#104

Merged
cipher813 merged 1 commit into
mainfrom
fix/daily-append-macro-before-universe-guard
Apr 27, 2026
Merged

fix(daily_append): write macro before universe-coverage guard#104
cipher813 merged 1 commit into
mainfrom
fix/daily-append-macro-before-universe-guard

Conversation

@cipher813
Copy link
Copy Markdown
Owner

2026-04-27 failure mode: 7 stock tickers (PAYC, ASGN, LW, GTM, MOH, KMPR, MTCH) went missing from daily_closes. The missing-from-closes guard at step 2b raised at threshold>5 BEFORE the macro write block (old "step 5") ran, so SPY/VIX/sector ETFs never landed in ArcticDB for the day. The downstream EOD reconcile then hard-failed on stale SPY (by design — alpha against stale SPY is meaningless) and the EOD email + eod_pnl row did not get produced.

Root cause was an architectural coupling: macro freshness is gated on stock-universe coverage, but the two have nothing to do with each other. Macro keys are a fixed list of ~18 well-known tickers (SPY, VIX, 11 sector ETFs, etc.); whether 5 or 50 stocks went missing in the universe doesn't change whether SPY needs to land.

Fix: reorder the function so the macro/sector-ETF write runs as step 2a (immediately after ArcticDB libs are opened) and the universe-coverage guard remains as step 2b. Macro lands in ArcticDB first, then the guard raises non-zero on threshold violations (operator still gets paged on the stock issue, pipeline still exits 1, downstream Step Function still marks the run failed). Net effect on this class of failure: EOD email goes out, daily-data still exits non-zero — independent loud failure of the actual problem.

Tests:

  • test_macro_write_runs_before_universe_coverage_guard — locks source-order invariant (macro write call site precedes the missing-from-closes raise). Future refactors that reverse the ordering fail loudly here.
  • test_macro_write_does_not_block_on_universe_coverage — functional simulation of the 2026-04-27 failure: 10 stocks missing from closes (well above threshold 5), but macro keys + sector ETFs all present. Asserts the function raises on the universe guard AND that all 7 macro keys + 11 sector ETFs were written first.
  • All 261 existing tests pass (48 daily_append + 213 others).

Operational note: this PR alone does not unblock today's run — ArcticDB is still missing SPY for 2026-04-27 because the previously- ordered code already raised before reaching the write. Need to deploy this PR + rerun daily-data to backfill macros, then rerun EOD to send the email.

2026-04-27 failure mode: 7 stock tickers (PAYC, ASGN, LW, GTM, MOH,
KMPR, MTCH) went missing from daily_closes. The missing-from-closes
guard at step 2b raised at threshold>5 BEFORE the macro write block
(old "step 5") ran, so SPY/VIX/sector ETFs never landed in ArcticDB
for the day. The downstream EOD reconcile then hard-failed on stale
SPY (by design — alpha against stale SPY is meaningless) and the
EOD email + eod_pnl row did not get produced.

Root cause was an architectural coupling: macro freshness is gated
on stock-universe coverage, but the two have nothing to do with each
other. Macro keys are a fixed list of ~18 well-known tickers (SPY,
VIX, 11 sector ETFs, etc.); whether 5 or 50 stocks went missing in
the universe doesn't change whether SPY needs to land.

Fix: reorder the function so the macro/sector-ETF write runs as
step 2a (immediately after ArcticDB libs are opened) and the
universe-coverage guard remains as step 2b. Macro lands in ArcticDB
first, then the guard raises non-zero on threshold violations
(operator still gets paged on the stock issue, pipeline still exits
1, downstream Step Function still marks the run failed). Net effect
on this class of failure: EOD email goes out, daily-data still exits
non-zero — independent loud failure of the actual problem.

Tests:
- test_macro_write_runs_before_universe_coverage_guard — locks
  source-order invariant (macro write call site precedes the
  missing-from-closes raise). Future refactors that reverse the
  ordering fail loudly here.
- test_macro_write_does_not_block_on_universe_coverage — functional
  simulation of the 2026-04-27 failure: 10 stocks missing from
  closes (well above threshold 5), but macro keys + sector ETFs all
  present. Asserts the function raises on the universe guard AND
  that all 7 macro keys + 11 sector ETFs were written first.
- All 261 existing tests pass (48 daily_append + 213 others).

Operational note: this PR alone does not unblock today's run —
ArcticDB is still missing SPY for 2026-04-27 because the previously-
ordered code already raised before reaching the write. Need to
deploy this PR + rerun daily-data to backfill macros, then rerun
EOD to send the email.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 57cceec into main Apr 27, 2026
1 check passed
@cipher813 cipher813 deleted the fix/daily-append-macro-before-universe-guard branch April 27, 2026 21:28
cipher813 added a commit that referenced this pull request Apr 27, 2026
…ma (#105)

Background — 2026-04-27 EOD-email blackout investigation
========================================================
The structural fix in PR #104 decoupled macro/SPY freshness from
stock-coverage correctness. Validation today exposed a second, latent
issue: with the universe-coverage guard now passing, daily_append's
per-stock writes finally execute — and 100% of them fail with an
ArcticDB schema-mismatch error.

Schema audit (2026-04-27 22:14 UTC) revealed heterogeneous universe state:

  - 816 symbols (~90%): 64 cols, no VWAP at all
  - 88  symbols (~10%): 65 cols, VWAP at idx=64 (appended at end)

daily_append writes via OHLCV_COLS = [Open, High, Low, Close, Volume,
VWAP, ...features], which puts VWAP at idx=5. ArcticDB update() requires
column order match — both schema variants fail. Per-stock universe
writes have therefore been failing since the polygon-VWAP work landed
on 2026-04-24 (PRs #90/#91/#92), masked until today by the macro-coupled
universe-coverage guard.

Operational design (yfinance EOD → polygon morning)
====================================================
- yfinance EOD post-close hook writes daily_closes parquet with
  VWAP=NaN (yfinance does not expose true volume-weighted VWAP).
- polygon morning enrichment overwrites the parquet with real VWAP
  values from polygon grouped-daily.
- daily_append runs end-of-day and writes whatever VWAP is in the
  parquet to ArcticDB universe — NaN initially, real values after the
  morning enrichment re-runs daily_append.

For that flow to work, VWAP must be a first-class column in the
universe schema with a stable position. This migration normalizes
every symbol to the canonical layout:

    [Open, High, Low, Close, Volume, VWAP] + FEATURES

NaN-fills VWAP historically for the 816 symbols that didn't have it.
Repositions VWAP for the 88 symbols that had it appended at idx=64.
Existing FEATURES block keeps its relative order.

Idempotent — symbols already in canonical order are skipped.
Per-symbol error isolation — one symbol's write failure does not abort
the batch (records into errors[], continues with the rest).

Tests
=====
- _canonical_column_order: VWAP inserted at idx=5, feature block
  preserved in relative order, drops nothing.
- _is_canonical: recognizes correct layout, rejects appended-VWAP and
  missing-VWAP variants.
- migrate_universe_vwap apply path:
  - Inserts VWAP at idx=5 with FLOAT64 NaN when absent.
  - Relocates VWAP from idx=last when appended (preserving values).
  - Skips already-canonical symbols (idempotent).
  - Honors --tickers override for canary / subset runs.
  - Per-symbol error isolation — partial-status return on partial failure.
- All 275 existing tests still pass (261 + 14 new).

Operational follow-up (not in this PR)
======================================
After merge, deploy + run:
    python -m builders.migrate_universe_vwap --apply
on ae-trading. Expected: 904 symbols migrated (816 + 88), audit JSON
written to s3://alpha-engine-research/builders/migrate_universe_vwap_audit/.
Then rerun alpha-engine-daily-data.service (per-stock writes succeed)
and alpha-engine-eod.service (held-stock close lookups succeed; EOD
email + 2026-04-27 eod_pnl row land).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 3, 2026
Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded
and wrote new-format captures to S3, but the eval-judge state silently
never fired because the operator had passed skip_backtester=true to
skip the long-running backtester for validation purposes.

PR 4c (#140) wired the eval-pipeline states between Backtester success
and SaturdayHealthCheck:

  CheckBacktesterStatus.Success
    → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence
        → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean
    → SaturdayHealthCheck

But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck,
bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit
this (skip_backtester defaults false; Backtester runs and routes
through eval-judge correctly), but operator manual skips for any
non-eval validation purpose silently dropped the eval state.

Fix: route skip_backtester=true → CheckSkipEvalJudge instead of
SaturdayHealthCheck. Eval pipeline now fires on every SF execution
where the operator hasn't explicitly skip_eval_judge'd it.

tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge:
  pins the routing so a future "simplification" can't re-introduce
  the silent bypass.

Tests 433 → 434 (+1 wiring assertion).

Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput
defense + judge max_tokens to strategic tier — closes the 5/32
remaining failure class observed in this same SF run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 3, 2026
Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded
and wrote new-format captures to S3, but the eval-judge state silently
never fired because the operator had passed skip_backtester=true to
skip the long-running backtester for validation purposes.

PR 4c (#140) wired the eval-pipeline states between Backtester success
and SaturdayHealthCheck:

  CheckBacktesterStatus.Success
    → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence
        → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean
    → SaturdayHealthCheck

But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck,
bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit
this (skip_backtester defaults false; Backtester runs and routes
through eval-judge correctly), but operator manual skips for any
non-eval validation purpose silently dropped the eval state.

Fix: route skip_backtester=true → CheckSkipEvalJudge instead of
SaturdayHealthCheck. Eval pipeline now fires on every SF execution
where the operator hasn't explicitly skip_eval_judge'd it.

tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge:
  pins the routing so a future "simplification" can't re-introduce
  the silent bypass.

Tests 433 → 434 (+1 wiring assertion).

Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput
defense + judge max_tokens to strategic tier — closes the 5/32
remaining failure class observed in this same SF run).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 3, 2026
* fix(sf): skip_backtester preserves eval-judge skip-gate path

Caught 2026-05-03 in SF eval-pipeline-validation-5: Research succeeded
and wrote new-format captures to S3, but the eval-judge state silently
never fired because the operator had passed skip_backtester=true to
skip the long-running backtester for validation purposes.

PR 4c (#140) wired the eval-pipeline states between Backtester success
and SaturdayHealthCheck:

  CheckBacktesterStatus.Success
    → CheckSkipEvalJudge → ComputeEvalCadence → CheckMonthlyCadence
        → EvalJudgeFirstSaturday or EvalJudgeWeekly → EvalRollingMean
    → SaturdayHealthCheck

But CheckSkipBacktester.skip routed directly to SaturdayHealthCheck,
bypassing the eval-pipeline entirely. Production Sat 5/9 won't hit
this (skip_backtester defaults false; Backtester runs and routes
through eval-judge correctly), but operator manual skips for any
non-eval validation purpose silently dropped the eval state.

Fix: route skip_backtester=true → CheckSkipEvalJudge instead of
SaturdayHealthCheck. Eval pipeline now fires on every SF execution
where the operator hasn't explicitly skip_eval_judge'd it.

tests/test_sf_eval_judge_wiring.py — TestSkipBacktesterPreservesEvalJudge:
  pins the routing so a future "simplification" can't re-introduce
  the silent bypass.

Tests 433 → 434 (+1 wiring assertion).

Pairs with alpha-engine-research PR #104 (RubricEvalLLMOutput
defense + judge max_tokens to strategic tier — closes the 5/32
remaining failure class observed in this same SF run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: drop dead ALPHA_ENGINE_LIB_TOKEN PAT plumbing

alpha-engine-lib was flipped public 2026-05-03; PAT auth machinery
that existed to install from a private repo is now dead weight.
Removed across 6 files (net −87 lines).

CI:
- .github/workflows/ci.yml — drop "Configure git auth" step
- .github/workflows/deploy.yml — drop the secondary
  actions/checkout for cipher813/alpha-engine-lib + the LIB_REPO_DIR
  env on the deploy step

Docker / deploy:
- Dockerfile — replace `COPY vendor/alpha-engine-lib` + local pip
  install with `pip install "alpha-engine-lib[flow_doctor] @
  git+https://github.com/cipher813/alpha-engine-lib@v0.3.0"`. The
  [flow_doctor]-only install for Lambda is preserved (Lambda doesn't
  need [arcticdb] or [rag]); requirements.txt's
  [arcticdb,flow_doctor,rag] extras still apply for the EC2 install
  path.
- infrastructure/deploy.sh — drop the vendor/alpha-engine-lib
  staging block + cleanup_lib_staging trap. Replace with one-line
  comment explaining lib comes from public git+https now.

EC2 spot scripts:
- infrastructure/spot_data_weekly.sh — drop SSM PAT fetch + insteadOf
  rewrite from the DEPS step. Update inline comments referencing the
  old mechanism (3 spots).
- infrastructure/spot_drift_detection.sh — same removal.

Companion follow-ups (not in this PR):
- Delete ALPHA_ENGINE_LIB_TOKEN GitHub Actions secret on this repo
- Delete /alpha-engine/lib-token SSM SecureString (us-east-1)
- vendor/alpha-engine-lib local checkout can be removed (gitignored,
  not in any commit)

Per ROADMAP follow-up "P3 Drop ALPHA_ENGINE_LIB_TOKEN PAT plumbing"
added 2026-05-03. Second of 6 consumer-repo PRs in this cleanup arc;
prototype landed in alpha-engine PR #128.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant