Skip to content

fix(backfill): apply daily_closes delta + regression preflight#130

Merged
cipher813 merged 1 commit into
mainfrom
fix/backfill-no-clobber-on-stale-source
May 2, 2026
Merged

fix(backfill): apply daily_closes delta + regression preflight#130
cipher813 merged 1 commit into
mainfrom
fix/backfill-no-clobber-on-stale-source

Conversation

@cipher813
Copy link
Copy Markdown
Owner

@cipher813 cipher813 commented May 2, 2026

Summary

  • DataPhase1 step 8 (builders.backfill) silently regressed ArcticDB macro+universe last_date by 1d on the 2026-05-02 Saturday SF run. Postflight caught it; SF halted.
  • Root cause: backfill loaded the 10y price cache (which passed the mtime "current" check today and skipped its weekly refresh, so internal data ended 4/30), computed features, and full-series lib.write() clobbered the 5/1 row MorningEnrich had appended at 09:18.
  • Fix adds two layers: (1) merge staging/daily_closes/{date}.parquet on top of the cache before any compute (mirrors features/compute.py::_apply_daily_delta), and (2) defense-in-depth _assert_no_arctic_regression preflight that hard-fails before compute if planned data is older than what's already in ArcticDB.

The 2026-04-22 SOLS regression was patched same-day with a ticker_filter-only skip_macro guard; this PR closes the full-universe path that the same bug class still ran through.

Test plan

  • tests/test_backfill_no_regression.py — 8 new tests cover the preflight (pass / macro-regression / universe-regression / first-write absent-from-arctic) and the wiring (delta call / preflight call / ticker_filter skip / dry_run skip)
  • tests/test_backfill_unified_and_macro_scoping.py — existing 9 tests still pass after extending mock setups with delta + preflight stubs
  • tests/test_daily_append_backfill_safe.py — 5 existing tests unaffected
  • Full suite green in CI (344 passed)
  • Live verification via Saturday SF redrive (after merge): MorningEnrich appends 5/1 → backfill applies delta + preflight passes → ArcticDB lands at 5/1 → postflight passes

🤖 Generated with Claude Code

DataPhase1 step 8 (builders.backfill) silently regressed ArcticDB macro
and universe last_date by 1 day on the 2026-05-02 Saturday SF run. Root
cause: backfill loaded the 10y price cache (which passed the mtime
"current" check and skipped refresh, so its data ended 4/30), computed
features over it, and full-series lib.write() clobbered the 5/1 row
MorningEnrich had appended at 09:18. Postflight rejected the regression
at 09:53; the SF halted at DataPhase1.

Two layers gate this class of incident going forward:

1. Apply daily_closes delta before any compute. backfill now mirrors
   features/compute.py::_apply_daily_delta — staging/daily_closes/{date}
   parquets are merged on top of the cache so the source captures
   MorningEnrich's polygon-T+1 fill (and any other post-cache-refresh
   appends). This makes the source as fresh as ArcticDB.

2. Regression preflight against ArcticDB. _assert_no_arctic_regression
   reads SPY + a 20-symbol universe sample (matches postflight's
   _UNIVERSE_SAMPLE_SIZE) and refuses to run if planned data is older
   than what's already in ArcticDB. Hard-fails BEFORE the multi-minute
   feature compute with an actionable error message pointing at the
   recovery path (force a price-cache refresh).

The 2026-04-22 SOLS regression was patched the same day with a
ticker_filter-only guard (skip_macro path); this PR closes the
full-universe path that the same bug class still ran through.

8 new tests in tests/test_backfill_no_regression.py cover the
preflight (pass / macro-regression / universe-regression / first-write
absent-from-arctic) and the wiring (delta call / preflight call /
ticker_filter skip / dry_run skip). Existing
test_backfill_unified_and_macro_scoping.py mock setups extended with
delta + preflight stubs.

Pre-existing failures in test_flow_doctor_wiring.py
(test_enabled_attaches_flow_doctor_handler,
test_exclude_patterns_plumbed_to_handler) are unrelated — flow-doctor
0.3.0 (pinned by alpha-engine-lib[flow_doctor]>=0.3.0,<0.4.0) doesn't
support the s3 notifier type the production yaml + tests reference.
That's a separate alpha-engine-lib version-coordination issue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 608b72d into main May 2, 2026
1 check passed
@cipher813 cipher813 deleted the fix/backfill-no-clobber-on-stale-source branch May 2, 2026 13:22
cipher813 added a commit that referenced this pull request May 2, 2026
…t list (#132)

MorningEnrich's missing-from-closes hard-fail was tripping every S&P
churn week. Today (2026-05-02): 8 tickers got dropped from the index
this past week (ASGN, GTM, HOLX, KMPR, LW, MOH, MTCH, PAYC). They're
still in ArcticDB universe (awaiting next prune cycle); they're absent
from the new constituents.json Phase 1 wrote at 09:20; MorningEnrich
no longer requests them from polygon. The pre-existing check
(arctic_universe - closes) saw 8 churn-outs + 4 chronic polygon-
coverage gaps = 12 missing > threshold of 5 → SF halt at MorningEnrich.

Add an optional ``expected_tickers`` parameter to ``daily_append``. When
the caller passes its request list, the check scopes to
``arctic ∩ expected`` instead of the full ArcticDB universe. Tickers
absent from the request (S&P churn-out stragglers) are excluded from
the alarm and logged at INFO so operators see drift building up between
prune cycles. Backward compatible — callers that don't pass it retain
the prior whole-universe behavior.

Both call sites (``_run_morning_enrich`` and ``_run_daily``) now pass
their constituents-derived ticker list. The ticker list was already
in scope at both sites.

Net effect on the 2026-05-02 SF redrive: missing-from-closes count
drops 12 → 4 (only the chronic BF-B/BRK-B/MOG-A/PSTG remain — well
under the 5-threshold WARN-only path), MorningEnrich completes, Phase 1
runs, PR #130's backfill regression preflight passes, ArcticDB lands
at 5/1, postflight passes, downstream Research/Predictor Training/
Backtester all run.

5 new tests in tests/test_daily_append_missing_from_closes.py (360
total) cover: stragglers excluded, real constituents-gap still raises,
caret-prefix stripping, None preserves legacy behavior, straggler-count
INFO log fires.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 2, 2026
* feat(preflight): add sf_preflight.py — Saturday SF dry-rehearsal

Predicts whether the Saturday SF would succeed BEFORE launching a spot.
Today's recovery cycle (5 SF redrives, ~5 polygon API calls each) burned
free-tier quota and operator hours discovering bugs sequentially. This
module simulates the critical pre-Phase-1 path against real S3 + ArcticDB
state and reports per-step pass/fail in ~30s with 1 polygon call total.

Eight independent checks, mapped to today's incident stack:

  PR #130 (backfill regression)         → check_backfill_source_freshness
  PR #131 (polygon coverage flake)      → check_polygon_grouped_coverage
  PR #132 (missing-from-closes scoping) → check_predicted_missing_from_closes
  PR #133 (freshness scan scoping)      → check_universe_sample_freshness
  PR #134 (workflow ordering)           → check_universe_drift
  PR #135 (return shape)                → check_constituents_fetch
  Postflight contracts                  → check_postflight_contracts
  ArcticDB reachability                 → check_arctic_connectivity

Each check is a pure function taking a PreflightContext, returning a
CheckResult. The orchestrator runs them all (catching per-check
exceptions so one fail doesn't abort the suite) and emits human or
JSON output. Exit code 1 on any failure.

Two macOS-specific design notes:

1. ArcticDB libs are initialized once in check_arctic_connectivity and
   reused across downstream checks via the context — re-initializing
   adb.Arctic() crashes Aws::S3::S3Client::S3Client on macOS.
2. Checks are ordered with arctic_connectivity FIRST so its bundled AWS
   SDK loads before boto3 (which gets pulled in by collectors imports).

Polygon check skips gracefully (WARN, not FAIL) when POLYGON_API_KEY
is unset — supports laptop-side preflight where the .env isn't loaded.
On the spot the key is present and the check fires.

18 tests in tests/test_sf_preflight.py — happy path + each failure mode
each check is designed to catch + orchestrator isolation.

394 tests total.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(sf_preflight): set POLYGON_API_KEY in polygon-coverage tests

CI runs without POLYGON_API_KEY in env, so the no-key skip-to-WARN
guard short-circuited the 3 polygon-coverage tests before they
reached the mocked client. Set the env var via monkeypatch so the
guard passes through to the polygon mock. Also add explicit test
for the no-key path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant