feat(data): migrate daily_closes parquet to staging/ prefix + lifecycle policy#112
Merged
Conversation
…le policy Closes ROADMAP P1 "VWAP centralization — Writer" (line 576) staleness + ROADMAP "Phase 7e: Deprecate legacy Parquet" namespace cleanup goal. The VWAP writer + universe schema canonicalization itself was shipped 2026-04-17 (PR #51) and 2026-04-28 morning respectively; what was still open was eliminating `predictor/daily_closes/` from the predictor/ namespace where it lived alongside authoritative artifacts when it's actually intermediate state between API fetch and ArcticDB ingest. The daily_closes parquet's only role is restartability when daily_append fails after the upstream fetch succeeded — canonical home for daily OHLCV is the ArcticDB universe library. Moving the prefix from `predictor/daily_closes/` → `staging/daily_closes/` makes that role explicit in the path, and the new infrastructure/s3_lifecycle_staging.json expires staging/ objects after 7 days so old parquets stop accumulating. ## Hard-cutover, no fallback (per feedback_no_silent_fails) Reader-side fallback was explicitly rejected as the silent-fail class that has bitten production repeatedly this month. All consumers hard-cut to the new prefix; if it's missing every reader fails loud (RuntimeError or NoSuchKey) rather than falling through to legacy. ## Files changed in this PR (writer + 2 readers + orchestrator) - collectors/daily_closes.py: default s3_prefix "predictor/daily_closes/" → "staging/daily_closes/"; module docstring rewritten to reflect the intermediate-state role + 7-day lifecycle. - builders/daily_append.py: _load_daily_closes reads staging/daily_closes/{date_str}.parquet. - features/compute.py: delta-load loop reads staging/daily_closes/{d}.parquet (+ docstring update). - weekly_collector.py: both --daily and --morning-enrich call sites default daily_cfg.s3_prefix to "staging/daily_closes/". - config.yaml.example: documented default updated; added comment explaining intermediate-staging contract. - README.md: writes table updated; entry now flags 7-day lifecycle + ArcticDB as canonical home. - tests/test_vwap_ingestion.py: docstring path corrected. ## Coordinated cross-repo cutover (separate PRs in dependency order) - alpha-engine-research (PR 2): feature_store_reader.read_latest_daily_closes must cut to staging/ — currently reads predictor/daily_closes/ (called from graph/research_graph.py:222-223 in the live Saturday SF Research Lambda). - alpha-engine-dashboard (PR 3): health_checker.py:158/161 daily_closes probe + pages/4_System_Health.py:266 S3 object count must cut to staging/. - alpha-engine (PR 4 — IAM cleanup): drop stale predictor/daily_closes/* grant from alpha-engine-executor-role/alpha-engine-s3-access.json:46 (executor migrated 2026-04-17 in PR #60; the IAM grant is dead). PRs 2 + 3 must merge in the same deploy window as this PR; brief window between data PR deploy and consumer Lambda redeploys may show 404s on new path — that's the intended "fail loud" signal under the hard-cutover contract. ## New artifacts - infrastructure/s3_lifecycle_staging.json: 7-day expiration scoped to staging/ prefix; 1-day abort for incomplete multipart uploads. - infrastructure/apply_s3_lifecycle.sh: idempotent put-bucket-lifecycle-configuration applier; --dry-run support. **Operational deploy step (after PR merges):** `bash infrastructure/apply_s3_lifecycle.sh` from this repo to push the lifecycle policy to alpha-engine-research bucket. ## Old prefix cleanup (after consumers migrated) Once PRs 2 + 3 deploy and one clean Saturday SF + weekday SF cycle confirms staging/ writes + reads are clean, the old predictor/daily_closes/ prefix can be deleted via `aws s3 rm --recursive s3://alpha-engine-research/predictor/daily_closes/`. That closes ROADMAP Phase 7e for this prefix. ## Tests 6 new in tests/test_staging_prefix_migration.py — lock the prefix in place across writer + 2 readers + orchestrator + lifecycle artifact JSON validity. Source-text invariants forbid both the literal legacy string and a regression of the staging/ default. Full suite 300 → 306 pass; no existing test regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
cipher813
added a commit
that referenced
this pull request
Apr 29, 2026
…113) PR #112 shipped the lifecycle policy file with only the new staging/ rule. put-bucket-lifecycle-configuration REPLACES the full document on apply, so running the script as shipped would have clobbered the pre-existing feature-store-retention rule on the features/ prefix (90-day STANDARD_IA transition + 365-day expiration). Caught while running the apply post-#112 merge. Inspected current bucket lifecycle BEFORE applying and merged both rules into the file. Also drops the _comment fields the original JSON had — AWS S3 lifecycle config schema doesn't recognize them; left in for context they'd have been rejected at apply time. Comments now live in the apply script's docstring instead, with explicit "this file is the single source of truth for ALL bucket lifecycle rules" warning to stop a future contributor from making the same mistake. Live bucket verified post-apply via get-bucket-lifecycle-configuration: both rules present (staging/ 7d expiration + features/ 365d/90d-IA). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 19, 2026
…ence/ (#270) ROADMAP P1 "predictor/ S3 namespace rationalization Wave 3" — start the write-both soak that migrates the 10y price_cache parquet tree from predictor/price_cache/ (under the predictor module's namespace) to reference/price_cache/ (long-lived data-module references). Mirrors the shape of Wave 1's predictor/daily_closes/ -> staging/daily_closes/ but uses write-both + soak instead of hard-cutover because this writer only rewrites STALE tickers — a hard cut would leave fresh tickers in legacy and the new prefix incomplete for a full yfinance refresh cycle. CLAUDE.md S3 Contract Safety mandates the write-both + >=1 week soak for any path change of this shape. ## What ships in PR1 (producer-side only — zero reader changes) - builders/_price_cache_writeboth.py (new): the single chokepoint. `price_cache_write_prefixes(primary)` returns [legacy, new] for the production default and [primary] for any custom string. Legacy ordered first so a fail-loud on the legacy write preserves pre-Wave-3 failure semantics — the new prefix never silently masks a legacy write error. - collectors/prices.py: yfinance refresh upload now writes both prefixes. - collectors/fred_history.py: FRED backfill upload now writes both prefixes. - weekly_collector.py: chronic-gap self-heal patch writes both prefixes (the get_object read stays on legacy since readers haven't migrated). - infrastructure/backfill_reference_price_cache.sh (new): one-shot `aws s3 sync` operator script to seed reference/price_cache/ with the ~934 objects currently in predictor/price_cache/. Idempotent; --dry-run supported. Run ONCE as part of PR1's deploy. - tests/test_price_cache_writeboth.py (new, 7 tests): helper contract (legacy default returns both, custom returns single, ordering pinned) + each of the 3 production writers exercised end-to-end with stubbed s3 + recording asserts that BOTH keys land per ticker with identical bodies. - tests/test_fred_history_fetcher.py: updated the pre-existing test_uploads_to_s3_when_not_dry_run from asserting a single upload to asserting write-both behavior. Required by zero-tolerance test policy. ## What does NOT ship in PR1 - Reader migrations: ~10 read sites across alpha-engine-data, alpha-engine-predictor, alpha-engine-backtester, alpha-engine-dashboard stay on the legacy prefix. PR3+ migrates them with legacy fallback. - IAM grant expansion to cover reference/price_cache/* — PR2 mirrors Wave 1 #120's IAM pattern on the alpha-engine repo's alpha-engine-s3-access.json. - builders/daily_append.py:_load_parquet_warmup (reader, not writer) — migrates in PR3. - sector_map.json (separate concern — write-once-per-Saturday, not part of the stale-ticker churn). Handled at cutover or PR3. - The cutover itself: PR4 will flip primary -> reference/, drop the legacy entry from price_cache_write_prefixes, retire reader fallbacks, and `aws s3 rm --recursive` the legacy prefix. Gated on >=1 week of clean write-both observation. ## Soak contract PR1 merge -> deploy this commit live -> run the backfill script ONCE to seed the new prefix -> next Saturday SF firing's first write to both prefixes starts the soak clock -> after >=4 Saturday firings (matches Wave 4's discipline) with no parity divergence, PR3 reader migrations go in, then PR4 cutover. ## Tests pytest tests/ -q -> 1387 passed, 1 skipped, 0 failed Composes with: ROADMAP Wave 4 slim-deletion arc currently in flight (institutional pattern for data-tier prefix changes — dual-read / dual-write + lib reconcile observation), Wave 1 PR #112 (template), S3 Contract Safety in CLAUDE.md. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First PR of a coordinated 4-PR arc closing ROADMAP P1 "VWAP centralization — Writer" (line 576) staleness + Phase 7e namespace cleanup goal. The VWAP writer itself shipped 2026-04-17 (PR #51) and the universe schema canonicalization shipped 2026-04-28 morning. What's still open is eliminating
predictor/daily_closes/from thepredictor/namespace where it lives alongside authoritative artifacts when it's actually intermediate state between API fetch and ArcticDB ingest.Moving the prefix from
predictor/daily_closes/→staging/daily_closes/makes that role explicit in the path, and the newinfrastructure/s3_lifecycle_staging.jsonexpiresstaging/objects after 7 days so old parquets stop accumulating.Hard-cutover, no fallback (per
feedback_no_silent_fails)Reader-side fallback was explicitly rejected as the silent-fail class that has bitten production repeatedly this month. All consumers hard-cut to the new prefix; if it's missing every reader fails loud (
RuntimeErrororNoSuchKey) rather than falling through to legacy.Coordinated cross-repo cutover (PRs 2 + 3 must merge in same deploy window)
feature_store_reader.read_latest_daily_closescuts tostaging/(called fromgraph/research_graph.py:222-223in the live Saturday SF Research Lambda)health_checker.py:158/161+pages/4_System_Health.py:266cut tostaging/predictor/daily_closes/*from executor IAM (executor migrated 2026-04-17 in PR #60)Brief window between data PR deploy and consumer Lambda redeploys may show 404s on new path — that's the intended "fail loud" signal under the hard-cutover contract.
Files changed (writer + 2 readers + orchestrator)
collectors/daily_closes.py: defaults3_prefix"predictor/daily_closes/"→"staging/daily_closes/"; module docstring rewritten to reflect the intermediate-state role + 7-day lifecyclebuilders/daily_append.py:_load_daily_closesreadsstaging/daily_closes/{date_str}.parquetfeatures/compute.py: delta-load loop readsstaging/daily_closes/{d}.parquet(+ docstring)weekly_collector.py: both--dailyand--morning-enrichcall sites defaultdaily_cfg.s3_prefixto"staging/daily_closes/"config.yaml.example: documented default updated; comment explaining intermediate-staging contractREADME.md: writes table updated; flags 7-day lifecycle + ArcticDB as canonical hometests/test_vwap_ingestion.py: docstring path correctedNew artifacts
infrastructure/s3_lifecycle_staging.json: 7-day expiration scoped tostaging/prefix; 1-day abort for incomplete multipart uploadsinfrastructure/apply_s3_lifecycle.sh: idempotentput-bucket-lifecycle-configurationapplier;--dry-runsupportOperational steps after merge
boot-pullon next instance start (or manualgit pull).bash infrastructure/apply_s3_lifecycle.shfrom this repo. Pushes the 7-day expiration to alpha-engine-research bucket.aws s3 rm --recursive s3://alpha-engine-research/predictor/daily_closes/. Closes ROADMAP Phase 7e for this prefix.Test plan
pytest tests/test_staging_prefix_migration.py -v-> 6 pass (writer default, no legacy strings, both readers, orchestrator, lifecycle artifact valid)pytest tests/-> 306 pass (was 300), 0 regressionspython3 -c 'import json; json.load(open("infrastructure/s3_lifecycle_staging.json"))'-> validbash infrastructure/apply_s3_lifecycle.sh --dry-run-> prints planned put-bucket-lifecycle-configuration command