Skip to content

feat(daily_closes): skip_if_canonical optimization for windowed yfinance pass (PR 2/5)#200

Merged
cipher813 merged 1 commit into
mainfrom
feat/daily-closes-skip-if-canonical
May 10, 2026
Merged

feat(daily_closes): skip_if_canonical optimization for windowed yfinance pass (PR 2/5)#200
cipher813 merged 1 commit into
mainfrom
feat/daily-closes-skip-if-canonical

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

PR 2 of the windowed-data-reconciliation arc. Plan doc: alpha-engine-docs/private/windowed-data-reconciliation-260510.md. Builds on PR 1's structural window orchestration (#199).

Adds skip_if_canonical: bool = False parameter to collect(). When True (set automatically by _collect_window):

  • yfinance_only / auto modes: read the existing parquet, identify "canonical" rows (source ∈ {"yfinance", "polygon"} AND non-null Close), skip yfinance fetch for those, merge preserved canonical rows into the output. Net effect: steady-state yfinance batch cost stays near zero across the 14-day window because most cells are already populated by prior passes.
  • polygon_only mode: flag is ignored per option (a) from the design discussion — polygon always re-overwrites within the window so corporate-action backfills are absorbed. grouped-daily call rate stays at 1 per date.
  • Legacy post-close-skip is bypassed when skip_if_canonical=True — the whole point is to fill NaN cells in older window dates that legacy logic would skip.
  • Read-failure fallback: if the existing parquet can't be read (corrupt / network), the per-date call falls back to the legacy refetch+overwrite path so a single bad parquet doesn't take down the window.

Source-precedence ladder enforced

NaN < "yfinance" < "polygon". Each pass writes only "below itself":

  • yfinance pass: skips canonical {yfinance, polygon}; never demotes polygon, never re-fetches its own work.
  • polygon pass: overwrites NULL / "yfinance" / "polygon" cells (option a). Polygon never introduces NaN: a polygon-empty cell retains whatever was there.

Out of scope (later PRs in the arc)

  • PR 3: SF wiring + window_days=14 config knob (default flag-gated OFF for first cycle's observation).
  • PR 4: simulator gap-warning metric refactor reading the source column.
  • PR 5: chronic_polygon_gaps allowlist deprecation.

Test plan

  • pytest tests/test_daily_closes_skip_if_canonical.py — 8 new tests pinning skip semantics, NaN-Close still refetches, legacy parquet without source column doesn't skip, post-close-skip bypass, polygon_only ignores flag, corrupt-parquet fallback, default-False preserves legacy
  • pytest tests/test_daily_closes_window_days.py — +1 test pinning window-mode propagates skip_if_canonical=True
  • Full data suite: 642 passed, 1 skipped (no regressions; +9 new tests vs PR 1 baseline)
  • Live exercise: gated on PR 3's SF wiring + flag flip

🤖 Generated with Claude Code

…nce pass

PR 2 of the windowed-data-reconciliation arc (plan doc:
alpha-engine-docs/private/windowed-data-reconciliation-260510.md).
Builds on PR 1's structural window orchestration.

**What this adds**

New ``skip_if_canonical: bool = False`` parameter on ``collect()``.
When True (set by ``_collect_window`` automatically per the
windowed-arc design):

- ``yfinance_only`` / ``auto`` modes: read the existing parquet,
  identify "canonical" rows (``source ∈ {"yfinance", "polygon"}`` AND
  non-null ``Close``), skip yfinance fetch for those tickers, and
  merge the preserved canonical rows into the output parquet. Net
  effect: steady-state yfinance batch cost stays near zero across the
  14-day window because most cells are already populated by prior
  passes.

- ``polygon_only`` mode: flag is *ignored* per Brian's 2026-05-10
  option (a) — polygon always re-overwrites within the window so
  corporate-action backfills (where polygon's adjusted close shifts
  retroactively) are picked up. ``grouped-daily`` call rate stays at
  one per date in the window regardless, honoring the 14/day free-tier
  contract.

- Legacy post-close-skip short-circuit is bypassed when
  ``skip_if_canonical=True`` because the whole point of windowed
  reconciliation is to look INSIDE the existing parquet for NaN cells
  in older window dates that the legacy "skip if file exists post-close"
  semantic would otherwise skip.

- If reading the existing parquet fails (corrupt, network), fall back
  to legacy refetch+overwrite for the date — don't take down the whole
  window because of one unreadable parquet.

**Source-precedence-ladder semantics**

``NaN < "yfinance" < "polygon"``. Each pass writes only "below itself":

- yfinance pass skips cells where source ∈ {yfinance, polygon} —
  yfinance never demotes polygon, never re-fetches its own work.
- polygon pass overwrites cells where source ∈ {NULL, "yfinance"} —
  polygon canonicalizes ahead of yfinance — AND overwrites polygon
  cells too (option a, corporate-action handling). Polygon never
  introduces NaN: a polygon-empty cell retains whatever was there.

**Out of scope (later PRs in the arc)**

- PR 3: SF wiring + ``window_days=14`` config knob. Today's default
  ``skip_if_canonical=False`` preserves legacy single-date behavior;
  PR 3 flips ``window_days=14`` and the SF callers automatically pick
  up the skip optimization.
- PR 4: simulator gap-warning metric refactor reading the ``source``
  column.
- PR 5: ``chronic_polygon_gaps`` allowlist deprecation.

**Test coverage**

+9 new tests across two files:

- ``tests/test_daily_closes_skip_if_canonical.py`` (8 tests):
  - skips canonical yfinance + canonical polygon tickers
  - does not skip NaN-Close tickers (refetch fills the NaN)
  - does not skip when ``source`` column is missing (legacy parquet)
  - bypasses post-close-skip short-circuit
  - polygon_only mode IGNORES the flag (option a contract)
  - corrupt parquet falls back to legacy refetch
  - default (skip_if_canonical=False) preserves legacy short-circuit
- ``tests/test_daily_closes_window_days.py`` (+1 test): window-mode
  call sets skip_if_canonical=True per design.

Suite: 642 passed (was 633 after PR 1; +9 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 54a1746 into main May 10, 2026
1 check passed
@cipher813 cipher813 deleted the feat/daily-closes-skip-if-canonical branch May 10, 2026 14:34
cipher813 added a commit that referenced this pull request May 10, 2026
… both daily_closes call sites (#201)

PR 3 of the windowed-data-reconciliation arc (plan doc:
alpha-engine-docs/private/windowed-data-reconciliation-260510.md).
Builds on PR 1 (#199, window_days orchestration) + PR 2 (#200,
skip_if_canonical optimization).

**What this adds**

Two SF call sites in ``weekly_collector.py`` now pass the windowed-
reconciliation knobs through to ``daily_closes.collect``:

- **MorningEnrich** (line 961, ``polygon_only``) — reads
  ``daily_cfg.get("window_days", 1)`` + ``daily_cfg.get(
  "skip_if_canonical", False)`` and forwards. Polygon ignores the
  skip flag per option (a) but still benefits from windowed
  ``grouped-daily`` calls (one per BDay in the window — 14 calls/day
  when ``window_days=14``, the free-tier rate-limit ceiling).
- **EOD pass** (line 1196, ``yfinance_only``) — same config read +
  forward. With ``skip_if_canonical=true`` the yfinance batch cost
  stays near zero in steady state because most cells are already
  canonical from prior pass days.

Adds ``window_days: 1`` and ``skip_if_canonical: false`` to
``config.yaml.example`` with documentation on the production-target
values (``window_days: 14`` + ``skip_if_canonical: true``) + the
staged-rollout protocol from the plan doc.

**Default behavior preserved**

When the new config keys are absent (current production state), both
call sites pass ``window_days=1`` + ``skip_if_canonical=False``,
which is byte-identical to legacy single-date behavior. No live
behavior change from this PR landing — the cutover happens via a
separate alpha-engine-config commit that flips the values once the
wiring is observed clean.

**Out of scope (later PRs in the arc)**

- Cutover commit in alpha-engine-config (the actual ``window_days: 14``
  flip) — flag-gated rollout per the plan doc:
  1 clean Sat SF + 5 clean weekday SFs at ``window_days=14`` before
  ``skip_if_canonical: true`` flips.
- PR 4: simulator gap-warning metric refactor reading the ``source``
  column.
- PR 5: ``chronic_polygon_gaps`` allowlist deprecation.

**Test coverage**

+6 new tests in ``tests/test_weekly_collector_window_days_wiring.py``:
- absent config keys default to legacy ``window_days=1`` (both call sites)
- configured ``window_days=14`` + ``skip_if_canonical=true`` flow through
- string YAML coercion (``"14"`` → ``14``, ``"true"`` → ``True``)
- end-to-end roundtrip via ``daily_closes.collect`` mock

Suite: 648 passed (was 642 after PR 2; +6 new).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 12, 2026
…clobber (#219)

The windowed-reconciliation cutover (PRs #199/#200/#201 + alpha-engine-config
flip to daily_closes:{window_days:14, skip_if_canonical:true}, activated
2026-05-11) amplified a latent bug in _fetch_fred_closes: the FRED query
used sort_order=desc + limit=5 with no upper bound, so per-date calls
across the rolling window all returned today's most-recent observation.
Every historical date's parquet got today's VIX/VIX3M/TNX/IRX/TWO/HYOAS/
BAA10Y stamped on it, clobbering correct historical closes.

FlowDoctor surfaced the regression 2026-05-12 ~13:01/13:04 UTC with paired
"polygon_only OVERWRITE VIX" ERROR alerts for 2026-04-22 and 2026-04-28,
both showing identical pre (18.36) and post (17.19) closes — the signature
of "every per-date stamp got today's latest".

Fix:
- _fetch_fred_closes sends observation_end=date_str so per-date calls
  return that date's actual FRED observation (or most-recent on-or-before
  for the same-day case where FRED hasn't published yet — preserves the
  legacy "today's parquet carries yesterday's FRED close" semantic).
- Defensive guard refuses to write a future-dated observation if FRED
  somehow returns one despite observation_end.

Repair tool (collectors/daily_closes_fred_repair.py) re-fetches correct
FRED values across an operator-specified window and rewrites only the
FRED-ticker rows of each affected daily_closes parquet. Polygon stock
rows are untouched (their fetcher was always per-date-correct). Idempotent.

Tests: +7 per-date regression tests pinning observation_end + same-day
fallback + future-date refusal + missing-value skip; +11 repair tests
covering business-day enumeration + on-or-before lookup + idempotent
no-op + dry-run + missing-parquet skip. Suite 774 → 792.

Operator follow-up: after merge, run
  python -m collectors.daily_closes_fred_repair \
    --bucket alpha-engine-research \
    --start 2026-04-22 --end 2026-05-12 [--dry-run]
to repair the clobbered window before tomorrow's MorningEnrich (which
now writes correct per-date FRED values going forward).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 20, 2026
…ine_lib.alerts CLI (#277)

Per ROADMAP L146 (SOTA / institutional-approach sub-sub-rule). Second
inline-alerts site named in the ROADMAP — sibling PR alpha-engine #200
migrates the alpha-engine/infrastructure/health_checker.sh half.

infrastructure/lambdas/changelog-incident-mirror/deploy.sh: raw `aws
sns publish` → `python -m alpha_engine_lib.alerts publish`. SNS target
stays identical (default `alpha-engine-alerts` topic resolution), so
the changelog-incident-mirror Lambda still receives the message and
the smoke test still verifies end-to-end. `--no-telegram` keeps the
deliberate-per-deploy noise off the operator channel; `severity=info`
matches the smoke-test semantics.

Lib pin v0.20.0 → v0.21.0 in BOTH requirements.txt AND Dockerfile
(lockstep, per the test_lib_pin_lockstep regression test). v0.21.0 is
the alerts-module floor; v0.20 → v0.21 is additive (just the alerts
module).

Suite: 1401 passed (vs 1400 baseline — lib v0.21 doesn't break any
existing consumer).

Closes alpha-engine-data half of ROADMAP L146 (P2). After both PRs
merge, `grep -rE "aws sns publish.*alpha-engine-alerts|api.telegram.org/bot" infrastructure/ deploy/ --include="*.sh"`
returns zero hits across both alpha-engine and alpha-engine-data.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant