feat(provenance): per-row source column on ArcticDB universe writes by cipher813 · Pull Request #196 · cipher813/alpha-engine-data

cipher813 · 2026-05-09T17:47:21Z

Summary

Closes the audit-trail gap surfaced during the chronic-polygon-gap arc: daily_closes.collect already records per-row source ("polygon" / "yfinance" / "fred") in the staging parquet, but the canonical ArcticDB universe library (consumed by predictor training + inference) dropped the column. Provenance lived only in 7-day-TTL staging parquets, then evaporated.

This PR threads source end-to-end:

_load_daily_closes extracts source per ticker (defaults "unknown" for pre-migration parquets)
Daily_append's per-ticker write carries source into ArcticDB rows alongside OHLCV + features
_apply_daily_delta propagates source through cache+delta merge (pre-delta = "yfinance", delta = from parquet, dedup keep="last")
builders/backfill includes source in per-ticker keep_cols
New schema-bridge helper _align_schema_for_update handles the migration boundary (drop extras / add missing / no-op on match)
Dtype-cast skip-list: source is string metadata, NOT subject to float32 fallback

Migration trajectory

Mon-Fri (post-merge): daily_append writes new rows with source; pre-migration ArcticDB series (no source col) get the schema bridge — new row's source dropped, update() succeeds.
Sat (next backfill): writes full series with source column. Pre-delta rows tagged yfinance; delta rows tagged from staging parquet source. Schema upgraded universe-wide.
Mon onward: daily_append writes source; existing series now has source col → no bridge fires.

Test plan

8 new tests in tests/test_provenance_source_column.py
pytest full suite 595 passed (vs 587 on origin/main)
Mon 5/11 weekday SF: daily_append writes new row, schema bridge drops source for pre-migration series; update() succeeds. CW logs confirm no schema-mismatch errors.
Sat 5/16 retrain backfill: full universe rewritten with source column; per-row provenance restored. Spot-check a few tickers via lib.read(sym).data.source should show "yfinance" for old rows, source-from-delta for recent rows.
Mon 5/18 weekday SF: daily_append writes new row with source; update() succeeds against the post-backfill schema (source column present).

🤖 Generated with Claude Code

Closes the audit-trail gap surfaced during the chronic-gap arc: ``daily_closes.collect`` already records per-row source ("polygon" / "yfinance" / "fred") in the staging parquet, but the canonical ArcticDB universe library writes (consumed by predictor training + inference) dropped the column at the parquet boundary. Result: no way to tell at row level whether a stored OHLCV bar came from polygon (true VWAP, T+1 settled) or yfinance (no VWAP, real-time same-day) or fred (index passthrough). Provenance lived only in the 7-day-TTL staging parquets, then evaporated. This PR threads ``source`` end-to-end: 1. ``builders/daily_append._load_daily_closes`` extracts ``source`` per ticker; defaults to ``"unknown"`` when the staging parquet predates the migration (graceful degrade vs hard-fail). 2. Daily_append's per-ticker write loop carries ``source`` through to the ArcticDB universe row alongside OHLCV + features. Source pulled from ``new_row`` directly (not from the post-feature-compute frame) so the provenance contract is decoupled from ``compute_features`` column-passthrough behaviour. 3. ``_apply_daily_delta`` propagates ``source`` through the cache+ delta merge: pre-delta rows tagged ``"yfinance"`` (price_cache parquets are yfinance-sourced); delta rows tagged from the daily_closes parquet's source field; dedup ``keep="last"`` lets the delta source win on date overlap. 4. ``builders/backfill`` includes ``source`` in the per-ticker write's keep_cols. Default-fills ``"yfinance"`` when the merged frame lacks the column (catches the cache-only path where no delta files exist). 5. New schema-bridge helper ``_align_schema_for_update`` in ``daily_append`` handles the migration boundary: existing ArcticDB series without ``source`` + new row with ``source`` → drop source from new row; existing with source + new row without → add NaN source to new row, reorder columns. Idempotent + no-op when schemas already match (preserves DataFrame identity for mock equality checks). 6. Dtype-cast skip-list in daily_append per-ticker write: ``source`` is string metadata, NOT subject to the float32 fallback path that fires for pure-feature columns. Migration trajectory: Mon-Fri (post-merge): daily_append writes new rows with source. Existing pre-migration ArcticDB series (no source col) get the schema bridge — new row's source dropped, update() succeeds. Sat (next backfill): writes full series with source column. Pre- delta rows tagged yfinance; delta rows tagged from staging parquet source. Schema upgraded universe-wide in one rewrite. Mon onward: daily_append writes source; existing series now has source col → no bridge fires; update() carries source forward. Tests (8 new in test_provenance_source_column.py): - schema-bridge: noop on match / drop extras / add missing / pass-through on empty existing - _load_daily_closes surfaces source per ticker; defaults "unknown" when parquet predates migration - _apply_daily_delta: pre-delta = yfinance, delta = from parquet - _apply_daily_delta: no delta files keeps cache-as-is Full suite 595/595 (vs 587 on origin/main). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…reduction) (#211) PR #196's per-row source column (object dtype) bloats memory ~125KB per ticker on the full-history universe pass — across 900 tickers that's ~110MB. Categorical dtype with the closed-set categories (polygon, yfinance, fred, unknown) cuts it to ~1 byte per row (category code) + ~50 bytes per series (category metadata), so ~2.5KB per ticker / ~2MB universe-wide. This was a contributing factor in the 2026-05-11 MorningEnrich OOM on the t3.small trading instance: daily_append's working set ran ~1GB, +~110MB from the new object-dtype source columns in _apply_daily_delta, alongside a crash-looping daemon (~200-300MB transient × 38s churn from flow-doctor s3-notifier ConfigError) on a 2GB box. Both contributors needed addressing; this PR closes the source-column half. Three sites converted: 1. features/compute._apply_daily_delta — base["source"] = "yfinance" was the largest single object-string allocation (2500 rows × 900 tickers). Now built via make_source_series with the categorical dtype. 2. features/compute._apply_daily_delta — delta_df["source"] construction switched to make_source_series for the same reason + dtype consistency at concat (mismatched dtype would fall back to object). 3. builders/backfill._build_symbol_df — default-fill when featured_df lacks the source column. Saturday full-universe rewrite path; same scale of savings. Categorical dtype is preserved through ArcticDB write/read so the storage savings carry forward to the predictor's read path too — though the load-bearing benefit is in-memory during the write pass on the t3.small. `make_source_series` coerces unknown values to "unknown" rather than expanding the category set, keeping the codes stable across writers so downstream readers can rely on them. 4 new tests in test_provenance_source_column.py: - SOURCE_CATEGORIES constant is the closed set - make_source_series returns categorical - make_source_series coerces unknown values - _apply_daily_delta produces categorical source column Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…undary (#224) 2026-05-12 EOD: weekly_collector.py --daily exited 1 in the ArcticDB append stage with ArcticDbNotYetImplemented: Symbol: BRK-B DataFrame/Series contains categorical data, cannot append or update Categorical columns: ['source'] Root cause: PR #211 (perf(provenance): categorical dtype for source column, ~108MB memory reduction) converted the per-row source column to pd.CategoricalDtype in features.compute._apply_daily_delta. ArcticDB's _handle_categorical_columns raises on every append/update path. PR #211 solved a real OOM (2026-05-11 MorningEnrich) but the in-memory dtype leaked through to update_batch / write_batch. Institutional fix: keep PR #211's in-memory memory win, normalize only at the storage boundary, via a single named helper called immediately before every ArcticDB write. Changes: - store/arctic_store.to_arctic_safe(df) — fast-path returns input unchanged for empty / no-categorical frames; otherwise copies + casts every CategoricalDtype column to object dtype (matches PR #196's pre-#211 storage representation). Does not mutate the caller's frame. - builders/daily_append.py — wrap update_batch's UpdatePayload data and write_batch's WritePayload data. - builders/backfill.py — wrap universe_lib.write + macro_lib.write (features + raw series + sector ETFs). Macro paths never used Categorical, but uniform wrapping is single-source-of-truth and the fast path makes it free. Tests (+10, suite 802 → 812): - Categorical source → object after to_arctic_safe; values + index + column order preserved. - Input frame is not mutated (PR #211's in-memory Categorical must survive intact through the compute path). - Fast paths: returns the input object (no copy) on empty + on no-categorical frames. - Handles multiple Categorical columns (defends future writers). - Source-level call-site regressions pin the wrap at all 5 ArcticDB write sites (2× daily_append, 3× backfill). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er (#238) 2026-05-14 EOD pipeline failure: weekly_collector.py --daily exited 1 in the ArcticDB append stage with 99.9% (903 / 904) tickers raising arcticdb::StreamDescriptorMismatch on universe_lib.update_batch: existing="…, VWAP idx=6, source idx=7, rsi_14 idx=8, …" new_val ="…, VWAP idx=6, rsi_14 idx=7, …, source idx=71" Root cause: the bare `today_row[PROVENANCE_COL] = …` assignment at line 1133 appends `source` at the END of the frame, but every persisted universe symbol carries `source` at idx 7 (between VWAP and the first feature) — laid down by `_write_row_backfill_safe` and the earlier per-ticker write paths from PR #196. ArcticDB's update_batch enforces strict column-order match against the existing version's descriptor, so EOD's append-at-head path fails for every existing ticker. The backfill-branch write_batch path masked the bug because full rewrite ignores the prior descriptor — yesterday's EOD went update path on a different schema vintage and slipped under the 5% error-rate threshold. Verified live: sampled 80 universe symbols (LAD/LEA/AAPL …) all show source at column position 6 (= 0-indexed pandas col, idx 7 in storage when including the date index), last_idx 2026-05-13. Fix: explicit re-projection to [OHLCV, source, FEATURES] before the write queue. Pinned by a source-level regression test in test_arctic_write_contract.py — a future PR that removes the re-projection trips here, preventing a silent re-introduction of the same EOD outage. Tests: full suite 1026 passed (was 1025, +1 column-order pin). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit b5a2736 into main May 9, 2026
1 check passed

cipher813 deleted the feat/provenance-source-column branch May 9, 2026 17:48

cipher813 mentioned this pull request May 11, 2026

perf(provenance): categorical dtype for source column (~108MB memory reduction) #211

Merged

3 tasks

cipher813 mentioned this pull request May 12, 2026

fix(arctic): normalize Categorical columns at every ArcticDB write boundary #224

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(provenance): per-row source column on ArcticDB universe writes#196

feat(provenance): per-row source column on ArcticDB universe writes#196
cipher813 merged 1 commit into
mainfrom
feat/provenance-source-column

cipher813 commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 9, 2026

Summary

Migration trajectory

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant