feat(provenance): per-row source column on ArcticDB universe writes#196
Merged
Conversation
Closes the audit-trail gap surfaced during the chronic-gap arc:
``daily_closes.collect`` already records per-row source ("polygon" /
"yfinance" / "fred") in the staging parquet, but the canonical
ArcticDB universe library writes (consumed by predictor training +
inference) dropped the column at the parquet boundary. Result: no
way to tell at row level whether a stored OHLCV bar came from polygon
(true VWAP, T+1 settled) or yfinance (no VWAP, real-time same-day) or
fred (index passthrough). Provenance lived only in the 7-day-TTL
staging parquets, then evaporated.
This PR threads ``source`` end-to-end:
1. ``builders/daily_append._load_daily_closes`` extracts ``source``
per ticker; defaults to ``"unknown"`` when the staging parquet
predates the migration (graceful degrade vs hard-fail).
2. Daily_append's per-ticker write loop carries ``source`` through
to the ArcticDB universe row alongside OHLCV + features. Source
pulled from ``new_row`` directly (not from the post-feature-compute
frame) so the provenance contract is decoupled from
``compute_features`` column-passthrough behaviour.
3. ``_apply_daily_delta`` propagates ``source`` through the cache+
delta merge: pre-delta rows tagged ``"yfinance"`` (price_cache
parquets are yfinance-sourced); delta rows tagged from the
daily_closes parquet's source field; dedup ``keep="last"`` lets
the delta source win on date overlap.
4. ``builders/backfill`` includes ``source`` in the per-ticker
write's keep_cols. Default-fills ``"yfinance"`` when the merged
frame lacks the column (catches the cache-only path where no
delta files exist).
5. New schema-bridge helper ``_align_schema_for_update`` in
``daily_append`` handles the migration boundary: existing
ArcticDB series without ``source`` + new row with ``source`` →
drop source from new row; existing with source + new row without
→ add NaN source to new row, reorder columns. Idempotent + no-op
when schemas already match (preserves DataFrame identity for
mock equality checks).
6. Dtype-cast skip-list in daily_append per-ticker write: ``source``
is string metadata, NOT subject to the float32 fallback path
that fires for pure-feature columns.
Migration trajectory:
Mon-Fri (post-merge): daily_append writes new rows with source.
Existing pre-migration ArcticDB series (no source col) get the
schema bridge — new row's source dropped, update() succeeds.
Sat (next backfill): writes full series with source column. Pre-
delta rows tagged yfinance; delta rows tagged from staging
parquet source. Schema upgraded universe-wide in one rewrite.
Mon onward: daily_append writes source; existing series now has
source col → no bridge fires; update() carries source forward.
Tests (8 new in test_provenance_source_column.py):
- schema-bridge: noop on match / drop extras / add missing /
pass-through on empty existing
- _load_daily_closes surfaces source per ticker; defaults
"unknown" when parquet predates migration
- _apply_daily_delta: pre-delta = yfinance, delta = from parquet
- _apply_daily_delta: no delta files keeps cache-as-is
Full suite 595/595 (vs 587 on origin/main).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
cipher813
added a commit
that referenced
this pull request
May 11, 2026
…reduction) (#211) PR #196's per-row source column (object dtype) bloats memory ~125KB per ticker on the full-history universe pass — across 900 tickers that's ~110MB. Categorical dtype with the closed-set categories (polygon, yfinance, fred, unknown) cuts it to ~1 byte per row (category code) + ~50 bytes per series (category metadata), so ~2.5KB per ticker / ~2MB universe-wide. This was a contributing factor in the 2026-05-11 MorningEnrich OOM on the t3.small trading instance: daily_append's working set ran ~1GB, +~110MB from the new object-dtype source columns in _apply_daily_delta, alongside a crash-looping daemon (~200-300MB transient × 38s churn from flow-doctor s3-notifier ConfigError) on a 2GB box. Both contributors needed addressing; this PR closes the source-column half. Three sites converted: 1. features/compute._apply_daily_delta — base["source"] = "yfinance" was the largest single object-string allocation (2500 rows × 900 tickers). Now built via make_source_series with the categorical dtype. 2. features/compute._apply_daily_delta — delta_df["source"] construction switched to make_source_series for the same reason + dtype consistency at concat (mismatched dtype would fall back to object). 3. builders/backfill._build_symbol_df — default-fill when featured_df lacks the source column. Saturday full-universe rewrite path; same scale of savings. Categorical dtype is preserved through ArcticDB write/read so the storage savings carry forward to the predictor's read path too — though the load-bearing benefit is in-memory during the write pass on the t3.small. `make_source_series` coerces unknown values to "unknown" rather than expanding the category set, keeping the codes stable across writers so downstream readers can rely on them. 4 new tests in test_provenance_source_column.py: - SOURCE_CATEGORIES constant is the closed set - make_source_series returns categorical - make_source_series coerces unknown values - _apply_daily_delta produces categorical source column Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
cipher813
added a commit
that referenced
this pull request
May 12, 2026
…undary (#224) 2026-05-12 EOD: weekly_collector.py --daily exited 1 in the ArcticDB append stage with ArcticDbNotYetImplemented: Symbol: BRK-B DataFrame/Series contains categorical data, cannot append or update Categorical columns: ['source'] Root cause: PR #211 (perf(provenance): categorical dtype for source column, ~108MB memory reduction) converted the per-row source column to pd.CategoricalDtype in features.compute._apply_daily_delta. ArcticDB's _handle_categorical_columns raises on every append/update path. PR #211 solved a real OOM (2026-05-11 MorningEnrich) but the in-memory dtype leaked through to update_batch / write_batch. Institutional fix: keep PR #211's in-memory memory win, normalize only at the storage boundary, via a single named helper called immediately before every ArcticDB write. Changes: - store/arctic_store.to_arctic_safe(df) — fast-path returns input unchanged for empty / no-categorical frames; otherwise copies + casts every CategoricalDtype column to object dtype (matches PR #196's pre-#211 storage representation). Does not mutate the caller's frame. - builders/daily_append.py — wrap update_batch's UpdatePayload data and write_batch's WritePayload data. - builders/backfill.py — wrap universe_lib.write + macro_lib.write (features + raw series + sector ETFs). Macro paths never used Categorical, but uniform wrapping is single-source-of-truth and the fast path makes it free. Tests (+10, suite 802 → 812): - Categorical source → object after to_arctic_safe; values + index + column order preserved. - Input frame is not mutated (PR #211's in-memory Categorical must survive intact through the compute path). - Fast paths: returns the input object (no copy) on empty + on no-categorical frames. - Handles multiple Categorical columns (defends future writers). - Source-level call-site regressions pin the wrap at all 5 ArcticDB write sites (2× daily_append, 3× backfill). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 14, 2026
…er (#238) 2026-05-14 EOD pipeline failure: weekly_collector.py --daily exited 1 in the ArcticDB append stage with 99.9% (903 / 904) tickers raising arcticdb::StreamDescriptorMismatch on universe_lib.update_batch: existing="…, VWAP idx=6, source idx=7, rsi_14 idx=8, …" new_val ="…, VWAP idx=6, rsi_14 idx=7, …, source idx=71" Root cause: the bare `today_row[PROVENANCE_COL] = …` assignment at line 1133 appends `source` at the END of the frame, but every persisted universe symbol carries `source` at idx 7 (between VWAP and the first feature) — laid down by `_write_row_backfill_safe` and the earlier per-ticker write paths from PR #196. ArcticDB's update_batch enforces strict column-order match against the existing version's descriptor, so EOD's append-at-head path fails for every existing ticker. The backfill-branch write_batch path masked the bug because full rewrite ignores the prior descriptor — yesterday's EOD went update path on a different schema vintage and slipped under the 5% error-rate threshold. Verified live: sampled 80 universe symbols (LAD/LEA/AAPL …) all show source at column position 6 (= 0-indexed pandas col, idx 7 in storage when including the date index), last_idx 2026-05-13. Fix: explicit re-projection to [OHLCV, source, FEATURES] before the write queue. Pinned by a source-level regression test in test_arctic_write_contract.py — a future PR that removes the re-projection trips here, preventing a silent re-introduction of the same EOD outage. Tests: full suite 1026 passed (was 1025, +1 column-order pin). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the audit-trail gap surfaced during the chronic-polygon-gap arc:
daily_closes.collectalready records per-rowsource("polygon" / "yfinance" / "fred") in the staging parquet, but the canonical ArcticDB universe library (consumed by predictor training + inference) dropped the column. Provenance lived only in 7-day-TTL staging parquets, then evaporated.This PR threads
sourceend-to-end:_load_daily_closesextracts source per ticker (defaults"unknown"for pre-migration parquets)_apply_daily_deltapropagates source through cache+delta merge (pre-delta ="yfinance", delta = from parquet, dedupkeep="last")builders/backfillincludes source in per-ticker keep_cols_align_schema_for_updatehandles the migration boundary (drop extras / add missing / no-op on match)Migration trajectory
Test plan
tests/test_provenance_source_column.pypytestfull suite 595 passed (vs 587 on origin/main)lib.read(sym).data.sourceshould show "yfinance" for old rows, source-from-delta for recent rows.🤖 Generated with Claude Code