Skip to content

feat(provenance): per-row source column on ArcticDB universe writes#196

Merged
cipher813 merged 1 commit into
mainfrom
feat/provenance-source-column
May 9, 2026
Merged

feat(provenance): per-row source column on ArcticDB universe writes#196
cipher813 merged 1 commit into
mainfrom
feat/provenance-source-column

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Closes the audit-trail gap surfaced during the chronic-polygon-gap arc: daily_closes.collect already records per-row source ("polygon" / "yfinance" / "fred") in the staging parquet, but the canonical ArcticDB universe library (consumed by predictor training + inference) dropped the column. Provenance lived only in 7-day-TTL staging parquets, then evaporated.

This PR threads source end-to-end:

  1. _load_daily_closes extracts source per ticker (defaults "unknown" for pre-migration parquets)
  2. Daily_append's per-ticker write carries source into ArcticDB rows alongside OHLCV + features
  3. _apply_daily_delta propagates source through cache+delta merge (pre-delta = "yfinance", delta = from parquet, dedup keep="last")
  4. builders/backfill includes source in per-ticker keep_cols
  5. New schema-bridge helper _align_schema_for_update handles the migration boundary (drop extras / add missing / no-op on match)
  6. Dtype-cast skip-list: source is string metadata, NOT subject to float32 fallback

Migration trajectory

  • Mon-Fri (post-merge): daily_append writes new rows with source; pre-migration ArcticDB series (no source col) get the schema bridge — new row's source dropped, update() succeeds.
  • Sat (next backfill): writes full series with source column. Pre-delta rows tagged yfinance; delta rows tagged from staging parquet source. Schema upgraded universe-wide.
  • Mon onward: daily_append writes source; existing series now has source col → no bridge fires.

Test plan

  • 8 new tests in tests/test_provenance_source_column.py
  • pytest full suite 595 passed (vs 587 on origin/main)
  • Mon 5/11 weekday SF: daily_append writes new row, schema bridge drops source for pre-migration series; update() succeeds. CW logs confirm no schema-mismatch errors.
  • Sat 5/16 retrain backfill: full universe rewritten with source column; per-row provenance restored. Spot-check a few tickers via lib.read(sym).data.source should show "yfinance" for old rows, source-from-delta for recent rows.
  • Mon 5/18 weekday SF: daily_append writes new row with source; update() succeeds against the post-backfill schema (source column present).

🤖 Generated with Claude Code

Closes the audit-trail gap surfaced during the chronic-gap arc:
``daily_closes.collect`` already records per-row source ("polygon" /
"yfinance" / "fred") in the staging parquet, but the canonical
ArcticDB universe library writes (consumed by predictor training +
inference) dropped the column at the parquet boundary. Result: no
way to tell at row level whether a stored OHLCV bar came from polygon
(true VWAP, T+1 settled) or yfinance (no VWAP, real-time same-day) or
fred (index passthrough). Provenance lived only in the 7-day-TTL
staging parquets, then evaporated.

This PR threads ``source`` end-to-end:

  1. ``builders/daily_append._load_daily_closes`` extracts ``source``
     per ticker; defaults to ``"unknown"`` when the staging parquet
     predates the migration (graceful degrade vs hard-fail).

  2. Daily_append's per-ticker write loop carries ``source`` through
     to the ArcticDB universe row alongside OHLCV + features. Source
     pulled from ``new_row`` directly (not from the post-feature-compute
     frame) so the provenance contract is decoupled from
     ``compute_features`` column-passthrough behaviour.

  3. ``_apply_daily_delta`` propagates ``source`` through the cache+
     delta merge: pre-delta rows tagged ``"yfinance"`` (price_cache
     parquets are yfinance-sourced); delta rows tagged from the
     daily_closes parquet's source field; dedup ``keep="last"`` lets
     the delta source win on date overlap.

  4. ``builders/backfill`` includes ``source`` in the per-ticker
     write's keep_cols. Default-fills ``"yfinance"`` when the merged
     frame lacks the column (catches the cache-only path where no
     delta files exist).

  5. New schema-bridge helper ``_align_schema_for_update`` in
     ``daily_append`` handles the migration boundary: existing
     ArcticDB series without ``source`` + new row with ``source`` →
     drop source from new row; existing with source + new row without
     → add NaN source to new row, reorder columns. Idempotent + no-op
     when schemas already match (preserves DataFrame identity for
     mock equality checks).

  6. Dtype-cast skip-list in daily_append per-ticker write: ``source``
     is string metadata, NOT subject to the float32 fallback path
     that fires for pure-feature columns.

Migration trajectory:
  Mon-Fri (post-merge): daily_append writes new rows with source.
    Existing pre-migration ArcticDB series (no source col) get the
    schema bridge — new row's source dropped, update() succeeds.
  Sat (next backfill): writes full series with source column. Pre-
    delta rows tagged yfinance; delta rows tagged from staging
    parquet source. Schema upgraded universe-wide in one rewrite.
  Mon onward: daily_append writes source; existing series now has
    source col → no bridge fires; update() carries source forward.

Tests (8 new in test_provenance_source_column.py):
  - schema-bridge: noop on match / drop extras / add missing /
    pass-through on empty existing
  - _load_daily_closes surfaces source per ticker; defaults
    "unknown" when parquet predates migration
  - _apply_daily_delta: pre-delta = yfinance, delta = from parquet
  - _apply_daily_delta: no delta files keeps cache-as-is

Full suite 595/595 (vs 587 on origin/main).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit b5a2736 into main May 9, 2026
1 check passed
@cipher813 cipher813 deleted the feat/provenance-source-column branch May 9, 2026 17:48
cipher813 added a commit that referenced this pull request May 11, 2026
…reduction) (#211)

PR #196's per-row source column (object dtype) bloats memory ~125KB per
ticker on the full-history universe pass — across 900 tickers that's
~110MB. Categorical dtype with the closed-set categories
(polygon, yfinance, fred, unknown) cuts it to ~1 byte per row
(category code) + ~50 bytes per series (category metadata), so
~2.5KB per ticker / ~2MB universe-wide.

This was a contributing factor in the 2026-05-11 MorningEnrich OOM on
the t3.small trading instance: daily_append's working set ran ~1GB,
+~110MB from the new object-dtype source columns in
_apply_daily_delta, alongside a crash-looping daemon (~200-300MB
transient × 38s churn from flow-doctor s3-notifier ConfigError) on a
2GB box. Both contributors needed addressing; this PR closes the
source-column half.

Three sites converted:
  1. features/compute._apply_daily_delta — base["source"] = "yfinance"
     was the largest single object-string allocation (2500 rows × 900
     tickers). Now built via make_source_series with the categorical
     dtype.
  2. features/compute._apply_daily_delta — delta_df["source"]
     construction switched to make_source_series for the same reason
     + dtype consistency at concat (mismatched dtype would fall back
     to object).
  3. builders/backfill._build_symbol_df — default-fill when
     featured_df lacks the source column. Saturday full-universe
     rewrite path; same scale of savings.

Categorical dtype is preserved through ArcticDB write/read so the
storage savings carry forward to the predictor's read path too — though
the load-bearing benefit is in-memory during the write pass on the t3.small.

`make_source_series` coerces unknown values to "unknown" rather than
expanding the category set, keeping the codes stable across writers so
downstream readers can rely on them.

4 new tests in test_provenance_source_column.py:
  - SOURCE_CATEGORIES constant is the closed set
  - make_source_series returns categorical
  - make_source_series coerces unknown values
  - _apply_daily_delta produces categorical source column

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 12, 2026
…undary (#224)

2026-05-12 EOD: weekly_collector.py --daily exited 1 in the ArcticDB
append stage with
  ArcticDbNotYetImplemented: Symbol: BRK-B
  DataFrame/Series contains categorical data, cannot append or update
  Categorical columns: ['source']

Root cause: PR #211 (perf(provenance): categorical dtype for source
column, ~108MB memory reduction) converted the per-row source column
to pd.CategoricalDtype in features.compute._apply_daily_delta. ArcticDB's
_handle_categorical_columns raises on every append/update path. PR #211
solved a real OOM (2026-05-11 MorningEnrich) but the in-memory dtype
leaked through to update_batch / write_batch.

Institutional fix: keep PR #211's in-memory memory win, normalize only
at the storage boundary, via a single named helper called immediately
before every ArcticDB write.

Changes:
- store/arctic_store.to_arctic_safe(df) — fast-path returns input
  unchanged for empty / no-categorical frames; otherwise copies + casts
  every CategoricalDtype column to object dtype (matches PR #196's
  pre-#211 storage representation). Does not mutate the caller's frame.
- builders/daily_append.py — wrap update_batch's UpdatePayload data
  and write_batch's WritePayload data.
- builders/backfill.py — wrap universe_lib.write + macro_lib.write
  (features + raw series + sector ETFs). Macro paths never used
  Categorical, but uniform wrapping is single-source-of-truth and the
  fast path makes it free.

Tests (+10, suite 802 → 812):
- Categorical source → object after to_arctic_safe; values + index +
  column order preserved.
- Input frame is not mutated (PR #211's in-memory Categorical must
  survive intact through the compute path).
- Fast paths: returns the input object (no copy) on empty + on
  no-categorical frames.
- Handles multiple Categorical columns (defends future writers).
- Source-level call-site regressions pin the wrap at all 5 ArcticDB
  write sites (2× daily_append, 3× backfill).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 14, 2026
…er (#238)

2026-05-14 EOD pipeline failure: weekly_collector.py --daily exited 1
in the ArcticDB append stage with 99.9% (903 / 904) tickers raising
arcticdb::StreamDescriptorMismatch on universe_lib.update_batch:
  existing="…, VWAP idx=6, source idx=7, rsi_14 idx=8, …"
  new_val ="…, VWAP idx=6, rsi_14 idx=7, …, source idx=71"

Root cause: the bare `today_row[PROVENANCE_COL] = …` assignment at
line 1133 appends `source` at the END of the frame, but every
persisted universe symbol carries `source` at idx 7 (between VWAP
and the first feature) — laid down by `_write_row_backfill_safe` and
the earlier per-ticker write paths from PR #196. ArcticDB's
update_batch enforces strict column-order match against the existing
version's descriptor, so EOD's append-at-head path fails for every
existing ticker. The backfill-branch write_batch path masked the bug
because full rewrite ignores the prior descriptor — yesterday's EOD
went update path on a different schema vintage and slipped under
the 5% error-rate threshold.

Verified live: sampled 80 universe symbols (LAD/LEA/AAPL …) all show
source at column position 6 (= 0-indexed pandas col, idx 7 in storage
when including the date index), last_idx 2026-05-13.

Fix: explicit re-projection to [OHLCV, source, FEATURES] before the
write queue. Pinned by a source-level regression test in
test_arctic_write_contract.py — a future PR that removes the
re-projection trips here, preventing a silent re-introduction of
the same EOD outage.

Tests: full suite 1026 passed (was 1025, +1 column-order pin).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant