daily_append: use update() instead of append() (dedup at source)#35
Merged
Conversation
…rce) Root cause fix for the 2026-04-15 predictor retrain outage where 904/909 tickers in the universe library had duplicate date rows when read back from ArcticDB. That failure was worked around defensively in the predictor loader (alpha-engine-predictor PR #26) and the inference loader has had equivalent defensive dedup for some time (load_prices.py line 403). This PR fixes the accumulation at the write site. Change: swap `lib.append(symbol, today_row)` for `lib.update(symbol, today_row)` at the three daily write sites in builders/daily_append.py: - universe_lib.update(ticker, today_row) (line 251) - macro_lib.update(key, new_row) (line 269) - macro_lib.update(sym, new_row) (line 286) append() adds rows without dedup — if daily_append runs twice for the same date (race, retry, concurrent Saturday+Sunday pipelines), rows accumulate. update() is idempotent: ArcticDB replaces any existing rows whose dates overlap with the input DataFrame, so a re-run with the same or updated row produces at most one row per date regardless of invocation count. The read-check at line 195 (if today_ts in hist.index: skip) stays — it's an efficiency guard that avoids the write entirely when the row already exists. update() is the safety net when that check misses. tests/test_daily_append_semantics.py — source-level regression guards against a future revert to append() on any of the three sites. Follow-up: once this has been in place for 1-2 full Saturday cycles, remove the defensive dedup in alpha-engine-predictor data/dataset.py (`_load_ticker_parquet`). Track on ROADMAP under Data Platform. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Root cause fix for the 2026-04-15 predictor retrain outage: 904/909 tickers in the production universe library had duplicate date rows when read back from ArcticDB.
The workaround landed in alpha-engine-predictor (#26 — defensive dedup in the training loader, mirroring inference's long-standing defensive dedup at `inference/stages/load_prices.py:403`). This PR fixes the accumulation at the write site so both defenses become unnecessary going forward.
Change: swap three `lib.append(symbol, today_row)` calls for `lib.update(symbol, today_row)` in `builders/daily_append.py` (universe, macro key, macro sym-path for sector ETFs). `append()` adds rows without dedup; `update()` replaces any existing rows whose dates overlap with the input, which is idempotent under re-runs, races, or concurrent pipeline invocations.
Why duplicates accumulated: the read-check at line 195 (`if today_ts in hist.index: skip`) was the only dedup guard. It fails under any of:
`update()` removes all three failure modes by making the write itself idempotent.
Test plan
Related
🤖 Generated with Claude Code