Skip to content

daily_append: use update() instead of append() (dedup at source)#35

Merged
cipher813 merged 1 commit into
mainfrom
fix/dedup-arcticdb-writes-at-source
Apr 15, 2026
Merged

daily_append: use update() instead of append() (dedup at source)#35
cipher813 merged 1 commit into
mainfrom
fix/dedup-arcticdb-writes-at-source

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Root cause fix for the 2026-04-15 predictor retrain outage: 904/909 tickers in the production universe library had duplicate date rows when read back from ArcticDB.

The workaround landed in alpha-engine-predictor (#26 — defensive dedup in the training loader, mirroring inference's long-standing defensive dedup at `inference/stages/load_prices.py:403`). This PR fixes the accumulation at the write site so both defenses become unnecessary going forward.

Change: swap three `lib.append(symbol, today_row)` calls for `lib.update(symbol, today_row)` in `builders/daily_append.py` (universe, macro key, macro sym-path for sector ETFs). `append()` adds rows without dedup; `update()` replaces any existing rows whose dates overlap with the input, which is idempotent under re-runs, races, or concurrent pipeline invocations.

Why duplicates accumulated: the read-check at line 195 (`if today_ts in hist.index: skip`) was the only dedup guard. It fails under any of:

  • Concurrent Saturday + Sunday pipeline invocations (already flagged in ROADMAP under Research: weekend-dated signals)
  • Retries during partial failures
  • Read reflecting a cache slightly behind the actual write state

`update()` removes all three failure modes by making the write itself idempotent.

Test plan

  • Source-level regression test: `tests/test_daily_append_semantics.py` locks the `update()` semantic — a future revert to `append()` on any of the 3 sites fails the test.
  • Full suite: 43 passed.
  • Next weekday pipeline run (2026-04-16 Thu): confirm no accumulation in universe_lib. Spot check: `lib.read('AAPL').data.index.has_duplicates` should be False.
  • After 1-2 full Saturday cycles have cleaned state, remove the defensive dedup in alpha-engine-predictor `data/dataset.py:_load_ticker_parquet` (tracked on ROADMAP).

Related

  • alpha-engine-predictor Integrate flow-doctor for failure alerting #26 (merged) — defensive dedup workaround in training loader.
  • alpha-engine-docs ROADMAP Data Platform / P1 — "Eliminate duplicate-date rows in ArcticDB writes" (this PR resolves the write-side half).

🤖 Generated with Claude Code

…rce)

Root cause fix for the 2026-04-15 predictor retrain outage where
904/909 tickers in the universe library had duplicate date rows when
read back from ArcticDB. That failure was worked around defensively in
the predictor loader (alpha-engine-predictor PR #26) and the inference
loader has had equivalent defensive dedup for some time (load_prices.py
line 403). This PR fixes the accumulation at the write site.

Change: swap `lib.append(symbol, today_row)` for `lib.update(symbol,
today_row)` at the three daily write sites in builders/daily_append.py:
  - universe_lib.update(ticker, today_row)  (line 251)
  - macro_lib.update(key, new_row)           (line 269)
  - macro_lib.update(sym, new_row)           (line 286)

append() adds rows without dedup — if daily_append runs twice for the
same date (race, retry, concurrent Saturday+Sunday pipelines), rows
accumulate. update() is idempotent: ArcticDB replaces any existing
rows whose dates overlap with the input DataFrame, so a re-run with
the same or updated row produces at most one row per date regardless
of invocation count.

The read-check at line 195 (if today_ts in hist.index: skip) stays —
it's an efficiency guard that avoids the write entirely when the row
already exists. update() is the safety net when that check misses.

tests/test_daily_append_semantics.py — source-level regression guards
against a future revert to append() on any of the three sites.

Follow-up: once this has been in place for 1-2 full Saturday cycles,
remove the defensive dedup in alpha-engine-predictor data/dataset.py
(`_load_ticker_parquet`). Track on ROADMAP under Data Platform.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 0fbd6d5 into main Apr 15, 2026
1 check passed
@cipher813 cipher813 deleted the fix/dedup-arcticdb-writes-at-source branch April 15, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant