fix(arctic): normalize Categorical columns at every ArcticDB write boundary by cipher813 · Pull Request #224 · cipher813/alpha-engine-data

cipher813 · 2026-05-12T20:57:36Z

Summary

Closes the 2026-05-12 EOD failure mode. weekly_collector.py --daily exited 1 in the ArcticDB append stage with ArcticDbNotYetImplemented: ... DataFrame/Series contains categorical data, cannot append or update on BRK-B. PR perf(provenance): categorical dtype for source column (~108MB memory reduction) #211's Categorical source column leaked from the in-memory compute path into update_batch.
Adds store.arctic_store.to_arctic_safe(df) — a single named helper at the ArcticDB storage boundary that strips CategoricalDtype columns to object dtype (matches PR feat(provenance): per-row source column on ArcticDB universe writes #196's pre-perf(provenance): categorical dtype for source column (~108MB memory reduction) #211 storage representation).
Wraps every ArcticDB write call site: 2× in builders/daily_append.py (update_batch + write_batch), 3× in builders/backfill.py (universe_lib.write + 3× macro_lib.write). Uniform wrapping is single-source-of-truth; the fast path returns the input unchanged when there's nothing to convert, so wrapping at the macro paths is free.
Keeps PR perf(provenance): categorical dtype for source column (~108MB memory reduction) #211's ~108MB in-memory memory win intact — the Categorical representation survives through _apply_daily_delta and compute_and_write; only the write-bound copy gets cast.

Why this shape

Institutional pattern: in-memory representations can be optimized (categoricals for compactness); the storage boundary enforces a strict contract that matches what the storage system accepts. The cast happens at exactly one named point. Same shape as JSON serialization, DataFrame→Parquet schema validation, ORM→SQL boundaries. Defends future writers (a new PR that adds another Categorical column — e.g. a quality flag or vendor enum — automatically gets normalized without changing the writer).

Bare .astype(str) at each call site would solve the immediate failure but encode the rule "ArcticDB doesn't accept Categorical" as folklore. The helper encodes it as code.

Test plan

python -m pytest tests/test_arctic_write_contract.py -v — all 10 new tests pass
Full suite: 811 passed, 1 skipped, 5 warnings (warnings pre-existing, unrelated to this PR; suite 802 → 812)
Source-level call-site regressions pin the wrap at every ArcticDB write site — a future revert or new unwrapped write site fails loudly
Tomorrow's 2026-05-13 MorningEnrich exercises update_batch in production (first post-merge live exercise)
Saturday SF 2026-05-16 exercises backfill.py's universe_lib.write + macro writes (first post-merge backfill exercise)

Refs

PR perf(provenance): categorical dtype for source column (~108MB memory reduction) #211 (2026-05-11) introduced the Categorical source dtype that broke update_batch
PR feat(provenance): per-row source column on ArcticDB universe writes #196 — original object-dtype source column, the pre-perf(provenance): categorical dtype for source column (~108MB memory reduction) #211 storage representation that round-trips cleanly through ArcticDB
feedback_test_fixture_shape_must_match_production — same institutional principle (shape-validation at storage boundaries), different boundary

🤖 Generated with Claude Code

…undary 2026-05-12 EOD: weekly_collector.py --daily exited 1 in the ArcticDB append stage with ArcticDbNotYetImplemented: Symbol: BRK-B DataFrame/Series contains categorical data, cannot append or update Categorical columns: ['source'] Root cause: PR #211 (perf(provenance): categorical dtype for source column, ~108MB memory reduction) converted the per-row source column to pd.CategoricalDtype in features.compute._apply_daily_delta. ArcticDB's _handle_categorical_columns raises on every append/update path. PR #211 solved a real OOM (2026-05-11 MorningEnrich) but the in-memory dtype leaked through to update_batch / write_batch. Institutional fix: keep PR #211's in-memory memory win, normalize only at the storage boundary, via a single named helper called immediately before every ArcticDB write. Changes: - store/arctic_store.to_arctic_safe(df) — fast-path returns input unchanged for empty / no-categorical frames; otherwise copies + casts every CategoricalDtype column to object dtype (matches PR #196's pre-#211 storage representation). Does not mutate the caller's frame. - builders/daily_append.py — wrap update_batch's UpdatePayload data and write_batch's WritePayload data. - builders/backfill.py — wrap universe_lib.write + macro_lib.write (features + raw series + sector ETFs). Macro paths never used Categorical, but uniform wrapping is single-source-of-truth and the fast path makes it free. Tests (+10, suite 802 → 812): - Categorical source → object after to_arctic_safe; values + index + column order preserved. - Input frame is not mutated (PR #211's in-memory Categorical must survive intact through the compute path). - Fast paths: returns the input object (no copy) on empty + on no-categorical frames. - Handles multiple Categorical columns (defends future writers). - Source-level call-site regressions pin the wrap at all 5 ArcticDB write sites (2× daily_append, 3× backfill). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n dry path — closes DriftDetection skip-exception) (#261) Adds a `--preflight-only` modifier to infrastructure/spot_drift_detection.sh, mirroring the merged #259 (spot_data_weekly.sh) / predictor #175 / backtester #224 pattern. Closes the DriftDetection skip-exception in ROADMAP "Friday shell-run — per-module dry-path activation" — the one per-module SF step still SKIPPED rather than dry-run on the Friday shell_run. Insertion point --------------- `PREFLIGHT_ONLY=0` modifier var initialised before the arg-parse loop (orthogonal to RUN_MODE, `set -u` safe); `--preflight-only) PREFLIGHT_ONLY=1` added to the case loop. The guard block is inserted AFTER the smoke-only block and strictly BEFORE the "# ── Full drift detection ──" section (the `run_remote bash -s <<DRIFT` heredoc) and before the trailing `aws cloudwatch put-metric-data` heartbeat. No-scan / no-write proof ------------------------ `monitoring.drift_detector` (in alpha-engine-predictor, on the sibling-clone PYTHONPATH) is the SOLE code path that does any S3 get_object/put_object of the drift report or SNS publish on alert; the launcher's CloudWatch put-metric-data heartbeat trails it. The PREFLIGHT_ONLY guard `exit 0`s strictly before the `<<DRIFT` heredoc, so the scan, the SNS publish, the S3 put_object, and the CloudWatch emit are all statically unreachable. The preflight itself runs only BasePreflight.check_env_vars (env read) + BasePreflight.check_s3_bucket (bucket HEAD) + an `importlib.import_module` of the drift module (import-only — boto3 clients + check_drift()/main() sit behind `if __name__ == "__main__"`, which an import does not trigger). Zero external API data fetch, zero S3/CW/SNS/config mutation; exit 0 because a passed preflight is a healthy outcome (SSM/SF report Success). Preflight substrate reused -------------------------- The drift workload binary lives in alpha-engine-predictor (no --preflight-only of its own; out of scope to modify here) and this repo's preflight.py DataPreflight modes (daily/morning_enrich/phase1/phase2) are data-collection scoped — none maps to drift. Per the canonical-lib fallback the preflight composes `alpha_engine_lib.preflight.BasePreflight` DIRECTLY (env-vars + S3 HEAD) — no bespoke preflight scaffolding duplicated. Verbatim flag name: `--preflight-only` Tests ----- New tests/test_spot_drift_detection_preflight_only.py (5 static greps/source-position assertions, mirroring tests/test_preflight_only_dry_path.py): flag parses as a modifier; guard precedes DRIFT + heartbeat; exit 0 before DRIFT; no scan/S3/CW/SNS in block; canonical BasePreflight reused (no scaffolding). `bash -n` clean. Full data suite: 1342 passed, 1 skipped (pre-existing), 5 pre-existing warnings. Independent of #260: that PR touches spot_data_weekly.sh + the Lambda dry-run keystone (a different file); the Saturday/Friday SF rewire to route the DriftDetection state at this `--preflight-only` flag under the Friday shell_run is a separate follow-on (no step_function.json change here). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit 9d2a2c5 into main May 12, 2026
1 check passed

cipher813 deleted the fix/arctic-write-boundary-categorical-normalizer branch May 12, 2026 21:00

cipher813 mentioned this pull request May 18, 2026

feat(data): spot_drift_detection.sh --preflight-only (Friday shell-run dry path — closes DriftDetection skip-exception) #261

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(arctic): normalize Categorical columns at every ArcticDB write boundary#224

fix(arctic): normalize Categorical columns at every ArcticDB write boundary#224
cipher813 merged 1 commit into
mainfrom
fix/arctic-write-boundary-categorical-normalizer

cipher813 commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 12, 2026

Summary

Why this shape

Test plan

Refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant