fix(arctic): normalize Categorical columns at every ArcticDB write boundary#224
Merged
Merged
Conversation
…undary 2026-05-12 EOD: weekly_collector.py --daily exited 1 in the ArcticDB append stage with ArcticDbNotYetImplemented: Symbol: BRK-B DataFrame/Series contains categorical data, cannot append or update Categorical columns: ['source'] Root cause: PR #211 (perf(provenance): categorical dtype for source column, ~108MB memory reduction) converted the per-row source column to pd.CategoricalDtype in features.compute._apply_daily_delta. ArcticDB's _handle_categorical_columns raises on every append/update path. PR #211 solved a real OOM (2026-05-11 MorningEnrich) but the in-memory dtype leaked through to update_batch / write_batch. Institutional fix: keep PR #211's in-memory memory win, normalize only at the storage boundary, via a single named helper called immediately before every ArcticDB write. Changes: - store/arctic_store.to_arctic_safe(df) — fast-path returns input unchanged for empty / no-categorical frames; otherwise copies + casts every CategoricalDtype column to object dtype (matches PR #196's pre-#211 storage representation). Does not mutate the caller's frame. - builders/daily_append.py — wrap update_batch's UpdatePayload data and write_batch's WritePayload data. - builders/backfill.py — wrap universe_lib.write + macro_lib.write (features + raw series + sector ETFs). Macro paths never used Categorical, but uniform wrapping is single-source-of-truth and the fast path makes it free. Tests (+10, suite 802 → 812): - Categorical source → object after to_arctic_safe; values + index + column order preserved. - Input frame is not mutated (PR #211's in-memory Categorical must survive intact through the compute path). - Fast paths: returns the input object (no copy) on empty + on no-categorical frames. - Handles multiple Categorical columns (defends future writers). - Source-level call-site regressions pin the wrap at all 5 ArcticDB write sites (2× daily_append, 3× backfill). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 18, 2026
…n dry path — closes DriftDetection skip-exception) (#261) Adds a `--preflight-only` modifier to infrastructure/spot_drift_detection.sh, mirroring the merged #259 (spot_data_weekly.sh) / predictor #175 / backtester #224 pattern. Closes the DriftDetection skip-exception in ROADMAP "Friday shell-run — per-module dry-path activation" — the one per-module SF step still SKIPPED rather than dry-run on the Friday shell_run. Insertion point --------------- `PREFLIGHT_ONLY=0` modifier var initialised before the arg-parse loop (orthogonal to RUN_MODE, `set -u` safe); `--preflight-only) PREFLIGHT_ONLY=1` added to the case loop. The guard block is inserted AFTER the smoke-only block and strictly BEFORE the "# ── Full drift detection ──" section (the `run_remote bash -s <<DRIFT` heredoc) and before the trailing `aws cloudwatch put-metric-data` heartbeat. No-scan / no-write proof ------------------------ `monitoring.drift_detector` (in alpha-engine-predictor, on the sibling-clone PYTHONPATH) is the SOLE code path that does any S3 get_object/put_object of the drift report or SNS publish on alert; the launcher's CloudWatch put-metric-data heartbeat trails it. The PREFLIGHT_ONLY guard `exit 0`s strictly before the `<<DRIFT` heredoc, so the scan, the SNS publish, the S3 put_object, and the CloudWatch emit are all statically unreachable. The preflight itself runs only BasePreflight.check_env_vars (env read) + BasePreflight.check_s3_bucket (bucket HEAD) + an `importlib.import_module` of the drift module (import-only — boto3 clients + check_drift()/main() sit behind `if __name__ == "__main__"`, which an import does not trigger). Zero external API data fetch, zero S3/CW/SNS/config mutation; exit 0 because a passed preflight is a healthy outcome (SSM/SF report Success). Preflight substrate reused -------------------------- The drift workload binary lives in alpha-engine-predictor (no --preflight-only of its own; out of scope to modify here) and this repo's preflight.py DataPreflight modes (daily/morning_enrich/phase1/phase2) are data-collection scoped — none maps to drift. Per the canonical-lib fallback the preflight composes `alpha_engine_lib.preflight.BasePreflight` DIRECTLY (env-vars + S3 HEAD) — no bespoke preflight scaffolding duplicated. Verbatim flag name: `--preflight-only` Tests ----- New tests/test_spot_drift_detection_preflight_only.py (5 static greps/source-position assertions, mirroring tests/test_preflight_only_dry_path.py): flag parses as a modifier; guard precedes DRIFT + heartbeat; exit 0 before DRIFT; no scan/S3/CW/SNS in block; canonical BasePreflight reused (no scaffolding). `bash -n` clean. Full data suite: 1342 passed, 1 skipped (pre-existing), 5 pre-existing warnings. Independent of #260: that PR touches spot_data_weekly.sh + the Lambda dry-run keystone (a different file); the Saturday/Friday SF rewire to route the DriftDetection state at this `--preflight-only` flag under the Friday shell_run is a separate follow-on (no step_function.json change here). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
weekly_collector.py --dailyexited 1 in the ArcticDB append stage withArcticDbNotYetImplemented: ... DataFrame/Series contains categorical data, cannot append or updateon BRK-B. PR perf(provenance): categorical dtype for source column (~108MB memory reduction) #211's Categoricalsourcecolumn leaked from the in-memory compute path intoupdate_batch.store.arctic_store.to_arctic_safe(df)— a single named helper at the ArcticDB storage boundary that stripsCategoricalDtypecolumns to object dtype (matches PR feat(provenance): per-row source column on ArcticDB universe writes #196's pre-perf(provenance): categorical dtype for source column (~108MB memory reduction) #211 storage representation).builders/daily_append.py(update_batch+write_batch), 3× inbuilders/backfill.py(universe_lib.write+ 3×macro_lib.write). Uniform wrapping is single-source-of-truth; the fast path returns the input unchanged when there's nothing to convert, so wrapping at the macro paths is free._apply_daily_deltaandcompute_and_write; only the write-bound copy gets cast.Why this shape
Institutional pattern: in-memory representations can be optimized (categoricals for compactness); the storage boundary enforces a strict contract that matches what the storage system accepts. The cast happens at exactly one named point. Same shape as JSON serialization, DataFrame→Parquet schema validation, ORM→SQL boundaries. Defends future writers (a new PR that adds another Categorical column — e.g. a
qualityflag orvendorenum — automatically gets normalized without changing the writer).Bare
.astype(str)at each call site would solve the immediate failure but encode the rule "ArcticDB doesn't accept Categorical" as folklore. The helper encodes it as code.Test plan
python -m pytest tests/test_arctic_write_contract.py -v— all 10 new tests passupdate_batchin production (first post-merge live exercise)backfill.py'suniverse_lib.write+ macro writes (first post-merge backfill exercise)Refs
sourcedtype that brokeupdate_batchsourcecolumn, the pre-perf(provenance): categorical dtype for source column (~108MB memory reduction) #211 storage representation that round-trips cleanly through ArcticDBfeedback_test_fixture_shape_must_match_production— same institutional principle (shape-validation at storage boundaries), different boundary🤖 Generated with Claude Code