Skip to content

fix(arctic): normalize Categorical columns at every ArcticDB write boundary#224

Merged
cipher813 merged 1 commit into
mainfrom
fix/arctic-write-boundary-categorical-normalizer
May 12, 2026
Merged

fix(arctic): normalize Categorical columns at every ArcticDB write boundary#224
cipher813 merged 1 commit into
mainfrom
fix/arctic-write-boundary-categorical-normalizer

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Why this shape

Institutional pattern: in-memory representations can be optimized (categoricals for compactness); the storage boundary enforces a strict contract that matches what the storage system accepts. The cast happens at exactly one named point. Same shape as JSON serialization, DataFrame→Parquet schema validation, ORM→SQL boundaries. Defends future writers (a new PR that adds another Categorical column — e.g. a quality flag or vendor enum — automatically gets normalized without changing the writer).

Bare .astype(str) at each call site would solve the immediate failure but encode the rule "ArcticDB doesn't accept Categorical" as folklore. The helper encodes it as code.

Test plan

  • python -m pytest tests/test_arctic_write_contract.py -v — all 10 new tests pass
  • Full suite: 811 passed, 1 skipped, 5 warnings (warnings pre-existing, unrelated to this PR; suite 802 → 812)
  • Source-level call-site regressions pin the wrap at every ArcticDB write site — a future revert or new unwrapped write site fails loudly
  • Tomorrow's 2026-05-13 MorningEnrich exercises update_batch in production (first post-merge live exercise)
  • Saturday SF 2026-05-16 exercises backfill.py's universe_lib.write + macro writes (first post-merge backfill exercise)

Refs

🤖 Generated with Claude Code

…undary

2026-05-12 EOD: weekly_collector.py --daily exited 1 in the ArcticDB
append stage with
  ArcticDbNotYetImplemented: Symbol: BRK-B
  DataFrame/Series contains categorical data, cannot append or update
  Categorical columns: ['source']

Root cause: PR #211 (perf(provenance): categorical dtype for source
column, ~108MB memory reduction) converted the per-row source column
to pd.CategoricalDtype in features.compute._apply_daily_delta. ArcticDB's
_handle_categorical_columns raises on every append/update path. PR #211
solved a real OOM (2026-05-11 MorningEnrich) but the in-memory dtype
leaked through to update_batch / write_batch.

Institutional fix: keep PR #211's in-memory memory win, normalize only
at the storage boundary, via a single named helper called immediately
before every ArcticDB write.

Changes:
- store/arctic_store.to_arctic_safe(df) — fast-path returns input
  unchanged for empty / no-categorical frames; otherwise copies + casts
  every CategoricalDtype column to object dtype (matches PR #196's
  pre-#211 storage representation). Does not mutate the caller's frame.
- builders/daily_append.py — wrap update_batch's UpdatePayload data
  and write_batch's WritePayload data.
- builders/backfill.py — wrap universe_lib.write + macro_lib.write
  (features + raw series + sector ETFs). Macro paths never used
  Categorical, but uniform wrapping is single-source-of-truth and the
  fast path makes it free.

Tests (+10, suite 802 → 812):
- Categorical source → object after to_arctic_safe; values + index +
  column order preserved.
- Input frame is not mutated (PR #211's in-memory Categorical must
  survive intact through the compute path).
- Fast paths: returns the input object (no copy) on empty + on
  no-categorical frames.
- Handles multiple Categorical columns (defends future writers).
- Source-level call-site regressions pin the wrap at all 5 ArcticDB
  write sites (2× daily_append, 3× backfill).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 9d2a2c5 into main May 12, 2026
1 check passed
@cipher813 cipher813 deleted the fix/arctic-write-boundary-categorical-normalizer branch May 12, 2026 21:00
cipher813 added a commit that referenced this pull request May 18, 2026
…n dry path — closes DriftDetection skip-exception) (#261)

Adds a `--preflight-only` modifier to infrastructure/spot_drift_detection.sh,
mirroring the merged #259 (spot_data_weekly.sh) / predictor #175 /
backtester #224 pattern. Closes the DriftDetection skip-exception in
ROADMAP "Friday shell-run — per-module dry-path activation" — the one
per-module SF step still SKIPPED rather than dry-run on the Friday shell_run.

Insertion point
---------------
`PREFLIGHT_ONLY=0` modifier var initialised before the arg-parse loop
(orthogonal to RUN_MODE, `set -u` safe); `--preflight-only) PREFLIGHT_ONLY=1`
added to the case loop. The guard block is inserted AFTER the smoke-only
block and strictly BEFORE the "# ── Full drift detection ──" section (the
`run_remote bash -s <<DRIFT` heredoc) and before the trailing
`aws cloudwatch put-metric-data` heartbeat.

No-scan / no-write proof
------------------------
`monitoring.drift_detector` (in alpha-engine-predictor, on the sibling-clone
PYTHONPATH) is the SOLE code path that does any S3 get_object/put_object of
the drift report or SNS publish on alert; the launcher's CloudWatch
put-metric-data heartbeat trails it. The PREFLIGHT_ONLY guard `exit 0`s
strictly before the `<<DRIFT` heredoc, so the scan, the SNS publish, the S3
put_object, and the CloudWatch emit are all statically unreachable. The
preflight itself runs only BasePreflight.check_env_vars (env read) +
BasePreflight.check_s3_bucket (bucket HEAD) + an `importlib.import_module`
of the drift module (import-only — boto3 clients + check_drift()/main()
sit behind `if __name__ == "__main__"`, which an import does not trigger).
Zero external API data fetch, zero S3/CW/SNS/config mutation; exit 0
because a passed preflight is a healthy outcome (SSM/SF report Success).

Preflight substrate reused
--------------------------
The drift workload binary lives in alpha-engine-predictor (no
--preflight-only of its own; out of scope to modify here) and this repo's
preflight.py DataPreflight modes (daily/morning_enrich/phase1/phase2) are
data-collection scoped — none maps to drift. Per the canonical-lib
fallback the preflight composes `alpha_engine_lib.preflight.BasePreflight`
DIRECTLY (env-vars + S3 HEAD) — no bespoke preflight scaffolding duplicated.

Verbatim flag name: `--preflight-only`

Tests
-----
New tests/test_spot_drift_detection_preflight_only.py (5 static
greps/source-position assertions, mirroring
tests/test_preflight_only_dry_path.py): flag parses as a modifier;
guard precedes DRIFT + heartbeat; exit 0 before DRIFT; no scan/S3/CW/SNS
in block; canonical BasePreflight reused (no scaffolding). `bash -n`
clean. Full data suite: 1342 passed, 1 skipped (pre-existing), 5
pre-existing warnings.

Independent of #260: that PR touches spot_data_weekly.sh + the Lambda
dry-run keystone (a different file); the Saturday/Friday SF rewire to
route the DriftDetection state at this `--preflight-only` flag under the
Friday shell_run is a separate follow-on (no step_function.json change here).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant