Skip to content

feat(data): spot_data_weekly.sh --preflight-only (Friday shell-run dry path)#259

Merged
cipher813 merged 1 commit into
mainfrom
feat/spot-data-weekly-preflight-only
May 18, 2026
Merged

feat(data): spot_data_weekly.sh --preflight-only (Friday shell-run dry path)#259
cipher813 merged 1 commit into
mainfrom
feat/spot-data-weekly-preflight-only

Conversation

@cipher813
Copy link
Copy Markdown
Owner

ROADMAP "Friday shell-run — per-module dry-path activation" owed-item #1.

Under the Friday shell_run, the DataPhase1/MorningEnrich + RAGIngestion spot states now boot the spot for real, run their EXISTING preflight + imports + AWS/SSM/ArcticDB connectivity checks, then exit 0 with ZERO external API data fetch and ZERO S3/ArcticDB/config/email/SNS writes — catching bootstrap-class breakage (lib-pin drift, sys.path collision, stale ArcticDB symbol, SSM timeout, Dockerfile/image gap) ~12h before the real Saturday run. Reuses the existing preflight substrate; no parallel preflight written.

Where the gate sits + zero-fetch / zero-write proof

weekly_collector.py — new --preflight-only argparse flag. main() exits immediately after the existing DataPreflight(config["bucket"], mode).run() and strictly before run_weekly(config, args) via raise SystemExit(0). run_weekly() is the sole function in the module that performs any collector fetch (polygon/FMP/FRED/yfinance) or any S3/ArcticDB/parquet/config/module-health write — gating in front of it makes every fetch/write code path statically unreachable. The preflight itself only does read-only/auth probes (S3 HEAD, polygon/FRED reference-data auth calls that fetch no collector data, ArcticDB list_libraries) plus a self-cleaning S3 PUT+DELETE sentinel under preflight/ (the preflight's own liveness probe, not a data write). Ordering pinned by an AST-source test.

rag/pipelines/run_weekly_ingestion.sh — new --preflight-only flag. Exits 0 after Step 0 (python -m rag.preflight: check_env_vars + check_s3_bucket HEAD — read-only) and strictly before Step 1 (ingest_sec_filings). Every ingest_* pipeline, Voyage embedding call, and Postgres/pgvector + parquet write lives in Steps 1-9 — all unreachable once the guard exits.

infrastructure/spot_data_weekly.sh — new --preflight-only flag sets PREFLIGHT_ONLY=1, a modifier orthogonal to RUN_MODE so it composes with the data path AND --rag-only. A dedicated data-path block runs weekly_collector.py --morning-enrich --preflight-only and/or weekly_collector.py --phase 1 --preflight-only (gated by the existing DO_MORNING_ENRICH/DO_PHASE1 preflight-task-split) then exit 0 before the real WORKLOADS heredoc — no prune (prune-audit JSON write), no RAG, no CloudWatch heartbeat, no S3 log upload.

--rag-only --preflight-only behavior

Runs ONLY the RAG-path preflight: boot + SSM secret fetch (so rag.preflight's check_env_vars sees the 4 RAG secrets) + run_weekly_ingestion.sh --preflight-only (step-0-only + exit 0). No real RAG ingestion, no rag-ingestion heartbeat. --preflight-only alone → ONLY the DataPhase1/MorningEnrich preflight.

Universe-freshness tolerance note (ROADMAP owed-item #5)

The Friday shell-run uses the phase1 / morning_enrich preflight modes. Per preflight.py::DataPreflight.run, neither runs check_arcticdb_fresh — they only do _check_arcticdb_libraries_present (a presence read, not a freshness gate). morning_enrich deliberately omits freshness (it is part of what makes ArcticDB fresh); phase1 populates ArcticDB. The only freshness gate (check_arcticdb_fresh macro/SPY 4d) lives in the daily mode, which the Saturday/Friday data path never selects. So a Friday run predating Friday's settled polygon aggregate does not spuriously fail on a Thursday-last-bar — no --preflight-only-scoped tolerance code is required for the data path. Documented inline so a future mode-mapping change re-audits this invariant.

Tests

New tests/test_preflight_only_dry_path.py (10 tests, static greps + AST-source assertions, matching the existing test_spot_data_weekly_run_modes.py / test_weekly_collector_preflight_mode_mapping.py convention): flag parsing on all 3 files, the exit-0-after-preflight-before-fetch/write ordering invariant, --rag-only --preflight-only step-0-only behavior, and the no-prune/no-RAG/no-heartbeat/no-S3-upload hard invariant.

  • Full suite: 1229 passed, 1 skipped (pre-existing skip)
  • bash -n clean on both shell scripts
  • No new deps, no secrets

Flag name (verbatim, for the SF keystone follow-on): --preflight-only

🤖 Generated with Claude Code

…y path)

ROADMAP "Friday shell-run — per-module dry-path activation" owed-item #1.
Under the Friday shell_run, the DataPhase1/MorningEnrich + RAGIngestion
spot states now boot the spot for real, run their EXISTING preflight,
then exit 0 with ZERO external API data fetch and ZERO
S3/ArcticDB/config/email/SNS writes — catching bootstrap-class breakage
(lib-pin drift, sys.path collision, stale ArcticDB symbol, SSM timeout,
Dockerfile/image gap) ~12h before the real Saturday run.

Reuses the existing preflight substrate; no parallel preflight written.

Where the gate sits / zero-fetch zero-write proof:

- weekly_collector.py: new `--preflight-only` argparse flag. main()
  exits HERE — `raise SystemExit(0)` immediately after the existing
  `DataPreflight(config["bucket"], mode).run()` and strictly BEFORE
  `run_weekly(config, args)`. run_weekly() is the SOLE function in the
  module that performs ANY collector fetch (polygon/FMP/FRED/yfinance)
  or ANY S3/ArcticDB/parquet/config/module-health write — gating in
  front of it makes every fetch/write code path statically unreachable.
  The preflight itself only does read-only/auth probes (S3 HEAD,
  polygon/FRED reference-data auth calls that fetch no collector data,
  ArcticDB list_libraries) plus a self-cleaning S3 PUT+DELETE sentinel
  under preflight/ (the preflight's own liveness probe, not a data
  write). Ordering pinned by an AST-source test.

- rag/pipelines/run_weekly_ingestion.sh: new `--preflight-only` flag.
  Exits 0 after Step 0 (`python -m rag.preflight`: check_env_vars +
  check_s3_bucket HEAD — read-only, zero fetch, zero write) and strictly
  BEFORE Step 1 (ingest_sec_filings). Every ingest_* pipeline, Voyage
  embedding call, and Postgres/pgvector + parquet write lives in Steps
  1-9 — all unreachable once the guard exits.

- infrastructure/spot_data_weekly.sh: new `--preflight-only` flag sets
  PREFLIGHT_ONLY=1, a MODIFIER orthogonal to RUN_MODE so it composes
  with the data path AND --rag-only. A dedicated data-path block runs
  `weekly_collector.py --morning-enrich --preflight-only` and/or
  `weekly_collector.py --phase 1 --preflight-only` (gated by the
  existing DO_MORNING_ENRICH/DO_PHASE1 split) then exit 0 before the
  real WORKLOADS heredoc — no prune (prune-audit JSON write), no RAG,
  no CloudWatch heartbeat, no S3 log upload.

--rag-only --preflight-only behavior: runs ONLY the RAG-path preflight
(boot + SSM secret fetch so rag.preflight's check_env_vars sees them +
`run_weekly_ingestion.sh --preflight-only` = step-0-only + exit 0). No
real RAG ingestion, no rag-ingestion heartbeat. `--preflight-only` alone
runs ONLY the DataPhase1/MorningEnrich preflight.

Universe-freshness tolerance note (ROADMAP owed-item #5): the Friday
shell-run uses the phase1 / morning_enrich preflight modes. Per
preflight.py::DataPreflight.run, NEITHER mode runs check_arcticdb_fresh
— they only do _check_arcticdb_libraries_present (a presence read, not a
freshness gate). morning_enrich deliberately omits freshness (it is part
of what *makes* ArcticDB fresh); phase1 *populates* ArcticDB. The only
freshness gate (check_arcticdb_fresh macro/SPY 4d) lives in the "daily"
mode, which the Saturday/Friday data path never selects. So a Friday run
predating Friday's settled polygon aggregate does NOT spuriously fail on
a Thursday-last-bar — no --preflight-only-scoped tolerance code is
required for the data path. Documented inline so a future mode-mapping
change re-audits this invariant.

Tests: new tests/test_preflight_only_dry_path.py (10 tests, static
greps + AST-source assertions, matching the existing
test_spot_data_weekly_run_modes.py / test_weekly_collector_preflight_
mode_mapping.py convention) pins: flag parsing on all 3 files, the
exit-0-after-preflight-before-fetch/write ordering invariant,
--rag-only --preflight-only step-0-only behavior, and the
no-prune/no-RAG/no-heartbeat/no-S3-upload hard invariant. Full suite:
1229 passed, 1 skipped (pre-existing). bash -n clean on both shell
scripts. No new deps, no secrets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit dece8ec into main May 18, 2026
1 check passed
@cipher813 cipher813 deleted the feat/spot-data-weekly-preflight-only branch May 18, 2026 20:18
cipher813 added a commit that referenced this pull request May 18, 2026
…n dry path — closes DriftDetection skip-exception) (#261)

Adds a `--preflight-only` modifier to infrastructure/spot_drift_detection.sh,
mirroring the merged #259 (spot_data_weekly.sh) / predictor #175 /
backtester #224 pattern. Closes the DriftDetection skip-exception in
ROADMAP "Friday shell-run — per-module dry-path activation" — the one
per-module SF step still SKIPPED rather than dry-run on the Friday shell_run.

Insertion point
---------------
`PREFLIGHT_ONLY=0` modifier var initialised before the arg-parse loop
(orthogonal to RUN_MODE, `set -u` safe); `--preflight-only) PREFLIGHT_ONLY=1`
added to the case loop. The guard block is inserted AFTER the smoke-only
block and strictly BEFORE the "# ── Full drift detection ──" section (the
`run_remote bash -s <<DRIFT` heredoc) and before the trailing
`aws cloudwatch put-metric-data` heartbeat.

No-scan / no-write proof
------------------------
`monitoring.drift_detector` (in alpha-engine-predictor, on the sibling-clone
PYTHONPATH) is the SOLE code path that does any S3 get_object/put_object of
the drift report or SNS publish on alert; the launcher's CloudWatch
put-metric-data heartbeat trails it. The PREFLIGHT_ONLY guard `exit 0`s
strictly before the `<<DRIFT` heredoc, so the scan, the SNS publish, the S3
put_object, and the CloudWatch emit are all statically unreachable. The
preflight itself runs only BasePreflight.check_env_vars (env read) +
BasePreflight.check_s3_bucket (bucket HEAD) + an `importlib.import_module`
of the drift module (import-only — boto3 clients + check_drift()/main()
sit behind `if __name__ == "__main__"`, which an import does not trigger).
Zero external API data fetch, zero S3/CW/SNS/config mutation; exit 0
because a passed preflight is a healthy outcome (SSM/SF report Success).

Preflight substrate reused
--------------------------
The drift workload binary lives in alpha-engine-predictor (no
--preflight-only of its own; out of scope to modify here) and this repo's
preflight.py DataPreflight modes (daily/morning_enrich/phase1/phase2) are
data-collection scoped — none maps to drift. Per the canonical-lib
fallback the preflight composes `alpha_engine_lib.preflight.BasePreflight`
DIRECTLY (env-vars + S3 HEAD) — no bespoke preflight scaffolding duplicated.

Verbatim flag name: `--preflight-only`

Tests
-----
New tests/test_spot_drift_detection_preflight_only.py (5 static
greps/source-position assertions, mirroring
tests/test_preflight_only_dry_path.py): flag parses as a modifier;
guard precedes DRIFT + heartbeat; exit 0 before DRIFT; no scan/S3/CW/SNS
in block; canonical BasePreflight reused (no scaffolding). `bash -n`
clean. Full data suite: 1342 passed, 1 skipped (pre-existing), 5
pre-existing warnings.

Independent of #260: that PR touches spot_data_weekly.sh + the Lambda
dry-run keystone (a different file); the Saturday/Friday SF rewire to
route the DriftDetection state at this `--preflight-only` flag under the
Friday shell_run is a separate follow-on (no step_function.json change here).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant