Hard-fail daily_append + propagate exit code in weekday SSM#24
Merged
Conversation
Root cause: ArcticDB universe library last wrote 2026-04-12. No writes on
4/13 or 4/14 even though the weekday Step Function's DailyData step
reported SUCCEEDED both days.
Two silent-fail chains were masking this:
1. `daily_append.py` swallowed `universe_lib.read(ticker)` exceptions at
`log.debug` level and counted them as `n_skip`, so an ArcticDB-wide
auth/URI failure would report `status=ok` with `tickers_appended=0`.
Same pattern on macro-series reads, per-ticker appends, and macro
bar appends. `_load_daily_closes` returned `{}` on any S3 error.
2. The weekday Step Function's DailyData SSM command was
`python ... | tee ... ; echo EXIT_CODE=$?`. The final `echo` is what
SSM sees, so the shell always exited 0 regardless of Python's exit
code. Saturday's `alpha-engine-saturday-pipeline` already uses
`set -eo pipefail`; the weekday one was never updated. Weekday also
did not git-pull, so any post-Saturday change on EC2 was invisible.
Changes:
- `builders/daily_append.py`: `_load_daily_closes` raises on missing
file / zero rows; macro reads raise on SPY missing and warn on others;
per-ticker read failures count `n_err` (not `n_skip`) at warning level;
the function raises `RuntimeError` if `n_ok == 0` or `n_err > 5%` of
stock tickers. Returns structured result on success.
- `weekly_collector.py`: `_run_daily` now uses `logger.exception` so the
traceback reaches SSM logs; status propagation to `main()`'s
`SystemExit(1)` path unchanged.
- `infrastructure/step_function_daily.json`: `set -eo pipefail` + weekday
git pull, matching the Saturday step. Dropped the `echo EXIT_CODE` line
so the final command is actually the Python script.
Remaining silent-fail patterns across the rest of alpha-engine-data are
deferred to a repo-wide audit PR (tracked).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses inconsistency in the first commit: non-SPY macro series (VIX, VIX3M, TNX, IRX, GLD, USO) and sector ETF reads/writes were log.warning + continue, which is visible but doesn't halt. Per feedback_hard_fail_until_stable, every non-ok condition must exit non-zero while Alpha Engine is unstable. A missing VIX silently produces zero-valued regime-interaction features; a missing sector ETF silently corrupts features for every stock in that sector. All macro + sector ETF read failures and append failures now raise RuntimeError. Per-ticker universe read/append keeps its 5% tolerance — individual tickers can legitimately be new / not-yet-backfilled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Apr 14, 2026
Follow-up to PR #24 (daily_append). Audits the rest of the alpha-engine-data daily production path for the same "except Exception: log.debug / pass" pattern that masked ArcticDB staleness for two days. Scope limited to files on the DailyData + feature-store write path; RAG ingestion, emailer, and fundamentals scan deferred to a later sweep. features/compute.py - `_load_daily_closes_delta`: per-date NoSuchKey upgraded to WARNING; every other S3 exception raises; raise if ALL dates in the business-day range were missing (the fingerprint of an upstream daily_closes outage). - Per-ticker feature computation failure: log.debug → log.warning; new RuntimeError if `n_err / len(universe_tickers) > 5%`, matching daily_append. - Empty store_rows now raises instead of returning `status=error` (status return is legacy; raising is consistent with hard-fail). - `_load_cached_alternative` outer except: log.debug → log.warning so auth / network failures surface even when "no alt data" is expected. - Schema version write: log.debug → log.warning with a comment explaining why this one stays non-raising (drift-check metadata, not feature data). collectors/daily_closes.py - head_object idempotency check: bare `except Exception: pass` → catch only ClientError with 404/NoSuchKey. Auth / throttling now raises instead of silently proceeding. - Per-ticker yfinance extract: log.debug → log.warning so partial yfinance failures are visible in the daily log. collectors/macro.py - `load_from_s3`: pointer-missing still returns None (expected), but every other error now raises instead of masquerading as "no data." weekly_collector.py - S3 constituents load fallback: bare `except Exception: pass` → warning with context. The Wikipedia fallback remains (legitimate failover). - Wikipedia constituents failure: bare `except Exception: pass` → ERROR log. The downstream `if not tickers` guard already hard-fails. Out of scope (tracked for follow-up audit) - rag/* (SEC / 8-K / earnings / theses ingestion) - emailer.py `except Exception: pass` in finalize email - features/compute.py fundamentals/alternative per-key fallback chains - collectors/fundamentals.py per-ticker fetch log.debug Dead code flagged for later removal - features/reader.py — read_feature_snapshot / read_feature_range / latest_available_date / read_registry. No callers remain inside alpha-engine-data. Consumers in sibling repos (predictor / backtester / research) will migrate away as the ArcticDB cutover completes; safe to delete once the cross-repo migration is confirmed clean. Tests: 41/41 pass. No new tests added — the hard-fail paths are essentially `if cond: raise` and are better exercised by the existing integration test suite against a live S3 bucket (follow-up). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
cipher813
added a commit
that referenced
this pull request
Apr 14, 2026
Hardened failures from PRs #24 + #25 now raise cleanly up through _run_daily and weekly_collector.main(). The pipeline exits non-zero, but there's no alerting — a 6 AM failure is only visible if someone manually reads CloudWatch / the SSM log. Flow-doctor closes that loop: any ERROR-level log (including traceback-emitting exceptions from logger.exception) fires email + a deduped GitHub issue. Mirrors the executor's integration (alpha-engine/executor/log_config.py, alpha-engine/flow-doctor.yaml) verbatim: - requirements.txt: flow-doctor[diagnosis]>=0.3.0,<0.4.0 - log_config.py: FlowDoctor singleton + FlowDoctorHandler at ERROR level, attached to the root logger when FLOW_DOCTOR_ENABLED=1. Import is inside _attach_flow_doctor so local dev without the dep installed still works. - flow-doctor.yaml.example: committed template. Real file is gitignored — will be staged from alpha-engine-config at deploy time (same pattern as predictor.yaml, risk.yaml). Out of scope for this PR (deployment steps — user action required) - Add flow-doctor.yaml to alpha-engine-config repo at path matching what the Step Functions expect. - Verify FLOW_DOCTOR_ENABLED=1 and FLOW_DOCTOR_GITHUB_TOKEN are already in /home/ec2-user/.alpha-engine.env (executor already uses these, so likely yes, but needs confirmation). - First live fire: next daily or Saturday Step Function run — expect an email + issue if any of the new hard-fail paths trigger. Out of scope (tracked) - Same integration for predictor Lambda (different deploy model — needs flow-doctor packaged into the Lambda image or layer). - Same integration for research Lambda. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Apr 14, 2026
Hardened failures from PRs #24 + #25 now raise cleanly up through _run_daily and weekly_collector.main(). The pipeline exits non-zero, but there's no alerting — a 6 AM failure is only visible if someone manually reads CloudWatch / the SSM log. Flow-doctor closes that loop: any ERROR-level log (including traceback-emitting exceptions from logger.exception) fires email + a deduped GitHub issue. Mirrors the executor's integration (alpha-engine/executor/log_config.py, alpha-engine/flow-doctor.yaml) verbatim: - requirements.txt: flow-doctor[diagnosis]>=0.3.0,<0.4.0 - log_config.py: FlowDoctor singleton + FlowDoctorHandler at ERROR level, attached to the root logger when FLOW_DOCTOR_ENABLED=1. Import is inside _attach_flow_doctor so local dev without the dep installed still works. - flow-doctor.yaml.example: committed template. Real file is gitignored — will be staged from alpha-engine-config at deploy time (same pattern as predictor.yaml, risk.yaml). Out of scope for this PR (deployment steps — user action required) - Add flow-doctor.yaml to alpha-engine-config repo at path matching what the Step Functions expect. - Verify FLOW_DOCTOR_ENABLED=1 and FLOW_DOCTOR_GITHUB_TOKEN are already in /home/ec2-user/.alpha-engine.env (executor already uses these, so likely yes, but needs confirmation). - First live fire: next daily or Saturday Step Function run — expect an email + issue if any of the new hard-fail paths trigger. Out of scope (tracked) - Same integration for predictor Lambda (different deploy model — needs flow-doctor packaged into the Lambda image or layer). - Same integration for research Lambda. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
cipher813
added a commit
that referenced
this pull request
Apr 14, 2026
Addresses the class of failure surfaced 2026-04-14: daily_append silently not writing to ArcticDB for two weekdays because arcticdb wasn't in the deploy image and the import error was swallowed. A preflight check catches this in ~1s instead of letting the pipeline run to "success" with stale data. Pattern D (simplest): inline _preflight() in weekly_collector.py, called from main() after config load, before run_weekly(). No new files, no shared library. If the check pattern proves valuable across the other modules, we can extract a common helper later — but the per-repo checks are small enough (~30 LOC) that inlining is fine for now. Scoped to external-world handshakes (env vars, S3, ArcticDB) — NOT correctness of the collection itself. The hardened collectors from PRs #24 + #25 still own data-integrity hard-fails. Checks by mode - daily: AWS_REGION env, S3 bucket reachable, ArcticDB universe library readable, SPY freshness ≤ 4 days (covers Fri → Tue long weekend + buffer). 4-day stale SPY would have caught today's bug on 2026-04-14 instead of letting Friday's write look healthy until Saturday. - phase1: AWS_REGION + FRED_API_KEY + POLYGON_API_KEY + S3 reachable. - phase2: AWS_REGION + FMP_API_KEY + EDGAR_IDENTITY + S3 reachable. Failures raise RuntimeError. main() already exits 1 on any SystemExit path, and flow-doctor (#26, once deployed) will dispatch the corresponding ERROR log as email + GitHub issue. Out of scope (tracked) - Same pattern in predictor inference + training, research Lambda, executor entrypoints, backtester. Rolling out after this first consumer proves the shape. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
cipher813
added a commit
that referenced
this pull request
Apr 14, 2026
Applies the PR #24/#25/#28 pattern to the Saturday RAG ingestion path. The shell script was silently swallowing failures in all 5 pipelines, making "RAG Weekly Ingestion Complete" a lie whenever any step failed. Shell script (rag/pipelines/run_weekly_ingestion.sh) - Removed `|| echo "WARNING: ... (non-fatal)"` from all 5 ingestion steps, the CloudWatch heartbeat, and the completion email. set -e was already active but these swallowers defeated it. - Removed the runtime `if [ -n "$FINNHUB_API_KEY" ]; then ... else echo SKIPPED fi` branch. All required env vars are now hard-failed by preflight (step 0) before any ingestion runs. A silently-skipped earnings transcript step defeats the purpose of having transcripts at all. - Added `Step 0/5: python -m rag.preflight` at the top. - The hardcoded 'status: ok' completion email is now truthful rather than aspirational — with set -e active and no swallowers, reaching the email means all 5 pipelines actually succeeded. New file: rag/preflight.py - RAGPreflight(BasePreflight) subclass — composes check_env_vars (AWS_REGION, VOYAGE_API_KEY, FINNHUB_API_KEY, EDGAR_IDENTITY, RAG_DATABASE_URL) + check_s3_bucket. - main() uses alpha-engine-lib's setup_logging with the shared flow-doctor.yaml path, so a preflight failure fires email + issue via the existing dispatch. rag/db.py - is_available(): log.debug → log.warning for the exception path. The function was otherwise unchanged — it's a non-raising probe for future retrieval-side consumers. Flagged as unused inside alpha-engine-data (zero callers); defer deletion until cross-repo audit completes, since predictor / research / backtester may import from it. rag/pipelines/ingest_8k_filings.py - Per-URL download failure: log.debug → log.warning. Caller still treats None as "skip this filing" (aggregate counts are reported), so no behavior change; the failure rate is just visible now. Dead code flagged (no change in this PR) - rag/db.py::is_available — zero local callers. Keep for now, flag for future cross-repo sweep. Out of scope (tracked) - Adopt alpha-engine-lib setup_logging in each ingestion script's main() for consistent log formatting + flow-doctor capture of per-pipeline errors. Currently only preflight.py uses the lib; ingestion scripts still use Python's default root logger. Minor follow-up. - Date-parsing `except ValueError: continue` patterns in ingest_sec_filings, ingest_8k_filings, ingest_theses, ingest_earnings_transcripts. Reviewed case-by-case — all are legitimate "skip this malformed entry" flows with aggregate counts upstream. Not silent fails. Test plan - [x] pytest tests/ — 41 pass - [x] Syntax check on all modified Python files - [x] bash -n on run_weekly_ingestion.sh - [ ] Next Saturday Step Function run exercises the hardened path. Forced failure test: unset FINNHUB_API_KEY on EC2 and re-run — must fail at preflight (step 0), not silently skip step 3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
cipher813
added a commit
that referenced
this pull request
Apr 14, 2026
2026-04-14 live run discovery: the n_ok==0 hard-fail guard added in PR #24 is a false positive on legitimate no-op reruns. When 900/902 tickers already have today's row in ArcticDB (because this morning's Step Function write succeeded), the loop correctly takes the "today already exists" skip path for each — n_ok=0, n_skip=900, n_err=2 (2 newly-listed tickers Q and SOLS not yet backfilled). My guard raised RuntimeError on that, treating "nothing to write because all done" as "failed to write anything." The real silent-fail this guard was meant to catch (ArcticDB-wide auth/connectivity failure → every read throws) was reclassified to n_err (not n_skip) as part of PR #24. So the err_rate > 5% threshold already catches the true failure case, without false positives on no-op reruns or idempotent retries. Kept: the err_rate > 5% threshold. If ArcticDB is genuinely broken, n_err will exceed 45 tickers on a 902-ticker run and this will fire. Net behavior: - n_ok=0, n_skip=902, n_err=0 → pass (idempotent rerun — everyone already wrote) - n_ok=900, n_skip=0, n_err=2 → pass (normal run with 2 missing tickers) - n_ok=0, n_skip=0, n_err=902 → fail (ArcticDB auth broken) - n_ok=800, n_skip=0, n_err=102 → fail (err rate > 5%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ArcticDB universe library last wrote 2026-04-12. No writes on 4/13 or 4/14 despite the weekday Step Function's DailyData step reporting SUCCEEDED both days. Two silent-fail chains were masking this:
daily_append.pysilently turned ArcticDB-wide failures into `status=ok`. `universe_lib.read(ticker)` exceptions logged at `debug` and counted as `n_skip`, so if all 909 tickers failed the function returned success with `tickers_appended=0`. Same pattern on macro reads, per-ticker appends, and macro bar appends. `_load_daily_closes` returned `{}` on S3 error.Changes
Out of scope (tracked follow-ups)
Test plan
Deployment
Merge → EC2 will git-pull on next DailyData run. Step Function JSON must be pushed with `aws stepfunctions update-state-machine` (not auto-deployed).
🤖 Generated with Claude Code